Collecting Images for Classifers

Dr. Bryan Patrick Wood

June 05, 2021

Filed under “

Looks like I might beat my previous time between posts in a landslide. I was told it was about a year in between my first and second posts. Not fair as this one will be less content-rich. Also, please forgive the self-promotion.


I've always loved the quote I’m a great believer in luck. The harder I work, the more luck I have. You're asked to build a model. Models need data. If you're lucky that data already exists somewhere or existing models where trained with them in mind. Most of the time that's not the case.

One class classification is an entire topic itself, and my current thinking is typical may not be the best approach. Let's say it's a binary image classifier. Those typically need images of both positive and negative examples. Usually a lot of both even when using transfer learning. That sounds like a huge pain. And it is.


Threw together something pretty quick to address a need. Done in some spare time over a weekend which makes this a fairly rare instance of something shareable that was work-adjacent. Rough around the edges for sure but did the job it needed to.

More info about usage here. @bpw1621/imgscrape12 is a pretty simple node based image webscraper. It uses puppeteer for controlling the browser and provides a yargs based CLI interface. First npm module I have taken the time to publish (and glad to have gone through that process now). Please visit the links for more information.

For those that just want to use this as a tool, because it wasn't clear to me immediately how to just install this and do that, it's as simple as, for instance, the following

npx imgscrape-cli -t narwhal -e google

Some engines work better than others at the moment and all worked better when I had first written it. I find Yandex usually works the best in terms of volume, usually in the thousands, while the rest stop in the hundreds of images. YMMV.

The Code

Almost all the logic is in lib/scrapeImages.js which clocks in at a little over 200 lines of code and should be pretty approachable. The puppeteer package does all the heavy lifting here. Its node code so a lot of async and await which I prefer to callbacks and explicitly using promises given the choice.

After instantiating the browser object, and a little more setup you're brought to a large switch statement with the details about the individual image search engines (e.g., URL, CSS selectors for the images, etc.). That part could definitely use some refactoring. Next we go to the page and scroll down looking for images making sure to find the site specific more results button if it pops up.

Supports both URL and data images. There is also logic to try to determine if the engine is just returning duplicate images or has run out of results and bail if that is the case. This is another part that could use a look: it worked well when it was first written, but I think some engines changed aspects of their page results since then, and those do not work great. Lastly, information about the successful, failed, and duplicate URLs are dumped out to JSON files along with the images.

Yargs Logo Yargs be a node.js library fer hearties tryin' ter parse optstrings. Love the whimsy ... The cli/imgscrape-cli.js parses setups up the CLI interface, parses the command line options, and calls the scrapeImages function lib/scrapeImages using the yargs package. I had not used yargs before and ended up pleased with it. It supported subcommands, detailed options specifications, example of commands, aliases for long and short style options, and a couple other niceties. The API supports method chaining which I also liked.

Finalized at 9:48 PM.