Collecting Images for Classifers

Dr. Bryan Patrick Wood

June 05, 2021

Filed under “

Looks like I might beat my previous time between posts in a landslide. I was told it was about a year in between my first and second posts. Not fair as this one will be less content-rich. Also, please forgive the self-promotion.


I've always loved the quote I’m a great believer in luck. The harder I work, the more luck I have. You're asked to build a model. Models need data. If you're lucky that data already exists somewhere or existing models where trained with them in mind. Most of the time that's not the case.

One class classification is an entire topic itself, and my current thinking is typical may not be the best approach. Let's say it's a binary image classifier. Those typically need images of both positive and negative examples. Usually a lot of both even when using transfer learning. That sounds like a huge pain. And it is.


Threw together something pretty quick to address a need. Done in some spare time over a weekend which makes this a fairly rare instance of something shareable that was work-adjacent. Rough around the edges for sure but did the job it needed to.

More info about usage here. @bpw1621/imgscrape12 is a pretty simple node based image webscraper. It uses puppeteer for controlling the browser and provides a yargs based CLI interface. First npm module I have taken the time to publish (and glad to have gone through that process now). Please visit the links for more information.

For those that just want to use this as a tool, because it wasn't clear to me immediately how to just install this and do that, it's as simple as, for instance, the following

npx imgscrape-cli -t narwhal -e google

Some engines work better than others at the moment and all worked better when I had first written it. I find Yandex usually works the best in terms of volume, usually in the thousands, while the rest stop in the hundreds of images. YMMV.

The Code

Almost all the logic is in lib/scrapeImages.js which clocks in at a little over 200 lines of code and should be pretty approachable. The puppeteer package does all the heavy lifting here. Its node code so a lot of async and await which I prefer to callbacks and explicitly using promises given the choice.

After instantiating the browser object, and a little more setup you're brought to a large switch statement with the details about the individual image search engines (e.g., URL, CSS selectors for the images, etc.). That part could definitely use some refactoring. Next we go to the page and scroll down looking for images making sure to find the site specific more results button if it pops up.

Supports both URL and data images. There is also logic to try to determine if the engine is just returning duplicate images or has run out of results and bail if that is the case. This is another part that could use a look: it worked well when it was first written, but I think some engines changed aspects of their page results since then, and those do not work great. Lastly, information about the successful, failed, and duplicate URLs are dumped out to JSON files along with the images.

Yargs Logo Yargs be a node.js library fer hearties tryin' ter parse optstrings. Love the whimsy ... The cli/imgscrape-cli.js parses setups up the CLI interface, parses the command line options, and calls the scrapeImages function lib/scrapeImages using the yargs package. I had not used yargs before and ended up pleased with it. It supported subcommands, detailed options specifications, example of commands, aliases for long and short style options, and a couple other niceties. The API supports method chaining which I also liked.

Finalized at 9:48 PM.

Click to read and post comments

Streamlit Topic Modeling

Dr. Bryan Patrick Wood

April 22, 2021

Filed under “

What does it take to create and deploy a topic modeling web application quickly? Natural Language Processing (NLP). I endeavored to find this out using Python NLP packages for topic modeling, Streamlit for the web application framework, and Streamlit Sharing for deployment.

Never mind what I might have done if not explicitly directed. I had been directed to use topic modeling on a project professionally, so I already had direct experience with relevant techniques on a challenging real-world problem. However, I encountered several unexpected difficulties sharing topic modeling results with a non-technical audience.

Is this a topic modeling?

Shortly after, I was consulted on implementing a topic modeling feature in a product system operating at scale. Here, again, the group I In short, in supervised learning you have labeled data and want to predict labels on new data; in unsupervised learning you have no labels, and try to find meaningful patterns in the data; and semi-supervised learning are hybrids. was trying to assist had a hard time understanding exactly what to expect out of topic modeling and keeping the important differences between supervised, semi-supervised, and unsupervised machine learning approaches straight8.

This motivated me to put something together to show, don't tell so to speak. I wanted something tangible for the folks I was dealing with to play around with. This was also a good excuse to use Streamlit and try out Streamlit Sharing. I had been proselytizing Specifically, use-cases where additional data scientists would be a bigger asset to the effort than adding a team of frontend software developers. Streamlit for a few use-cases professionally when really I had only played around with a few toy examples. Deploying via Streamlit Sharing was new and piqued my curiosity.


First, Il meglio è l'inimico del bene. the application is still very much a work in progress / prototype. There is functionality stubbed out that is not implemented (e.g., using non-negative matrix factorization1). Code all needs to be refactored out of a sprawling 250 line script too. Focus was on getting enough of the piece parts working well enough to allude to robust capabilities that could be implemented and having enough of a complete application to stimulate discussion. Second, Streamlit had good support for literate programming2. As a result, some narrative is repeated from the application here. As such, if you have already gone to the application you can skim some of what follows.

Topic Modeling

As I would find out, topic modeling can mean different things to different people. The words topic and model are common enough where most people can look at them and formulate an opinion on what the technique must accomplish when successful.

Without additional qualifications, the term topic modeling usually refers to types of statistical models used in the discovery of abstract topics that The word document here really ends up meaning any ordered collection of strings. occur in a collection of documents6. These techniques are almost always fully unsupervised although semi-supervised and supervised variants For example, Guided LDA or Seeded LDA in the semi-supervised case and Labeled LDA in the supervised case. do exist. Among the most commonly used techniques, and the one that is fully implemented in the application, is Latent Dirichlet Allocation (LDA)7.

For the representation, there is even this whole graphical language to describe the latent structure of the model inscurtible as it is LDA Plate Notation At a superficial level, LDA is just a matrix factorization of the words document relationship matrix (viz., below) into a two relationship matrices: words to topics and topics to documents. The theory posits an underlying distribution of words in topics and topics in documents but that is more of interest if one wishes to understand underlying theory which is well exposed elsewhere.

Topic Modeling as Matrix Factorization

Not going deep into LDA theory here: that is a topic for its own blog post and another time.

I had done a lot of experimentation on the professional project. That experimentation In my professional use-case, data cleaning and preprocessing were probably the most important aspects. is not directly useful outside its context which I cannot share. I will highlight some snippets of code that may be of use to an aspiring topic modeler.

The Pareto principal of data science claims that 80% of time gets expended on data preparation and, as the story goes, the other 20% complaining about it. Preprocessing is vitally important in all machine learning problems. In NLP problems, there tends to be a lot more choices than in other domains. For topic modeling specifically, one usually wants to remove various types of named entities before applying modeling. The following function was used to denoise the text documents

In words, grab text out of a data frame column, remove some uninformative entity types, and run the documents through gensim.utils.simple_preprocess removing stopwords from nltk.corpus.stopwords.

import pandas as pd
import regex
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords

URL_REGEX_STR = r'((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*'

def denoise_docs(texts_df: pd.DataFrame, text_column: str):
    texts = texts_df[text_column].values.tolist()
    remove_regex = regex.compile(f'({EMAIL_REGEX_STR}|{MENTION_REGEX_STR}|{HASHTAG_REGEX_STR}|{URL_REGEX_STR})')
    texts = [regex.sub(remove_regex, '', text) for text in texts]
    docs = [[w for w in simple_preprocess(doc, deacc=True) if w not in stopwords.words('english')] for doc in texts]
    return docs

I also experimented with using bigram and trigram phrases through gensim.models.Phrases Left in as application options. and gensim.models.phrases.Phraser but did not see a big lift. Using the bigrams and trigrams themselves rather than as a preprocessing step may have been more impactful. The final step in document preprocessing was using spaCy to perform lemmantization.

import pandas as pd
import spacy

def generate_docs(texts_df: pd.DataFrame, text_column: str, ngrams: str = None):
    docs = denoise_docs(texts_df, text_column)

    # bigram / trigam preprocessing ...

    lemmantized_docs = []
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
    for doc in docs:
        doc = nlp(' '.join(doc))
        lemmantized_docs.append([token.lemma_ for token in doc])

    return lemmantized_docs

The modeling code is gensim standard fare

import gensim
from gensim import corpora

def prepare_training_data(docs):
    id2word = corpora.Dictionary(docs)
    corpus = [id2word.doc2bow(doc) for doc in docs]
    return id2word, corpus

def train_model(docs, num_topics: int = 10, per_word_topics: bool = True):
    id2word, corpus = prepare_training_data(docs)
    model = gensim.models.LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics, per_word_topics=per_word_topics)
    return model

I Non-negative Matrix Factorization (NMF)1. had intended to add more modeling options but ran out of time. At very least, I will be adding in an option to use NMF in the future. Anecdotally, NMF can produce better topics depending on the dataset being investigated. Adding any method that is not unsupervised will be a much bigger lift.

pyLDAvis termite plot For visualization, I liberally took from Topic modeling visualization – How to present the results of LDA models? specifically for the model result visualizations: it is a good reference for visualizing topic model results.

pyLDAvis9 is also a good topic modeling visualization but did not fit great with embedding in an application. Termite plots10 are another interesting topic modeling visualization available in Python using the textaCy package.

The most involved visualization I had time for was the word clouds and since there was already a Python package to do just that the task was trivial

from wordcloud import WordCloud

WORDCLOUD_FONT_PATH = r'./data/Inkfree.ttf'

def generate_wordcloud(docs, collocations: bool = False):
    wordcloud_text = (' '.join(' '.join(doc) for doc in docs))
    wordcloud = WordCloud(font_path=WORDCLOUD_FONT_PATH, width=700, height=600, background_color='white', collocations=collocations).generate(wordcloud_text)
    return wordcloud

The Airline Tweets Wordcloud settings required a little playing around with to get something that looked decent. Adding additional visualizations is the main place I felt like I ran out of time and will likely revisit.


Plagiarized directly from their documentation. Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science. I will focus on the create here and on the sharing For example, between coworkers on a corporate network if the firewall rules are not too draconian. in the sequel although sharing on a trusted local area network is trivial.

The main value proposition is taking a data science or machine learning artifact to web application quick for the purpose of sharing with folks that would not be comfortable with something like a Jupyter notebook. On this it delivers. I went from script to web application in a couple hours. That allowed me to share a web application with a group of decision makers that were trying to make heads or tails of what topic modeling even meant. I was very impressed from the provider-end and received feedback of the same from the receiver-end.

Another benefit would be its pure Python nature (i.e., no HTML, CSS, JS, etc.) so no need to require data scientists to learn wonky web technologies they have no interest in learning. A comparison with Plotly Dash probably deserves its own blog post but the Dash approach is very much more in the camp of making React easier to do in Python. It is very much focused on

Probably not a big concern from most folks thinking about using this technology given its target but its worth noting, for those with experience in traditional GUI application frameworks, that Streamlit works more like an immediate mode user interface11. That is, it reruns the script from top to bottom each time a UI action (e.g., button is clicked) is performed. Aggressive caching, via the @cache decorator, allows for efficient execution: only rerun the code you need to on each change. This requires the user to make those decisions, and the arguments to be hashable.

It even supports screencast recording natively! Helps with showing folks how to use what you are sharing.

Record a Screencast

This video shows the use of st.sidebar context manager: a staple for application documentation, settings, and even navigation. Streamlit does not really have proper multipage application support yet.

This next video shows how usage of the new st.beta.expander context manager: it is fantastic especially for adding literate exposition sections that the user will want to collapse after they have read it to regain the screen real estate.

The last thing I'll highlight in the application is the usage of the new st.beta.columns context manager that was used to create a grid of word clouds for the individual topics.

Topic Wordclouds

Here is the code

st.subheader('Top N Topic Keywords Wordclouds')
topics = model.show_topics(formatted=False, num_topics=num_topics)
cols = st.beta_columns(3)
colors = random.sample(COLORS, k=len(topics))
for index, topic in enumerate(topics):
    wc = WordCloud(font_path=WORDCLOUD_FONT_PATH, width=700, height=600, background_color='white', collocations=collocations, prefer_horizontal=1.0, color_func=lambda *args, **kwargs: colors[index])
    with cols[index % 3]:
        st.image(wc.to_image(), caption=f'Topic #{index}', use_column_width=True)

There is a ton more to dive into here but nothing that cannot be gained from jumping in and trying something yourself.

Streamlit Sharing

From their website: a gif is worth a lot of words Streamlit Sharing Workflow

The Streamlit Sharing tagline is pretty good: deploy, manage, and share your apps with the world, directly from Streamlit — all for free.

The ease of sharing a machine learning application prototype was delightful. I had originally deployed on an Amazon AWS EC2 instance to meet a deadline (viz., below). Given my background and experience with AWS I would not say it was overly difficult to deploy this way, but I know talented machine learning professionals that might have had trouble here. Moreover, most would not want to spend their time on role access, security settings, setting up DNS records, etc. And yes, their time is most definitely better spent on what they do best.

In order to use this service you need to request and be granted an account. You do that here. You will get a transactional email letting you now that you are in the queue for access, but the invite is not coming just yet. I requested access on Feburary 11th and received access on March 2nd or about 20 days. I would suggest, if you think you will want to try this out anytime soon, that you sign up right away.

Once access is granted it is pretty easy to just follow the gif above or the directions here. There are a couple of things that required me iterate on my github repository to get everything working including

  • Using / setup.cfg correctly means you do not need a requirements.txt file, but the service requires one
  • It is common for machine learning packages to download data and models
    • In the case of spaCy, they transitioned to making their models available as Python packages so nothing to do there but add the model I needed to requirements.txt
    • In the case of NLTK, I had to add a call to to grab stopwords in my main application script

Otherwise it was completely straight-forward and accomplished in just a couple clicks. Official guidance on deployment of Streamlit applications can be found here.

Nonetheless, very generous of the Streamlit team! Running on AWS was only 43¢ a day (about $157 a year or $13 a month) but free is certainly an improvement.

It is also important to note that this is absolutely not a replacement for production deployment. Each user is limited to 3 applications. Individual applications are limited to being run in a shared environment that can get up to 1 CPU, 800 MB of RAM, and 800 MB of dedicated storage.3 So not the right place for your next start-up's web application, but a great value proposition for sharing quick prototypes.

Wrap Up

If you have gotten this far I would like to thank you for taking the time. If on reading this you were interested enough to play around with the application and have feedback I would love to hear it.

The prototype application can be accessed on Streamlit Sharing4 and the code is available on Github5. Intention is to augment and improve what is there time permitting. Plan to get my thought for improvements and expansion into Github issues as I have time to work on them.

Click to read and post comments

Experiment With Style Transfer

Dr. Bryan Patrick Wood

May 29, 2020

Filed under “

What is style transfer? Style transfer is a machine learning technique to transfer the style of one datum onto another. Truth in advertising this time. I will be looking at image style transfer specifically. The techniques is a few years old now. I can remember when I first saw the paper's results I thought ... magic!

The process used below follows the seminal paper Image Style Transfer Using Convolutional Neural Networks which uses the pre-trained VGG19 Imagenet model. The PyTorch implementation was taken liberally from the Style Transfer section of the Udacity Deep Learning Nanodegree Program. I had planned to go into more detail about the paper and implementation approach but this post is already long enough just with the results so that will be deferred to a possible future post.

I used a fixed picture for the content image throughout and varied the style image.

I chose a picture of my eldest daughter sleeping for the content image: tell me that's not the picture of an angel ensconced in a cloud-like comforter? Sleeping Angel

All appreciation in advance for not making her the transfer learning Lenna!1

Jean Henri Gaston Giraud was a French artist that achieved great acclaim under the Mœbius pseudonym. The first two examples are from the fantastic and surreal comic artist Mœbius.

The first example came out pretty well and trippy: a lot like his originals. The vibrant shades of purple really pop. First Moebius Style Transfer Example First Moebius Style Image First one from Moebius. This is one of a series of fantasy prints from a collection called Mystere of Montrouge (this one is plate 10).2

A second example by the same artist; again, pretty good and trippy. Arguably a better result as the comforter looks like a natural fit for transferring the mountains and waterfalls to the content image. Second Moebius Style Transfer Example Second Moebius Style Image Another by Moebius. This one was done for the famous brand Hermès, who was releasing a new perfume and asked him if he could do some artwork on the theme of Hermès Voyages.3

I think there were a couple different aspects that made these style images work particularly well with the content image. First, the style image used bright, vibrant colors: as we will see further on, style images dominated with darker colors produced worse results. Second, since the style image was an illustration with very precise underlying line art the subsequent transfer of style seemed to be more intricate as well. Most of the style images below that are dominated by larger blobs of color (e.g., Rothko) did produce as pleasing a result.

The next two use the style from two pieces by Piet Mondrain. Pieter Cornelis Mondriaan was a Dutch painter who is regarded as one of the greatest artists of the 20th century.

Not so long ago I participated in an art class to make a fused glass creation and this is what I came up with. Fused Glass

When the instructor came around and mentioned my creation was very Mondrain-esque poor confused me just smiled and nodded. I later found out what she meant and I have to agree it is quite uncanny.

The first one is a little busier than many of his other works but has a nice aesthetic. First Mondrain Style Transfer Example Tableau No. 2/Composition No. VII
Tableau No. 2/Composition No. VII by Piet Mondrian.4

The second one is probably much more representative of what most folks think of when they picture Mondrian's artwork. Second Mondrain Style Transfer Example Mondrain Composition It looks like Mondrain made a ton of similar looking pieces under the umbrella of Composition. I spent very little time looking for the one I used here before giving up.

I actually was pleased with the results here although when I showed them to my wife she did not care for them. For these styles being being so abstract I was not sure what exactly to expect. In the two cases I think the first one was trained just about right whereas the second one might have had a better result if trained for less epochs.

Gustav Klimt was an Austrian symbolist painter.

Next up is a painting by Gustav Klimt of a flower garden that I thought would produce a nice style effect. Klimt Style Transfer Example Blumengarten
Blumengarten (Flower Garden) by Gustav Klimt.5

My wife did like this one and it is not that I dislike it as much as it almost looks to me like she has either boils or bad poison ivy on her face. To each their own on that one. It is not an altogether unpleasant effect. One thing to note is that the presence of all the circular flower patterns in the style image transfer fairly literally to the content image.

I know Jackson Pollock is viewed as a divisive figure in the art world but I have always enjoyed his work without thinking about whether a four year-old could produce the same effect given a canvas, buckets of random paint colors, and a couple brushes.

I thought his unique style might be a good fit for style transfer and I picked a representative work to try out. Pollock Style Transfer Example Convergence By Jackson Pollock Nice looking painting called Convergence by Jackson Pollock.6

Was not terribly impressed with the results here. Maybe a different work or more training epochs could have improved the result.

René François Ghislain Magritte was a Belgian Surrealist artist. My wife and I are both huge fan of René Magritte and his surrealist art. In our home we have two of his works up on the walls.

My wife is partial to his Le fils de l'homme (The Son of Man)7, which is great, but I have always had a soft spot for the cerebral La trahison des images (The Treachery of Images)8. For those that are not already familiar.

I love this quote

When Magritte was once asked about La trahison des images, he replied that of course it was not a pipe, just try to fill it with tobacco.9

However, for style transfer I did not think either were particularly good choices. I chose a work I was not very familiar with but was very stylized from his earlier works Magritte Style Transfer Example L’écuyère by Magritte L’écuyère (Woman on horseback) by René Magritte. It was hard to find a good citation for this one for some reason.

My assessment is an okay. The blocky colors in the style image led to a muddled transfer to the content. Not awful and maybe training for less epochs would have improved the situation but midland at best in comparison to the rest of the results.

Henri Émile Benoît Matisse was a French artist, known for both his use of color and his fluid and original style.

I am not a huge fan of Matisse in general but I was on the hunt for cool looking styles and I thought I had found one by him.

For this I chose what some consider his masterpiece Matisse Style Transfer Example Red Room By Matisse The Dessert: Harmony in Red (The Red Room) by Henri Matisse.10

For example, if I were going to spend more time to improve the result here I might try to apply an image filter to desaturate the style image. The red steamrolled the transfer, and I do not think there is much that could be done to fix this situation without modifying the style image itself.

Georgia Totto O'Keeffe was an American artist known for her paintings of enlarged flowers, New York skyscrapers, and New Mexico landscapes.

I know very little about Georgia O'Keeffe but I happened upon a couple of works of flowers that I thought might be good choices for style transfer.

This one was very pleasant with light color tones of whites and greens so based on prior experience it seemed like a good bet O'Keeffe Style Transfer Example An Orchid
An Orchid by Georgia O'Keeffe.11

I like this one. It is not perfect, but it is pretty. Not sure if there's a way to just transfer the color information but that might have led to a better result in this case since the flower morphology does not add much when transferred to the aesthetic.

Mark Rothko was an American painter generally identified as an abstract expressionist.

Mark Rothko is another artist I have always found paintings somehow pleasing even if I could not articulate what exactly about the work it was that I liked. Another divisive art world figure I suppose.

Picked one I guessed would be good for style transfer. Rothko Style Transfer Example White Center (Yellow, Pink and Lavender on Rose)
White Center (Yellow, Pink and Lavender on Rose) by Mark Rothko.12

The results were pretty bad but somewhat predictable in hindsight. Not sure this could be made to work much better given my newly found appreciation for what does and does not work well in the realm of image style transfer at least with the knob I know how to turn.

Far from leaving the best for last, is my final attempt at style transfer for this post.

Wassily Wassilyevich Kandinsky was a Russian painter generally credited as the pioneer of abstract art.

This one from Kandinsky look promising at the time I was collecting style images. Kandinsky Style Transfer Example Several Circles
Several Circles by Wassily Kandinsky.13

The result is less than awesome although it has grown on me a little. Too much black in the style image or too many training epochs.

Perhaps tuning, lessons learned, or alternative models. I did collect many more style images during the exploration phase of writing this post so I may get around to writing a follow-up post at some point spending a little more time on aspects that were not discussed in any great detail. All in all, this was a pretty fun experiment with some cool output that I would not mind revisiting in the future.

Click to read and post comments