Bulk Labelling the Mona Lisa

Oren Zeev Ben Mordehai
5 min readDec 5, 2022
Image by Peter Schmidt from Pixabay

When your dataset is not yet labelled, or you suspect that some of the labels were wrongly annotated, it would have been nice to bootstrap and to have a visual way to select the most interesting data points for (re)labeling / verification. Potentially it can be an iterative process in which you label some of the data points, train a model, use the model predictions to (re)examine the data, and label some more. But how to start and what tools can be helpful in this chore?

I came across two short Explosion (spaCy, prodigy) videos by Vincent Warmerdam, referring to image labelling. Those two videos are packed with useful tips and know how, I strongly recommend to watch those. I’ll do my best to share the main ideas below.

Quick, Draw! Eiffel tower (finding bad labels, straight forward embeddings).

Cats vs. Dogs (bootstrap labelling, pretrained CNN DNN embeddings and task-specific bootstrapped embeddings).

The suggested pipeline is to create 2-D embeddings for each data point, and to plot those as a point cloud. Then you can hopefully see a few interesting emerging clusters that are worth investigating. The next step would be to select a cluster and examine a few data points from that cluster. If indeed there is something special about the (say images) in the cluster. Maybe it turns out that all the data points in the cluster should be labeled the same. Even if only 90% of them are cats for example, then it would be nice to have those defaulted to ‘cat’ which can speed labeling with a tool such as prodigy. Identify a cluster of bad labels (and removing those data points or fixing the labels) can be a boon. You can also Identify “hard” regions, where the embedding does not do really good in separating the classes, and consider starting the manual labeling with examples from there.

Let’s enumerate the steps above:

  1. Create 2-D embeddings for the data points. The embedding should be meaningful in the sense that they help in clustering visually different samples with some relevance to the downstream task(s). How to get there? It is a trial and error. But Vincent suggests to try some SOTA n-dimensional embeddings and to use UMAP as a dimensionality reduction technique, as to wind up with 2 dimensions.
  2. Visualize as a scatter plot. This can be done in a Jupyter notebook. Your 2-D embeddings are the ‘x’ and the ‘y’ of each point.
  3. Select visually a cluster. We can identify an interesting rectangle and draw it on the scatter plot to verify we got the relevant ‘x’s and ‘y’s.
  4. Examine data points from the selected cluster. We can sample datapoints that fall in that rectangle, and for each of those plot the relevant image (or other relevant information). As a complementary tool for the notebooks, Vincent provided the ‘bulk’ utility installable as a Python package. The added value of ‘bulk’ is that we can visually select and, the functionality for displaying the selected datapoints as well as exporting is given.
  5. Export findings regarding to the selected points to the next in the chain tool expected format. This should be done by some combination of scripts and / or the bulk utility. With a knowledge and support of the target labeling tool.
  6. Use findings in additional labeling tools and / or machine learning models. For example using prodigy by Explosion or another.
  7. Keep track of history of data points and labelling to avoid repetitive work and infinite cycles.
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(x_tfm[:, 0], x_tfm[:, 1], alpha=0.002)
ax.add_patch(
matplotlib.patches.Rectangle((3.0, 6.6), 3, 2, color='r', fill=False))

Example Python snippet for items 2–3. above.

I have decided to replicate what Vincent is showing in the first video. He was using there images from Google Quick, Draw! Vincent selected specifically the Eiffel tower dataset, each image is 28x28 monochromatic image, and in the video we learn to quickly find images that are not really good samples of a Eiffel tower drawing. The dataset is interesting as users where given 20 seconds or so to quickly draw an image representing Eiffel tower and other categories (a total of 345 categories). And surprising enough a machine learning model can now figure out, very fast, what you are drawing (out of the given categories). But if we want to build such a ML model ourselves, it makes sense first to remove samples that are obviously wrong.

Vincent picked images of the Eiffel tower, and since all those images are of the same category, then we hope that in a good clustering visualization, what seems to be grouped together is indeed similar, for example for Vincent he got a single big blob of the “good examples” and a smaller identified blob of mostly not good Eiffel tower samples.

For me, with the Mona Lisa category, I have found even nicer pattern. I have used Jupyter notebook and the bulk utility. To prepare an input for the bulk utility, I have used a code similar to below:

preparing a CSV for the bulk utility: x, y, path (disclaimer: a URL path may not be supported yet in bulk)
Created in a notebook (with matplotlib)

After examining the clusters above, in the bulk utility actually, I believe above tags are representative: ‘single frame’, ‘double frame’, ‘no frame’, and ‘not really Mona Lisa’.

no frame
single frame
double frame
Mona Lisa???

--

--

Oren Zeev Ben Mordehai

MSc in computer science, an MBA, Practicing Data Science, Data Mining, Machine Learning, and loving it.