Ben Chuanlong Du's Blog

It is never too late to learn.

Label Image Data Quickly Without Crowdsourcing

If you have to label images for your project but have no budget for crowdsourcing, here are some simple tips which might help you reduce time on human labeling significantly.

Approach 0: Train a Model on Already Labeled Data and Use it to Label New Data

If you already have some labeled data, you can train a simple model on it and use it to help you label new data.

  1. Train a (simple) model on already labeled data.

  2. Run the model on new data and generate labels for them. If the model is not so good, you can let it output the probability for each classes and only accept a label when the predicted probability is high.

  3. Manually revised the generated labels.

  4. Now you have more data, you can repeat steps 1-3 if necessary.

This approach is great but it relies on the assumption that there are already some labeled data. Below are some approaches to quickly get some labeled data from nothing.

Approach 1: Use Pretrained Models to Identify Similar Images

The most convenient way is perhaps to run pretrained vision models on your data to identify similar images.

  1. Run a pretrained model on your data.

  2. Pick an image that you want to label as, say, class A and run the pretrained model on it. Let's say that it generates a numerical label 5.

  3. Filter images that have the output label 5 by the pretrained model and move them into a subdirectory named 5.

  4. In most cases, the majority of images get selected into the subdirectory 5 are those images that you want to lable as class A. If not, you can try again pretrained model and repeat steps 1-3.

  5. Manually label images in the subdirectories.

Approach 2: Pretrained Models and K-means Clustering

This method is similar to approach 1. However, instead of generating final labels using a pretrained model on your data, you get the image embeddings (or features) and then do a k-means algorithm on your data.

  1. Pick a pretrained model and remove its output layer.

  2. Run the customized pretrained model on your data to generate image embeddings (or features).

  3. Pick a few representative images for each of the class and run the customized pretrained model on them to generated the center points for clusters.

  4. Run a k-means algorithm on the image embeddings with the inital start points generated in step 3.

Approach 3: Rules Based Algorithms

You can also develop a rule-based algorithm to help you separated images. However, this usually takes more time than the approach 1.

In [ ]:
 

Computer Science

Comments