‘Google Images’ is a great source to find relevant images while constructing a database for a classification problem. Let’s take the problem of classifying movie posters based on their genre. We’re going to take three classes that have the least overlap: romance, horror, and superhero.
Creating the Dataset
Getting a list of URLs:
The first step is to get a list of URLs from where we can download our images. To do this, go to Google Images and search for the images you are interested in. Scroll down until you’ve seen all the images you want to download, or until you see a button that says ‘Show more results’. All the images you scrolled past are now available to download. To get more, click on the button, and continue scrolling. The maximum number of images Google Images shows is 700.
urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);window.open('data:text/csv;charset=utf-8,' + escape(urls.join('n')));
Then, we upload the URLs’ files to our working directory, in the folder created for our movie dataset:
Downloading images and viewing the data:
We can use the
download_images function provided by fast.ai to download all the images using the URLs files and ‘show_batch’ function to view them:
Training the Model
Now that our dataset is ready, we can train the model. I’ve used pre-trained weights of the ResNet-34 model for transfer learning.
We have an accuracy of 80.5% now.
Many a time, a low accuracy is due to problems with the dataset, such as mislabeled images in the training or validation set, images that could belong to more than one class with a high probability, etc. We can look at the confusion matrix and most confused image as follows:
In most cases, we can see clearly why the classifier got confused. For example, the first movie poster (Guardians of the galaxy) is mislabeled as ‘romance’ in our dataset, the second image has nothing scary in it for it to be labeled as ‘horror’, the 5th poster (batman) looks a lot like a ‘horror’ movie poster, the 6th poster seems a bit irrelevant and the 8th poster (Jack Reacher) would be impossible even for a human to just look at the poster (without having any other context) and classify it as a ‘superhero’ movie.
Now that we have identified the images causing problems, we can clean up our dataset by deleting the problem images. We could either use the ‘ImageCleaner’ widget provided by fast.ai or write a custom function to view the images having top losses and delete the ones we feel are problematic (make sure you don’t delete images that you feel the classifier has classified incorrectly although they evidently belong to the given label). Also, create a copy of the original dataset, just in case you wanna go back to the old version.
After you’re done with cleaning, train the model again to see if there’s any improvement.
Here are my results, after cleaning up the dataset and training again:
Testing our classifier
Find an image that doesn’t belong to your dataset to test your classifier. Upload it to your working directory and test as follows:
In order to try and improve the classifier, I thought of and tried out a few techniques:
- Reducing the number of training images: The motivation behind this was that as you scroll down in a google image search, the pictures get less relevant. So, in order to make sure I don’t see many irrelevant pictures in my dataset, I limited the number of images to 80 per class. I got an accuracy of 68.8% before unfreezing, and an accuracy of 82.2% after doing so. Even after cleaning up, the accuracy didn’t improve much, in fact, it went lower because the number of images was lesser. This clearly shows that in order to have a well-trained model, we need to have a good amount of data.
2. Increasing the number of training images: By using 600 images, I noticed that the accuracy without unfreezing went down to 43% and after unfreezing a cleaning up, the best I could get to is 49.4%. The low accuracy is most probably because of the presence of more number of irrelevant images as mentioned in the previous point. Turns out, it’s better to have lesser but relevant images.
3. Handpicking images: Although highly impractical in most cases, it would be interesting to try and handpick about 100 images in each category and see the outcome.
4. Using resnet50 instead of resnet34 for transfer learning: In transfer learning, the base model we use is very important. We could try out different models and pick the best one.
5. Adding/changing the classes: During this experiment, I noticed that horror and superhero movie posters have a very similar theme (dark, gloomy, serious) and trying out more mutually exclusive classes would probably give better results.
Our classifier has gotten a general idea of how typical movie posters of the three chosen genres look. Although we have tried to minimize overlapping classes, it turns out that many superhero movie posters look like horror, horror movie posters look romantic, etc. However, an accuracy of 87.2% is pretty good given we built the data set from scratch and there were high chances of classes overlapping.