MIT has removed a dataset which leads to misogynistic, racist AI models

Ryan Daws is a senior editor at TechForge Media, with a seasoned background spanning over a decade in tech journalism. His expertise lies in identifying the latest technological trends, dissecting complex topics, and weaving compelling narratives around the most cutting-edge developments. His articles and interviews with leading industry figures have gained him recognition as a key influencer by organisations such as Onalytica. Publications under his stewardship have since gained recognition from leading analyst houses like Forrester for their performance. Find him on X (@gadget_ry) or Mastodon (

MIT has apologised for, and taken offline, a dataset which trains AI models with misogynistic and racist tendencies.

The dataset in question is called 80 Million Tiny Images and was created in 2008. Designed for training AIs to detect objects, the dataset is a huge collection of pictures which are individually labelled based on what they feature.

Machine-learning models are trained using these images and their labels. An image of a street – when fed into an AI trained on such a dataset – could tell you about things it contains such as cars, streetlights, pedestrians, and bikes.

Two researchers – Vinay Prabhu, chief scientist at UnifyID, and Abeba Birhane, a PhD candidate at University College Dublin in Ireland – analysed the images and found thousands of concerning labels.

MIT’s training set was found to label women as “bitches” or “whores,” and people from BAME communities with the kind of derogatory terms I’m sure you don’t need me to write. The Register notes the dataset also contained close-up images of female genitalia labeled with the C-word.

The Register alerted MIT to the concerning issues found by Prabhu and Birhane with the dataset and the college promptly took it offline. MIT went a step further and urged anyone using the dataset to stop using it and delete any copies.

A statement on MIT’s website claims it was unaware of the offensive labels and they were “a consequence of the automated data collection procedure that relied on nouns from WordNet.”

The statement goes on to explain the 80 million images contained in the dataset, with sizes of just 32×32 pixels, means that manual inspection would be almost impossible and cannot guarantee all offensive images will be removed.

“Biases, offensive and prejudicial images, and derogatory terminology alienates an important part of our community – precisely those that we are making efforts to include. It also contributes to harmful biases in AI systems trained on such data,” wrote Antonio Torralba, Rob Fergus, and Bill Freeman from MIT.

“Additionally, the presence of such prejudicial images hurts efforts to foster a culture of inclusivity in the computer vision community. This is extremely unfortunate and runs counter to the values that we strive to uphold.”

You can find a full pre-print copy of Prabhu and Birhane’s paper here (PDF)

(Photo by Clay Banks on Unsplash)

Interested in hearing industry leaders discuss subjects like this? Attend the co-located 5G Expo, IoT Tech Expo, Blockchain Expo, AI & Big Data Expo, and Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London, and Amsterdam.

Tags: , , , , , , , , , , ,

View Comments
Leave a comment

Leave a Reply