ARTICLE AD
LAION, the German research org that created the data used to train Stable Diffusion, among other generative AI models, has released a new data set that it claims has been “thoroughly cleaned of known links to suspected child sexual abuse material (CSAM).”
The new data set, Re-LAION-5B, is actually a re-release of an old data set, LAION-5B — but with “fixes” implemented with recommendations from the nonprofit Internet Watch Foundation, the Canadian Center for Child Protection and the now-defunct Stanford Internet Observatory. It’s available for download in two versions, Re-LAION-5B Research and Re-LAION-5B Research-Safe, both of which were filtered for thousands of links to known — and suspected — CSAM, LAION says.
The release of Re-LAION-5B comes after an investigation in December 2023 by the Stanford Internet Observatory that found that LAION-5B — specifically a subset called LAION-5B 400M — included at least 1,679 illegal images scraped from social media posts and popular adult websites. According to the report, 400M also contained “a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes.”
While the Stanford co-authors of the report noted that it would be difficult to remove the offending content and that the presence of CSAM doesn’t necessarily influence the output of models trained on the data set, LAION said it would temporarily remove the data sets online.
The Stanford report recommended that models trained on LAION-5B “should be deprecated and distribution ceased where feasible.” Perhaps relatedly, AI startup Runway recently removed its Stable Diffusion 1.5 model from the model hosting platform Hugging Face; we’ve reached out to the company for more information. (Runway in 2023 partnered with Stability AI, the company behind Stable Diffusion, to help train the original Stable Diffusion model.)
Of the new Re-LAION-5B data set, which contains around 5.5 billion text-image pairs and is released under an Apache license, LAION says that the metadata can be used by third parties to clean existing copies of LAION-5B by removing the matching illegal content.
“In all, 2,236 links [to suspected CSAM] were removed after matching with the lists of link and image hashes provided by our partners,” LAION wrote in a blog post. “These links also subsume 1008 links found by the Stanford Internet Observatory report in December 2023.”
Important to note is that LAION’s data sets don’t — and never did — contain images. Rather, they’re indexes of links to images and image alt text that it scrapes.