(Ken Silva, Headline USA) A massive dataset used to power artificial intelligence models has been removed by the organization that created it due to the discovery of child pornography, according to a Thursday report from 404 Media.
Citing a study from the Stanford Internet Observatory, 404 Media reported that the LAION‐5B dataset includes thousands of illegal images—not including all of the intimate imagery published and gathered non‐consensually.
“If you have downloaded that full dataset for whatever purpose, for training a model for research purposes, then yes, you absolutely have [child porn], unless you took some extraordinary measures to stop it,” David Thiel, lead author of the study and Chief Technologist at the Stanford Internet Observatory told 404 Media.
The LAION-5B machine learning dataset, used by Stable Diffusion and other major AI products, was out of “an abundance of caution,” a LAION-5B spokesperson told 404 Media.
The LAION-5B dataset is used to train the most popular AI generation models currently on the market. It is reportedly made up of more than five billion links to images scraped from the open web, including user-generated social media platforms.
Researchers reportedly think the dataset contained child porn because it indiscriminately collected data throughout the open internet.
“Child abuse material likely got into LAION because the organization compiled the dataset using tools that scrape the web, and CSAM isn’t relegated to the realm of the ‘dark web,’ but proliferates on the open web and on many mainstream platforms,” 404 Media reported.
“In 2022, Facebook made more than 21 million reports of CSAM to the National Center for Missing and Exploited Children (NCMEC) tipline, while Instagram made 5 million reports, and Twitter made 98,050.”
Responding to the misguided notion that a few thousand child porn images won’t affect a dataset of billions, 404 Media said it’s the real-life victims who are affected most.
“[Victims] knowing that their content is in a dataset that’s allowing a machine to create other images—which have learned from their abuse—that’s not something I think anyone would have expected to happen, but it’s clearly not a welcome development,” Dan Sexton, chief technology officer at the UK-based Internet Watch Foundation, told the outlet.
“For any child that’s been abused and their imagery circulated, excluding it anywhere on the internet, including datasets, is massive.”
Ken Silva is a staff writer at Headline USA. Follow him at twitter.com/jd_cashless.