AI Tools Are Secretly Training on Real Images of Children

Over 170 images and personal details of children from Brazil have been scraped by an open-source dataset without their knowledge or consent, and used to train AI, claims a new report from Human Rights Watch released Monday.

The images have been scraped from content posted as recently as 2023 and as far back as the mid-1990s, according to the report, long before any internet user might anticipate that their content might be used to train AI. Human Rights Watch claims that personal details of these children, alongside links to their photographs, were included in LAION-5B, a dataset that has been a popular source of training data for AI startups.

“Their privacy is violated in the first instance when their photo is scraped and swept into these datasets. And then these AI tools are trained on this data and therefore can create realistic imagery of children,” says Hye Jung Han, children’s rights and technology researcher at Human Rights Watch and the researcher who found these images. “The technology is developed in such a way that any child who has any photo or video of themselves online is now at risk because any malicious actor could take that photo, and then use these tools to manipulate them however they want.”

LAION-5B is based on Common Crawl—a repository of data that was created by scraping the web and made available to researchers—and has been used to train several AI models, including Stability AI’s Stable Diffusion image generation tool. Created by the German nonprofit organization LAION, the dataset is openly accessible and now includes more than 5.85 billion pairs of images and captions, according to its website.

The images of children that researchers found came from mommy blogs and other personal, maternity, or parenting blogs, as well as stills from YouTube videos with small view counts, seemingly uploaded to be shared with family and friends.

“Children should not have to live in fear that their photos might be stolen and weaponized against them,” says Hye.

LAION did not respond immediately to a request for comment by WIRED, but confirmed to Human Rights Watch that the images its researchers identified did exist, and agreed to remove them. But Hye worries that what she was able to find is just the beginning. It was a “tiny slice” of the data that her team was looking at, she says—less than .0001 percent of all the data in LAION-5B. She suspects it is likely that similar images may have found their way into the dataset from all over the world.

Last year, a German ad campaign used an AI-generated deepfake to caution parents against posting photos of children online, warning that their children’s images could be used to bully them or create CSAM. But this does not address the issue of images that are already published, or are decades old but still in existence online.

“Removing links from a LAION dataset does not remove this content from the web,” says Tyler. These images can still be found and used, even if it’s not through LAION. “This is a larger and very concerning issue, and as a nonprofit, volunteer organization, we will do our part to help.”

Hye says that the responsibility to protect children and their parents from this type of abuse falls on governments and regulators. The Brazilian legislature is currently considering laws to regulate deepfake creation, and in the US, representative Alexandria Ocasio-Cortez of New York has proposed the DEFIANCE Act, which would allow people to sue if they can prove a deepfake in their likeness had been made nonconsensually.

“I think that children and their parents shouldn’t be made to shoulder responsibility for protecting kids against a technology that’s fundamentally impossible to protect against,” Hye says. “It’s not their fault.”

Source : Wired