Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content

In 2023, OpenAI told the UK parliament that it was “impossible” to train leading AI models without using copyrighted materials. It’s a popular stance in the AI world, where OpenAI and other leading players have used materials slurped up online to train the models powering chatbots and image generators, triggering a wave of lawsuits alleging copyright infringement.

Two announcements Wednesday offer evidence that large language models can in fact be trained without the permissionless use of copyrighted materials.

A group of researchers backed by the French government have released what is thought to be the largest AI training dataset composed entirely of text that is in the public domain. And the nonprofit Fairly Trained announced that it has awarded its first certification for a large language model built without copyright infringement, showing that technology like that behind ChatGPT can be built in a different way to the AI industry’s contentious norm.

“There’s no fundamental reason why someone couldn’t train an LLM fairly,” says Ed Newton-Rex, CEO of Fairly Trained. He founded the nonprofit in January 2024 after quitting his executive role at image generation startup Stability AI because he disagreed with its policy of scraping content without permission.

Fairly Trained offers a certification to companies willing to prove that they’ve trained their AI models on data that they either own, have licensed, or is in the public domain. When the nonprofit launched, some critics pointed out that it hadn’t yet identified a large language model that met those requirements.

Today, Fairly Trained announced it has certified its first large language model. It’s called KL3M and was developed by Chicago-based legal tech consultancy startup 273 Ventures, using a curated training dataset of legal, financial, and regulatory documents.

Common Corpus is a collaboration coordinated by the French startup Pleias, in association with a variety of other AI groups, including Allen AI, Nomic AI, and EleutherAI. It’s backed by the French Ministry of Culture and claims to include the largest open dataset to date in French. It aspires to be multicultural, though, as well as multipurpose—a way to offer researchers and startups across a wide variety of fields access to a vetted training set, free from concerns over potential infringement.

The new dataset also comes with limitations. A lot of public domain data is antiquated—in the US, for example, copyright protection usually lasts over seventy years from the death of the author—so this type of dataset won’t be able to ground an AI model in current affairs or, say, how to spin up a blog post using current slang. (On the flip side, it might write a mean Proust pastiche.)

“As far as I am aware, this is currently the largest public domain dataset to date for training LLMs,” says Stella Biderman, the executive director of EleutherAI, an open source, collective project that releases AI models. “It’s an invaluable resource.”

Projects like this are also exceedingly rare. No other LLMs besides 273’s have been submitted to Fairly Trained for certification. But some who want to make AI fairer to artistswhose works have been slurped into systems like GPT-4 hope Common Corpus and KL3M can demonstrate that there is a pocket of the AI world skeptical of arguments justifying permissionless data scraping.

“It’s a selling point,” says Mary Rasenberger, CEO of the Authors Guild, which represents book authors. “We’re starting to see much more licensing, and requests for licensing. It’s a growing trend.” The Authors Guild, along with actors and radio artists labor union SAG-AFTRA and a few additional professional groups, was recently named an official supporter of Fairly Trained.

Although it doesn’t have additional LLMs on its docket, Fairly Trained recently certified its first company to offer AI voice models, the Spanish voice-changing startup VoiceMod, as well as its first “AI band,” a heavy-metal project called Frostbite Orckings.

“We were always going to see legally and ethically created large language models spring up,” says Newton-Rex. “It just took a bit of time.”

Updated March 20, 2024, 2:45 pm EST: The Common Corpus dataset contains 500 billion tokens, not 500 million.

Source : Wired