So far, when AI companies have trained on YouTube’s invaluable stash of videos, captions, and other content, they’ve done so without permission. An AI-focused content licensing startup called Calliope Networks is hoping to change that with its new “License to Scrape,” a program aimed directly at YouTube stars.
“There’s obvious demand from AI companies to scrape YouTube content. We see that by their actions. So what we’re trying to do is to create a tool that makes it legal and simple for them,” says Calliope Networks CEO Dave Davis. Unlike other big social platforms, like Reddit, YouTube hasn’t struck deals with AI bigwigs to scrape its videos. The appeal of the License to Scrape is that it sidesteps the company itself providing a large volume of YouTube content in one go by corralling a group of creators and negotiating a blanket license.
Davis has a background in traditional media licensing; he left a gig at the Motion Picture Licensing Corporation to launch Calliope, betting that the AI industry would eventually move away from permissionless scraping and toward licensing as a norm. He’s not alone in this belief; it’s a boom time for AI data licensing startups. Calliope Networks is a founding member of the Datasets Providers Alliance, a trade group that requires all creators and rights holders to opt into scraping.
Here’s how Davis hopes it’ll work: YouTube creators who want to license their data will enter into a contract with Calliope, which will then sublicense their work out for training generative AI foundational models. It’ll need a critical mass of content to make the deal attractive enough to the AI players first, so the program will need to get YouTubers on board before it can properly get up and running. Calliope would take a percentage of the licensing fees paid by the AI companies.
Although there’s nothing quite like this in the AI world yet, Davis modeled the scraping license format off other parts of the entertainment industry, like Broadcast Music Inc. (BMI) and the American Society of Composers, Authors, and Publishers (ASCAP), which both use blanket licenses for music.
“It’s early in the recruitment process,” Davis says. He estimates that Calliope will need to offer a minimum of 25,000 to 50,000 hours of YouTube content before it’s taken seriously by the AI industry. That this volume of footage is the likely threshold for blanket licenses demonstrates why banding together could be some creators’ best bet for making money for AI training—in this business, volume matters, and video generators are powered by a large amount of data.
There aren’t any marquee names endorsing the license yet, but Calliope has already drafted a few influencer marketing agencies like Viral Nation to get clients on board. “I’ve been getting really good feedback from creators,” says Bianca Serafini, Viral Nation’s head of content licensing. She is confident that a large number of the company’s client roster—which is close to 900 YouTubers—will participate. “No one has presented something like this to us before.”
And what does YouTube make of all this? Davis hasn’t directly worked with the company on this project, but he believes it’s in line with the video behemoth’s wishes. “My take is that YouTube wants to give creators more control,” Davis says.
While YouTube won’t comment on specific licensing companies, it does indeed support its users striking their own agreements. “Generally speaking, creators can enter into deals with third-party companies regarding their content on our platform,” says YouTube spokesperson Jack Malon, who noted that the company recently published a blog post emphasizing its intentions to allow YouTubers “more control” in the age of AI. The crucial thing for YouTube is authorization, or getting explicit permission: “Unauthorized access of creator content is prohibited by YouTube’s Terms of Service, and we’ll continue to employ measures to ensure third parties respect these terms.”
Whether the License to Scrape program succeeds will depend on more than just securing big-name YouTubers. It will require a major shift in how AI companies approach foundational training. With more than 30 copyright cases involving permissionless data-scraping winding through US courts, that type of shift may end up legally mandated. However, as text-to-video generation tools often need large amounts of high-quality data to work well, the hunt for more sources of said data may necessitate a different approach.
Until then, though, it’s not at all clear that the AI bigwigs plan to stop scraping what they call “publicly available” data from websites like YouTube. (When they do reach agreements that include foundational model training, like video-focused AI startup Runway inking a deal with movie studio Lionsgate, the data involved is typically not “publicly available.”) Most of the deals they are striking with platforms and publishers are focused on providing content for AI search products like SearchGPT rather than foundational model training. Recently, after it received a legal threat from the popular UK-based parenting forum Mumsnet, OpenAI told WIRED that it is primarily interested in licensing large datasets that aren’t publicly available.
In the meantime, supporters of this project believe it’s time to press forward, rather than wait for AI companies to signal interest. “We just have to get ahead of this,” Serafini says.
Source : Wired