Grand Theft AI

Surprise! Your favorite chatbot's brain was trained using material stolen from the Web.

daniel tynan
April 09, 2024

Never steal anything small. Source: Midjourney.

When I was just a fledgling editor, shortly before the invention of electricity, my boss at the time called me into his office and offered me these sage words of advice: Good writers borrow, he said. Mature writers steal. [1]

I think he was trying to tell me that I needed to start with things that already work, and build on them. That, or he was just messing with my head. But his words stuck.

It turns out that stealing from existing writers (and artists, photographers, musicians, etc) is also how mature AI companies operate. An expansive story in the New York Times this week reveals just how much material companies like OpenAI, Google, Meta, et al purloined to train their very sophisticated, very expensive Large Language Models – the nerd-brains behind ChatGPT, Google Gemini, and their chatbot cousins. [2]

How much did they steal? How about nearly everything on the Internet, and then some? And they did it with the knowledge they were violating millions of copyrights along the way.

Per the Times:

The race to lead A.I. has become a desperate hunt for the digital data needed to advance the technology. To obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law....

The companies’ actions illustrate how online information — news stories, fictional works, message board posts, Wikipedia articles, computer programs, photos, podcasts and movie clips — has increasingly become the lifeblood of the booming A.I. industry. Creating innovative systems depends on having enough data to teach the technologies to instantly produce text, images, sounds and videos that resemble what a human creates.

Note the careful way the Times describes this. By "bending the law," they mean twisting it into the world's largest pretzel.

This is now the world's second largest pretzel. Source: Guinness World Records

Steal first, get sued later

None of this is terribly shocking. What's amazing is the brazenness.

OpenAI went so far as to invent an AI tool called Whisper solely for the purpose of transcribing text from more than a million hours of YouTube videos, which very much violates Google's policies about automated data extraction (aka "scraping").

But it turns out that Google also harvested YouTube videos to train its AI. In order to do that legally, the company's phalanx of lawyers had to tweak the language in YouTube's privacy policies to allow this type of use. They made these changes over the July 4th weekend, hoping nobody would notice.

Source: NY Times, by way of Google.

Nothing unusual about that, right? [3]

Steal first, get sued later is the mantra of Silicon Valley. That's how YouTube got created in the first place. Long before Google acquired it, YouTube's creators were blatantly uploading copyrighted materials to the site. YouTube then got sued by Viacom to the tune of $1 billion for alleged copyright violations. And by "alleged," I mean absolutely without a doubt hell yes they did. Ars Technica has the scoop on some of the dirt that came out of that suit.

It’s always the emails that get you. Source: Ars Technica

Seven years later the suit was settled for undisclosed terms. A lot of Viacom's argument was undercut by federal legislation that absolves the owners of digital platforms from copyright violations performed by their users. [4]

In 2005 Google was sued by the Authors Guild for scraping millions of published works for its Google Books project. Ten years later the courts decided that suit in favor of Google, ruling that because Google only digitized parts of every work, Books fell under Fair Use rules that protect the reproduction of copyrighted works if the use is "transformative." [5]

Copyrights and wrongs

The Times is suing OpenAI and Microsoft over allegedly scraping thousands of Times articles to train the brains behind ChatGPT and Copilot. They join a growing list of other folks whose content was used to make these chatbots smarter, and who also want a seat on the AI money train.

The AI companies are leaning hard on the "transformative" argument, and they might win. But given the vast amount of potential revenue gen AI represents, they shouldn't get away without paying something for their crimes.

The other interesting tidbit that emerged from that story is even more shocking: These companies are running out of data to train the upcoming generation of chatbots. In other words, they've used up the Internet. The next version of your favorite chatbot may have to be trained using 'synthetic data' — fake content generated by other AI engines.

Imagine what that will be like. We know what happens when families intermarry. Soon we'll get to see the AI version of the Habsburg chin.

How much do you think companies should pay for using our data to train their AI brains? Share your thoughts in the comments or email me: crankyolddan AT gmail DOT com.

[1] It turns out he stole that quote from T. S. Eliot. So I guess he was right?

[2] Technically, I should say they borrowed this material — but without the permission or compensation of the people who created it. Given how much money these companies are worth, and how much cash they're likely to generate as a result, that feels like theft to me.

[3] Yet another reason why privacy policies are generally not worth the paper they're not printed on. Companies can change them at will for any reason, and 99 percent of the time nobody even bothers to check.

[4] The Digital Millennium Copyright Act (DMCA), one of the most poorly written, widely abused pieces of tech legislation ever passed.

[5] Apparently, attaching a search function to parts of an existing work is 'transformative' in the courts' eyes. I'm a big fan of Fair Use rules if they're applied fairly; unfortunately, the courts almost always favor corporate interests, regardless of which side of the argument they're on. Most of our nation's aging federal judges are still looking for that elusive 'any' key.

Reply

or to participate.