LLMs can hoover up data from books, judge rules
Anthropic scores a qualified victory in fair use case, but got slapped for using over 7 million pirated copies
iconIain Thomson
Tue 24 Jun 2025 // 19:52 UTC
One of the most tech-savvy judges in the US has ruled that Anthropic is within its rights to scan purchased books to train its Claude AI model, but that pirating content is legally out of bounds.
In training its model, Anthropic bought millions of books, many second-hand, then cut them up and digitized the content. It also downloaded over 7 million pirated books from Books3 dataset, Library Genesis (Libgen), and the Pirate Library Mirror (PiLiMi), and that was the sticking point for Judge William Alsup of California's Northern District court.
On Monday, he ruled that simply digitizing a print copy counted as fair use under current US law, as there was no duplication of the copyrighted work since the printed pages were destroyed after they were scanned. However, Anthropic may have to face trial over the use of pirated material.
"As Anthropic trained successive LLMs, it became convinced that using books was the most cost-effective means to achieve a world-class LLM," Alsup wrote [PDF] in Monday's ruling. "During this time, however, Anthropic became 'not so gung ho about' training on pirated books 'for legal reasons.' It kept them anyway."
Anthropic became 'not so gung ho about' training on pirated books 'for legal reasons.'
The case was filed by three authors - Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson - who claimed that Anthropic illegally used their fiction and non-fiction works to train Claude. At least two of each author's books were included in the pirated material used by Anthropic.
Alsup noted that Anthropic hired the former head of partnerships at Google’s book-scanning project, Tom Turvey, who began conversations with publishers about licensing content, as other AI developers have done. But these talks were abandoned in favor of simply buying millions of books, taking the pages out, and scanning them, which the judge ruled was fair use.
"We are pleased that the Court recognized that using 'works to train LLMs was transformative — spectacularly so,'" an Anthropic spokesperson told The Register.
"Consistent with copyright’s purpose in enabling creativity and fostering scientific progress, Anthropic's LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different."
On the matter of piracy, however, Alsup noted that in January or February 2021, Anthropic cofounder Ben Mann "downloaded Books3, an online library of 196,640 books that he knew had been assembled from unauthorized copies of copyrighted books — that is, pirated." In June, he downloaded "at least five million copies of books" from Libgen, and in July 2022, another two million copies were downloaded from PiLiMi, both of which Alsup classified as "pirate libraries."