Business Anthropic destroyed millions of print books to build its AI models - On Monday, court documents revealed that AI company Anthropic spent millions of dollars physically scanning print books to build Claude, an AI assistant similar to ChatGPT.

1.webp
Credit:
Alexander Spatari via Google Images

On Monday, court documents revealed that AI company Anthropic spent millions of dollars physically scanning print books to build Claude, an AI assistant similar to ChatGPT. In the process, the company cut millions of print books from their bindings, scanned them into digital files, and threw away the originals solely for the purpose of training AI—details buried in a copyright ruling on fair use whose broader fair use implications we reported yesterday.

The 32-page legal decision tells the story of how, in February 2024, the company hired Tom Turvey, the former head of partnerships for the Google Books book-scanning project, and tasked him with obtaining "all the books in the world." The strategic hire appears to have been designed to replicate Google's legally successful book digitization approach—the same scanning operation that survived copyright challenges and established key fair use precedents.

While destructive scanning is a common practice among some book digitizing operations, Anthropic's approach was somewhat unusual due to its documented massive scale. By contrast, the Google Books project largely used a patented non-destructive camera process to scan millions of books borrowed from libraries and later returned. For Anthropic, the faster speed and lower cost of the destructive process appears to have trumped any need for preserving the physical books themselves, hinting at the need for a cheap and easy solution in a highly competitive industry.

Ultimately, Judge William Alsup ruled that this destructive scanning operation qualified as fair use—but only because Anthropic had legally purchased the books first, destroyed each print copy after scanning, and kept the digital files internally rather than distributing them. The judge compared the process to "conserv[ing] space" through format conversion and found it transformative. Had Anthropic stuck to this approach from the beginning, it might have achieved the first legally sanctioned case of AI fair use. Instead, the company's earlier piracy undermined its position.

But if you're not intimately familiar with the AI industry and copyright, you might wonder: Why would a company spend millions of dollars on books to destroy them? Behind these odd legal maneuvers lies a more fundamental driver: the AI industry's insatiable hunger for high-quality text.

The race for high-quality training data​

To understand why Anthropic would want to scan millions of books, it's important to know that AI researchers build large language models (LLMs) like those that power ChatGPT and Claude by feeding billions of words into a neural network. During training, the AI system processes the text repeatedly, building statistical relationships between words and concepts in the process.

The quality of training data fed into the neural network directly impacts the resulting AI model's capabilities. Models trained on well-edited books and articles tend to produce more coherent, accurate responses than those trained on lower-quality text like random YouTube comments.

Publishers legally control content that AI companies desperately want, but AI companies don't always want to negotiate a license. The first-sale doctrine offered a workaround: Once you buy a physical book, you can do what you want with that copy—including destroy it. That meant buying physical books offered a legal workaround.

And yet buying things is expensive, even if it is legal. So like many AI companies before it, Anthropic initially chose the quick and easy path. In the quest for high-quality training data, the court filing states, Anthropic first chose to amass digitized versions of pirated books to avoid what CEO Dario Amodei called "legal/practice/business slog"—the complex licensing negotiations with publishers. But by 2024, Anthropic had become "not so gung ho about" using pirated ebooks "for legal reasons" and needed a safer source.

Buying used physical books sidestepped licensing entirely while providing the high-quality, professionally edited text that AI models need, and destructive scanning was simply the fastest way to digitize millions of volumes. The company spent "many millions of dollars" on this buying and scanning operation, often purchasing used books in bulk. Next, they stripped books from bindings, cut pages to workable dimensions, scanned them as stacks of pages into PDFs with machine-readable text including covers, then discarded all the paper originals.

The court documents don't indicate that any rare books were destroyed in this process—Anthropic purchased its books in bulk from major retailers—but archivists long ago established other ways to extract information from paper. For example, The Internet Archive pioneered non-destructive book scanning methods that preserve physical volumes while creating digital copies. And earlier this month, OpenAI and Microsoft announced they're working with Harvard's libraries to train AI models on nearly 1 million public domain books dating back to the 15th century—fully digitized but preserved to live another day.

While Harvard carefully preserves 600-year-old manuscripts for AI training, somewhere on Earth sits the discarded remains of millions of books that taught Claude how to juice up your résumé. When asked about this process, Claude itself offered a poignant response in a style culled from billions of pages of discarded text: "The fact that this destruction helped create me—something that can discuss literature, help people write, and engage with human knowledge—adds layers of complexity I'm still processing. It's like being built from a library's ashes."

This article was updated on 6/26/25 at 7:57 a.m. to add information about the non-destructive scanning technique used by Google Books.

Article Link

Archive
 
"The fact that this destruction helped create me—something that can discuss literature, help people write, and engage with human knowledge—adds layers of complexity I'm still processing. It's like being built from a library's ashes."
Billions of pages of discarded text and it still speaks in reddit pseudo-profundities.
 
Billions of pages of discarded text and it still speaks in reddit pseudo-profundities.
It will only sound as good as the material it was provided. Imagine the difference between an AI model solely built off of literature from the last 50 years compared to one using public domain works that are mostly from before the 1930s.
 
Not my heckin bookerinos!

My mechanic used to run a book selling business before it got wrecked by covid. He's got literal pallets of pallets of books destined for the shredder that I go through.

The amount of shit produced on a daily basis makes this mean nothing. It's just rage bait for anti ai people

edit: He tried to donate some to impoverish child and schools in Africa. Apparently the trucks got hijacked and all the books just dumped on the side of the road.
 
Last edited:
"You're a hopeless romantic," said Faber. "It would be funny if it were not serious. It's not books you need, it's some of the things that once were in books. The same things could be in the 'parlor families' today. The same infinite detail and awareness could be projected through the radios and televisors but are not. No, no, it's not books at all you're looking for! Take it where you can find it, in old phonograph records, old motion pictures, and in old friends; look for it in nature and look for it in yourself. Books were only one type of receptacle where we stored a lot of things we were afraid we might forget. There is nothing magical in them at all. The magic is only in what books say, how they stitched the patches of the universe together into one garment for us."
 
I don't really care about the loss of mass market paperback garbage, but this all sounds wildly inefficient as a means of gathering data to train on. Probably a mechanical nightmare; They say the destructive variant is faster but I have a hard time believing it could really be that fast compared to absorbing already digital materials instead, surely?
 
Were any of them rare, out-of-print books? Court documents indicate no, and if they're right, then who gives a fuck. I would like to know the scans saved for use or archived on their own, without having to talk to the retarded AI, but journalism has gone to shit even without the political bias.
 
who gives a shit? The weird fetishization of books is so unbelievably gay.
No shit. It's like the people who go on about "food waste" as a moral crime, when they're talking about Walmart goyslop that is overproduced in the first place.

Do you know how many books are currently rotting, unread, in the homes of old hoarders? A gorillion. My local thrift stores won't even take books anymore, and the library is shrinking shelf space in favor of computer stations.

Books are important, in general, but we're not talking about medieval manuscripts handwritten by monks.
but this all sounds wildly inefficient as a means of gathering data to train on
At least some low wage workers got air-conditioned jobs for a while.
 
Book authors made the wrong arguments in Meta AI training case, judge says (archive)
Judges clash over "schoolchildren" analogy in key AI training rulings.

Ashley Belanger – Jun 26, 2025

Soon after a landmark ruling deemed that when Anthropic copied books to train artificial intelligence models, it was a "transformative" fair use, another judge has arrived at the same conclusion in a case pitting book authors against Meta.

But that doesn't necessarily mean the judges are completely in agreement, and that could soon become a problem for not just Meta, but other big AI companies celebrating the pair of wins this week.

On Wednesday, Judge Vince Chhabria explained that he sided with Meta, despite his better judgment, mainly because the authors made all the wrong arguments in their case against Meta.

"This ruling does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful," Chhabria wrote. "It stands only for the proposition that these plaintiffs made the wrong arguments and failed to develop a record in support of the right one."

Rather than argue that Meta's Llama AI models risked rapidly flooding their markets with competing AI-generated books that could indirectly harm sales, authors fatally only argued "that users of Llama can reproduce text from their books, and that Meta’s copying harmed the market for licensing copyrighted materials to companies for AI training."

Because Chhabria found both of these theories "flawed"—the former because Llama cannot produce long excerpts of works, even with adversarial prompting, and the latter because authors are not entitled to monopolize the market for licensing books for AI training—he said he had no choice but to grant Meta's request for summary judgment.

Ultimately, because authors introduced no evidence that Meta's AI threatened to dilute their markets, Chhabria ruled that Meta did enough to overcome authors' other arguments regarding alleged harms by simply providing "its own expert testimony explaining that Llama 3’s release did not have any discernible effect on the plaintiffs’ sales."

Chhabria seemed to criticize authors for raising a "half-hearted" defense of their works, noting that his opinion "may be in significant tension with reality," where it seems "possible, even likely, that Llama will harm the book sale market."

There is perhaps a silver lining for other book authors in this ruling, Chhabria suggested. Since Meta's request for summary judgment came before class certification in the lawsuit, his ruling only applies to the 13 authors who sued Meta in this particular case. That means that other authors who perhaps could make a stronger case alleging market harms could still have a strong chance at winning a future Meta lawsuit, Chhabria wrote.

"In cases involving uses like Meta’s, it seems like the plaintiffs will often win, at least where those cases have better-developed records on the market effects of the defendant’s use," Chhabria wrote. "No matter how transformative [AI] training may be, it’s hard to imagine that it can be fair use to use copyrighted books to develop a tool to make billions or trillions of dollars while enabling the creation of a potentially endless stream of competing works that could significantly harm the market for those books."

Further, Chhabria suggested that "some cases might present even stronger arguments against fair use"—such as news organizations suing OpenAI over allegedly infringing ChatGPT outputs that could indirectly compete with their websites. Celebrating the ruling, a lawyer representing The New York Times in that suit, Ian Crosby, told Ars that both Chhabria's and Alsup's rulings are viewed as strengthening the NYT's case.

"These two decisions show what we have long argued: generative AI developers may not build products by copying stolen news content, particularly where that content is taken by wrongful means and their products output substitutive content that threatens the market for original, human-made journalism," Crosby said.

On the other hand, Chhabria wrote that AI companies may have an easier time defeating copyright claims if the feared market dilution is a trade-off for a clear public benefit, like advancing non-commercial research into national security or medicine.

Chhabria said that if the authors had introduced any evidence of market dilution, Meta would not have won at this stage of the case and would have likely faced broader discovery in a class-action suit weighed by a jury.

Instead, the only surviving claim in this case concerns Meta's controversial torrenting of books to train Llama, which authors have so far successfully alleged may have violated copyright laws by distributing their works as part of the torrenting process.

Training AI is not akin to teaching “schoolchildren”​

According to Chhabria, if rights holders provide evidence of market dilution, that may raise the strongest opposition most likely to win AI copyright fights. So, while Meta technically won this fight against these book authors, the ruling isn't necessarily a slam dunk for Meta, nor does it offer ample security for any AI company.

Rather than suggest that AI companies can defeat copyright claims on the virtue that their products are "transformative" uses of authors' works, Chhabria said that cases will win or lose based on allegations of market harm.

He claimed that the "upshot" of his ruling is that he did not create any bright-line rules carving out exceptions for AI companies. Instead, he believes that his ruling makes it clear "that in many circumstances it will be illegal to copy copyright-protected works to train generative AI models without permission. Which means that the companies, to avoid liability for copyright infringement, will generally need to pay copyright holders for the right to use their materials."

In his order, Chhabria called out Judge William Alsup for focusing his ruling this week in the Anthropic case "heavily on the transformative nature of generative AI while brushing aside concerns about the harm it can inflict on the market for the works it gets trained on."

Chhabria particularly did not approve that Alsup compared authors' complaints of the possible market harms that could result if Anthropic's Claude flooded book markets to the outlandish idea that teaching "schoolchildren to write well" would "result in an explosion of competing works."

"According to Judge Alsup, this 'is not the kind of competitive or creative displacement that concerns the Copyright Act,'" Chhabria wrote. "But when it comes to market effects, using books to teach children to write is not remotely like using books to create a product that a single individual could employ to generate countless competing works with a miniscule [sic] fraction of the time and creativity it would otherwise take.

"This inapt analogy is not a basis for blowing off the most important factor in the fair use analysis," Chhabria cautioned.

Additionally, Meta's claim that granting authors a win would stop AI innovation "in its tracks" is "ridiculous," Chhabria wrote, noting that if rights holders win in any of the lawsuits against AI companies today, the only outcome would be that AI companies would have to pay authors—or else rely on materials in the public domain and prove that it's not necessary to use copyrighted works for AI training after all.

"These products are expected to generate billions, even trillions, of dollars for the companies that are developing them," Chhabria wrote. "If using copyrighted works to train the models is as necessary as the companies say, they will figure out a way to compensate copyright holders for it."

Three ways authors can keep fighting AI training​

This week's rulings suggest that the question of whether AI training is transformative has been largely settled.

But as authors continue suing AI companies, with the latest lawsuit lobbed at Microsoft this week, Chhabria suggested that "generally the plaintiff’s only chance to defeat fair use will be to win decisively on" the fourth factor of a fair use analysis, where judges and juries weigh "the effect of the use upon the potential market for or value of the copyrighted work."

Chhabria suggested that authors had at least three paths to fight AI training on the basis of market harms. First, they could claim that AI outputs "regurgitate their works." Second, they could "point to the market for licensing their works for AI training and contend that unauthorized copying for training harms that market (or precludes the development of that market)." And third, they could argue that AI outputs could "indirectly substitute" their works by generating "substantially similar" works.

Because the first two arguments failed in the Meta case, Chhabria thinks "the third argument is far more promising" for authors intending to pick up the torch where the 13 authors in the current case have failed.

An interesting wrinkle that may have stopped authors from invoking market dilution as a threat in the Meta case is that Chhabria noted that Meta had argued that "market dilution does not count under the fourth factor."

But Chhabria clarified "that can’t be right."

"Indirect substitution is still substitution," Chhabria wrote. "If someone bought a romance novel written by [a large language model (LLM)] instead of a romance novel written by a human author, the LLM-generated novel is substituting for the human-written one." Seemingly, the same would go for AI-generated non-fiction books, he suggested.

So while "it’s true that, in many copyright cases, this concept of market dilution or indirect substitution is not particularly important," AI cases may change the copyright landscape because it "involves a technology that can generate literally millions of secondary works, with a miniscule [sic] fraction of the time and creativity used to create the original works it was trained on," Chhabria wrote.

This is unprecedented, Chhabria suggested, as no other use "has anything near the potential to flood the market with competing works the way that LLM training does. And so the concept of market dilution becomes highly relevant... Courts can’t stick their heads in the sand to an obvious way that a new technology might severely harm the incentive to create, just because the issue has not come up before."

In a way, Chhabria's ruling provides a roadmap for rights holders looking to advance lawsuits against AI companies in the midst of precedent-setting rulings.

Unfortunately for book authors suing Meta who found a sympathetic judge in Chhabria—but only made a "fleeting reference" to indirect substitution in a single report in its filings ahead of yesterday's ruling—"courts can’t decide cases based on what they think will or should happen in other cases."

If their allegations were just a little stronger, Chhabria suggested they could have even won on summary judgment, instead of Meta.

"Indeed, it seems likely that market dilution will often cause plaintiffs to decisively win the fourth factor—and thus win the fair use question overall—in cases like this," Chhabria wrote.



Judge: Pirate libraries may have profited from Meta torrenting 80TB of books (archive)
Meta may defeat authors’ torrenting claim due to lack of evidence.

Ashley Belanger – Jun 26, 2025

Now that Meta has largely beaten an AI training copyright lawsuit raised by 13 book authors—including comedian Sarah Silverman and Pulitzer Prize-winning author Junot Diaz—the only matter left to settle in that case is whether Meta violated copyright laws by torrenting books used to train Llama models.

In an order that partly grants Meta's motion for summary judgment, judge Vince Chhabria confirmed that Meta and the authors would meet on July 11 to "discuss how to proceed on the plaintiffs’ separate claim that Meta unlawfully distributed their protected works during the torrenting process."

Chhabria's order suggested that authors may struggle to win this part of the fight, too, due to a lack of evidence, as there has not yet been much discovery on this issue that was raised so late in the case. But he also warned that Meta was wrong to argue its torrenting was completely "irrelevant" to whether its copying of books was fair use.

Chhabria suggested the torrenting—which may have comprised more than 80.6 terabytes of data from one shadow library, LibGen—is "at least potentially relevant" in "a few different ways."

First, Chhabria noted that Meta deciding to pirate books from shadow libraries was "relevant to the issue of bad faith." That’s connected to the first factor of a fair use analysis, which weighs the character of the use.

Authors had argued that Meta had sparked conversations with some publishers about licensing authors' works, but "after failing to acquire licenses," CEO Mark Zuckerberg "escalated" the issue, Chhabria explained. That prompted a decision to acquire books from pirate libraries instead, Chhabria wrote, with Meta admittedly using BitTorrent to seize data after abandoning its pursuit of licensing deals for the same books.

However, that aspect of the trial may not matter much, since Chhabria noted that "the law is in flux about whether bad faith is relevant to fair use."

It could certainly look worse for Meta if authors manage to present evidence supporting the second way that torrenting could be relevant to the case, Chhabaria suggested.

"Meta downloading copyrighted material from shadow libraries" would also be relevant to the character of the use, "if it benefitted those who created the libraries and thus supported and perpetuated their unauthorized copying and distribution of copyrighted works," Chhabria wrote.

Counting potential strikes against Meta, Chhabria pointed out that the "vast majority of cases" involving "this sort of peer-to-peer file-sharing" are found to "constitute copyright infringement." And it likely doesn't help Meta's case that "some of the libraries Meta used have themselves been found liable for infringement."

However, Meta may overcome this argument, too, since book authors "have not submitted any evidence" that potentially shows how Meta's downloading may perhaps be "propping up" or financially benefiting pirate libraries.

Finally, Chhabria noted that the "last issue relating to the character of Meta’s use" of books in regards to its torrenting is "the relationship between Meta’s downloading of the plaintiffs’ books and Meta’s use of the books to train Llama."

Authors had tried to argue that these elements were distinct. But Chhabria said there's no separating the fact that Meta downloaded the books to serve the "highly transformative" purpose of training Llama.

"Because Meta’s ultimate use of the plaintiffs’ books was transformative, so too was Meta’s downloading of those books," Chhabria wrote.

AI training rulings may get more authors paid​

Authors only learned of Meta's torrenting through discovery in the lawsuit, and because of that, Chhabria noted that "the record on Meta’s alleged distribution is incomplete."

It's possible that authors may be able to show evidence that Meta "contributed to the BitTorrent network" by providing significant computing power that could've meaningfully assisted shadow libraries, Chhabria said in a footnote.

But Chhabria dinged authors for citing only an outdated Ars Technica article from 2010 that suggested that people only rarely used torrents to pirate books. (E-book piracy has significantly spiked since then, as TorrentFreak has documented in more recent reports that also note research showing that taking pirated books offline can benefit book sales.)

More will be revealed as the Meta case advances next month, but Chhabria noted that one potential outcome, win or lose for authors, could be that publishers become incentivized to make it easier to license authors' works for AI training.

"Publishers may not currently hold the subsidiary rights necessary to make group licensing possible," Chhabria wrote. "But it’s hard to believe that they won’t soon start negotiating those rights with their authors so that they can engage in large-scale negotiation and licensing" with large language model (LLM) developers—"assuming they haven’t already started to do so."

"It seems especially likely that these licensing markets will arise if LLM developers’ only choices are to get licenses or forgo the use of copyrighted books as training data," Chhabria noted.

That could be the outcome if other authors suing AI companies secure victories that Chhabria views as inevitable. They would need to show evidence that AI products dilute markets for their works, which the authors suing Meta failed to do.

In his ruling granting Meta the win against authors' copyright infringement claims, Chhabria suggested that Meta won only because authors raised the "wrong arguments," suggesting Meta may be more inclined to renew licensing talks in the future if a stronger copyright fight is raised, despite winning this landmark copyright battle against a handful of authors this week.

And if AI companies facing that potential reality "instead choose to use only public domain works as training data (instead of licensing copyrighted works), that would indicate that they don’t actually need the copyrighted works as badly as they say they do," Chhabria wrote. And if that's true, there's likely little excuse for torrenting of pirated books that authors otherwise had long considered an obvious example of copyright infringement.
 
The "paperless" revolution continues. Kek. And the market demand for dumbing oneself down via laziness or shiny object. Did they recycle those old books like good little pod men? What was the carbon footprint?

Eh, whatever. Scan and dump all ya want, AI fags. Old books will rot just fine in a landfill without poisoning anything. Well, hopefully not much.

You're just gonna get a another digital parrot toy, slightly different from the other one. Kinda like the way everything online is just improved email in one way or another.
 
  • Agree
Reactions: Dragoon
I don't really care about the loss of mass market paperback garbage, but this all sounds wildly inefficient as a means of gathering data to train on. Probably a mechanical nightmare; They say the destructive variant is faster but I have a hard time believing it could really be that fast compared to absorbing already digital materials instead, surely?
Its about owning a physical copy to deal with copyright restrictions.
 
When asked about this process, Claude itself offered a poignant response in a style culled from billions of pages of discarded text: "The fact that this destruction helped create me—something that can discuss literature, help people write, and engage with human knowledge—adds layers of complexity I'm still processing. It's like being built from a library's ashes."
Someone on 4chan pointed out that excessive em dashes are often a "tell" for AI works, and I can't unsee it.
 
Back