New Meta Emails Reveal That the Company Downloaded 81.7 TB of Copyrighted Books via BitTorrent to Train Its AI Models - Tech giants have been continuously extracting data from the Internet to develop their models.

0.png

The ongoing Kadrey v. Meta Platforms, Inc lawsuit accuses the tech giant of using copyrighted materials to train its artificial intelligence models. A few months ago, it was revealed that Meta CEO Mark Zuckerberg authorized the use of pirated books. New evidence recently emerged to support these claims.

Unsealed emails. Appendix A of the case includes several emails from Meta employees that reveal a significant number of downloads of copyrighted books. One employee named Melanie Kambadur expressed her refusal to participate in this kind of data collection in October 2022.

“Torrenting from a corporate laptop doesn’t feel right,” Nikolay Bashlykov, a Meta engineer responsible for this data collection, said in an April 2023 message. He added that the company needed to be cautious about the IP address from which they downloaded the materials.

Meta knew the risks. In September 2023, Bashlykov cautioned that torrenting could lead to “seeding,” which “could be legally not OK.” These internal discussions suggest that Meta recognized this type of activity as unlawful, according to authors who have sued the company.

Covering its tracks. In an internal message, Meta researcher Frank Zhang said that the company took measures to avoid using its servers when downloading the data set. This was to prevent anyone from being able to trace the seeding and the entity downloading the content.

81.7 TB of data. According to Ars Technica, new evidence indicates that Meta downloaded at least 81.7 TB of data from several libraries that offered copyrighted books via torrents. A recent document from the ongoing legal process revealed that at least 35.7 TB were downloaded from sites like Z-Library or LibGen (which was shut down in the summer).

Meta seeks to dismiss the allegations. The company has filed a motion to dismiss these charges. Meta claims there’s no evidence that any of its employees downloaded books via torrents or that they were distributed by Meta. Xataka has contacted the company for comments on the case and will update this post if we receive a response.

Plundering the Internet. This issue highlights the questionable practices that AI companies employ to train their models. It happened with Google, which updated its privacy policy in 2023 to say that it’ll “use publicly available information to help train Google’s AI models.” It’s also evident with OpenAI, which used millions of texts, many of them copyrighted, to train ChatGPT. Perplexity recently came under scrutiny for bypassing the “rules of the Internet” to feed its AI model.

Internet theft is being normalized. What’s remarkable is that as companies increasingly skirt the rules and violate copyright, this behavior is starting to be seen as normal. There seems to be little time for outrage, and people often treat it as an accepted practice so they can continue with their business.

Is this really “fair use”? Many companies rely on the concept of “fair use,” which allows for limited use of protected material without requiring permission. While copyright infringement lawsuits are emerging in the world of generative AI, they often seem to take a backseat as these large companies continue to thrive.

Article Link

Archive
 
Here is an update to this story.

Meta defends using pirated material, claims it's legal if you don't seed content

Meta claimed in a court filing this week that despite torrenting an 82 TB dataset of pirated, copyrighted material from shadow libraries to train its LLaMA AI models, that employees "took precautions not to "seed" any downloaded files".

The act of Seeding in torrenting terminology refers to sharing a file with other users during, (or commonly after) downloading it. Since torrenting is a peer-to-peer system, every user downloading a file can also upload parts of it to other users.

Meta's lawyers claim that there are "no facts to show that Meta seeded Plaintiffs' books". This means that the company's defense is pinning hopes on the fact that there isn't currently any proof that Meta shared the material during the torrenting process.

Though Meta claims that there is no evidence of seeding, Michael Clark, an executive at Meta in charge of project management testified that the configuration settings they were using were modified "so that the smallest amount of seeding possible could occur".

Following this statement, a question regarding why Meta chose to minimize seeding was asked, attorney-client privilege was invoked so that Clark could not answer.

Interestingly, the statement issued by Clark shows that Meta sought methods to minimize seeding, but has yet to offer up indication that it entirely prevented seeding copyrighted material.

Additionally, an internal message from Frank Zhang, a Meta researcher, could point toward alleged concealment of potential seeding from Meta's servers, to avoid "risk of tracing back the seeder/downloader" to Facebook servers.

Meta's defense seems to hinge around the lack of evidence around not sharing the large amount of data they have allegedly downloaded to train its AI models. Should Meta win on this defense and prove that downloading copyrighted content isn't illegal, but distribution is, it could shake up future cases of piracy and unauthorized distribution of copyrighted content.

The defense relying on torrenting terminology could also a way for Meta to aim in tripping up courts. Focusing on seeding could further muddy the claim that Meta allegedly knew that it was violating laws by torrenting copyrighted material.

Meta has yet to respond to claims surrounding on whether it knew that it was sharing data during the download process.

Authors allege Meta was "knowing participant" in "illegal peer-to-peer piracy network"

Authors of the copyrighted material alleged to have been obtained by Meta without prior licensing agreements have alleged [PDF] that "Meta's decision to bypass lawful acquisition methods and become a knowing participant in an illegal peer-to-peer piracy network".

With the court battle expected to continue, no final decision around the case has been made. Even following a final decision, it's expected that Meta will attempt to appeal the decision if they were to lose, meaning that final judgements could be a long while away.

But, similar cases do exist. OpenAI was sued by novelists in 2023, with the New York Times also suing OpenAI and Microsoft over "millions" copied news articles. As the long list of LLM-related litigation continues, this is likely not going to be the last we hear from Meta's specific case.

Article Link
 
Meta claimed in a court filing this week that despite torrenting an 82 TB dataset of pirated, copyrighted material from shadow libraries to train its LLaMA AI models, that employees "took precautions not to "seed" any downloaded files".
Though Meta claims that there is no evidence of seeding, Michael Clark, an executive at Meta in charge of project management testified that the configuration settings they were using were modified "so that the smallest amount of seeding possible could occur".
Archives of both legal documents are attached to this post.

1.png
2.png
 

Attachments

I know KF has a hateboner for copyright laws, but I sincerely hope that that Zucc gets fined for every single book here.
So, the Ai only repeat what they read from books without adding any personal input.

It's like any college student, but faster.
Because no citations are even needed.
Worst part about the whole thing, niggas didn't even have the decency to seed after...
Having some META servers seeding would at least be some form of a "moral" compensation for how this company has socially engineered people to be more retarded. But they didn't even do that, which really tells you how much of a kike Zucc really is.
 
Worst part about the whole thing, niggas didn't even have the decency to seed after...
Yeah, those niggers should have at least done that. You always think they couldn't get worse...
Having some META servers seeding would at least be some form of a "moral" compensation for how this company has socially engineered people to be more retarded. But they didn't even do that, which really tells you how much of a kike Zucc really is.
It's beyond parody. Absolutely incredible.
:story::story::story:
 
Having some META servers seeding would at least be some form of a "moral" compensation for how this company has socially engineered people to be more retarded. But they didn't even do that, which really tells you how much of a kike Zucc really is.
It's beyond parody. Absolutely incredible.
:story::story::story:
So there is a bit more irony to it than just "Meta Jew'd out and didn't seed from their servers" as the actual torrenting itself was done manually via individual engineers torrenting directly to their local machines, where they specifically used VPN's because of how insane it would be if there were Meta owned corporate IP's being shown doing this. So basically, the decision not to seed after torrenting was made individually by each engineer, even though they were hiding their corporate IP's. So it wasn't just "nameless middle manager ran everything through a big server" it was a bunch of assholes who were totally aware of how much of a piece of shit they were being lmfao.
 
Last edited:
The only crime was not seeding. Imagine Tay 2,0 trained on autistic manifestos, 50 shades of gray and japanese light novels plus self help guidebooks.

Hilarious.
 
  • Like
Reactions: indigoisviolet
Now watch as the book publishers do nothing.
They're not going to do nothing. This could actually be the biggest copyright infringement suit of all time, on thousands, maybe tens of thousands, maybe more books. They were obtained illicitly for the purpose of doing something to earn a profit in the billions.

Let's say it's 10,000 infringements but they're only asking for statutory damages. That's $1,500,000,000. You'd probably have a whole cartel of publishers diving in to share the booty.

But they could probably also or instead go for disbursement of all improper profits. That could be in the billions.

They'd be insane to leave that money on the floor.

You can make fair use of material you gained legally, such as by purchasing it, or using public domain material. But when you do it by pirating it, and you knew you were pirating it, because Cuckerberg just moronically gets told hey dude, that's illegal, and he says fuck it, do it anyway, you move it into the highest tier of infringement, and perhaps even the criminal realm.
Here is an update to this story.
It's a dumb argument. While illicit access to the material in question doesn't completely eliminate a fair use defense, since a lot of news stories are based on leaks, it vitiates it, because you've made an infringing copy of every single work in huge volume.

You're commercially exploiting someone else's work you didn't even access legitimately, because you pirated it and you knew you were pirating it, and deliberately concealed the activity by not seeding (fucking leeching scum), and that's way more obviously the motive than that they actually didn't want to infringe. What they didn't want is to get CAUGHT.
 
Back