New Meta Emails Reveal That the Company Downloaded 81.7 TB of Copyrighted Books via BitTorrent to Train Its AI Models - Tech giants have been continuously extracting data from the Internet to develop their models.

0.png

The ongoing Kadrey v. Meta Platforms, Inc lawsuit accuses the tech giant of using copyrighted materials to train its artificial intelligence models. A few months ago, it was revealed that Meta CEO Mark Zuckerberg authorized the use of pirated books. New evidence recently emerged to support these claims.

Unsealed emails. Appendix A of the case includes several emails from Meta employees that reveal a significant number of downloads of copyrighted books. One employee named Melanie Kambadur expressed her refusal to participate in this kind of data collection in October 2022.

“Torrenting from a corporate laptop doesn’t feel right,” Nikolay Bashlykov, a Meta engineer responsible for this data collection, said in an April 2023 message. He added that the company needed to be cautious about the IP address from which they downloaded the materials.

Meta knew the risks. In September 2023, Bashlykov cautioned that torrenting could lead to “seeding,” which “could be legally not OK.” These internal discussions suggest that Meta recognized this type of activity as unlawful, according to authors who have sued the company.

Covering its tracks. In an internal message, Meta researcher Frank Zhang said that the company took measures to avoid using its servers when downloading the data set. This was to prevent anyone from being able to trace the seeding and the entity downloading the content.

81.7 TB of data. According to Ars Technica, new evidence indicates that Meta downloaded at least 81.7 TB of data from several libraries that offered copyrighted books via torrents. A recent document from the ongoing legal process revealed that at least 35.7 TB were downloaded from sites like Z-Library or LibGen (which was shut down in the summer).

Meta seeks to dismiss the allegations. The company has filed a motion to dismiss these charges. Meta claims there’s no evidence that any of its employees downloaded books via torrents or that they were distributed by Meta. Xataka has contacted the company for comments on the case and will update this post if we receive a response.

Plundering the Internet. This issue highlights the questionable practices that AI companies employ to train their models. It happened with Google, which updated its privacy policy in 2023 to say that it’ll “use publicly available information to help train Google’s AI models.” It’s also evident with OpenAI, which used millions of texts, many of them copyrighted, to train ChatGPT. Perplexity recently came under scrutiny for bypassing the “rules of the Internet” to feed its AI model.

Internet theft is being normalized. What’s remarkable is that as companies increasingly skirt the rules and violate copyright, this behavior is starting to be seen as normal. There seems to be little time for outrage, and people often treat it as an accepted practice so they can continue with their business.

Is this really “fair use”? Many companies rely on the concept of “fair use,” which allows for limited use of protected material without requiring permission. While copyright infringement lawsuits are emerging in the world of generative AI, they often seem to take a backseat as these large companies continue to thrive.

Article Link

Archive
 
Unsealed emails. Appendix A of the case includes several emails from Meta employees that reveal a significant number of downloads of copyrighted books. One employee named Melanie Kambadur expressed her refusal to participate in this kind of data collection in October 2022.
Covering its tracks. In an internal message, Meta researcher Frank Zhang said that the company took measures to avoid using its servers when downloading the data set. This was to prevent anyone from being able to trace the seeding and the entity downloading the content.
81.7 TB of data. According to Ars Technica, new evidence indicates that Meta downloaded at least 81.7 TB of data from several libraries that offered copyrighted books via torrents. A recent document from the ongoing legal process revealed that at least 35.7 TB were downloaded from sites like Z-Library or LibGen (which was shut down in the summer).
The referenced legal documents are archived at the bottom of this post.

0.png
1.png
 

Attachments

Download all that data to still get a model that is lobotomized to all hell and pretty much can't tell a good story to save its life. I love generative AI software (as toys) but they are shitty toys that you get bored with after a week they way they are and have been for the past two years. If the courts declare this method of scraping illegal, what happens then?
 
If the courts declare this method of scraping illegal, what happens then?
They retrain on old ass books and whatever else they can scrounge up that's in the public domain, or license content (undesirable unless they negotiate favorable terms). Meta surely uses other content like whatever brainfarts are posted on Facebook.
 
Download all that data to still get a model that is lobotomized to all hell and pretty much can't tell a good story to save its life. I love generative AI software (as toys) but they are shitty toys that you get bored with after a week they way they are and have been for the past two years. If the courts declare this method of scraping illegal, what happens then?
AI will be like a fad like 3D and VR are. Gets popular for a few years when technology improves then nobody gives a gives a shit for a few years because it's not practical. Rinse and repeat.
 
Back