Warning: long sperg essay ahead. I feel like matters of copyright and fair use affect KF rather deeply and wanted to give this issue the attention it deserves.
The US Copyright Office has waded into the AI/fair use legal battle and ignited some controversy in the process. Their latest conclusions could have far-reaching impact on AI development, and more broadly how fair use is interpreted.
The USCO has been gradually developing a series of reports on the subject of AI and how they are planning to treat it moving forward. They've released three of them:
Part 1 - Where they stress the need for regulation regarding unauthorized replicas and deepfakes, but acknowledge that style cannot be copyrighted:
https://www.copyright.gov/ai/Copyri...telligence-Part-1-Digital-Replicas-Report.pdf (
archive)
Part 2 - Where they find that raw AI prompting likely doesn't rise to the level of copyrightability, but even minor edits that show human authorship like
inpainting are often enough to to make an AI work copyrightable, and to date, the USCO has already granted over 1000 copyrights to AI works:
https://www.copyright.gov/ai/Copyri...telligence-Part-2-Copyrightability-Report.pdf (
archive)
Part 3 (Pre-publication) - Where they declare that AI training is likely not fair use:
https://www.copyright.gov/ai/Copyri...I-Training-Report-Pre-Publication-Version.pdf (
archive)
It is this third report which has understandably generated a lot of hubbub and controversy.
It was published on May 9th as a "pre-publication" version.
Copyright Lately (
a) notes that it is not standard practice to release such reports this way. The same day, the Trump administration fired the Librarian of Congress Carla Hayden, and on the following day the head of the Copyright Office, Shira Perlmutter, was also fired. There is rampant speculation that their firings are due to this report. You'll find many confident declarations of this on social media and comments sections, though there is no official statement confirming this.
The Register (
a) speculates that the timing of this firing might just be coincidental:
There’s another possible explanation for Perlmutter’s ousting: The Copyright Office is a department of the Library of Congress, whose leader was last week fired on grounds of “quite concerning things that she had done … in the pursuit of DEI [diversity, equity, and inclusion] and putting inappropriate books in the library for children," according to White House press secretary Karoline Leavitt.
So maybe this is just the Trump administration enacting its policy on diversity without regard to the report’s possible impact on donors or Elon Musk.
Regardless, this latest report is deeply flawed, seemingly partisan, and attempts to reinterpret fair use in concerning ways. Even though the USCO aren't lawmakers, the report could have a troubling impact on ongoing and future court cases regarding AI, the idea of fair use, and the scope of where the Copyright Office is allowed to poke their nose into ongoing litigation.
Here's a
copyright law professor's take on the report:
https://chatgptiseatingtheworld.com...s-flawed-both-procedurally-and-substantively/ (
archive)
Advising Congress is one thing. But making rulings on each factor of fair use based in part on comments submitted from some of the parties on both sides in the 36 copyright lawsuits being litigated in courts is quite another.
Indeed, I cannot think of any novel legal issue related to fair use in the pending 36 copyright lawsuits–from transformative purpose in AI training, use of pirated books datasets, deployment of AI in RAG search, and a new and untested theory of market dilution–that the Copyright Office didn’t rule on.
And by “rule on,” I mean the Office took a legal position, favoring one side or even party in the litigation.
Section 701 of the Copyright Act does not authorize the Copyright Office to hold its own shadow proceeding ruling on legal issues being raised by parties in active litigation. To do so usurps the province of the federal courts, especially given that fair use is a quintessentially judge-made doctrine. It’s for the courts, not the Copyright Office, to embark on “uncharted territory” in fair use. If the Office wanted to give its opinions on live controversies involving fair use, the proper procedure for the Office to use is filing an amicus brief in the pending litigation, which comports with due process and gives the opponent a chance to respond to the Office’s views. Indeed, the Copyright Office’s Report is itself unprecedented. To my knowledge, the Office has not ever issued a Report taking the side of one party in active litigation on novel issues of copyright law.
I am not a lawyer. I am just a retard on the Internet. But it isn't difficult to read parts of this report and question how their claims make any sense if applied to other examples of what's considered fair use.
I'm going to focus on one specific section which I see as particularly ridiculous: their take on fair use factor three, "the amount and substantiality of the portion used in relation to the copyrighted work as a whole." While no one factor of fair use is necessarily more important than the others, their examination of this point at least highlights some of the flaws in their judgement.
The USCO's commentary on "the amount used" is on page 55. In a 113 page report, this factor is glossed over in only two paragraphs:
The Supreme Court has said that courts assessing the amount and substantiality must consider both the quantity of material used and its quality and importance. Copying even a small portion of a work may weigh against fair use where it is the “heart” of the work. In general, “[t]he larger the amount, or the more important the part, of the original that is copied, the greater the likelihood that the secondary work might serve as an effectively competing substitute for the original, and might therefore diminish the original rights holder's sales and profits.”
Downloading works, curating them into a training dataset, and training on that dataset generally involve using all or substantially all of those works. Such wholesale taking ordinarily weighs against fair use.
Where to even start.
Downloading and organizing works prior to publishing anything with them is not considered "use" at that point. An artist might download hundreds of pose references in order to draw a character which isn't substantially similar to any of those references. That doesn't mean their drawing "used" all of those works, since they're not present anywhere in the final drawing. If downloading those works was piracy or illegal for some reason, that would be its own separate matter...and in general, web scraping of publicly available data has been established as a legal practice in the US. Fair use is about how the work you used is present in the final result you share with the world, there isn't a "fruit of the poisonous tree" doctrine where the works you reference during the creation process get you into trouble.
To say that training AI involves using "all" of the work is laughable. A completed AI model does not store any of the works it trained on. There's a pervasive myth that AI "chops up" pictures into smaller chunks and copies and pastes them together, but that's not how generative AI works. AI is a complex field and difficult to explain, but gen AI models are a series of mathematical weights and vectors that help the model diffuse an image from noise. It gets hardly any information from an individual image. Models are trained on billions of images but the finished product is only a few gigabytes. If it was somehow compressing the images it trained on, it would be the most impossibly mindblowing compression known to man. It's true that some images end up getting memorized by the model, but these are errors in the training process, instances where one image made its way into the dataset thousands of times and was trained on repeatedly. In general, every image should only be looked at once, and this won't be enough to reproduce any of it substantially.
For most modern diffusion models, when AI trains on an image, it looks at it and "learns" somewhere from two to five bytes from it, and those bytes don't even really represent the image. It's baffling to claim that "all" of the image is in those two to five bytes. That's fucking nothing. It's barely even enough data to write one word.
Compare this to other longstanding examples of fair use.
Look at Wikipedia's plot summaries of movies and books. Does it make any sense to say that a summary of The Avengers used "all" of the movie because the wiki editor had to watch the entire movie in order to write their summary of it? Or does it use basically nothing of the movie, because that text summary doesn't contain any screenshots, music, sound, or lines of dialogue?
If you summarize the broad points covered by Nick Rekieta during an unhinged rant on stream, did you use all of the stream because you had to watch it in order to know those points?
Here's a random case where the court ruled in favor of fair use:
https://www.arl.org/blog/celebrating-fair-use-in-films/ (
archive)
Documentary filmmakers have relied heavily on the doctrine of fair use, which makes a lot of sense. If documentary filmmakers constantly had to rely on permission and licenses—which would also mean that a rightholder could refuse to grant permission—the result could be that these documentaries lacked proper historical references and context. In a 1996 case, the Southern District of New York refused to grant Turner Broadcasting’s motion for injunctive relief, finding that the clips of a boxing match film involving Muhammad Ali and George Foreman in a documentary about Muhammad Ali was likely a fair use. In Monster Communications, Inc. v. Turner Broadcasting Systems, the court noted that only a small portion of the total film—just 41 seconds—was taken and that the documentary used it for informational purposes.
Ah, but did the court consider that the documentarians had to watch the entire film in order to find the 41 seconds they ended up using? The USCO would say that they used all of it.
That's the pants-on-head retardation we're talking about here.
There's a substantial footnote below the Copyright Office's declaration that AI training uses all of the works it trains on. I wonder where they got this idea that looking at something means you used all of it.
See NMPA Initial Comments at 10 (“When a pre-existing work is used to train an AI model, it is analyzed in its entirety. For some models, developers will compress each training example into a compact representation and then cause the developing model to predictively reconstruct it.“); Karla Ortiz Initial Comments (“I found that almost the entire body of my work, the work of almost every artist I know, and the work of hundreds of thousands of other artists, was taken without our consent, credit or compensation to train these for-profit technologies.”); Katherine Lee et al. Initial Comments at 100. Some have argued that generative AI training in fact uses little of the training data. See Meta Initial Comments at 15 (stating the process “meant to extract, relatively speaking, a miniscule amount of information from each piece of training data.”); Oren Bracha, The Work of Copyright in the Age of Machine Production, UNIV. OF TEXAS LAW 1, 25 (last updated Feb. 16, 2024),
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4581738 (arguing that what developers take in the training process is not protectible expression at all but a form of “meta information” about the works). As discussed above, however, the Office finds that the training process regularly uses expressive elements of the underlying works. See supra text accompanying notes 268–71.
Oh, it's based on the word of plaintiff litigants who want to stop AI training. Lovely. When the National Music Publishers' Association tells you that AI uses every bit of the works it trains on, don't you just find that to be completely trustworthy?
The fact that they cite Karla Ortiz of all people, as if she was some kind of expert on these matters...Karla is a metaphorically blue-haired sperg who was one of the first artists to fly into a rage and sue every AI company under the sun before knowing anything about it, and as a result
had the majority of her case thrown out. (
a) Most embarrassingly, she was trying to sue for copyright infringement on her works without even having registered them with the Copyright Office. Her quote above about how the entire body of her work was taken was from 2022, at a time when she would've had no way of knowing or proving whether
any of her works had been trained on by AI. And beyond that, this quote has nothing to do with whether or not the technical process of training AI uses all of a work in training. She knows nothing about generative AI and has no capacity to speak on whether or not training uses all or most of a work, so why is the Copyright Office treating her like an authority on the subject?
"The Office finds that the training process regularly uses expressive elements of the underlying works." Really? Which works are infringed upon? Because it doesn't collect enough information from one image to possess any of its expressive elements. It's true that after examining
many works you can create something that seems expressive in a way that those works share in common, something like a common pose or a way of framing a car in an advertisement, but no individual work had its expressive elements taken. It's not physically possible with the amount of data we're talking about.
The whole thing is a shitshow, and I hope Perlmutter WAS fired for this report. It would threaten the very idea of fair use if it had any legal bearing whatsoever, which it still might, if it ends up cited by any judges in their upcoming rulings.