A recent U.S. federal court decision has shed important light on the complex copyright issues raised when artificial intelligence companies use vast libraries of books to train large language models (LLMs). In the seminal case of Bartz et al. v. Anthropic PBC, Judge William Alsup of the Northern District of California issued a partial summary judgment on whether the unauthorized copying of books by AI firm Anthropic qualified as “fair use” under Section 107 of the Copyright Act.
The decision reveals a key tension between innovation and intellectual property rights and invites a comparison with EU copyright law, especially Article 4 of the Copyright in the Digital Single Market Directive (CDSM Directive).
What Happened Leading Up To The Law Suit?
Anthropic, the creator of the Claude AI system, was found to have copied over seven million books from pirated and purchased sources, using them in two principal ways:
- To build a permanent internal research library.
- To train its LLMs, including Claude, by selecting subsets of these books.
The court assessed whether these uses qualified as “fair use” under Section 107 of the U.S. Copyright Act, analyzing four distinct uses:
- Training LLMs on books (including “transformative use” of content).
- Digitizing legally purchased print books.
- Creating a centralized digital library from pirated copies.
- Retaining unused pirated books for possible future use.
What Did The Court Find?
- Generally, training LLMs with books is “ transformative” and “fair use“
The court found that training Claude using copyrighted books did not reproduce or distribute infringing output, and instead transformed the books into new statistical relationships and generative capabilities; a new use not substitutable with the original. Therefore, it ruled this use was fair under U.S. law. - Digitizing legally purchased hard copy books is also “fair use“
Anthropic destroyed the physical copies after scanning them, using the digital versions internally. The court likened this to prior rulings where format-shifting (e.g., analog to digital) was accepted when it did not lead to redistribution. No surplus copies were created, and accordingly no new market was harmed. - Storing pirated books for library purposes is not “fair use“
The court rejected the idea that simply maintaining a comprehensive research library with pirated books is protected. The judge emphasized that building a “library of all the books in the world” by piracy is inherently infringing, even if some books are later used in fair-use contexts like training.
How Would EU Law Assess This? CDSM Directive Article 4 Compared
Key Contrasts:
Aspect | U.S. “Fair Use“ | EU Article 4 CDSM |
Purpose | Open-ended (“transformative use”) | Open to any purpose, but rights may be reserved |
Opt-out | No opt-out “fair use” applies ex post | Explicit opt-out mechanism for rightsholders |
Pirated copies | Pirated sources never allowed | Pirated sources never allowed |
Retention for future use | Possibly infringing if not justified | “as long as is necessary for the purposes of text and data mining” but only with lawful access |
Licensing obligation | No requirement if use is fair | Permissible only where no opt-out exercised |
In the EU, Anthropic’s use of pirated copies would also have clearly fallen outside of Article 4 of the CDSM Directive. The text and data mining exception requires lawful access, meaning content scraped from pirate libraries like Books3 or LibGen would not qualify. Furthermore, EU publishers can opt out by marking their content with machine-readable reservations, effectively closing off AI firms from using such data unless a license is obtained prior.
Takeaway
- In the U.S., “fair use” remains flexible but is increasingly context-sensitive. “Transformative use“, especially in AI training, is a strong defense,but does not excuse initial unlawful acquisition.
- In the EU, required lawful access and authors’ opt-out reservations place stronger limits on AI training datasets, even when the ultimate use is innovative.
- Companies developing AI systems should:
- Audit datasets for provenance and licensing.
- Build mechanisms to comply with opt-out signals under CDSM.
- Avoid reliance on pirated content, even for internal or non-public uses.
This ruling in Bartz v. Anthropic signals a growing judicial awareness of the nuanced ways AI interacts with copyright laws. While the court supported innovation through “fair use“, it firmly drew the line at piracy, which is a position that aligns with EU law’s more formalized safeguards.