Skip to main content
  1. AI Legal Resources/

AI Copyright & Training Data Liability

Table of Contents

The Copyright Battle Over AI#

At the heart of modern AI development lies a legal question worth billions: Can AI companies use copyrighted works to train their models without permission or payment?

The answer is shaping up to be: it depends, on how the works were obtained, what the AI produces, and whether the output competes with the original.

Over 50 copyright lawsuits have been filed against AI developers. Three federal judges have ruled on fair use, with split results. The U.S. Copyright Office has weighed in. European courts have issued their first rulings. And the outcomes will determine not just AI companies’ liability, but whether anyone deploying AI systems faces secondary copyright claims.

The Fair Use Framework
#

U.S. copyright law permits “fair use” of copyrighted works without permission for purposes like criticism, comment, news reporting, teaching, scholarship, or research. Courts evaluate four factors:

  1. Purpose and character of the use (including whether transformative and commercial)
  2. Nature of the copyrighted work
  3. Amount and substantiality of the portion used
  4. Effect on the potential market for the original

AI companies argue training is transformative, the model learns patterns, not copies, and doesn’t compete with the originals. Rights holders counter that wholesale copying for commercial gain isn’t fair use, especially when AI outputs substitute for licensed content.

Landmark Cases
#

Bartz v. Anthropic (N.D. Cal., 2025)
#

The first definitive fair use ruling in an AI training case came on June 23, 2025, when Judge William Alsup issued a split decision in the class action against Anthropic.

The Claims: Authors alleged Anthropic infringed their copyrights by using their books to train Claude, including works obtained from piracy sources.

The Ruling:

Training Is Fair Use: Judge Alsup characterized AI training as “exceedingly transformative,” “spectacularly so,” and “quintessentially transformative”:“among the most transformative many of us will see in our lifetimes.”

He reasoned that authors cannot exclude others from using their works to learn. Training doesn’t copy for expressive purposes; it extracts statistical patterns to generate new content.

Format-Shifting Is Fair Use: Anthropic’s “destructive scanning” of lawfully purchased print books, stripping bindings and digitizing contents, was transformative because it facilitated storage and searchability without increasing copy numbers.

Pirated Copies Are NOT Fair Use: The court drew a critical line: Anthropic’s acquisition and retention of pirated copies to build a permanent digital library was not justified. The eventual transformative use for training could not “retroactively excuse the initial act of piracy.”

Settlement: On August 26, 2025, the parties announced a class-wide settlement. Terms remain confidential, but class counsel described the outcome as “historic” and beneficial to authors.

Key Takeaway: Training on lawfully obtained works is likely fair use. Training on pirated works is not. AI developers must document provenance of training data.

Kadrey v. Meta (N.D. Cal., 2025)
#

In a parallel ruling, Judge Vince Chhabria granted Meta summary judgment on fair use claims brought by authors over Llama training.

The Limitation: Judge Chhabria expressed doubts about whether mass training qualifies as fair use and based his ruling primarily on authors’ failure to demonstrate market harm. The decision is narrower than Bartz and less enthusiastic about training’s transformative nature.

Thomson Reuters v. ROSS Intelligence (D. Del., 2025)
#

On February 11, 2025, Judge Bibas delivered the first ruling against an AI company on fair use grounds.

The Facts: ROSS Intelligence sought to create a legal research AI to compete with Westlaw. When Thomson Reuters declined to license its content, ROSS commissioned “Bulk Memos” from a third-party that contained Westlaw headnotes. ROSS used these to train its AI.

The Ruling: The court found ROSS’s copying was not fair use:

  • Factor 1 (Purpose): ROSS’s use was commercial and not transformative. Using headnotes as training data doesn’t transform them, it extracts their value.
  • Factor 4 (Market Effect): ROSS intended to create a “market substitute” for Westlaw, potentially undermining Thomson Reuters’ licensing market.

Critical Distinction: The court emphasized ROSS’s AI was “not generative AI”, it retrieved existing content rather than creating new material. Whether generative AI training would be treated differently remains open.

Implications: Training AI on copyrighted works to create direct competitors faces significant fair use hurdles, particularly when the training data source declined to license.

New York Times v. OpenAI/Microsoft (S.D.N.Y., ongoing)
#

The highest-profile AI copyright case continues to advance through federal court.

The Claims: The New York Times alleges OpenAI and Microsoft used millions of its articles to train ChatGPT and Bing without permission, creating a “market substitute” for its journalism that damages both subscription revenue and advertising.

Key Ruling (March 2025): Judge Sidney Stein rejected OpenAI’s motion to dismiss, allowing the core copyright infringement claims to proceed. Discovery is ongoing.

The Evidence Dispute: The Times claims ChatGPT can regurgitate articles verbatim when prompted. OpenAI argues the Times manipulated prompts to force specific outputs. OpenAI has resisted turning over user conversation logs, arguing privacy concerns.

Stakes: The Times seeks “billions of dollars” in damages. A ruling against OpenAI could require licensing agreements with publishers industry-wide.

Disney/NBCUniversal v. Midjourney (N.D. Cal., filed June 2025)
#

Hollywood studios filed their first major lawsuit against an AI company on June 11, 2025.

The Claims: Disney and Universal allege Midjourney operates as “a virtual vending machine, generating endless unauthorized copies” of their characters, including Darth Vader, Elsa, Bart Simpson, Shrek, and the Minions.

Financial Context: Midjourney had $200 million in revenue in 2023 and reportedly $300 million in 2024, with 21 million users.

Midjourney’s Response: In August 2025, Midjourney moved to dismiss, arguing fair use and claiming the studios themselves use generative AI tools.

Status: Pending. The case will test whether image generators that produce recognizable copyrighted characters face different treatment than text models.

RIAA v. Suno/Udio (D. Mass. & S.D.N.Y., filed June 2024)
#

On June 24, 2024, major record labels filed landmark lawsuits against AI music generators.

The Claims: Sony, UMG, and Warner allege Suno and Udio trained on copyrighted recordings without permission. The RIAA claims the AI systems produce outputs strikingly similar to Michael Jackson’s “Billie Jean,” the Beach Boys’ “I Get Around,” Mariah Carey’s “All I Want For Christmas Is You,” and others, and even reproduce producer tags.

Damages Sought: Up to $150,000 per work infringed, potentially billions of dollars.

The Companies’ Defense: Suno’s CEO claims the technology is “transformative” and “designed to generate completely new outputs, not to memorize and regurgitate pre-existing content.”

Status: Pending. The first music industry test of AI training fair use.

Concord Music Publishers v. Anthropic (N.D. Cal., ongoing)
#

Music publishers sued Anthropic in October 2023, alleging Claude reproduces copyrighted song lyrics when prompted.

December 2024 Agreement: Anthropic agreed to maintain guardrails preventing lyrics output while contesting the underlying claims.

March 2025 Ruling: Judge Eumi Lee denied a preliminary injunction, finding publishers hadn’t proven market harm because the guardrails address the concern. However, the lawsuit continues on the underlying infringement claims.

Significance: The case may establish whether implementing guardrails after a lawsuit reduces liability exposure, relevant for any AI company facing copyright claims.

International Developments
#

GEMA v. OpenAI (Munich Regional Court, November 2025)
#

The first European ruling against an AI developer came from Germany on November 11, 2025, when the Munich Regional Court ruled in favor of GEMA, Germany’s music collecting society.

The Claims: GEMA alleged OpenAI used German song lyrics, including “Atemlos” and “Männer”, to train GPT-4 without a license.

Key Findings:

  • Training data becomes embedded in model weights through “memorization” and remains retrievable
  • OpenAI is responsible as the developer/operator, not end users
  • The EU text and data mining exception doesn’t justify outputs that replicate original works
  • OpenAI’s non-profit arguments failed, commercial subsidiaries preclude that exemption

Remedies: OpenAI must compensate GEMA for damages including unpaid royalties. The company must cease storing unlicensed German lyrics on German infrastructure. The judgment must be published in a local newspaper.

Status: OpenAI is appealing. The decision may reach the Court of Justice of the European Union.

Implications: AI companies deploying in the EU face heightened risk if they cannot demonstrate lawful sourcing of training data. The ruling conflicts with U.S. Bartz reasoning on transformative use.

U.S. Copyright Office Guidance#

On May 9, 2025, the U.S. Copyright Office released its third and final report on AI and copyright, focusing on training and fair use.

Key Findings:

  • AI training is not inherently transformative simply because it’s for non-expressive purposes
  • “Making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries”
  • Model weights themselves may infringe reproduction and derivative work rights where outputs are substantially similar to training inputs
  • The report supports development of licensing solutions

Limitations: The report isn’t binding law. Courts will make final determinations. The report was released amid political turmoil, one day after the Librarian of Congress was dismissed and one day before the Register of Copyrights was dismissed.

Practical Effect: The Copyright Office position provides ammunition for rights holders in litigation but doesn’t resolve the fair use question.

The Liability Chain
#

Copyright exposure extends beyond AI developers to anyone in the AI deployment chain.

Direct Infringement
#

AI Developers: Companies that train on copyrighted works without authorization face direct infringement claims for:

  • Reproduction (copying works into training datasets)
  • Derivative works (creating models that incorporate copyrighted elements)
  • Distribution (making infringing outputs available)

AI Users: Users who prompt AI systems to generate infringing content may face direct infringement liability for the outputs they create and distribute.

Secondary Liability
#

Contributory Infringement: AI providers may face contributory liability if they:

  • Know users are generating infringing content
  • Materially contribute to that infringement
  • Fail to take reasonable steps to prevent it

The Concord v. Anthropic guardrails agreement suggests implementing content filters may reduce secondary liability exposure.

Vicarious Liability: Companies that:

  • Have the right and ability to supervise infringing activity
  • Have a direct financial interest in the infringement

may face vicarious liability for user-generated outputs.

Enterprise Deployers
#

Organizations that deploy AI tools face potential liability for:

  • Generating content that infringes third-party copyrights
  • Using outputs commercially without verifying rights
  • Failing to implement reasonable safeguards

Risk Factors:

  • High-volume content generation increases infringement probability
  • Using AI outputs in commercial contexts (advertising, publishing) elevates exposure
  • Lack of human review before distribution compounds risk

Implications for AI Deployers
#

Due Diligence Requirements
#

Organizations using AI tools should:

1. Vendor Assessment:

  • Review AI vendor’s training data sources and documentation
  • Assess whether vendor has faced copyright litigation
  • Evaluate vendor’s indemnification provisions
  • Understand what content filters the vendor implements

2. Use Case Analysis:

  • Identify high-risk use cases (creative content, music, visual media)
  • Assess whether AI outputs could substitute for licensed content
  • Consider whether outputs might reproduce recognizable copyrighted elements

3. Output Review:

  • Implement human review for commercially published AI content
  • Check AI-generated text for verbatim reproduction
  • Screen AI-generated images for copyrighted characters or distinctive styles

Contractual Protections
#

Vendor Agreements:

  • Seek indemnification for copyright claims arising from AI outputs
  • Require vendors to maintain documentation of training data sources
  • Include representations regarding lawful data acquisition

A 2024 study found:

  • 88% of AI vendors impose liability caps limiting damages to subscription fees
  • Only 17% provide compliance warranties

Customer Agreements:

  • Disclaim warranties regarding third-party IP rights in AI outputs
  • Allocate responsibility for output verification to users
  • Include provisions addressing AI-generated content

Insurance Considerations
#

Many professional liability and E&O policies don’t clearly cover AI-generated copyright infringement. Organizations should:

  • Review existing coverage for AI-related exclusions
  • Consider supplemental coverage for AI liability
  • Document AI governance procedures for potential claims

See Insurance Coverage Analysis for detailed guidance.

Industry-Specific Considerations
#

Publishing and Media
#

Highest Risk:

  • AI-generated articles that reproduce copyrighted source material
  • Summarization tools that substitute for original reporting
  • Content mills using AI without editorial oversight

Mitigation:

  • Human editorial review before publication
  • Citation and attribution protocols
  • Licensing agreements for AI training data

Creative Industries
#

Highest Risk:

  • AI-generated images resembling copyrighted characters
  • AI music that reproduces recognizable melodies or sounds
  • AI writing that mimics distinctive authorial styles

Mitigation:

  • Content filters blocking recognizable IP
  • Style-based rather than character-based prompting
  • Output screening for copyrighted elements

Legal and Professional Services#

Highest Risk:

  • AI legal research reproducing copyrighted headnotes (per Thomson Reuters v. ROSS)
  • AI-generated documents incorporating copyrighted language
  • Client deliverables containing AI-generated content without disclosure

Mitigation:

  • Original drafting rather than AI generation of key content
  • Human review of all AI-assisted work product
  • Client disclosure regarding AI use

The Path Forward
#

Judicial Timeline
#

No additional summary judgment decisions on fair use in AI training are expected until summer 2026. Key cases to watch:

  • NYT v. OpenAI: Discovery ongoing; trial date TBD
  • RIAA v. Suno/Udio: First music training fair use test
  • Disney v. Midjourney: First major studio visual AI case

Legislative Developments
#

Federal: No comprehensive AI copyright legislation has passed. The Copyright Office report stops short of recommending specific legislation but emphasizes licensing solutions.

International: The EU AI Act doesn’t directly address training data copyright but imposes transparency requirements. The German GEMA ruling may influence EU-wide standards.

Licensing Evolution
#

Emerging Solutions:

  • Stock media companies offering AI training licenses
  • News organizations negotiating platform deals
  • Collective licensing organizations developing AI frameworks

Challenges:

  • Scale of training data requirements
  • Retroactive licensing for existing models
  • International coordination

Practical Takeaways
#

For AI Developers:

  1. Document training data provenance meticulously
  2. Avoid pirated sources:Bartz makes clear this isn’t fair use
  3. Implement output filters for recognizable copyrighted content
  4. Consider licensing for high-risk content categories
  5. Prepare for jurisdiction-specific compliance (especially EU)

For AI Deployers:

  1. Conduct vendor due diligence on training data sources
  2. Implement human review for commercial AI outputs
  3. Screen outputs for copyrighted elements before publication
  4. Review insurance coverage for AI-related IP claims
  5. Include appropriate provisions in vendor and customer contracts

For Rights Holders:

  1. Monitor AI outputs for reproduction of your works
  2. Document instances of verbatim or substantially similar outputs
  3. Consider the litigation landscape before filing (fair use rulings are mixed)
  4. Explore licensing opportunities with AI developers

The fair use question for AI training remains genuinely unsettled. Bartz and Kadrey favor AI developers; Thomson Reuters and GEMA favor rights holders. The NYT case may prove dispositive. Until then, prudent organizations will assume AI outputs carry copyright risk and implement appropriate safeguards.

Resources
#

Related

AI Defamation and Hallucination Liability

The New Frontier of Defamation Law # Courts are now testing what attorneys describe as a “new frontier of defamation law” as AI systems increasingly generate false, damaging statements about real people. When ChatGPT falsely accused a radio host of embezzlement, when Bing confused a veteran with a convicted terrorist, when Meta AI claimed a conservative activist participated in the January 6 riot, these weren’t glitches. They represent a fundamental challenge to defamation law built on human publishers and human intent.

AI Debt Collection and FDCPA Violations: Legal Guide

When AI Becomes the Debt Collector # The debt collection industry, historically notorious for harassment and intimidation, is rapidly adopting artificial intelligence. AI chatbots can contact millions of debtors in days. Voice cloning technology creates synthetic agents indistinguishable from humans. Algorithmic systems decide who gets sued, when to call, and how aggressively to pursue payment.

AI Employment Discrimination Tracker: Algorithmic Hiring, EEOC Enforcement & Bias Cases

AI in Employment: The New Discrimination Frontier # Artificial intelligence has transformed how companies hire, evaluate, and fire workers. Resume screening algorithms, video interview analysis, personality assessments, performance prediction models, and automated termination systems now influence employment decisions affecting millions of workers annually. But as AI adoption accelerates, so does evidence that these systems perpetuate, and sometimes amplify, discrimination based on race, age, disability, and gender.

AI Hallucinations & Professional Liability: Malpractice Exposure for Lawyers Using LLMs

Beyond Sanctions: The Malpractice Dimension of AI Hallucinations # Court sanctions for AI-generated fake citations have dominated headlines since Mata v. Avianca. But sanctions are only the visible tip of a much larger iceberg. The deeper exposure lies in professional malpractice liability, claims by clients whose cases were harmed by AI-generated errors that their attorneys failed to catch.

AI Litigation Landscape 2025: Comprehensive Guide to AI Lawsuits

The AI Litigation Explosion # Artificial intelligence litigation has reached an inflection point. From copyright battles over training data to employment discrimination class actions, from product liability claims for AI chatbots to healthcare AI denial lawsuits, 2025 has seen an unprecedented wave of cases that will define AI accountability for decades to come.