OpenAI: Copyrighted data ‘impossible’ to avoid for AI training

Ryan Daws is a senior editor at TechForge Media with over a decade of experience in crafting compelling narratives and making complex topics accessible. His articles and interviews with industry leaders have earned him recognition as a key influencer by organisations like Onalytica. Under his leadership, publications have been praised by analyst firms such as Forrester for their excellence and performance. Connect with him on X (@gadget_ry) or Mastodon (

OpenAI made waves this week with its bold assertion to a UK parliamentary committee that it would be “impossible” to develop today’s leading AI systems without using vast amounts of copyrighted data.

The company argued that advanced AI tools like ChatGPT require such broad training that adhering to copyright law would be utterly unworkable.

In written testimony, OpenAI stated that between expansive copyright laws and the ubiquity of protected online content, “virtually every sort of human expression” would be off-limits for training data. From news articles to forum comments to digital images, little online content can be utilised freely and legally.

According to OpenAI, attempts to create capable AI while avoiding copyright infringement would fail: “Limiting training data to public domain books and drawings created more than a century ago … would not provide AI systems that meet the needs of today’s citizens.”

While defending its practices as compliant, OpenAI conceded that partnerships and compensation schemes with publishers may be warranted to “support and empower creators.” But the company gave no indication that it intends to dramatically restrict its harvesting of online data, including paywalled journalism and literature.

This stance has opened OpenAI up to multiple lawsuits, including from media outlets like The New York Times alleging copyright breaches.

Nonetheless, OpenAI appears unwilling to fundamentally alter its data collection and training processes—given the “impossible” constraints self-imposed copyright limits would bring. The company instead hopes to rely on broad interpretations of fair use allowances to legally leverage vast swathes of copyrighted data.

As advanced AI continues to demonstrate uncanny abilities emulating human expression, legal experts expect vigorous courtroom battles around infringement by systems intrinsically designed to absorb enormous volumes of protected text, media, and other creative output. 

For now, OpenAI is betting against copyright maximalists in favour of near-boundless copying to drive ongoing AI development.

(Photo by Levart_Photographer on Unsplash)

See also: OpenAI’s GPT Store to launch next week after delays

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with Digital Transformation Week and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

Tags: , , , , , , , , , , , ,

View Comments
Leave a comment

Leave a Reply