Attorneys representing The New York Times and Daily News claim OpenAI engineers mistakenly removed data concerning their lawsuit over unauthorized use of the publications’ content for AI training.
In the fall, OpenAI provided two virtual machines for lawyers representing The Times and Daily News to search its AI training data for copyrighted content.
The publishers’ attorneys stated that since November 1, 2024, the publishers and their consultants have worked over 150 hours reviewing OpenAI’s training data for possible copyright violations. However, on November 14, OpenAI engineers discarded the results from the search from one of the virtual machines. This was noted in a declaration filed late Wednesday in the US District Court for the Southern District of New York.
A lawyer for The New York Times and other outlets says that Open AI engineers erased information produced during a search of training data for instances of the news organizations’ work:https://t.co/28AiTv9EQn pic.twitter.com/2u3UcbBncl
— Ben Mullin (@BenMullin) November 21, 2024
OpenAI made an effort to restore lost information with partial success. However, on November 19, 2024, the publishers determined that the restored data from OpenAI, lacking original file organization and directory names, was unreliable and could not be used to trace where their copyrighted articles were utilized in OpenAI’s models.
On November 19, 2024, The New York Times and Daily News informed OpenAI of their plan to file a status letter updating the Court on issues related to inspecting OpenAI’s training data. They proposed filing the letter jointly, but OpenAI declined, according to the declaration.
The counsel stated that the plaintiffs had to reconstruct important work due to the inaccessible retrieved data. The counsel clarified that the publishers did not consider the deletion purposeful, but emphasized that OpenAI is better positioned to examine its own datasets for copyright infringements, as reported by TechCrunch. This led to the filing of a supplemental letter.
The Legal Battle Over AI Training Data
In December 2023, The New York Times sued OpenAI and Microsoft, accusing them of using its copyrighted articles to develop AI models without acquiring proper permissions.
OpenAI asserts that leveraging open-source data, including The New York Times and Daily News articles, to train models like GPT-4 qualifies as fair use. The company argues it isn’t required to license or compensate for the data, even though it profits from the models built using it.
Meanwhile, OpenAI has secured licensing agreements with major publishers like the Associated Press and News Corp, with reports suggesting Dotdash Meredith receives at least $16 million annually for its content.
If the case is not settled and OpenAI and Microsoft are found guilty, they could face fines and be ordered to delete the data. The outcome may set legal precedents and lead to stricter AI regulations, affecting future negotiations and compensation models between publishers and tech companies.