Openai once defendant go through many The parties training their AI license to copyrighted content. Now it’s new Paper The serious allegation brought by AI regulatory organizations is that the company is increasingly relying on non-public books and it does not license to train more complex AI models.
AI models are essentially complex prediction engines. They were trained on a lot of data (books, movies, TV shows, etc.) and they learned patterns and novel methods that were inferred from simple tips. When the model “writes” articles about Greek tragedy or “draws” Ghibli-style images, it simply extracts its extensive knowledge from it roughly to approximate values. It hasn’t reached anything new.
Although many AI labs, including OpenAI, have begun embracing AI-generated data to train AI because they exhaust the source of the real world (mainly public networks), few people have completely avoided real world data. This may be because training with pure synthetic data comes with risks, such as deteriorating the performance of the model.
The new paper, AI Pulinosures Project, is a nonprofit founded by media tycoon Tim O’Reilly and economist Ilan Strauss in 2024, concluded that Openai may have trained it GPT-4O O’Reilly Media’s paid book model. (O’Reilly is the CEO of O’Reilly Media.)
exist chatgptGPT-4O is the default model. The report said O’Reilly did not reach a license agreement with OpenAI.
“GPT-4O is Openai’s latest and capable model, showing strong recognition of O’Reilly’s book content for paywall […] The paper’s co-author wrote compared to OpenAI’s earlier GPT-3.5 Turbo. Instead, GPT-3.5 Turbo shows a greater relative awareness of publicly accessed O’Reilly Book samples. ”
The paper uses a method called DE-COPfirst introduced in a 2024 academic paper, aims to detect copyrighted content in language model training data. Also known as a “member reasoning attack,” the method tests whether the model can reliably distinguish artificially written text from AI-generated versions of the interpretation of the same text. If possible, it suggests that the model may have some knowledge of the text from its training data.
The paper’s co-author – O’Reilly, Strauss and AI researcher Sruly Rosenblat – said they explored GPT-4O, GPT-3.5 turbochargedand other Openai models’ knowledge of O’Reilly Media Books, were published before and after the training deadline. They used 13,962 excerpts from 34 O’Reilly books to estimate the possibility that a specific excerpt was included in the model training dataset.
According to the paper’s results, the GPT-4O “approved” O’Reilly book content is much more than the older models of OpenAI (including the GPT-3.5 Turbo). Even after considering potential confounders, it is like improving the ability of newer models to figure out whether text is artificially implemented, the authors say.
“GPT-4O [likely] Recognize that many non-public O’Reilly books are published before the training deadline, as are many non-public O’Reilly books,” the co-author wrote.
This is not a smoking gun, the co-authors pay attention to it carefully. They acknowledge that their experimental approach is not foolproof and that Openai may have collected fee-based excerpts from users who copied and pasted them to Chatgpt.
The co-author further confuses the waters and does not evaluate Openai’s recent model collection, which includes GPT-4.5 and “inference” models such as O3-Mini and O1. These models may not have been trained in paid O’Reilly book data, or have been trained in GPT-4O.
That being said, it is no secret to advocate Openai Restrictions on looseness Focusing on the development of models using copyrighted data, high-quality training data has been sought for some time. The company has even left Hire journalists to help fine-tune the output of their model. Here is a trend across the wider industry: AI companies recruit experts in areas such as science and physics Effectively allow these experts to incorporate their knowledge into AI systems.
It should be noted that OpenAI will pay at least some of its training data. The company has reached a license agreement with news publishers, social networks, stock media libraries and more. Openai also provides an exit mechanism – Although imperfect – Allow copyright owners to tag content they do not want the company to use for training purposes.
Still, with Openai having several lawsuits in U.S. courts over its training data practices and treatment of copyright law, O’Reilly Paper isn’t the most likable look.
Openai did not respond to a request for comment.