Blog Post

Prmagazine > News > News > OpenAI’s models ‘memorized’ copyrighted content, new study suggests | TechCrunch
OpenAI’s models ‘memorized’ copyrighted content, new study suggests | TechCrunch

OpenAI’s models ‘memorized’ copyrighted content, new study suggests | TechCrunch

one New research It seems to express credibility for the allegations that Openai has at least trained its copyrighted AI model.

Openai is caught in a lawsuit brought by authors, programmers and other rights holders who accused the company of using its works (books, codebases, etc.) to develop its models without permission. Openai has long claimed Reasonable use Defence, but in these cases, the plaintiffs held that the copyright law in the United States did not have copyright law for training data.

The study, co-authored by researchers at the University of Washington, the University of Copenhagen and Stanford, proposes a new approach to identify training data for models “memory” behind APIs, such as OpenAI’S.

The model is a prediction engine. After a lot of data training, they learn the pattern – that’s how they are able to generate papers, photos, etc. Most outputs are not verbatim copies of training data, but because of the way the model is “learned”, some are inevitable. Image model has been discovered Reflect on screenshots from the movie, they were trainedwhile language models have been observed Effectively steal news articles.

The research method relies on what the co-authors call “high-altitude”, that is, stand out in larger work. For example, the word “radar” in the sentence “Jack and I sit completely with the radar buzz” would be considered highly resilient because it is less statistically likely than it appears before “buzzing.”

The co-author explored several Openai models, including GPT-4 and GPT-3.5, for signs of memory, by removing high-altitude words from novel books and New York Times clips and having the model try to “guess” which words are masked. The co-authors concluded that if the models manage to guess correctly, they are likely to remember the clips during training.

Openai Copyright Research
An example of a model “guessing” a high-source word.Image source:Openai

According to the test results, GPT-4 showed signs of memorizing some popular novel books, including books in the dataset containing copyrighted e-book samples, called Bookmia. The results also show that the model remembers parts of the New York Times article, albeit at a relatively low speed.

Abhilasha Ravichander, a doctoral student at the University of Washington, and co-author of the study, told TechCrunch that the findings reveal that the “controversial data” model may have been trained.

“To have a trusted large language model, we need to have models that can be explored and audited and scientifically examined,” Ravichander said. “Our work aims to provide a tool to explore large language models, but it does require higher data transparency throughout the ecosystem.”

Openai has long advocated Restrictions on looseness Use copyrighted data to develop models. While the company has certain content license agreements and provides an exit mechanism that allows copyright owners to tag content they do not want the company to not use for training purposes, it has Lobbying for several governments Compilation of “reasonable use” rules around AI training methods.

Source link

Leave a comment

Your email address will not be published. Required fields are marked *

star360feedback