Blog Post

Prmagazine > News > News > OpenAI’s new reasoning AI models hallucinate more | TechCrunch
OpenAI’s new reasoning AI models hallucinate more | TechCrunch

OpenAI’s new reasoning AI models hallucinate more | TechCrunch

Openai Recently launched O3 and O4-Mini AI models It is state-of-the-art in many ways. However, new models are still hallucinating or making up for it – in fact, they are hallucinating More Several older models than Openai.

It turns out that hallucinations are one of the biggest and most difficult problems to solve in AI, impacting Even the best performing system today. Historically, every new model in the Hallucination Department has improved slightly, with fewer hallucinations than their predecessors. But this doesn’t seem to be the case for O3 and O4-Mini.

According to Openai’s internal test, O3 and O4-Mini are so-called inference models, hallucination More often Compared with the company’s previous inference models (O1, O1-Mini and O3-Mini) and Openai’s traditional “non-conditioning” models, such as GPT-4O.

Perhaps even more worrying is that ChatGpt Maker doesn’t really know why this happens.

In its technical report O3 and O4-Mini“More research is needed” to understand why hallucinations get worse as the inference model expands. O3 and O4-Mini perform better in certain fields, including tasks related to coding and mathematics. However, because they “produce more claims in general”, they often lead to “more accurate claims and more inaccurate/idiosity claims.”

Openai found that O3 responded to 33% of PersonQA issues, an internal benchmark for the company to measure the accuracy of the model’s knowledge on people. This is twice the hallucination rate of OpenAI’s previous inference models O1 and O3 Mini, at 16% and 14.8%, respectively. O4-Mini is even worse on PersonQA – hallucination 48% of the time.

third party test Through Cluctuce, a nonprofit AI research laboratory, evidence has also been found that O3 tends to compensate for actions taken during the arrival of the answer. In one example, Clansuce observes that O3 claims it runs the code on the 2021 MacBook Pro “outside Chatgpt” and then copy the numbers into its answer. While O3 has access to certain tools, it can’t do it.

“Our hypothesis is that reinforcement learning for O series models may amplify the problem that standard post-training pipelines usually alleviate (but not completely erased),” current researcher and former OpenAI employee Neil Chowdhury in an email sent to TechCrunch.

Clansuce co-founder Sarah Schwettmann added that the hallucination speed of O3 may make it less useful than other situations.

Kan Katanforoosh, an adjunct professor and CEO at Stanford, told TechCrunch that his team had tested O3 in the coding workflow and they found that it had surpassed the competition. However, Katanforoosh says O3 tends to have a broken site link. The model will provide a link that is invalid when clicked.

Hallucinations may help models to come up with interesting ideas and be creative in “thinking,” but they also make some models difficult to sell to businesses in markets where accuracy is critical. For example, a law firm may be unhappy with a model that inserts many facts into a client contract.

One promising way to improve model accuracy is to provide them with web search capabilities. OpenAI’s GPT-4O has web search achievements 90% Accuracy On SimpleQa. Potentially, search can also increase the hallucination rate of inference models – at least if the user is willing to expose the prompts to a third-party search provider.

If the enlarged inference model does continue to worsen hallucinations, then this will make it more urgent to seek solutions.

“Solving hallucinations in all models is an ongoing field of research and we are working to improve its accuracy and reliability,” OpenAI spokesman Niko Felix said in an email to TechCrunch.

Over the past year, the broader AI industry has been pivoting to focus on inference models Techniques to improve traditional AI models begin to show a decrease in returns. Inference improves model performance for various tasks without the need for large amounts of computation and data during the training process. However, it seems that reasoning may also lead to more hallucinations – to present challenges.

Source link

Leave a comment

Your email address will not be published. Required fields are marked *

star360feedback Recruitgo