one New research papers Ask from Openai why large language models like GPT-5 and chatbots are still hallucinating, and whether anything can be done to reduce these hallucinations.
exist A blog post summarizes the paperOpenai defines hallucinations as “reasonable but false statements produced by language models” and acknowledges that despite improvements, hallucinations “still remain fundamental challenges for all large language models” that will forever be completely eliminated.
To illustrate this, the researchers said when they asked Adam Tauman Kalai’s Ph.D. for a “widely used chatbot.” The paper, they got three different answers, all of which were wrong. (Kalai is one of the authors of this article.) They then asked about his birthday and received three different dates. Again, all of them were wrong.
How could a chatbot be so wrong – sounds so confident about its mistakes? The researchers believe that hallucinations are partly due to the process of preprocessing, which focuses on making the model correctly predict the next word without the true or false tag attached to the training statement: “The model sees only positive examples of the language and must approximate the overall distribution.”
“Spelling and parentheses follow a consistent pattern, so the errors there disappear with the scale,” they wrote. “But unlike pets’ birthdays, arbitrary low-frequency facts cannot be predicted by pattern alone, leading to hallucinations.”
However, the solutions proposed by the paper focus less on the initial preprocessing process and more on how large language models are evaluated. It believes that the current evaluation model does not cause hallucinations themselves, but rather “sets the wrong incentives.”
The researchers compared these assessments with a variety of random guess tests because “you may be lucky and correct” and the answer is blank “guaranteed zero.”
TechCrunch Events
San Francisco
|
October 27-29, 2025
“In the same way, when the model is only rated based on accuracy, they are completely correct for the percentage of the problem, and they are encouraged to guess rather than say ‘I don’t know’.”
Therefore, the proposed solution is similar to testing (such as SAT), which includes “negative” [scoring] For wrong answers or partial credit, because leaving the question blank to prevent blind guessing. “Equally, Openai said that model assessments require “punishment of self-confidence errors rather than punishment uncertainty and provide part of the credit for appropriate expressions of uncertainty.” ”
Researchers believe that it is not enough to introduce “some new uncertainty perception tests on the side”. Instead, “the widely used precision-based Evals need to be updated to discourage their score.”
“If the main scoreboard continues to make sense of lucky guesses, the model will continue to learn guesses,” the researchers said.