AI models face a tradeoff: making confident errors versus admitting uncertainty. Image Source: ChatGPT-5

Why AI Language Models Hallucinate: OpenAI Explains the Challenge

Key Takeaways: OpenAI on AI Hallucinations and Evaluation Methods

  • OpenAI research finds that current AI evaluation methods reward guessing over acknowledging uncertainty, contributing to hallucinations.

  • Hallucinations are defined as false but plausible answers generated by language models, even for straightforward questions.

  • A comparison of gpt-5-thinking-mini and OpenAI o4-mini shows the tradeoff: higher accuracy often comes with higher error rates.

  • Abstaining from answers reflects humility and reduces hallucinations, but most benchmarks penalize it, favoring risky guesses.

  • OpenAI argues for revised evaluation metrics that penalize confident errors more than uncertainty, shifting incentives for model training.

What Hallucinations Are and Why They Matter

OpenAI defines hallucinations as instances where a language model produces confident but false answers. These can occur even on simple queries. For example, when researchers asked a widely used chatbot for the dissertation title of Adam Tauman Kalai, an author of the paper, the model gave three incorrect responses. When asked for his birthday, it gave three different wrong dates.

These errors highlight the problem: models often guess when uncertain instead of admitting they don’t know.

The Role of Current Evaluations in AI Hallucinations

According to OpenAI, hallucinations persist because standard evaluations measure only accuracy and reward guessing rather than honesty about uncertainty. Accuracy-based benchmarks create an incentive structure where guessing is better than abstaining.

The company compares this to multiple-choice testing. Guessing yields a chance of success, while abstaining guarantees zero points. As a result, models trained and scored under these conditions learn to prioritize guesses over saying, “I don’t know.”

OpenAI notes that abstaining aligns with its core values of humility and is better than providing confident misinformation. Their model spec states that it is better to indicate uncertainty or ask for clarification than provide confident information that may be incorrect. However, most leaderboards still rank models primarily on accuracy.

A Concrete Example: SimpleQA Evaluation

In the SimpleQA evaluation from GPT-5's system card, results show that different models approach uncertainty differently:

Here, o4-mini achieves slightly higher accuracy but with a much higher error rate. In contrast, gpt-5-thinking-mini abstains more often, lowering its hallucination risk. This shows that making strategic guesses when uncertain can boost accuracy scores, but it also raises error rates and leads to more hallucinations. This illustrates why even advanced AI models continue to hallucinate.

Why Next-Word Prediction Leads to Hallucinations

While evaluations explain why hallucinations persist, the origin of these errors lies deeper in how models are trained. In particular, hallucinations emerge from the pretraining process of large language models. During pretraining, a model’s task is to predict the next word in vast amounts of text, based purely on statistical patterns. Unlike traditional machine learning problems, this process has no “true/false” labels attached to statements. The model is never explicitly told which answers are correct or incorrect—it only sees fluent examples of language.

This creates a fundamental limitation: the model learns to replicate patterns of plausibility, not necessarily truth. As a result, it can generate text that sounds correct but is factually wrong.

For example, consistent rules like grammar, spelling, and punctuation are easy to master through pattern recognition, which is why large models rarely make mistakes with them. But rare factual details—such as a person’s exact birthday or the title of a dissertation—do not follow predictable linguistic patterns. With no clear signal in the training data, the model must “guess,” leading to errors that show up as hallucinations.

OpenAI compares this to an image recognition task. If millions of photos are labeled “cat” or “dog,” algorithms can learn to classify them accurately. But if the same photos were labeled with each pet’s birthday, the system would inevitably fail—birthdays are essentially random and unlearnable from the data alone. The same principle applies to language models: some facts simply cannot be derived from text patterns, no matter how large the dataset.

While later stages of training—such as fine-tuning and reinforcement learning—are meant to help models filter out implausible or unsupported statements, they cannot fully eliminate the errors rooted in pretraining. As a result, hallucinations persist, particularly when models are asked for low-frequency, highly specific facts.

This explains why hallucinations are not random glitches but predictable outcomes of the statistical process behind next-word prediction.

Proposed Solution: Rethinking Evaluation Metrics

OpenAI argues that the way language models are evaluated must be fundamentally rethought. At present, most widely used benchmarks focus on accuracy scores—rewarding correct answers without accounting for the cost of incorrect ones. This structure encourages models to guess even when uncertain, leading to higher hallucination rates.

The company suggests a straightforward fix: shift the incentive structure so that confidently wrong answers are penalized more heavily than uncertainty. In other words, models should lose more “points” for hallucinations than for abstaining from answering. At the same time, partial credit should be awarded when a model acknowledges limits by saying, “I don’t know” or by requesting clarification.

Importantly, OpenAI emphasizes that adding a handful of new uncertainty-aware benchmarks is not enough. Because leaderboards, model cards, and developer evaluations overwhelmingly emphasize accuracy, any secondary tests that reward humility remain sidelined. Instead, primary evaluation metrics must be redesigned across the board. If accuracy-based scoreboards continue to dominate, developers will keep training models that favor risky guesses over reliability.

OpenAI frames this shift as a practical path forward: by fixing how models are graded, the industry can accelerate adoption of hallucination-reduction techniques already known in research but underutilized in practice.

Common Misconceptions About AI Hallucinations

OpenAI’s research paper challenges several assumptions about hallucinations:

  • Claim: Better accuracy will eliminate hallucinations.
    Finding: Accuracy will never reach 100%, as some questions are inherently unanswerable.

  • Claim: Hallucinations are inevitable.
    Finding: They are not; models can abstain when uncertain.

  • Claim: Only larger models with a degree of intelligence can avoid hallucinations.
    Finding: Smaller models may recognize their limits more easily, making abstention simpler.

  • Claim: Hallucinations are mysterious glitches.
    Finding: They result from statistical patterns in training and evaluation incentives.

  • Claim: Specialized hallucination evals are enough.
    Finding: Broader evaluation reform is required to change model incentives and reward uncertainty.

Q&A: OpenAI on AI Hallucinations

Q: What are AI hallucinations?
A: AI hallucinations are confident but false answers generated by language models.

Q: Why do language models hallucinate?
A: Models hallucinate because evaluation metrics reward guessing over admitting uncertainty.

Q: How does abstaining reduce hallucinations?
A: Abstaining avoids confident errors by allowing models to say “I don’t know” instead of risking a false answer.

Q: What did the SimpleQA evaluation show?
A: It showed that o4-mini guessed more often, achieving higher accuracy but a much higher error rate, while gpt-5-thinking-mini abstained more often, reducing hallucinations.

Q: How does OpenAI propose fixing evaluations?
A: By penalizing confident errors more than uncertainty and rewarding calibrated, cautious responses.

What This Means: Rethinking How AI Is Measured

OpenAI’s research reframes hallucinations as a problem not only of model design but also of how success is measured in AI development. Current accuracy-driven benchmarks reward lucky guesses and penalize humility, creating an ecosystem where models are pushed to sound confident rather than be reliable.

OpenAI argues that this incentive structure needs to change. By embedding uncertainty directly into evaluation systems, the AI industry could shift toward models that acknowledge their limits instead of overstating knowledge. This change would have broad implications for developers, users, and society at large.

For developers, it could mean building models that are better calibrated and less error-prone, ultimately improving trust in outputs. For users, it could mean more consistent interactions where AI systems are willing to admit uncertainty instead of offering misleading answers. And for society, it signals a step toward AI systems that are not just smarter but also more responsible, with evaluation methods aligned to real-world needs rather than artificial scoreboards.

In short, rethinking evaluation metrics may be just as important as scaling models themselves. If progress is measured in a way that values humility over risky confidence, the industry could accelerate the path toward AI systems that people can depend on with confidence and safety.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiroo’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.

Keep Reading

No posts found