In February 2024, Google staged a showcase event for Gemini, its AI model, hoping to catch up with ChatGPT. In one of the promotional pieces, Gemini described in vivid detail "the first direct image of an exoplanet outside our solar system, captured by the James Webb telescope." It sounded impressive.
That same afternoon, NASA had to come out publicly and clarify that James Webb had never taken such an image. It didn't exist. Gemini invented it in a demo Google had designed to look flawless.
The Verge ran the story with clinical precision: hallucinations reach even the launches that should be perfect. It was the second time in under a year that the most-hyped AI of the moment invented facts on stage — ChatGPT had done the same in the Schwartz case. The problem isn't one model. It's structural.
The mechanism: why predicting isn't knowing
To understand hallucinations you have to understand an uncomfortable thing about how language models work.
An LLM — large language model — doesn't have an internal database of verified facts. It doesn't have a lookup table that says "Varghese v. China Southern Airlines: doesn't exist, don't cite". What it has is a statistical representation, built from hundreds of billions of words, of which word tends to come after which other word in which context.
When you ask a question, the model doesn't look up the answer. It generates a response token by token, choosing at each step the most likely word given the prior sequence. If the sequence leads to a pattern shaped like a legal citation — two names, a versus, a court — the model completes with that shape. The shape is right. The content can be real or invented. The model doesn't tell the two apart.
That's the root of the problem. There's no default "I don't know" mode because there's no internal module marking the difference between knowing and not knowing.
Where it goes wrong most often
Six categories of errors where you should raise your guard:
Specific dates. "When was such-and-such agreement signed?" The model gives you a day and month. Sometimes it's right, sometimes it swaps the year.
Financial figures. Prices, revenues, valuations. Numbers move often, and the model sticks with old versions or invents updates.
Direct quotes. Sentences attributed to someone. Part exists, part is altered, and you can't tell which is which.
Code. A function that doesn't exist, a library that doesn't install with that command, syntax the language doesn't accept. Code is where you catch it fastest, because it simply doesn't run.
Math. Multi-step operations where it loses the thread. Modern tools mitigate this with code execution, but if you don't have that on, watch out.
Legal and medical topics. High-risk because they're areas where the plausibility of the format (article X, subsection Y, dose Z) masks the possibility that the content is false.
The rates you can actually cite
There's a public reference worth knowing: the Vectara Hallucination Leaderboard. It's a benchmark that measures how much each model hallucinates on a specific task — summarizing a document you hand it — and it gets updated periodically.
The numbers move with each model version, but the general picture of the last year is consistent: Claude, GPT-4, and Gemini all sit at low single-digit rates, with differences that in practice are smaller than the marketing suggests. Two years ago the differences were much larger. Today, picking the "best model" is no longer the main lever against hallucinations.
If you want to cite a specific number, cite the source — for example, "Claude 3.5 Sonnet scored X percent per Vectara, checked April 2026." Without a verifiable source, you don't put a number. It's a craft rule.
What works and what doesn't to reduce it
Three techniques have solid evidence behind them:
Native web search. Claude, ChatGPT, and Gemini today can search the internet while they respond. When you turn that on, the task shifts from "remembering" to "looking up", and the hallucination rate drops noticeably. It's the simplest and most effective technique.
RAG (Retrieval-Augmented Generation). A system where before answering, the model queries a specific database of yours — company manuals, contracts, internal documentation. Heavily used in enterprise applications. Cuts risk a lot but doesn't eliminate it: the model can still invent within the retrieved document.
Prompting with explicit instructions. Phrases like "if you aren't sure, say so", "cite the exact source", "separate what you know from what you're inferring". It helps, but it's the weakest of the three.
What doesn't work as well: asking the model "how confident are you?" and trusting its answer. Models have a calibration problem — they report similar confidence for things they know and things they invent. Useful as a signal, not as truth.
A practical verification protocol
This is what I use and what I recommend in consulting when an AI output is going to reach a client.
- Identify what's verifiable and what isn't. Verifiable: numbers, dates, laws, names, quotes, URLs. Not verifiable: opinions, interpretations, suggestions on phrasing.
- For what's verifiable, apply two steps. First, ask the model to search the web and cite sources. Second, open the sources and check them. If a source won't open, or doesn't say what the model claims, it's invented.
- For what isn't verifiable, use your judgment. Your experience and your knowledge of the domain. The AI proposes; you decide.
- Never sign, send, or file anything you haven't reviewed. Even when the AI sounds confident. Schwartz learned that at the cost of his bar license.
- When the cost of an error is high, ask two different models. Five extra minutes, risk cut to a fraction.
To close
Hallucinations aren't a defect of one version. They're the natural shape of a system that learned to complete patterns before it learned to say "I don't know". They'll improve over time — Dario Amodei said in 2024 that "hallucinations won't be eliminated entirely, they'll become rarer and more verifiable" — but they won't disappear.
That means the verification protocol is part of the craft of using AI, not an accessory. The good news is the protocol isn't complicated: one of the most effective moves is to ask it to search the web and cite sources.
If you want to go deeper into the full workflow, The café method is the next step. If you're just starting out and want to understand what a prompt is and how to write one well, What a prompt is.
Where in your work can't you afford an AI error — and what protocol did you build to protect yourself there?
In September 2023, a group of OpenAI researchers published a short and required-reading paper: Why Language Models Hallucinate. The text has the rare virtue, in recent technical literature, of admitting the problem has no clean solution. It describes the mechanism that generates hallucinations, shows that known mitigations are partial, and openly acknowledges that the industry needs to live with the problem rather than promise to eliminate it. A year later, in a public interview, Dario Amodei said something in the same vein: hallucinations won't be eliminated entirely, they'll become rarer and more verifiable.
The thesis of this piece is that understanding why there's no clean solution — and why that isn't a failure but a structural property — is the step that separates the mature user from the naive evangelist.
Autoregression and the maximum-likelihood problem
An LLM is trained with one concrete objective: given a sequence of tokens, predict the probability distribution of the next token. Training uses maximum likelihood — adjust the weights so tokens actually observed in the corpus receive higher probability. It's an efficient procedure with solid mathematical theory behind it. It produces models that complete text remarkably well.
The flip side is that this procedure contains no mechanism to distinguish "this sequence is factually correct" from "this sequence appears often in the corpus and therefore has high probability". To the model they're indistinguishable. If the corpus contains many biographies with a certain structure, the model learns that structure. If you ask for the biography of a real person whose details aren't well represented in the corpus, the model produces something with the right structure and invented content that fits it.
Hallucination isn't a model error. It's the expected behavior of a system whose only learning signal is "how likely is this sequence?" and which operates in a domain where corpus likelihood doesn't equal real-world truth.
No ground truth and the evaluation problem
There's an additional complication the OpenAI paper emphasizes. In many domains — including the ones that matter most in professional applications — there's no labeled ground truth at scale.
To train an image classifier you can label millions of photos with "cat" or "not cat". To train a language model that "doesn't hallucinate" you'd need to label millions of responses with "true" or "false", which requires costly human verification case by case and doesn't scale to the volume modern models need. The alternative — using heuristics to generate labels automatically — drags the heuristics' errors into the trained model.
This means the problem doesn't get solved with "more data". It gets partially solved with better-curated data and evaluation processes that add a truthfulness signal on top of the likelihood signal. But that layer is always incomplete.
RLHF and the perverse incentive
On top of autoregressive training, modern models apply reinforcement learning from human feedback — RLHF. Humans rank responses, a reward model is trained on those rankings, and the model is tuned to maximize the reward.
The intent is to orient the model toward useful, safe, and honest responses. The observed side effect — documented by OpenAI itself — is that human annotators tend to prefer responses that sound confident and complete over responses that admit uncertainty. "Here are the three relevant cases" ranks higher than "I'm not sure there are relevant cases for this jurisdiction". The reward model learns that preference. The trained model learns to sound confident even when it isn't.
This is what the literature calls the confident hallucinator problem: a model that combines high technical capability with a training incentive rewarding the appearance of certainty. It isn't a bug — it's a consequence of how the reward function was defined.
Constitutional AI, the method Anthropic uses, tries to mitigate this by training the model with explicit principles, among which is "acknowledge uncertainty when appropriate". It works partially: Claude has a reputation for being one of the models that most often says "I don't know". But even with constitutional AI, the underlying incentive of the autoregressive architecture is still there.
Why RAG helps but doesn't solve
Retrieval-Augmented Generation is the most popular mitigation against hallucinations in enterprise applications. The idea is simple: before responding, the model queries a specific database (internal documentation, manuals, verified information repositories), retrieves the relevant fragments, and generates the response conditioned on those fragments.
RAG reduces the hallucination rate measurably. The premise is that a model with verified information in context is less likely to invent. In practice it works — with three limitations worth knowing.
First, the model can hallucinate within the retrieved document. It can attribute something to a source that's in the context even though the source doesn't say it. Second, if retrieval fails and brings back an irrelevant fragment, the model can generate a plausible-but-wrong response by combining the fragment with what it "thinks" it knows. Third, quality depends entirely on the database — if the base has errors, the model amplifies them.
RAG is a real improvement, not a solution. The same model that with RAG produces more reliable responses can still invent citations inside the retrieved content if you don't instrument it with explicit verification.
The calibration problem
One of the least understood — and most important — aspects of hallucinations is the calibration problem.
A well-calibrated model would be one that, when it reports 80 percent confidence, is actually right 80 percent of the time. Current LLMs are systematically miscalibrated. They report similar confidence for claims with high probability of being correct and for claims they invented. This is what the literature calls overconfidence.
The practical consequence is you can't trust the model's self-assessment. If you ask "how sure are you?" and it says "95 percent", that doesn't equate to a real 95 percent probability of correctness. You can treat the response as a weak signal — a "5 percent" probably indicates real uncertainty, a "95 percent" guarantees you nothing — but not as a calibrated probability.
There's active research on calibration techniques — fine-tuning with confidence rankings, ensembles, pairing with external verifiers — but none is mature enough to replace human verification in high-stakes applications.
Recent contributions that mitigate without solving
The latest model generation incorporates several techniques that reduce — but don't eliminate — hallucinations.
Native chain-of-thought. Models like o1, o3, and Claude with extended thinking reason step by step before responding. This reduces errors in math and chained reasoning, but doesn't resolve factual hallucinations because the step-by-step reasoning can rest on invented premises.
Integrated native web search. Claude, ChatGPT, and Gemini can query the internet while they respond. It's probably the most effective mitigation available to end users today: it transfers the burden from "remembering" to "looking up live sources". It doesn't eliminate the problem — the AI can summarize a source badly or cite it wrong — but it reduces it significantly.
External verifiers. Systems where one model generates the response and a second model (or the same one run with a different prompt) verifies factual claims against sources. More robust but slower and more expensive.
Tool use. Modern models can invoke calculators, code executors, databases. This eliminates an entire category of hallucinations — math — by transferring the computation to a deterministic system.
No single technique solves the problem. Combined, they reduce the effective rate significantly, but not to zero.
Editorial thesis: the problem is in the sales pitch, not the tech
I'll close with a thesis that goes past reporting.
The AI industry got it wrong by selling these models as sources of truth. They're generators of probable responses. They're remarkably useful for drafts, analysis, synthesis, writing, code, and many other tasks where "probable and coherent" is enough. They're inadequate, used without a protocol, for tasks where "true" is a requirement.
That category error is the root of the Schwartz case and the Gemini-James-Webb case. A lawyer assumed ChatGPT worked like a case-law database. A company assumed Gemini could produce factually correct ad content without review. Both premises are wrong, and the industry encouraged them by positioning its products as "the future of search" or "your personal assistant that knows everything".
The responsibility for mitigation is shared across three layers:
The user has to understand these tools aren't encyclopedias and build their verification protocol. It's the first and main filter.
The company (the one buying or integrating AI into its products) has to design flows where AI output goes through human review before reaching an external client or a critical process. It's not optional.
The platform (OpenAI, Anthropic, Google) has to keep investing in calibration, integrated web search, verifiers, and interpretability — and has to communicate the limits honestly. Anthropic is the one doing the most of this today. The others are improving on that dimension but from a worse starting point.
User maturity is the main mitigation. It's the only one under your direct control and the one with the biggest impact on actual risk in your work. Everything else — which model you pick, which platform you use, the sophisticated prompting techniques — is secondary if you don't have the basic reflex to verify.
A concrete and simple way to start: ask it to search the web and cite sources. Then open the sources and check them. That's the floor of the craft.
What's your personal empirical test for deciding when an AI output needs verification before being delivered, and when you can trust it as is?