Last Friday a client sent me a screenshot of a table with four columns of numbers and two little green arrows, comparing GPT-4o and Claude. The message said: "Which one is better? The table shows this one wins three out of four."
I spent forty minutes on the phone explaining that the numbers there lie more than they look like they do. Not because the company is lying with malice. Because whoever writes the test picks what to ask, how to grade it, and which result to show you.
What benchmarks are, without the jargon
Think of an AI as a student. To know if she's any good, people give her tests. Those tests are called "benchmarks" — that's just the technical name.
There are tests for general knowledge (history, biology, law), for math, for code, for reasoning. The AI answers, you count the correct ones, out comes a percentage. Sounds objective.
The trouble: humans write the tests, other AIs sometimes grade them, and the companies selling the AIs know exactly which test is going to get administered. Imagine your school if the teacher told you a month in advance what would be on the exam, and on top of that, the student could study the exact exam beforehand. That's the benchmark market today.
The traps worth knowing
Contamination. AIs learn by reading the internet. If the exam questions were already on the internet when the model learned, it memorized them. That's taking the test after seeing the answers.
Teaching to the test. When a company knows which benchmark it'll be judged on, it trains specifically for that test. The AI comes out brilliant there and mediocre everywhere else.
Cherry-picking. The company shows you the three benchmarks it won and hides the seven it lost. Pick the ripe cherries, toss the rest.
Sneaky versions. The headline says "beats GPT-4 by 15%." What it doesn't say: the comparison is against GPT-4, not GPT-4o, which came out six months earlier and is quite a bit better.
The five that still work
MMLU. Thousands of multiple-choice questions across 57 subjects. Saturated today — frontier models score between 85 and 92. It works as a floor: any AI that doesn't clear 75 isn't serious.
GPQA Diamond. PhD-level questions in hard sciences, designed so you can't solve them by Googling. It's the hard exam of the moment.
SWE-Bench Verified. The AI gets a real GitHub bug and the test measures if it actually fixes it. It's the closest thing to "does actual programmer work."
Chatbot Arena. Two anonymous models answer, a person votes which they prefer. Thousands of votes, a chess-style ranking. Hardest to game because the voter doesn't know which AI they're rating.
Your week. The only definitive test. Doesn't appear in any paper.
What to take away
Three ideas to read the next benchmark table without buying smoke:
- Numbers don't lie. The people picking them can. When you see "beats X by Y percent," ask what test, who wrote it, when it was measured, against which exact version. No footnote with those answers? Suspect.
- Chatbot Arena is worth more than almost any official table. Because the voters are random humans who don't know what model they're ranking. When a company reports an in-house metric and ends up lower on Arena, believe Arena.
- The test that actually matters is a week of real work. Ask the AI for the things you ask yourself for. See if it saves time or doubles the work. That data isn't in any table — and it's the only one that decides.
In June 2024, a team of independent researchers affiliated with Scale AI published results from a new benchmark called GPQA Diamond. These were PhD-level questions in physics, chemistry, and biology, written so that Google wouldn't help. When they ran GPT-4 against that test under identical conditions, the model scored 12 points lower than OpenAI had reported in its own launch materials a few months earlier.
The gap wasn't statistical noise. It was the signal of a problem the industry has been carrying for a while and that's worth naming directly: benchmarks have owners, and when the owner is the same party selling the model, the number moves.
This piece is so that the next time an AI provider shows you a table and says "we're better," you know how to read between the rows.
The five benchmarks still worth looking at
MMLU (Massive Multitask Language Understanding). Designed in 2020 by Berkeley academics. 14,000 multiple-choice questions across 57 disciplines: history, law, medicine, math, ethics. In 2020, an AI scored 45%. By 2024, frontier models were pushing 90%. The benchmark is saturated: when every model sits between 85 and 92, the metric stops distinguishing anything. Today it works as a floor — if a model doesn't clear 75, skip it — not a ceiling.
GPQA Diamond. Published in 2023. PhD-level questions in hard science, calibrated so that a human with Google access still struggles. It's the current hard exam: the one separating models with genuine reasoning from those that memorize well. Being newer, it's also less contaminated.
SWE-Bench Verified. The most honest benchmark in the code space. The AI gets a real GitHub issue with an attached repository and the test measures whether it fixes the bug. The Verified version (August 2024) manually filtered 500 best-constructed tasks from the 2,294 originals, removing ambiguity in the success criterion. It's the closest thing to measuring "does actual programmer work."
Chatbot Arena (lmarena.ai). Two anonymous models answer the same prompt, the user votes which they preferred without knowing which is which. Thousands of votes produce an Elo-style ranking. It's the hardest benchmark to inflate for a structural reason: there's no predefined correct answer to memorize, and the voters are random humans. Its known biases — longer answers win more, prompts tend to be short — are minor problems compared to the fragility of everything else.
Real tasks from your own work. The only definitive test. It's your private eval. More on this in two minutes.
The four standard ways to pump a number
Contamination. If the test questions appeared in the training data, the model memorized them. The MMLU case in 2023 was notorious: independent analyses found significant portions of the test set had circulated on technical forums before several models were trained. "Official" scores dropped between 10 and 25 points when measured on clean questions.
Training aimed at the benchmark (gaming). A company knows it'll be judged on a particular test. It trains specifically for that test. The model shines there and underperforms everywhere else. Goodhart's law applied to machine learning: when a measure becomes a target, it stops being a good measure.
Cherry-picking. The launch report shows the three benchmarks the model won. The five where it lost don't appear. If a company only shows tables it wins on, that isn't evaluation — it's marketing.
Comparisons against older versions. "Beats GPT-4 by 15%." Against which GPT-4? The original from March 2023 is dated compared to GPT-4o (May 2024) or GPT-4-turbo. When the exact version and measurement date aren't specified, the headline is built to mislead.
Benchmark vs evaluation: the distinction that matters
A benchmark is a standardized test anyone can run. An evaluation is testing a model for your specific case. Benchmarks give you a general picture; evaluations tell you whether the model works for you.
Rule: benchmarks for filtering (rule out the bad), your own eval for deciding (pick among the good).
The private eval is what serious companies do before signing a contract with a provider. They call it exactly that — a "private eval." You collect 10 to 15 real tasks from the team's actual work, run each task against each candidate model under identical conditions, and score them against your own criteria. The winning model isn't the one with the highest MMLU — it's the one that solves your team's specific tasks best.
A small operational note: a private eval isn't expensive to set up. One afternoon with 15 well-chosen prompts and three candidate models gives you a far more useful signal than any official table.
An honest contrast between providers
There's an observable difference in how the two largest labs report their numbers — and it's worth naming without disguising.
Anthropic tends to publish benchmarks with comparable conditions (same version, same settings, same shot count), and its system cards include detailed methodology sections. When Claude loses a benchmark, Anthropic usually says so.
OpenAI has historically been less transparent about exact settings: several of the peaks reported at GPT-4 and GPT-4o launches were measured under special configurations (more inference compute, assisted prompting, re-tries) that aren't what a typical user runs in production.
It's not that one company lies and the other tells the truth. It's that one publishes more data that allows auditing, and the other publishes less. The evidentiary regime is different.
The counterweight: Chatbot Arena is more honest than any number either of them reports. Because the voters don't know which model they're rating.
Three questions for reading any comparison
Before believing a table, ask three questions. If you can't answer all three, the number isn't evidence — it's decoration.
Public or private test? Public tests are auditable but contaminable. Private ones (provider holdout) are clean but not verifiable.
Who picked which benchmarks to report? An honest report includes tests where the model loses, not just where it wins.
What exact version was measured, and when? Models change. A January benchmark may not apply to a June model — and that's a common trap in cross-provider comparisons.
Why Claude appears lower on benchmarks and higher on Arena
There's an observable pattern worth understanding. On point benchmarks, Claude usually sits 2 to 5 points below the peak of the moment. On Chatbot Arena, Claude consistently sits in the top 1 to 3. The gap isn't inconsistency — it's design.
Anthropic optimizes for robustness over peak: they'd rather ship a model that stably scores 88 than one that sometimes scores 95 and sometimes 62. Benchmarks measure moments; human experience measures averages and variance. That's why in blind voting by real users, Claude wins more often. People don't notice the competitor's peak — they notice its inconsistency.
Same difference as between two students: the one who sometimes scores 10 and sometimes 4, versus the one who always scores 8. On the year-end report, the second wins.
The test no benchmark replaces
Use the model for a full week in your real work. Ask it for the things you ask yourself for. Note three concrete things: how many times it saved you time (and how much), how many times it doubled your work because you had to review its output, and how many times it understood on the first try without you rephrasing.
That data isn't in any paper. But it's the only one that answers the question you actually care about: does this tool work for me?
Which of the five benchmarks I named feels like it weighs the most in how you'd pick an AI for your work — or do you have your own test that would beat them all? If you want to go deeper into the competitive picture, read the AI race. If you want to understand how frontier models actually differ in real use, the differences that do matter is the next step.
MMLU was created in 2021 by an academic team led by Dan Hendrycks at Berkeley. 14,000 multiple-choice questions across 57 disciplines, from abstract algebra to jurisprudence. In the publication paper (Hendrycks et al., 2021), the authors reported GPT-3 scoring 43.9% — against an expert human averaging 89.8%. The paper's conclusion was clear: language models had a long road ahead in general knowledge.
Three years later, in 2024, frontier models — Claude 3 Opus, GPT-4 Turbo, Gemini 1.5 Pro — scored between 86% and 88.7%. By 2025, several crossed 90%. Today in 2026, almost no lab reports MMLU as a differentiating metric: it's the floor, not the ceiling. The useful lifespan of a public benchmark at the current scale frontier is roughly two years. What gets measured gets optimized; what gets optimized gets saturated.
That sequence — academic invention, adoption as standard, saturation, obsolescence — is the normal life cycle of any AI benchmark. Understanding it is the foundation for evaluating any number you read over the coming months.
Taxonomy of current evaluation
The model-evaluation space today splits into five families with distinct methodological properties.
Structured knowledge and reasoning. MMLU (2021), MMLU-Pro (2024), GPQA Diamond (2023), BigBench-Hard. Multiple-choice or short-answer format, programmatic grading, simple metric. Strength: reproducibility and cross-comparison. Weakness: contaminable via training data leaks, saturable when the community converges on the ceiling. Original MMLU is saturated; MMLU-Pro was the response (more questions, harder), and it's starting to saturate too.
Executable programming. HumanEval (2021), MBPP, SWE-Bench (2023), SWE-Bench Verified (2024), LiveCodeBench. What's measured is whether the code compiles and passes the tests. Low ambiguity. SWE-Bench is particularly interesting because it measures resolution of real GitHub issues with entire repositories as context — it approaches a genuine software engineering task. The Verified version manually filtered 500 tasks from the 2,294 originals to remove ambiguity in the test cases, answering the base benchmark's structural noise critique.
Mathematical reasoning. GSM8K (2021), MATH (2021), AIME. The last two gained relevance with the arrival of extended chain-of-thought models (OpenAI's o1, o3, Anthropic's Opus 4.x), because the gap between think-longer and answer-directly becomes measurable.
Blind human preference. Chatbot Arena / LMSys. Elo system based on human votes in blind pair battles of models. The key methodological property: the user doesn't know which model is answering. It's the closest thing we have to a utility metric at scale. Its known limitations are three: sample bias (voters are self-selected and concentrated in technically literate regions), length bias (longer answers systematically win at equal content), and prompt concentration on short queries (the benchmark measures short-turn punctuality well, long-task sustainability worse).
Agentic evaluation. AgentBench, WebArena, GAIA, τ-bench (tau-bench). Emerging. They measure execution of tool-chained actions — browse, search, write code, persist context, interact with APIs. These are the benchmarks that matter for the next two years and where saturation hasn't arrived yet.
The central methodological problem: contamination
When a public test appears in the training data, the model memorizes rather than solves it. The effect is measurable with memorization detection techniques — for example, completing the prompt with blanks and measuring exact probability of the correct answer — but providers rarely publish those analyses.
A concrete example: during 2023, independent analyses on several open models found substantial fragments of the MMLU test set appeared in public pre-training datasets (The Pile, RefinedWeb, C4). When those models were re-measured against "paraphrase-resistant" versions generated post-training, scores dropped between 10 and 25 percentage points. Few labs publish this kind of analysis proactively.
The community's partial response is maintaining private evaluation holdouts that each lab uses internally. The methodological cost is clear: non-reproducibility. Anthropic has tended to combine public benchmark reports with explicitly declared internal holdout results; OpenAI traditionally publishes less detail about specific settings; DeepMind has been more explicit with eval cards. That's not an equivalence — it's an observable difference in transparency policy.
The other response is re-generating the benchmark. GPQA Diamond, MMLU-Pro, and SWE-Bench Verified are examples of that race: every time a benchmark saturates or contaminates, a harder or cleaner version appears. The cycle repeats every two to three years.
Goodhart's law applied to training
"When a measure becomes a target, it stops being a good measure." In LLM training, Goodhart's law shows up as the gaming problem: a lab optimizes explicitly against a known benchmark, gets superior scores on that test, and degrades performance out of distribution.
The best-documented case was several Chinese models during 2024 that reported exceptional GSM8K results; later analyses found the evaluation set had circulated on technical forums before training. The correction was 25-30 percentage points when measured against a clean post-training set.
The technical antidote is out-of-distribution evaluation: generate problems with the same structure but distinct from the public set, and run them only after training. Few providers do this systematically given the cost of generating new sets of equivalent methodological quality.
The LLM judge problem
When evaluation requires quality judgment ("which of these two answers is better"), a common practice is using another LLM as judge — typically GPT-4 evaluating other models' responses. The problem is circularity: if you evaluate model A against model B with model C as judge, the result depends on C's preferences.
Recent studies (Zheng et al. 2023; follow-ups through 2024-2025) show LLM judges have three systematic biases: verbosity bias (prefer longer answers), self-identification bias (prefer answers that stylistically resemble themselves), and position bias (prefer the first answer shown in a comparison). Chatbot Arena mitigates this by using humans as judges, but introduces different human biases — verbosity is still present, and voters aren't a representative sample.
The emerging methodological consensus is that no single source of judgment is sufficient: robust evaluation requires combining programmatic tests (code correctness, verifiable math), human preference (Arena), and private out-of-distribution evaluations. No provider today publishes a full stack.
Differential transparency between providers
Worth contrasting with precision how the major labs report their numbers, because the difference is operational, not rhetorical.
Anthropic publishes detailed system cards at every major release, includes explicit sections on evaluation settings (shot count, temperature, prompting strategy), reports results on benchmarks where Claude loses (not just where it wins), and maintains a declared internal holdout policy. The comparative tables it publishes tend to use the same version and settings across competitors. This isn't neutrality — Anthropic is still an interested party — but it's an evidentiary regime that allows auditing.
OpenAI has historically published less detail about specific settings. Several peaks reported at GPT-4 and GPT-4o launches were measured under configurations the typical user doesn't run in production: more inference compute, engineering-assisted prompting, majority-vote re-tries. These settings were sometimes documented in later technical reports but not in the launch material that captures the news cycle.
Google DeepMind sits between them: Gemini releases include structured eval cards, but methodology is sometimes communicated with less granularity than Anthropic's.
The editorial point: this isn't an accusation of fraud. It's the observation that the evidentiary regime varies between providers, and the critical reader needs to know against which standard each table should be read.
The honest counterweight: Chatbot Arena, not controlled by any of the three, is more honest than any figure any of them reports. When Claude sits near the Arena top and simultaneously loses some academic benchmarks, the useful signal is Arena — because it's blind voting, not a table someone built.
Recommended metrics for 2026
For filtering (rule out the bad): MMLU as floor. If a model doesn't clear 75, it isn't serious for professional work.
For hard reasoning: GPQA Diamond and AIME. Both require genuine thought and are recent-generation, less contaminated.
For code: SWE-Bench Verified. The least gameable metric in the programming space.
For human preference in everyday use: Chatbot Arena. Hard to inflate, known biases that are compensable.
For agents and tool use: AgentBench, τ-bench. Emerging, unsaturated, relevant for what's coming.
For your specific case: none of the above. Your evaluation is your private eval.
How to build a private eval that actually works
This section is the most important in the piece and the one that rarely appears in mainstream coverage.
A well-built private eval has four elements.
Representative tasks. Between 10 and 15 real tasks you do in a normal work week. Not hypothetical tasks, not benchmark tasks — the ones you get paid to solve. Consulting examples: drafting a client rejection email, summarizing a 30-page contract, drafting a commercial proposal, debugging unfamiliar code, translating a formal letter. Developer examples: implement a function with tests, refactor a messy module, review a pull request. Each role has its own set.
Canonical prompts. Write the exact prompt as you'd use it in your normal flow. Don't optimize it for the model — part of the eval is seeing which model best understands your natural way of asking.
Measurement protocol. Run each task against each candidate model under identical conditions — same version, same day, no prior context. Ideally double-blind: someone else labels the outputs before you see them.
Scoring rubric. Two minimum dimensions: output quality (1 to 5) and editing cost to make it presentable (1 to 5, inverted — less editing is better). Optional: a third confidence dimension, "would I hand this in without detailed review?" Sum scores per task, average per model.
The result is an internal table that tells you, with your own evidence, which model performs best on your specific work. That table is worth more than any benchmark paper.
Serious companies — consultancies, law firms, fintechs, product teams — do this before adopting a provider. It's what Anthropic, OpenAI and Google informally call a "bakeoff": the client tests the three models with their private eval and decides. In my consulting experience, the bakeoff winner varies by case — but the frequency with which Claude wins on high-stakes production work is the reason it's my default.
Editorial thesis
I'll close with a thesis that goes past methodological reporting and into evaluation.
The most valuable benchmark for a serious professional isn't published in any paper and doesn't appear in any launch table. It's the one you build, with 10 to 15 real tasks from your work, repeated across each candidate model, scored against criteria you define, and refreshed every three or six months. That private eval is what serious companies do before adopting a provider, what well-calibrated consultants do before recommending a tool, and what AI press rarely explains to its readers.
The reason it doesn't get explained is structural: public benchmarks generate headlines; private evals generate decisions. And the incentives of the news cycle prioritize headlines. That's why most of the "which AI is better" coverage in 2026 is still a naive read of tables that should be read as interested-party evidence.
This paper's editorial line is explicit on the point: official benchmarks are reported with their declared evidentiary regime, Chatbot Arena is cited as contrast when appropriate, and the operational recommendation always points to the private eval. Not because it's more glamorous — the opposite, it's more work. Because it's the only one that answers the question that matters.
Do you have a private eval set up for your work, or are you still picking AI by table? If the answer is the second, the afternoon it takes you to build the 10 tasks and run them is probably the best tooling-infrastructure investment you'll make this year.