Análisis · The AI Landscape · Edition #0019

How AIs actually get measured

Benchmarks are AI's report cards. Five of them still tell you something — and only one test can't be gamed: using the tool for a week in your actual job.

G
Germán Falcioni April 20, 2026
✦ Reading: 10 min
A table full of percentages isn't enough to decide which AI to use. The real test lasts a week, not a screenshot.
TL;DR

Benchmarks are the standardized tests used to compare AI models. They sound objective but can be inflated in predictable ways: data contamination, training aimed at the test, cherry-picked results, cherry-picked versions. Five still earn your attention — MMLU (floor), GPQA Diamond (hard reasoning), SWE-Bench Verified (real code), Chatbot Arena (blind human voting) and your own private eval built from 10-15 tasks out of your job. Anthropic tends to report comparable conditions; OpenAI has been less transparent about settings. The most honest metric is Chatbot Arena because voters don't know which model they're rating. The definitive one is your week of real use. The most valuable benchmark for a professional is unpublished: the one you build yourself with real tasks and your own criteria.

✦ Summarized with Claude at publish time
AI rewrite
Read it as…

In June 2024, a team of independent researchers affiliated with Scale AI published results from a new benchmark called GPQA Diamond. These were PhD-level questions in physics, chemistry, and biology, written so that Google wouldn't help. When they ran GPT-4 against that test under identical conditions, the model scored 12 points lower than OpenAI had reported in its own launch materials a few months earlier.

The gap wasn't statistical noise. It was the signal of a problem the industry has been carrying for a while and that's worth naming directly: benchmarks have owners, and when the owner is the same party selling the model, the number moves.

This piece is so that the next time an AI provider shows you a table and says "we're better," you know how to read between the rows.

The five benchmarks still worth looking at

MMLU (Massive Multitask Language Understanding). Designed in 2020 by Berkeley academics. 14,000 multiple-choice questions across 57 disciplines: history, law, medicine, math, ethics. In 2020, an AI scored 45%. By 2024, frontier models were pushing 90%. The benchmark is saturated: when every model sits between 85 and 92, the metric stops distinguishing anything. Today it works as a floor — if a model doesn't clear 75, skip it — not a ceiling.

GPQA Diamond. Published in 2023. PhD-level questions in hard science, calibrated so that a human with Google access still struggles. It's the current hard exam: the one separating models with genuine reasoning from those that memorize well. Being newer, it's also less contaminated.

SWE-Bench Verified. The most honest benchmark in the code space. The AI gets a real GitHub issue with an attached repository and the test measures whether it fixes the bug. The Verified version (August 2024) manually filtered 500 best-constructed tasks from the 2,294 originals, removing ambiguity in the success criterion. It's the closest thing to measuring "does actual programmer work."

Chatbot Arena (lmarena.ai). Two anonymous models answer the same prompt, the user votes which they preferred without knowing which is which. Thousands of votes produce an Elo-style ranking. It's the hardest benchmark to inflate for a structural reason: there's no predefined correct answer to memorize, and the voters are random humans. Its known biases — longer answers win more, prompts tend to be short — are minor problems compared to the fragility of everything else.

Real tasks from your own work. The only definitive test. It's your private eval. More on this in two minutes.

The four standard ways to pump a number

Contamination. If the test questions appeared in the training data, the model memorized them. The MMLU case in 2023 was notorious: independent analyses found significant portions of the test set had circulated on technical forums before several models were trained. "Official" scores dropped between 10 and 25 points when measured on clean questions.

Training aimed at the benchmark (gaming). A company knows it'll be judged on a particular test. It trains specifically for that test. The model shines there and underperforms everywhere else. Goodhart's law applied to machine learning: when a measure becomes a target, it stops being a good measure.

Cherry-picking. The launch report shows the three benchmarks the model won. The five where it lost don't appear. If a company only shows tables it wins on, that isn't evaluation — it's marketing.

Comparisons against older versions. "Beats GPT-4 by 15%." Against which GPT-4? The original from March 2023 is dated compared to GPT-4o (May 2024) or GPT-4-turbo. When the exact version and measurement date aren't specified, the headline is built to mislead.

Benchmark vs evaluation: the distinction that matters

A benchmark is a standardized test anyone can run. An evaluation is testing a model for your specific case. Benchmarks give you a general picture; evaluations tell you whether the model works for you.

Rule: benchmarks for filtering (rule out the bad), your own eval for deciding (pick among the good).

The private eval is what serious companies do before signing a contract with a provider. They call it exactly that — a "private eval." You collect 10 to 15 real tasks from the team's actual work, run each task against each candidate model under identical conditions, and score them against your own criteria. The winning model isn't the one with the highest MMLU — it's the one that solves your team's specific tasks best.

A small operational note: a private eval isn't expensive to set up. One afternoon with 15 well-chosen prompts and three candidate models gives you a far more useful signal than any official table.

An honest contrast between providers

There's an observable difference in how the two largest labs report their numbers — and it's worth naming without disguising.

Anthropic tends to publish benchmarks with comparable conditions (same version, same settings, same shot count), and its system cards include detailed methodology sections. When Claude loses a benchmark, Anthropic usually says so.

OpenAI has historically been less transparent about exact settings: several of the peaks reported at GPT-4 and GPT-4o launches were measured under special configurations (more inference compute, assisted prompting, re-tries) that aren't what a typical user runs in production.

It's not that one company lies and the other tells the truth. It's that one publishes more data that allows auditing, and the other publishes less. The evidentiary regime is different.

The counterweight: Chatbot Arena is more honest than any number either of them reports. Because the voters don't know which model they're rating.

Three questions for reading any comparison

Before believing a table, ask three questions. If you can't answer all three, the number isn't evidence — it's decoration.

Public or private test? Public tests are auditable but contaminable. Private ones (provider holdout) are clean but not verifiable.

Who picked which benchmarks to report? An honest report includes tests where the model loses, not just where it wins.

What exact version was measured, and when? Models change. A January benchmark may not apply to a June model — and that's a common trap in cross-provider comparisons.

Why Claude appears lower on benchmarks and higher on Arena

There's an observable pattern worth understanding. On point benchmarks, Claude usually sits 2 to 5 points below the peak of the moment. On Chatbot Arena, Claude consistently sits in the top 1 to 3. The gap isn't inconsistency — it's design.

Anthropic optimizes for robustness over peak: they'd rather ship a model that stably scores 88 than one that sometimes scores 95 and sometimes 62. Benchmarks measure moments; human experience measures averages and variance. That's why in blind voting by real users, Claude wins more often. People don't notice the competitor's peak — they notice its inconsistency.

Same difference as between two students: the one who sometimes scores 10 and sometimes 4, versus the one who always scores 8. On the year-end report, the second wins.

The test no benchmark replaces

Use the model for a full week in your real work. Ask it for the things you ask yourself for. Note three concrete things: how many times it saved you time (and how much), how many times it doubled your work because you had to review its output, and how many times it understood on the first try without you rephrasing.

That data isn't in any paper. But it's the only one that answers the question you actually care about: does this tool work for me?

Which of the five benchmarks I named feels like it weighs the most in how you'd pick an AI for your work — or do you have your own test that would beat them all? If you want to go deeper into the competitive picture, read the AI race. If you want to understand how frontier models actually differ in real use, the differences that do matter is the next step.

Next article
What a prompt is — the most important instruction you will ever write