GPT-5.5 just shipped. The question isn't which model wins — it's when to use which

TL;DR

OpenAI shipped GPT-5.5 on April 23 — six weeks after GPT-5.4 and one week after Claude Opus 4.7. The two frontier models stopped competing on the same axis: GPT-5.5 wins the benchmarks for agents that run long workflows (Terminal-Bench 2.0, OSWorld), Opus 4.7 wins the ones for analytical precision (SWE-Bench Pro, MCP-Atlas). For you, using AI to work better: "which one is better?" just went obsolete. The question that matters now is "which one do I use for what?"

✦ Summarized with Claude at publish time

✦ AI rewrite

Read it as…

Six weeks. That's what passed between GPT-5.4 and GPT-5.5. Big frontier models used to take months to get a new version. Now it's weeks.

But the thing worth watching isn't speed. It's the reshuffling that came with it. GPT-5.5 and Claude Opus 4.7 — the two frontier models that shipped this week and last — stopped competing on the same axis.

For the past two years, top AI models got measured on a single list: who scores higher on the same exams? Today, GPT-5.5 wins on some and Opus 4.7 wins on others. The gaps aren't cosmetic. They're structural.

"The race stopped being a ranking. It's an ecosystem now — and that's good for you."

What happened this week

OpenAI shipped GPT-5.5 on April 23 in three variants: Standard, Thinking (with extended reasoning), and Pro (highest accuracy). A one-million-token context window. Pricing: $5 per million input tokens, $30 per million output tokens.

According to data OpenAI provided in its release, the model scores 88.7% on SWE-bench (a standard software engineering benchmark) and 92.4% on MMLU (a broad knowledge test). OpenAI also reports a 60% reduction in "hallucinations" versus GPT-5.4.

All those are OpenAI's numbers. Treat them as what they are: figures from the maker.

A week earlier, Anthropic had shipped Claude Opus 4.7, also with a 1M-token window, similar pricing ($5 input, $25 output), and its own set of optimistic numbers. Both models sit in the exact same price-capability band. The gap shows up when you look at community benchmarks where evaluations run with the same rules for both.

Where each model wins

This is where the story gets interesting.

GPT-5.5 wins on agentic coding — the long flows where the AI runs, corrects, and retries on its own:

Terminal-Bench 2.0: 82.7% vs Opus 4.7's 69.4%.
OSWorld-Verified: 78.7% vs 78.0%.
Expert-SWE: 73.1%.

These benchmarks measure long workflows: the AI opens a terminal, runs commands, reads output, debugs, retries. A high score means the AI completes the task on its own without getting stuck.

Claude Opus 4.7 wins on analytical precision — the isolated tasks where "close enough" isn't enough:

SWE-Bench Pro: 64.3% vs GPT-5.5's 58.6%.
MCP-Atlas: 79.1% vs 75.3%.

Opus 4.7 wins 6 of 10 shared benchmarks. GPT-5.5 wins 4. Margins run 2 to 13 points.

Without context, these numbers say nothing. With context, they say everything.

When to use which

The practical rule, sharpened by a couple of tests this week:

When the work needs the AI to analyze, write, review, decide with precision, give one answer that has to be right — Opus 4.7.

When the work needs the AI to run multiple steps in sequence, use tools, adapt to intermediate results, finish something that involves many small moves — GPT-5.5.

A concrete example from my week: I had to analyze 50 résumés and rank them by fit for an open role.

With no prior context, I gave the task to Claude. I pasted the role profile, the CVs, asked for the ranking with reasoning. Claude came back with judgment, caveats, explained why #17 was ahead of #22. Precision on each case.

In an agentic setup — same task, but "do it yourself, from pulling the CVs out of my Drive to emailing me the formatted ranking" — GPT-5.5 has the edge. Five tools chained together, and if one fails it tries a variation. Claude can do it too, but with GPT-5.5 you're less likely to see it stall at step three.

Neither one solves both jobs equally well. This week's two releases made that explicit.

What didn't change

A detail that gets lost in benchmark comparisons: for most professionals' daily work, the differences become invisible.

Ask either model to "draft this email," "translate this proposal," "summarize this meeting," "list pros and cons," and both come back with a solid answer. A five-point benchmark margin dissolves in daily use.

The gap starts to matter when: (a) you're working with large files or long reasoning chains, where Opus 4.7 holds coherence better; (b) you need the AI to complete an autonomous flow across multiple steps without your intervention, where GPT-5.5 cuts through better; (c) you use AI for work where a small error is expensive (legal review, financial analysis, editorial content), where Opus's precision earns its pricing.

Closing

Here's Friday's news: there's no single answer to "which one do I use?"

The answer you have is: for this kind of work, Claude. For that kind of work, GPT-5.5. Both live in your day. Neither solves everything. The professional who picks with judgment works better than the one who marries a brand.

What about you? Is there something in your flow that isn't working today, and might be because you're using the wrong tool for that task?

Picture two bakeries on your block. Both sell bread every day. Both are good.

But one makes the better croissants and the other makes the better sourdough. Sunday brunch? You go to one. Tuesday sandwich? You go to the other.

That's what just happened with the two AI tools most professionals actually use.

On Wednesday, OpenAI shipped GPT-5.5 — a week after Anthropic shipped Claude Opus 4.7. Both are frontier models. Both cost nearly the same. Both handle huge files.

They just stopped being good at the same things.

What happened this week

GPT-5.5 landed six weeks after GPT-5.4. Six weeks.

A new frontier model used to take months or years. Now it's weeks. The race got faster.

According to numbers OpenAI published in its release, GPT-5.5 pulls ahead on tasks where the AI has to "do things" — run steps, use tools, execute and fix code, search and compare information without you hovering.

Claude Opus 4.7, the top model Anthropic shipped last week, still leads on the other side of the work — analysis, writing, document review, decisions that need precision.

Where each model wins

Here's the analogy that lands.

You ask an assistant to plan your weekend trip. One assistant puts the itinerary together, books the hotels, compares prices, reroutes when a flight shifts. The other writes the perfect email to your partner explaining what you planned, with the right tone, not one extra word.

You need both. But not for the same thing.

That's what the numbers show. GPT-5.5 handles long multi-step workflows better. Claude Opus 4.7 handles each individual precision task better.

When to use which

The simple rule you can try Monday:

If the work is writing, reviewing, analyzing a report, making a decision with data on the table — open Claude.

If the work is "I want the AI to figure this out on its own, from the search to the result" — try GPT-5.5.

You don't have to pick a brand for life. You pick the tool for the task. Any good professional has been doing that since Google existed.

How have you been deciding between the two so far?

Six weeks. That's what passed between GPT-5.4 and GPT-5.5. Big frontier models used to take months to get a new version. Now it's weeks.

"The race stopped being a ranking. It's an ecosystem now — and that's good for you."

What happened this week

All those are OpenAI's numbers. Treat them as what they are: figures from the maker.

Where each model wins

This is where the story gets interesting.

GPT-5.5 wins on agentic coding — the long flows where the AI runs, corrects, and retries on its own:

Terminal-Bench 2.0: 82.7% vs Opus 4.7's 69.4%.
OSWorld-Verified: 78.7% vs 78.0%.
Expert-SWE: 73.1%.

These benchmarks measure long workflows: the AI opens a terminal, runs commands, reads output, debugs, retries. A high score means the AI completes the task on its own without getting stuck.

Claude Opus 4.7 wins on analytical precision — the isolated tasks where "close enough" isn't enough:

SWE-Bench Pro: 64.3% vs GPT-5.5's 58.6%.
MCP-Atlas: 79.1% vs 75.3%.

Opus 4.7 wins 6 of 10 shared benchmarks. GPT-5.5 wins 4. Margins run 2 to 13 points.

Without context, these numbers say nothing. With context, they say everything.

When to use which

The practical rule, sharpened by a couple of tests this week:

When the work needs the AI to analyze, write, review, decide with precision, give one answer that has to be right — Opus 4.7.

When the work needs the AI to run multiple steps in sequence, use tools, adapt to intermediate results, finish something that involves many small moves — GPT-5.5.

A concrete example from my week: I had to analyze 50 résumés and rank them by fit for an open role.

Neither one solves both jobs equally well. This week's two releases made that explicit.

What didn't change

A detail that gets lost in benchmark comparisons: for most professionals' daily work, the differences become invisible.

Closing

Here's Friday's news: there's no single answer to "which one do I use?"

What about you? Is there something in your flow that isn't working today, and might be because you're using the wrong tool for that task?

Six weeks between GPT-5.4 and GPT-5.5 doesn't make sense without laying it next to the 2024 timeline, when GPT-4 Turbo and GPT-4o were more than five months apart. The cycle compressed. Not because OpenAI wanted it — because Anthropic set the cadence from September 2025 onward with Sonnet 4.6, Opus 4.5, Opus 4.6, and Opus 4.7 in under eight months.

But the real story this week isn't speed. It's the split of axes. Through March 2026, the competitive map read as a single list with one dominant metric (MMLU, then SWE-bench). With GPT-5.5 and Opus 4.7 side by side, the ranking broke. And what's left is more useful to the user than the ranking ever was.

What happened this week

GPT-5.5 shipped April 23 in three variants. Standard for general use. Thinking for extended reasoning (the equivalent of Claude's think mode). Pro for top accuracy without inference-time compute limits. Context window: 1M tokens. Pricing: $5/M input, $30/M output. The Pro tier has differential pricing not yet public at the time of writing.

Data OpenAI provided in the release: 88.7% SWE-bench, 92.4% MMLU, a 60% reduction in hallucinations vs GPT-5.4, a meaningful gain on long-context recall, and reliable tool use across chains of more than 30 steps. None of these numbers has been third-party audited as of today.

Opus 4.7 shipped April 16. Same 1M-token window. $5/M input, $25/M output. Data Anthropic provided: improvements in software engineering over Opus 4.6, upgraded vision (higher-resolution image processing), stronger performance on long-horizon tasks.

Both models sit in the same price-capability band. The gap shows up on shared benchmarks where the community runs reproducible evals.

Where each model wins

Compact table of what circulated this week:

Benchmark	GPT-5.5	Opus 4.7	Eval type
Terminal-Bench 2.0	82.7%	69.4%	Agentic coding (community)
OSWorld-Verified	78.7%	78.0%	Agentic desktop
Expert-SWE	73.1%	—	Complex coding
SWE-Bench Pro	58.6%	64.3%	Precision coding
MCP-Atlas	75.3%	79.1%	Precision with tool use
MMLU	92.4%*	—	General knowledge
SWE-bench	88.7%*	—	Coding (original)

*Numbers OpenAI provided in its release. Not independently verified as of today.

Terminal-Bench 2.0 measures an agent's ability to operate end-to-end in a Unix terminal: open shells, run commands, read outputs, debug, recover from errors. A 13-point GPT-5.5 lead on this metric isn't marginal — it's a qualitative difference in long agentic flows.

SWE-Bench Pro measures bug-fix resolution on real GitHub repositories with strict correctness evaluation. Opus 4.7's 5.7-point edge there means Claude is still ahead on analytical code precision.

MCP-Atlas — the benchmark most relevant to the agentic-with-analytical-precision pattern — goes to Opus 4.7 by 3.8 points.

The read: GPT-5.5 is optimized for long flows with multi-tool use. Opus 4.7 is optimized for individual tasks of high precision. The 2-13 point range of differences that keeps showing up across benchmarks is consistent with that hypothesis — not with a global "one is better."

Specialization as a structural pattern

One could argue this split is temporary. OpenAI is responding to the benchmark of the moment (agentic) and Anthropic to its own (precision coding), and in six weeks both will converge.

The argument has historical support — it happened in 2023 with GPT-4 and Claude 2, in 2024 with GPT-4o and Claude 3.5 Sonnet. But in 2026, two elements complicate the pattern.

First: the RLHF in GPT-5.5 is calibrated with specific heuristics to avoid "giving up" in long flows (OpenAI says so explicitly in the release). That's a directed design choice, not a side effect. I'd doubt Anthropic copies it one-for-one — it runs against the precision-first philosophy they've defended in technical posts since 2024.

Second: the vocation difference shows up in pricing. GPT-5.5 charges $30 on output; Opus 4.7, $25. On a long agentic flow, GPT-5.5 produces more output tokens (searches, re-evaluations, corrections). That asymmetric pricing — $5 more on output when the model is optimized to produce more output — models a usage economy where each model captures the workload it's best at. It's not neutral.

The hypothesis I find most defensible: the split holds. Each company deepens its axis rather than copying the other. For the user, that's good news — an ecosystem is more useful than a ranking.

When to use which

The operational heuristic, refined after running both through real flows:

Opus 4.7 for: long analysis with precision ("review this contract and flag the clauses that contradict the commercial proposal"), writing with strong editorial judgment, work where the correct answer is singular and the cost of error is high (compliance, audit, brand content), sustained reasoning over large documents where coherence between sections matters.

GPT-5.5 for: agentic flows with multiple tools ("find this data on the web, cross-reference it with this CSV, email me the result"), tasks where the AI has to recover from its own errors without your intervention, desktop/terminal operation with prolonged tool use, flows where "good enough" is actually useful because the cost of intermediate error is low.

Claude Sonnet 4.6 (cheaper) for: daily tasks where either model responds well. Most of your desk work.

GPT-5.5 Thinking when: the flow is long but needs explicit reasoning before action. It's the model that puts the brakes on standard GPT-5.5's agentic tendency to "try first, think later."

Running both in parallel is what any thoughtful professional should consider in 2026. It's not vendor lock-in — it's use by task. My own flow: Claude in Projects for the big pieces I need to control paragraph by paragraph. Claude Code for debugging and review of specific changes. GPT-5.5 when the task has the shape of "execute this flow end-to-end."

The piece benchmarks don't capture

There's something no benchmark captures well, and it determines which one you reach for more: what it feels like to work with the model.

Claude has an interpretive tone. Given an ambiguous task, it asks you what you meant; when it decides on its own, it explains the decision. The cost is that sometimes it over-asks and slows the flow. GPT-5.5 has an executor tone — it assumes, it does, it tells you after. The cost is that it decides things you would have decided differently.

For work you'll review before accepting, Claude saves you work. For work you're delegating on autopilot, GPT-5.5 saves you time. Both are productivity, through different paths.

That personality difference isn't going to close with benchmarks or updates. It lives in the design philosophy. And that, ultimately, is what makes you prefer one over the other for your real workflow.

What's next

Three weeks before the end of the quarter, the market's implicit bets:

Anthropic will respond. The obvious move: drop Opus 4.7's price to neutralize the attack on that vector. The less obvious and more interesting move: strengthen Sonnet 4.7's agentic capabilities (not yet announced) so the mid-tier model captures part of the long workload without Opus pricing. I'd bet on the second. Anthropic's pricing history suggests they don't like price wars.

OpenAI already showed the cadence. If GPT-5.6 lands in June with improvements on long-context analytical tasks, it's stepping directly on Opus's toes. The question is whether they hold the specialization I just described or converge again. The economics of task-specific RLHF suggest they don't.

Google hasn't played its hand yet. Gemini 3.1 Pro is competitive but not frontier. Gemini 3.2, with the new TPU 8t/8i Google announced at Cloud Next this week, could shift the map. Not before Q3 2026.

The fourth bet — the least likely but the most valuable if it resolves — is open-source models (DeepSeek V4, Llama 5) closing the frontier gap. If DeepSeek V4 reaches 90% of Opus 4.7 at a tenth of the cost, the whole ecosystem rewrites itself. No concrete signals of that in the next 30 days. But the DeepSeek V3.2 precedent says it's not impossible.

What's your read on the specialization-vs-convergence trade-off? Will the two keep separating on axes, or will the next release pull them back together?

What happened this week

Where each model wins

When to use which

What didn't change

Closing

Want to go deeper?