Primera Plana · The AI Landscape · Edition #0059

GPT-5.5 just shipped. The question isn't which model wins — it's when to use which

OpenAI shipped a frontier model six weeks after its last one and one week after Opus 4.7. They're no longer competing on the same axis.

G
Germán Falcioni April 24, 2026
✦ Reading: 10 min
Editorial illustration · AI-generated for edition #0059
TL;DR

OpenAI shipped GPT-5.5 on April 23 — six weeks after GPT-5.4 and one week after Claude Opus 4.7. The two frontier models stopped competing on the same axis: GPT-5.5 wins the benchmarks for agents that run long workflows (Terminal-Bench 2.0, OSWorld), Opus 4.7 wins the ones for analytical precision (SWE-Bench Pro, MCP-Atlas). For you, using AI to work better: "which one is better?" just went obsolete. The question that matters now is "which one do I use for what?"

✦ Summarized with Claude at publish time
AI rewrite
Read it as…

Six weeks. That's what passed between GPT-5.4 and GPT-5.5. Big frontier models used to take months to get a new version. Now it's weeks.

But the thing worth watching isn't speed. It's the reshuffling that came with it. GPT-5.5 and Claude Opus 4.7 — the two frontier models that shipped this week and last — stopped competing on the same axis.

For the past two years, top AI models got measured on a single list: who scores higher on the same exams? Today, GPT-5.5 wins on some and Opus 4.7 wins on others. The gaps aren't cosmetic. They're structural.

"The race stopped being a ranking. It's an ecosystem now — and that's good for you."

What happened this week

OpenAI shipped GPT-5.5 on April 23 in three variants: Standard, Thinking (with extended reasoning), and Pro (highest accuracy). A one-million-token context window. Pricing: $5 per million input tokens, $30 per million output tokens.

According to data OpenAI provided in its release, the model scores 88.7% on SWE-bench (a standard software engineering benchmark) and 92.4% on MMLU (a broad knowledge test). OpenAI also reports a 60% reduction in "hallucinations" versus GPT-5.4.

All those are OpenAI's numbers. Treat them as what they are: figures from the maker.

A week earlier, Anthropic had shipped Claude Opus 4.7, also with a 1M-token window, similar pricing ($5 input, $25 output), and its own set of optimistic numbers. Both models sit in the exact same price-capability band. The gap shows up when you look at community benchmarks where evaluations run with the same rules for both.

Where each model wins

This is where the story gets interesting.

GPT-5.5 wins on agentic coding — the long flows where the AI runs, corrects, and retries on its own:

  • Terminal-Bench 2.0: 82.7% vs Opus 4.7's 69.4%.
  • OSWorld-Verified: 78.7% vs 78.0%.
  • Expert-SWE: 73.1%.

These benchmarks measure long workflows: the AI opens a terminal, runs commands, reads output, debugs, retries. A high score means the AI completes the task on its own without getting stuck.

Claude Opus 4.7 wins on analytical precision — the isolated tasks where "close enough" isn't enough:

  • SWE-Bench Pro: 64.3% vs GPT-5.5's 58.6%.
  • MCP-Atlas: 79.1% vs 75.3%.

Opus 4.7 wins 6 of 10 shared benchmarks. GPT-5.5 wins 4. Margins run 2 to 13 points.

Without context, these numbers say nothing. With context, they say everything.

When to use which

The practical rule, sharpened by a couple of tests this week:

When the work needs the AI to analyze, write, review, decide with precision, give one answer that has to be right — Opus 4.7.

When the work needs the AI to run multiple steps in sequence, use tools, adapt to intermediate results, finish something that involves many small moves — GPT-5.5.

A concrete example from my week: I had to analyze 50 résumés and rank them by fit for an open role.

With no prior context, I gave the task to Claude. I pasted the role profile, the CVs, asked for the ranking with reasoning. Claude came back with judgment, caveats, explained why #17 was ahead of #22. Precision on each case.

In an agentic setup — same task, but "do it yourself, from pulling the CVs out of my Drive to emailing me the formatted ranking" — GPT-5.5 has the edge. Five tools chained together, and if one fails it tries a variation. Claude can do it too, but with GPT-5.5 you're less likely to see it stall at step three.

Neither one solves both jobs equally well. This week's two releases made that explicit.

What didn't change

A detail that gets lost in benchmark comparisons: for most professionals' daily work, the differences become invisible.

Ask either model to "draft this email," "translate this proposal," "summarize this meeting," "list pros and cons," and both come back with a solid answer. A five-point benchmark margin dissolves in daily use.

The gap starts to matter when: (a) you're working with large files or long reasoning chains, where Opus 4.7 holds coherence better; (b) you need the AI to complete an autonomous flow across multiple steps without your intervention, where GPT-5.5 cuts through better; (c) you use AI for work where a small error is expensive (legal review, financial analysis, editorial content), where Opus's precision earns its pricing.

Closing

Here's Friday's news: there's no single answer to "which one do I use?"

The answer you have is: for this kind of work, Claude. For that kind of work, GPT-5.5. Both live in your day. Neither solves everything. The professional who picks with judgment works better than the one who marries a brand.

What about you? Is there something in your flow that isn't working today, and might be because you're using the wrong tool for that task?

Keep exploring

Want to go deeper?

01 So which one should I actually use for my work?

It depends almost entirely on the kind of work you do. If most of your day is writing, reviewing text, analyzing complex information, preparing documents, or making decisions that need precision, Opus 4.7 is still the option least likely to let you down.nnIf instead you build flows where the AI has to run several steps in sequence (search the web, read files, write code, run it, fix it, retry — without you in the middle), GPT-5.5 gets you to the result with fewer stumbles.nnThis isn't dogma. It's specialization. The professional who uses both tools with judgment works better than the one who marries a brand.n

02 Are the benchmark numbers real, or are they marketing?

Both kinds of numbers coexist. OpenAI publishes 88.7% on SWE-bench and 92.4% on MMLU as figures provided by OpenAI in its release. Anthropic did the same for Opus 4.7 last week. Both are maker numbers.nnThe healthy way to read them: look at ecosystem benchmarks — Terminal-Bench 2.0, SWE-Bench Pro, MCP-Atlas — where the community runs evaluations with the same rules for everyone. That's where the differences hold.nnAnd above all: test it yourself, with your real work, before drawing conclusions. The benchmark that actually matters is the one you run on a Monday at 3 p.m.n

03 Does this mean Claude is falling behind?

No. It means the race changed shape. Through 2024 and 2025, frontier models were measured on a single list — who scores higher on the same exams? In April 2026, GPT-5.5 and Opus 4.7 are effectively tied on general use cases and split on specialized ones.nnOpus 4.7 wins 6 of 10 shared benchmarks. GPT-5.5 wins 4. The margins range from 2 to 13 points.nnFor daily use, neither one leaves you stranded on what it does well. Falling behind would mean not having a competitive product — and that's not the case. Opus 4.7 has a competitive product by a wide margin.n

Next article
Copilot stopped suggesting and started doing: what changes tomorrow in your Word, Excel, and PowerPoint