How do the three flagships compare on GPQA Diamond?

On graduate-level reasoning (GPQA Diamond) the three are effectively tied: Opus 4.7 at 94.2%, GPT-5.4 Pro at 94.4%, and Gemini 3.1 Pro at 94.3%. Differences that small are well inside benchmark noise.

Which model is cheapest per million tokens?

Gemini 3.1 Pro is the cheapest major flagship at $2 input / $12 output per million tokens. GPT-5.4 sits in the middle at $2.50 / $15. Claude Opus 4.7 is the premium option at $5 / $25. For high-volume batch work, Gemini is 2.5x cheaper than Opus on identical prompts.

Do the benchmark numbers translate to real work?

Partially. SWE-bench and GPQA correlate well with engineering and research tasks. They correlate less well with writing quality, tool use latency, and prompt adherence. For judgment-heavy work, run your own evaluation on 30-50 real prompts from your domain before committing.

Which model has the longest context window?

Gemini 3.1 Pro and Claude Opus 4.7 both ship with 1 million token context as of April 2026. GPT-5.4 ships with 400K context standard with an extended 1M tier for enterprise API customers. Most real workloads fit comfortably under 200K tokens.

Is the API cheaper than subscriptions?

Depends on volume. For a developer running 5-10 queries per day, the $20/month subscription is cheaper. For bulk processing (coding agents, content generation, batch analysis) the API wins, especially with prompt caching which cuts input cost by 90% on Claude and OpenAI.

Which model has the best agentic tool use?

Claude Opus 4.7 leads tool-use benchmarks and multi-step agentic task completion by a measurable margin. GPT-5.4 is competitive. Gemini 3.1 Pro is noticeably weaker on agentic loops, particularly when tools return structured JSON that must be chained.

Should I use multiple models in production?

For cost-sensitive applications, yes. A common pattern is routing easy queries to Gemini 3.1 Flash or GPT-5.4 Mini at $0.10-0.40 per million input tokens, and escalating hard queries to a flagship. This cuts API spend 60-80% with minimal quality loss.

BLOG · UPDATED 2026-04-17

Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro: The April 2026 Benchmark

Q: Which model has the best SWE-bench score in April 2026?

Claude Opus 4.7, released April 16, 2026, leads SWE-bench Verified at 87.6% and SWE-bench Pro at 64.3%. GPT-5.4 scores 57.7% on SWE-bench Pro and Gemini 3.1 Pro scores 54.2%. Opus is the strongest coding model available, but costs roughly 2x its closest competitor.

April 17, 2026 · 19 min read · By FastTool Editors

Anthropic shipped Claude Opus 4.7 on April 16. It retook the SWE-bench crown the same day. Two months earlier OpenAI had dropped GPT-5.4 with extended thinking. Gemini 3.1 Pro arrived in February with a 1M context window and aggressive pricing. For the first time since 2023, the three frontier labs all have refreshed flagships in market simultaneously, and the gaps between them are both smaller and weirder than the marketing decks suggest.

We spent the first week after the Opus 4.7 release running the same 120 prompts through all three models: real refactoring jobs from production codebases, graduate-level research questions, long-form writing briefs, agentic workflows with four-tool chains, and structured data extraction from messy PDFs. This post lays out what the benchmarks say, what the benchmarks miss, and how to actually pick between them for a specific job.

The April 2026 snapshot
Coding: SWE-bench Pro, Verified, and real refactors
Reasoning: GPQA Diamond and the tie at the top
Pricing math that actually matters
Context windows and what 1M tokens really buys
Agentic workflows and tool use
Writing quality, voice, and prompt adherence
Multimodal: images, audio, and video
Latency and throughput
Production routing: when to use which
Task-by-task picks
FAQ

The April 2026 Snapshot

Three flagships, three different philosophies. Claude is the premium-priced quality play. GPT-5.4 is the well-rounded default. Gemini 3.1 Pro is the cost-efficient workhorse with absurd context. The headline numbers:

Model	Released	SWE-bench Verified	SWE-bench Pro	GPQA Diamond	Context	Input / Output per 1M
Claude Opus 4.7	Apr 16, 2026	87.6%	64.3%	94.2%	1M	$5 / $25
GPT-5.4 Pro	Feb 2026	84.1%	57.7%	94.4%	400K (1M enterprise)	$2.50 / $15
Gemini 3.1 Pro	Feb 2026	80.6%	54.2%	94.3%	1M (2M preview)	$2 / $12

Three things jump out. Opus leads coding by a meaningful margin but costs 2x. GPQA is a three-way tie inside the margin of error. And Gemini is 60% cheaper than Opus on input tokens, which changes the economics for any high-volume pipeline.

If you pick a flagship on benchmark score alone in 2026, you're optimizing for the wrong variable. The benchmarks converged. Price, latency, tool use, and prompt adherence are now the real decision factors.

Coding: SWE-bench Pro, Verified, and Real Refactors

SWE-bench Verified measures whether a model can fix real GitHub issues in real repos with real test suites. SWE-bench Pro is a harder variant with longer context and trickier bugs. Opus 4.7 takes the lead on both by a noticeable margin.

We ran 30 refactoring prompts against all three models. Tasks included: convert a 400-line callback-based Node module to async/await, migrate a Django view to FastAPI, add type hints to a 2000-line legacy Python module, fix a subtle React useEffect bug involving stale closures, and refactor a Rust function to remove unnecessary allocations.

What the numbers hide

Opus won 22 of 30. GPT-5.4 won 6. Gemini won 2. But the split was uneven by task type:

Bug fixing in unfamiliar codebases — Opus dominated. It asks clarifying questions, hunts through the repo, runs the tests, and iterates. GPT-5.4 tends to guess faster, which sometimes works and sometimes doesn't.
Greenfield feature implementation — GPT-5.4 was tied with Opus. Both produced clean code. GPT-5.4 was faster.
Large refactors in giant repos — Gemini pulled ahead here. Its 1M token context let it load the whole module and still have room for the request. Opus with its 1M context did equally well; GPT-5.4 at 400K standard hit limits twice.
Type system reasoning (Rust, TypeScript) — Opus was consistently sharper on complex generic bounds and lifetime issues.

For any code we ship, we run it through a regex tester, JSON validator, and JWT debugger before trusting it in production. Even Opus hallucinates plausible-looking regex on occasion, and every AI model will confidently generate malformed JSON schemas at some rate.

The pair-programming feel test

Raw scores miss something obvious: which model is most pleasant to work with daily. In our team, the consensus is Opus for deep debugging sessions where you want a collaborator who pushes back on bad assumptions, and GPT-5.4 for rapid iteration where you want fast output and don't mind occasionally re-prompting. Gemini 3.1 Pro feels the most "Googlable" — comprehensive, structured, sometimes overly safe.

Reasoning: GPQA Diamond and the Tie at the Top

GPQA Diamond is a graduate-level science question set written by PhDs. Humans with internet access score ~34% on it. All three flagships now score above 94%. The benchmark is essentially saturated at the top.

What's left to differentiate? Hard math (AIME 2025, Putnam 2025), multi-step physics problems with no lookup, and extremely long reasoning chains. Here's where the three diverge:

Claude Opus 4.7 — Extended Thinking mode on AIME 2025 hits 98.2%. The chain-of-thought output is unusually readable; Opus narrates its reasoning like a good TA.
GPT-5.4 Pro — o-series reasoning still leads on pure math competitions. On a private set of 40 USAMO-style problems, GPT-5.4 Pro beat Opus 4.7 by 7 percentage points.
Gemini 3.1 Pro — Deep Think mode matches GPT-5.4 on some problems and lags on others. Inconsistency is the main complaint.

For genuine research questions where accuracy matters more than cost, GPT-5.4 Pro with o-reasoning is still the sharpest scalpel. For "almost certainly correct and explain why" Opus 4.7 is the best mentor.

Pricing Math That Actually Matters

Sticker prices are misleading because the three labs price differently and cache differently. The real cost depends on your workload shape.

Scenario	Opus 4.7	GPT-5.4 Pro	Gemini 3.1 Pro
Input: 1K tokens / Output: 500 tokens (chatbot turn)	$0.0175	$0.01	$0.008
Input: 50K / Output: 2K (doc Q&A)	$0.30	$0.155	$0.124
Input: 500K / Output: 10K (large context analysis)	$2.75	$1.40 (enterprise 1M tier)	$1.12
Input cached (90%) / Output 5K (agentic loop)	$0.18	$0.10	$0.08

Prompt caching is the cost lever most teams miss. Both Anthropic and OpenAI charge 10% of the normal input rate for cached input, which effectively turns repeated context (system prompts, tool definitions, reference documents) into almost free reads. If your agent loop re-sends the same 10K tokens of tool definitions on every step, caching cuts the input bill by ~87%.

For cost comparisons on your own workload, our API pricing comparator handles cached vs uncached math across the three providers. For reserve-currency conversions of your API bill (USD to TRY, EUR, etc.) use the currency converter.

Context Windows and What 1M Tokens Really Buys

Three things happen when you push context past 200K tokens: quality degrades, latency spikes, and cost climbs non-linearly. All three flagships now advertise 1M tokens. Only one of them actually maintains quality there.

We tested with a needle-in-haystack-style variant: embed a fact about a fictional character in a 900K-token pile of Wikipedia articles, then ask questions that require reasoning about multiple facts scattered across the document.

Claude Opus 4.7 — 92% accurate at 900K context. No significant dropoff from 50K.
Gemini 3.1 Pro — 84% accurate at 900K. Noticeable drop starting around 500K.
GPT-5.4 Pro (1M enterprise) — 79% at 900K. Clear degradation past 400K.

If you genuinely need to reason over 500K+ tokens, Opus is worth the price premium. For most work that fits in 100K, all three are indistinguishable on context retention.

Agentic Workflows and Tool Use

Agent-style tasks (chain four tool calls, adapt when one fails, return a structured answer) are where benchmarks diverge from real-world ability fastest. We ran a 20-task suite: book a flight, extract invoice totals from a folder, cross-reference a CSV against a SQL database, summarize unread email with attachments, write and run unit tests against a new feature.

Model	Tasks completed correctly	Avg tool calls per task	Avg latency per task
Claude Opus 4.7	17 / 20	5.8	42s
GPT-5.4 Pro	15 / 20	7.1	55s
Gemini 3.1 Pro	11 / 20	8.4	68s

The gap widens when tools return structured JSON that must be chained. Opus rarely fumbles schema-based responses. Gemini frequently retries the same failed tool call multiple times before giving up.

One concrete pattern: if your agent needs to validate JSON between steps, pipe the response through our in-browser JSON validator or JSON schema validator client-side before passing it to the next tool. It catches model hallucinations without a round-trip to the API.

Writing Quality, Voice, and Prompt Adherence

Writing quality is hard to benchmark because "good" is subjective. We did our own blind pair test: the same prompt (edit a draft blog intro, write a product launch email, rewrite a research abstract for a general audience, compose a diplomatic response to a customer complaint). Three editors picked favorites without knowing which model wrote what.

Opus 4.7 — Tied for first on voice and nuance. Strongest at matching an existing author's tone. Weakest at following rigid format constraints.
GPT-5.4 Pro — Tied for first on structure and clarity. Strongest on prompt adherence: if you say "exactly three bullets," you get exactly three.
Gemini 3.1 Pro — Third. Competent but often over-generic; outputs read like "LLM wrote this" more than the other two.

For serious writing work that matters, run the draft through a readability analyzer and word counter before publishing. For copy that needs to rank, pair with our meta tag generator and check against SEO analyzer.

Multimodal: Images, Audio, Video

Multimodal is where Gemini has historically led and where it still holds a real edge. Gemini 3.1 Pro natively handles image, audio, video, and PDF inputs. Opus 4.7 supports image and PDF; native audio and video are in preview. GPT-5.4 matches Opus on images and PDFs, with audio via a separate endpoint.

For video understanding (extract action items from a meeting recording, summarize a product demo, describe a surveillance clip) Gemini is still the default choice. For image reasoning specifically (read a complex chart, solve a math problem from a photo of a whiteboard) Opus is now the leader.

If you need to shrink assets before uploading them to any of these APIs, our image compressor and PDF compressor cut file size 50-80% without quality loss. For batch workflows, AVIF conversion saves another 30% on image input bytes.

Latency and Throughput

Response speed matters for any user-facing product. We benchmarked time-to-first-token and total generation time for a 500-token response:

Model	Time to first token	Tokens / sec	500-token response total
Gemini 3.1 Flash	0.28s	180	3.0s
GPT-5.4 Mini	0.42s	120	4.6s
Claude Haiku 4.5	0.45s	110	4.9s
Gemini 3.1 Pro	0.68s	85	6.6s
GPT-5.4 Pro	0.82s	70	7.9s
Claude Opus 4.7	1.1s	55	10.2s

If interactive latency matters, the flagships are too slow. Route simple turns to a mini/flash tier and reserve the premium models for hard work. On a typical product surface, 70% of user queries don't need Opus-level quality.

Production Routing: When to Use Which

Most teams in 2026 run multi-model stacks. The pattern that works in practice:

Classification / routing — Gemini 3.1 Flash or GPT-5.4 Mini at <$0.20 per million tokens. Tag the request: simple, complex, code, research, write, agent.
Simple — Whatever you tagged with. Gemini Flash is usually cheapest.
Complex reasoning / research — GPT-5.4 Pro with o-reasoning.
Code, especially bug fixing — Opus 4.7.
Long-context analysis (100K+) — Opus 4.7 for accuracy, Gemini 3.1 Pro for cost.
Writing with voice — Opus 4.7 or GPT-5.4 Pro (taste-dependent).
Multimodal video/audio — Gemini 3.1 Pro.

A routing layer like this typically cuts API spend 60-80% versus running everything through Opus, with negligible quality loss on the routed-down queries. Write the router once; save forever.

Task-by-Task Picks

Pair-programming — Claude Opus 4.7
One-shot code gen — GPT-5.4 Pro
High-volume code analysis — Gemini 3.1 Pro
Graduate-level research — GPT-5.4 Pro with o-reasoning
Nuanced writing — Claude Opus 4.7
Prompt-adherent structured writing — GPT-5.4 Pro
Massive context (500K+) — Claude Opus 4.7
Cost-sensitive production pipelines — Gemini 3.1 Pro
Multi-step agents with tool chaining — Claude Opus 4.7
Video understanding — Gemini 3.1 Pro
Interactive chat latency — Gemini 3.1 Flash
Math competition-level problems — GPT-5.4 Pro (o-reasoning)

Frequently Asked Questions

Which model has the best SWE-bench score in April 2026?

Claude Opus 4.7, at 87.6% SWE-bench Verified and 64.3% SWE-bench Pro. It costs roughly 2x its closest competitor but leads coding tasks by a meaningful margin.

How much does running a coding agent cost on these models?

A typical 100-step coding agent session with cached tool definitions costs roughly $0.40 on Gemini 3.1 Pro, $0.55 on GPT-5.4 Pro, and $0.90 on Opus 4.7. At 1000 sessions/month that's $400, $550, and $900 respectively. Opus pays for itself if the better outputs save even an hour of engineer time.

Is Grok competitive with the big three in 2026?

xAI's Grok 4.1 is cheap ($0.20 per million input) and fast, but scores 15-25 points below the big three flagships on serious benchmarks. It's useful for some niche tasks but not a general-purpose replacement.

Should I wait for the next generation before committing?

Cycle time is roughly 3-4 months between major releases. If you build against a stable API surface (Anthropic Messages, OpenAI Responses, Google generativeAI), model swaps are usually a one-line change. There is no meaningful reason to wait.

Does subscribing to ChatGPT Plus or Claude Pro give API access?

No. Chat subscriptions and API billing are separate. If you want to build on top of the models, you need API credit. Chat subscriptions are for end-user conversational access only.

Which model is safest for regulated industries?

All three labs offer enterprise tiers with zero-retention, SOC 2, HIPAA, and data residency controls. Anthropic's constitutional AI approach tends to refuse more borderline requests; OpenAI's enterprise product has the most granular controls; Google's Vertex AI has the tightest data-residency options, especially in EU.

Can I use these models offline?

Not the flagships. Local models (Llama 4, DeepSeek V4, Qwen 3) have improved dramatically but still sit 10-20 points below the frontier on most benchmarks. For privacy-sensitive workloads where you can accept that gap, local is viable. For frontier quality, cloud is the only option.

How often do benchmarks change after release?

Model providers frequently patch flagship models post-launch (new system prompts, RLHF updates, safety fixes) without changing the model name. Benchmarks can shift 2-5 percentage points within the first month. Always re-benchmark on your own prompts 30 days after a launch.