

“Set your heart upon your work, but never on its reward.”
Bhagavad Gita

“Set your heart upon your work, but never on its reward.”
Bhagavad Gita
For the first time since ChatGPT launched in November 2022, there is no clear best AI model.
Not because the models aren't good — they're extraordinary. But because "best" has fractured into a dozen dimensions, and each company has claimed different ones. Google's Gemini 3.1 Pro tops math, science, and the overall intelligence index. OpenAI's GPT-5.3 Codex leads agentic coding. Anthropic's Claude Opus 4.6 dominates economically valuable work. And each one falls short in ways that matter.
I've spent the past month working with all three in production — building agents, writing code, analyzing documents, and pushing each model to its limits. Here's what I found, backed by actual benchmarks, real pricing, and honest assessments of where each one falls short.
Three companies. Three flagship models. Three very different philosophies.
Google Gemini 3.1 Pro — The newest kid on the block, launched February 19, 2026, and still in preview. A major leap over Gemini 3 Pro (November 2025), it immediately claimed the #1 spot on the Artificial Analysis Intelligence Index (57/100, rank #1 of 115 models). It's Google's "throw everything at the wall" model: 1 million token context, reasoning capabilities, and benchmarks that would have seemed impossible a year ago — including a 148% relative improvement on ARC-AGI-2 over its predecessor.
Anthropic Claude Opus 4.6 — Released February 5, 2026. Anthropic's approach is precision over spectacle. Opus 4.6 briefly topped Terminal-Bench 2.0 before GPT-5.3 Codex reclaimed it the same day, but it leads by 144 Elo points on economically valuable work tasks (GDPval-AA) and introduced "adaptive thinking" — a feature where the model decides how hard to think about your problem. No other model does this.
OpenAI GPT-5.3 Codex — Launched February 5, 2026, roughly 30 minutes after Anthropic published Claude Opus 4.6. GPT-5.3 Codex merged the previous GPT-5.2 and GPT-5.2-Codex into a single general-purpose agentic model that transcends the IDE — handling end-to-end knowledge work from Jira tickets to deployment pipelines. It immediately reclaimed the Terminal-Bench crown, and a lightweight variant, GPT-5.3 Codex Spark, followed a week later running on Cerebras hardware at 1,000+ tokens per second.

Let's skip the benchmarks that don't translate to real-world performance and focus on the ones that do.
This is where most developers will care, and the picture is surprisingly clear.
| Benchmark | Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.3 Codex |
|---|---|---|---|
| SWE-Bench Verified | 80.6% | ~78%* | 80.2% |
| SWE-Bench Pro | 54.2% | — | 56.8% |
| Terminal-Bench 2.0 | 68.5% | 69.9% | 75.1% |
| OSWorld-Verified | — | — | 64.7% |
| LiveCodeBench Pro Elo | 2,887 | — | — |
*Opus 4.6 specific SWE-Bench Verified number not published; predecessor Sonnet 4.5 scored 77.2%
The takeaway: Gemini 3.1 Pro barely edges ahead on SWE-Bench Verified (80.6% vs GPT-5.3 Codex's 80.2%), but GPT-5.3 Codex dominates agentic coding — the kind of multi-step, multi-file software engineering that reflects actual development work. Terminal-Bench 2.0 measures exactly this: can the model navigate a real codebase, understand context across files, and ship working code? GPT-5.3 Codex sits at #1 with 75.1%, a meaningful lead over Claude Opus 4.6's 69.9%. Its 64.7% on OSWorld-Verified (up from GPT-5.2's 38.2%) shows dramatic improvement in computer-use tasks as well.
For competitive programming and algorithmic challenges, Gemini 3.1 Pro's LiveCodeBench Elo of 2,887 is in a league of its own.
| Benchmark | Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.3 Codex |
|---|---|---|---|
| GPQA Diamond (grad-level science) | 94.3% | — | 87.7%*** |
| AIME 2025 (math competition) | 95.2%** | — | 96.7%*** |
| ARC-AGI-2 (abstract reasoning) | 77.1% | — | 52.9% |
| Humanity's Last Exam | 51.4%** | — | — |
**Gemini 3 Flash score; Gemini 3.1 Pro's AIME 2025 score not published separately ***o3 scores; GPT-5.3 Codex reasoning capability is essentially unchanged from GPT-5.2 (GDPval 70.9% identical), so these approximations hold ****With search and code tools; 44.4% without tools
Google dominates here. Gemini 3.1 Pro's 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2 represent significant leads. The ARC-AGI-2 gap is particularly striking — Google nearly doubles OpenAI's score on what's considered one of the hardest reasoning benchmarks in existence. Gemini 3 Flash's 95.2% on AIME 2025 (99.7% with code tools) is also remarkable for a model priced at $0.50/$3.00.
This is where it gets interesting. Anthropic introduced a benchmark called GDPval-AA that measures performance on "economically valuable" tasks — the kind of work that actually generates revenue in professional settings.
| Benchmark | Claude Opus 4.6 | GPT-5.3 Codex | Gemini 3.1 Pro |
|---|---|---|---|
| GDPval-AA (economic value) | +144 Elo vs GPT-5.3 Codex | Baseline | 1,317 Elo |
| BrowseComp (web research) | Best in class | — | 85.9% |
| MRCR v2 128k (long-context retrieval) | 76% | — | 84.9% |
Claude Opus 4.6 is 144 Elo points ahead of GPT-5.3 Codex on economically valuable work. That's not a marginal difference — that's a generational gap on the tasks that matter most for professional use.

Let me be direct. After extensive use, each model has a clear domain of dominance.
Best for: Mathematical reasoning, scientific analysis, competitive programming, multilingual tasks, processing massive documents, software engineering at scale.
Gemini 3.1 Pro launched just yesterday (February 19) and immediately took the #1 spot on the Artificial Analysis Intelligence Index with a score of 57/100 across 115 models. Its 94.3% on GPQA Diamond means it can handle graduate-level physics, chemistry, and biology questions with near-expert accuracy. The jump on ARC-AGI-2 — from 31.1% (Gemini 3 Pro) to 77.1% — is the most dramatic single-generation improvement on any reasoning benchmark this cycle.
On coding, it leads SWE-Bench Verified at 80.6% while scoring a competitive 68.5% on Terminal-Bench 2.0. GitHub testing showed 35% higher accuracy in resolving software engineering challenges, and JetBrains reported a 50% improvement in benchmark task resolution compared to the predecessor.
The 1 million token context window is genuinely useful, not just a marketing number. I've fed it entire codebases, 500-page documents, and multi-hour meeting transcripts. Retrieval accuracy at 128k tokens (84.9% on MRCR v2) is the best of any model.
Where it falls short: Still in preview — Google's own documentation warns against using it for mission-critical production workloads. Computer use is notably absent (that's on Gemini 3 Flash, not 3.1 Pro). Time-to-first-token of ~31 seconds is painfully slow compared to Claude Opus 4.6's 1.7 seconds, a consequence of its reasoning approach. And output quality can be inconsistent compared to Claude, particularly in creative writing and nuanced instruction-following.
Best for: Software engineering, agentic workflows, code review, technical writing, professional knowledge work, computer use automation.
There's a reason Claude has become the default model for serious software engineering work. Opus 4.6 leads GDPval-AA by a massive margin (+144 Elo over GPT-5.3 Codex on economically valuable tasks), scores a competitive 69.9% on Terminal-Bench 2.0, and introduced a feature called adaptive thinking that fundamentally changes how the model approaches problems.
Here's what adaptive thinking means in practice: instead of you deciding whether the model should "think hard" about a problem (and paying for reasoning tokens either way), Opus 4.6 automatically scales its reasoning depth to match the complexity of your query. Simple question? Instant response. Complex multi-step coding task? Deep, extended reasoning. This isn't a gimmick — it's a meaningful UX improvement that no other model offers.
The 128K max output is also the highest in the Claude family and lets the model generate entire features, complete with tests, in a single response.
Pricing tells a story too. Opus 4.6 is $5/$25 per million tokens (input/output). The previous generation Opus 4 was $15/$75. Anthropic tripled the intelligence while cutting the price by 3x. That's the kind of deflation that reshapes entire markets.
Where it falls short: No audio or video input. The 1M context window is still in beta and restricted to high-tier API users. Extended thinking costs can be unpredictable because you're billed for the full internal reasoning, not just the summary you see.
Best for: Agentic software engineering, computer-use automation, end-to-end knowledge work, rapid iteration, tasks that benefit from integrated tool use.
GPT-5.3 Codex's strength is momentum. It reclaimed the Terminal-Bench 2.0 crown at 75.1%, leapfrogging Claude Opus 4.6, and its 64.7% on OSWorld-Verified — up 26.5 points from GPT-5.2 — represents the largest single-generation improvement in computer-use capability we've seen. It's no longer just a coding model: OpenAI positions it as a "general work agent" that handles Jira tickets, deployment pipelines, SQL queries, and PDF reports alongside code.
The interactive real-time collaboration feature is genuinely new. Via the Codex macOS app, you can steer the model mid-task — give feedback, add context, redirect — without breaking the workflow. It's approximately 25% faster than GPT-5.2-Codex, generating ~65-70 tokens per second on NVIDIA GB200 NVL72 hardware.
The GPT-5.3 Codex Spark variant deserves its own mention. Running on Cerebras WSE-3 wafer-scale chips (a first for OpenAI — no NVIDIA), Spark hits 1,000+ tokens per second, roughly 15x faster than the flagship. It nearly matches the flagship on SWE-Bench Pro (~56%) but drops significantly on Terminal-Bench 2.0 (58.4%) — speed costs you depth on complex agentic tasks. Currently a research preview for ChatGPT Pro subscribers ($200/month).
Where it falls short: Price. At $6/$30 per million tokens, GPT-5.3 Codex is the most expensive flagship model by a significant margin. In a real-world Ruby on Rails benchmark, it achieved ~0.70 quality score at under $1/ticket — higher quality than Claude's ~0.61, but Claude did it at ~$5/ticket with deeper architectural reasoning. GPT-5.3 Codex optimizes for momentum, not elegance. And the reasoning capability is essentially unchanged from GPT-5.2 (identical GDPval score of 70.9%), so don't expect new breakthroughs in math or science.
Let's talk money. Here's what you'll actually pay to use each flagship model through their APIs.
| Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.3 Codex | |
|---|---|---|---|
| Input (per 1M tokens) | $2.00 - $4.00 | $5.00 | $6.00 |
| Output (per 1M tokens) | $12.00 - $18.00 | $25.00 | $30.00 |
| Context window | 1M tokens | 200K (1M beta) | 400K tokens |
| Max output | 64K tokens | 128K tokens | 128K tokens |
| Batch discount | 50% off | 50% off | Available |
| Prompt caching | Yes (storage fees) | Yes (90% savings on hits) | Yes (24hr retention free) |
The budget analysis:
If you're cost-sensitive, the mid-tier models tell a different story entirely:
| Gemini 3 Flash | Claude Sonnet 4.6 | GPT-5 mini | |
|---|---|---|---|
| Input | $0.50 - $1.00 | $3.00 | $0.25 |
| Output | $3.00 | $15.00 | $2.00 |
Gemini 3 Flash at $0.50/$3.00 with 90.4% on GPQA Diamond is arguably the best deal in AI right now. Claude Sonnet 4.6 is pricier but users preferred it over Opus 4.5 in 59% of comparisons — you're getting near-Opus intelligence at Sonnet pricing.
Beyond raw benchmarks, each company has shipped unique capabilities that fundamentally change what you can build.
The model dynamically decides when and how much to reason. No more guessing whether to enable "thinking mode." No more paying for deep reasoning on simple questions. This is the kind of UX innovation that seems obvious in retrospect but nobody else has shipped.
Every Gemini model — even the budget Flash-Lite — gets the full 1 million token context window. Anthropic and OpenAI gate their largest context windows behind beta programs or premium tiers. Google's approach is more democratic.
GPT-5.3 Codex isn't just a coding model anymore — it's a full knowledge-work agent. It handles engineering, ops, product planning, analysis, and communication in a single model. The interactive collaboration feature lets you steer it mid-task via the Codex app, and the Spark variant on Cerebras hardware delivers 1,000+ tokens per second for rapid iteration.
All three companies now offer computer use — models that can see your screen and interact with it through clicks and keystrokes. Claude Sonnet 4.6 achieved 94% on an insurance industry computer-use benchmark. GPT-5.3 Codex scored 64.7% on OSWorld-Verified (up 26.5 points from GPT-5.2's 38.2%), closing the gap dramatically. Google's implementation is available on Gemini 3 Flash (notably, not on 3.1 Pro, which trades computer use for stronger reasoning).
Google's Deep Research Agent (available via Gemini 3 models) autonomously conducts multi-step research: planning queries, reading sources, iterating on findings, and synthesizing results. It costs $2-5 per research task and can run for up to 60 minutes. Nothing from Anthropic or OpenAI directly competes with this as a built-in feature.

Here's my honest recommendation based on the use case:
"I'm building a coding agent or developer tool" → It depends. GPT-5.3 Codex leads Terminal-Bench 2.0 at 75.1% and optimizes for speed and momentum — ideal for rapid agentic iteration. Claude Opus 4.6 scores lower (69.9%) but leads on economically valuable work tasks and provides deeper architectural reasoning. Many developers now use a hybrid approach: Claude for planning and complex logic, GPT-5.3 Codex for rapid execution.
"I need to analyze scientific papers or solve math problems" → Gemini 3.1 Pro. The GPQA Diamond and ARC-AGI-2 scores speak for themselves. The 1M context window means you can feed it entire research corpora.
"I want a general-purpose AI assistant" → GPT-5.3 Codex in ChatGPT. The integrated tool ecosystem (search, code execution, image generation, file analysis) plus interactive steering mid-task is unmatched for power users.
"I'm on a budget but need real capability" → Gemini 3 Flash at $0.50/$3.00 or GPT-5 mini at $0.25/$2.00. Both deliver shocking performance for their price. Flash's 90.4% on GPQA Diamond at a fraction of Pro pricing is hard to argue with.
"I need computer use / RPA automation" → Claude Sonnet 4.6. 94% on insurance industry computer-use tasks. This isn't close.
"I need the absolute largest context window" → Gemini (any tier). 1M tokens, available everywhere, no beta program required.
What strikes me most about the current landscape isn't who's winning — it's that nobody can hold a lead for more than a few weeks. Consider the timeline:
That's nine lead changes in six months. The competitive intensity is unprecedented — three flagship models shipped in the span of two weeks in February alone.
The real winner in this race is developers. Prices are cratering — Anthropic's Opus tier dropped from $15/$75 to $5/$25, o3's pricing fell 80% in a single month — while capabilities accelerate. A year ago, 70% on SWE-Bench was state-of-the-art. Today, three different companies exceed 77%.
There is no best AI model in 2026. There is only the best model for your specific task, at your specific budget, at this specific moment in time. And that might change by the time you finish reading this post.
Model data sourced from official documentation (ai.google.dev, docs.anthropic.com, platform.openai.com), Artificial Analysis Intelligence Index, published benchmark reports, and tech press coverage as of February 20, 2026. Self-reported benchmarks may differ from independent evaluations. Pricing is subject to change.

With million-token context windows, the challenge shifted from crafting prompts to architecting context. Here's what that looks like in practice.

A practical guide to the Model Context Protocol — what it is, how it works, and how to build MCP servers that extend AI capabilities.

Developers spent decades wishing for tools that write code. Now they have them. Why does freedom feel like loss?