Anvat / Tools

LLM benchmark comparison

Side-by-side scores for 9 frontier models across 13 standard benchmarks — coding, agentic, reasoning, math, long-context, and multimodal. Numbers verified June 2026 from publisher system cards.

Best score in each row highlighted in accent. Dash (—) means the publisher hasn't released a comparable number on that benchmark.

Reading this table

• Coding — SWE-Bench measures real PR fixes; LiveCodeBench is competitive programming; Terminal-Bench is agentic coding in a shell.
• Agentic — MCP Atlas + Toolathlon + OSWorld + GDPval-AA Elo all measure tool-use + long-horizon work.
• Reasoning — HLE is the hardest expert exam; ARC-AGI-2 tests novel abstraction.
• Long context — MRCR 128K is the production ceiling; MRCR 1M only applies to models with native 1M context support.

Coding

Benchmark	Claude Opus 4.8	Claude Opus 4.7	Claude Sonnet 4.6	GPT-5.5	GPT-5 Mini	Gemini 3.5 Flash	Gemini 3.1 Pro	DeepSeek V4 Pro	DeepSeek V4 Flash
SWE-Bench Verified % solved	81.5	80.8	70.3	67.4	52.1	55.1	54.2	80.6	62.4
LiveCodeBench Pass@1 score	89.4	88.8	80.2	85.2	71.4	—	91.7	93.5	82.6
Terminal-Bench 2.1 % solved	68.2	66.1	60.3	78.2	58.4	76.2	70.3	67.9	49.1

SWE-Bench Verified: Real GitHub issues resolved as merged PRs. The standard coding-quality benchmark.

LiveCodeBench Pass@1: Competitive programming problem solving on a continuously refreshed problem set.

Terminal-Bench 2.1: Agentic coding in a real terminal with full filesystem + shell access.

Agentic + tool use

Benchmark	Claude Opus 4.8	Claude Opus 4.7	Claude Sonnet 4.6	GPT-5.5	GPT-5 Mini	Gemini 3.5 Flash	Gemini 3.1 Pro	DeepSeek V4 Pro	DeepSeek V4 Flash
MCP Atlas % solved	79.4	79.1	69.5	75.3	64.2	83.6	78.2	73.6	56.6
Toolathlon % solved	—	—	—	55.6	—	56.5	49.4	51.8	—
OSWorld-Verified % solved	78.2	78.0	72.5	78.7	66.1	78.4	76.2	—	—
GDPval-AA (Elo) Elo	1781	1753	1676	1769	1502	1656	1314	—	—

MCP Atlas: Multi-step tool-use workflows via Model Context Protocol.

Toolathlon: Long-horizon tool composition across heterogeneous tool sets.

OSWorld-Verified: Computer-use task completion — actual screen + keyboard control.

GDPval-AA (Elo): Economically valuable knowledge work — task-by-task pairwise Elo against human experts.

Reasoning

Benchmark	Claude Opus 4.8	Claude Opus 4.7	Claude Sonnet 4.6	GPT-5.5	GPT-5 Mini	Gemini 3.5 Flash	Gemini 3.1 Pro	DeepSeek V4 Pro	DeepSeek V4 Flash
Humanity's Last Exam % correct	47.2	46.9	33.2	41.4	27.4	40.2	44.4	37.7	—
ARC-AGI-2 % solved	76.1	75.8	58.3	84.6	—	72.1	77.1	—	—

Humanity's Last Exam: Hardest expert-level academic reasoning questions across all disciplines.

ARC-AGI-2: Novel abstract reasoning puzzles. Tests genuine generalisation, not memorisation.

Math

Benchmark	Claude Opus 4.8	Claude Opus 4.7	Claude Sonnet 4.6	GPT-5.5	GPT-5 Mini	Gemini 3.5 Flash	Gemini 3.1 Pro	DeepSeek V4 Pro	DeepSeek V4 Flash
HMMT 2026 Feb (math) Pass@1	96.5	96.2	—	—	—	—	—	95.2	—

HMMT 2026 Feb (math): Harvard-MIT Mathematics Tournament problems — competition math at the olympiad level.

Long context

Benchmark	Claude Opus 4.8	Claude Opus 4.7	Claude Sonnet 4.6	GPT-5.5	GPT-5 Mini	Gemini 3.5 Flash	Gemini 3.1 Pro	DeepSeek V4 Pro	DeepSeek V4 Flash
MRCR v2 (128K, 8-needle) % recall	60.1	59.3	84.9	94.8	88.4	77.3	84.9	—	—
MRCR v2 (1M, 8-needle) % recall	—	—	—	—	—	26.6	26.3	83.5	78.7

MRCR v2 (128K, 8-needle): Multi-needle long-context recall at 128K — the standard for production RAG ceilings.

MRCR v2 (1M, 8-needle): Multi-needle long-context recall at 1M tokens. Only DeepSeek V4 and some Gemini variants compete here.

Multimodal

Benchmark	Claude Opus 4.8	Claude Opus 4.7	Claude Sonnet 4.6	GPT-5.5	GPT-5 Mini	Gemini 3.5 Flash	Gemini 3.1 Pro	DeepSeek V4 Pro	DeepSeek V4 Flash
CharXiv Reasoning % correct	82.3	82.1	72.4	84.1	76.2	84.2	83.3	—	—

CharXiv Reasoning: Scientific chart understanding — information synthesis from complex multi-panel figures.

Pick by use case, not by overall winner

No single model wins across every benchmark. Use this table to pick the right model for the workload, then run all of them through a single key with Anvat.

• Hardest coding — Opus 4.8 (81.5% SWE-Bench) or DeepSeek V4 Pro (80.6% at 1/20th the cost)
• Agentic + tool use — Gemini 3.5 Flash (MCP Atlas 83.6%) or GPT-5.5 (Toolathlon 55.6%)
• Hardest reasoning — Opus 4.8 (HLE 47.2%) or GPT-5.5 (ARC-AGI-2 84.6%)
• Long-context recall — GPT-5.5 (MRCR 128K 94.8%) or DeepSeek V4 (MRCR 1M 83.5%)
• Multimodal — Gemini 3.5 Flash (CharXiv 84.2%) or GPT-5.5 (84.1%)
• Cost-sensitive coding — DeepSeek V4 Flash or Gemini 3.5 Flash

Start free with $2 credit →Run the cost calculator

Methodology

• All scores from publisher system cards or official blog posts (Anthropic, OpenAI, Google DeepMind, DeepSeek).
• Best-mode numbers (max reasoning tokens, max compute budget) where the publisher reports multiple modes.
• Cross-verified against Artificial Analysis where the benchmark is in their tracking set.
• Dash (—) where the publisher hasn't released a comparable number, not a zero score.
• Updated as new model versions ship.

What this table doesn't measure

• Latency. Gemini 3.5 Flash is 4× faster than other frontiers; not visible in benchmark scores.
• Cost per task. See /cost-calculator — Opus 4.8 is ~21× more expensive than DeepSeek V4 per output token.
• Ecosystem maturity. Tool-use frameworks (LangChain, MCP) are more mature for some providers than others — affects production reliability.
• Reliability at scale. Benchmark numbers are single-pass; production reliability includes retries, rate limits, regional availability.