Anvat / Tools

LLM benchmark comparison

Side-by-side scores for 9 frontier models across 13 standard benchmarks — coding, agentic, reasoning, math, long-context, and multimodal. Numbers verified June 2026 from publisher system cards.

Best score in each row highlighted in accent. Dash (—) means the publisher hasn't released a comparable number on that benchmark.

Reading this table

  • Coding — SWE-Bench measures real PR fixes; LiveCodeBench is competitive programming; Terminal-Bench is agentic coding in a shell.
  • Agentic — MCP Atlas + Toolathlon + OSWorld + GDPval-AA Elo all measure tool-use + long-horizon work.
  • Reasoning — HLE is the hardest expert exam; ARC-AGI-2 tests novel abstraction.
  • Long context — MRCR 128K is the production ceiling; MRCR 1M only applies to models with native 1M context support.

Coding

BenchmarkClaude Opus 4.8Claude Opus 4.7Claude Sonnet 4.6GPT-5.5GPT-5 MiniGemini 3.5 FlashGemini 3.1 ProDeepSeek V4 ProDeepSeek V4 Flash
SWE-Bench Verified
% solved
81.580.870.367.452.155.154.280.662.4
LiveCodeBench Pass@1
score
89.488.880.285.271.491.793.582.6
Terminal-Bench 2.1
% solved
68.266.160.378.258.476.270.367.949.1

SWE-Bench Verified: Real GitHub issues resolved as merged PRs. The standard coding-quality benchmark.

LiveCodeBench Pass@1: Competitive programming problem solving on a continuously refreshed problem set.

Terminal-Bench 2.1: Agentic coding in a real terminal with full filesystem + shell access.

Agentic + tool use

BenchmarkClaude Opus 4.8Claude Opus 4.7Claude Sonnet 4.6GPT-5.5GPT-5 MiniGemini 3.5 FlashGemini 3.1 ProDeepSeek V4 ProDeepSeek V4 Flash
MCP Atlas
% solved
79.479.169.575.364.283.678.273.656.6
Toolathlon
% solved
55.656.549.451.8
OSWorld-Verified
% solved
78.278.072.578.766.178.476.2
GDPval-AA (Elo)
Elo
1781175316761769150216561314

MCP Atlas: Multi-step tool-use workflows via Model Context Protocol.

Toolathlon: Long-horizon tool composition across heterogeneous tool sets.

OSWorld-Verified: Computer-use task completion — actual screen + keyboard control.

GDPval-AA (Elo): Economically valuable knowledge work — task-by-task pairwise Elo against human experts.

Reasoning

BenchmarkClaude Opus 4.8Claude Opus 4.7Claude Sonnet 4.6GPT-5.5GPT-5 MiniGemini 3.5 FlashGemini 3.1 ProDeepSeek V4 ProDeepSeek V4 Flash
Humanity's Last Exam
% correct
47.246.933.241.427.440.244.437.7
ARC-AGI-2
% solved
76.175.858.384.672.177.1

Humanity's Last Exam: Hardest expert-level academic reasoning questions across all disciplines.

ARC-AGI-2: Novel abstract reasoning puzzles. Tests genuine generalisation, not memorisation.

Math

BenchmarkClaude Opus 4.8Claude Opus 4.7Claude Sonnet 4.6GPT-5.5GPT-5 MiniGemini 3.5 FlashGemini 3.1 ProDeepSeek V4 ProDeepSeek V4 Flash
HMMT 2026 Feb (math)
Pass@1
96.596.295.2

HMMT 2026 Feb (math): Harvard-MIT Mathematics Tournament problems — competition math at the olympiad level.

Long context

BenchmarkClaude Opus 4.8Claude Opus 4.7Claude Sonnet 4.6GPT-5.5GPT-5 MiniGemini 3.5 FlashGemini 3.1 ProDeepSeek V4 ProDeepSeek V4 Flash
MRCR v2 (128K, 8-needle)
% recall
60.159.384.994.888.477.384.9
MRCR v2 (1M, 8-needle)
% recall
26.626.383.578.7

MRCR v2 (128K, 8-needle): Multi-needle long-context recall at 128K — the standard for production RAG ceilings.

MRCR v2 (1M, 8-needle): Multi-needle long-context recall at 1M tokens. Only DeepSeek V4 and some Gemini variants compete here.

Multimodal

BenchmarkClaude Opus 4.8Claude Opus 4.7Claude Sonnet 4.6GPT-5.5GPT-5 MiniGemini 3.5 FlashGemini 3.1 ProDeepSeek V4 ProDeepSeek V4 Flash
CharXiv Reasoning
% correct
82.382.172.484.176.284.283.3

CharXiv Reasoning: Scientific chart understanding — information synthesis from complex multi-panel figures.

Pick by use case, not by overall winner

No single model wins across every benchmark. Use this table to pick the right model for the workload, then run all of them through a single key with Anvat.

  • Hardest coding — Opus 4.8 (81.5% SWE-Bench) or DeepSeek V4 Pro (80.6% at 1/20th the cost)
  • Agentic + tool use — Gemini 3.5 Flash (MCP Atlas 83.6%) or GPT-5.5 (Toolathlon 55.6%)
  • Hardest reasoning — Opus 4.8 (HLE 47.2%) or GPT-5.5 (ARC-AGI-2 84.6%)
  • Long-context recall — GPT-5.5 (MRCR 128K 94.8%) or DeepSeek V4 (MRCR 1M 83.5%)
  • Multimodal — Gemini 3.5 Flash (CharXiv 84.2%) or GPT-5.5 (84.1%)
  • Cost-sensitive coding — DeepSeek V4 Flash or Gemini 3.5 Flash

Methodology

  • • All scores from publisher system cards or official blog posts (Anthropic, OpenAI, Google DeepMind, DeepSeek).
  • • Best-mode numbers (max reasoning tokens, max compute budget) where the publisher reports multiple modes.
  • • Cross-verified against Artificial Analysis where the benchmark is in their tracking set.
  • • Dash (—) where the publisher hasn't released a comparable number, not a zero score.
  • • Updated as new model versions ship.

What this table doesn't measure

  • Latency. Gemini 3.5 Flash is 4× faster than other frontiers; not visible in benchmark scores.
  • Cost per task. See /cost-calculator — Opus 4.8 is ~21× more expensive than DeepSeek V4 per output token.
  • Ecosystem maturity. Tool-use frameworks (LangChain, MCP) are more mature for some providers than others — affects production reliability.
  • Reliability at scale. Benchmark numbers are single-pass; production reliability includes retries, rate limits, regional availability.