Anvat / Tools
LLM benchmark comparison
Side-by-side scores for 9 frontier models across 13 standard benchmarks — coding, agentic, reasoning, math, long-context, and multimodal. Numbers verified June 2026 from publisher system cards.
Best score in each row highlighted in accent. Dash (—) means the publisher hasn't released a comparable number on that benchmark.
Reading this table
- • Coding — SWE-Bench measures real PR fixes; LiveCodeBench is competitive programming; Terminal-Bench is agentic coding in a shell.
- • Agentic — MCP Atlas + Toolathlon + OSWorld + GDPval-AA Elo all measure tool-use + long-horizon work.
- • Reasoning — HLE is the hardest expert exam; ARC-AGI-2 tests novel abstraction.
- • Long context — MRCR 128K is the production ceiling; MRCR 1M only applies to models with native 1M context support.
Coding
| Benchmark | Claude Opus 4.8 | Claude Opus 4.7 | Claude Sonnet 4.6 | GPT-5.5 | GPT-5 Mini | Gemini 3.5 Flash | Gemini 3.1 Pro | DeepSeek V4 Pro | DeepSeek V4 Flash |
|---|---|---|---|---|---|---|---|---|---|
SWE-Bench Verified % solved | 81.5 | 80.8 | 70.3 | 67.4 | 52.1 | 55.1 | 54.2 | 80.6 | 62.4 |
LiveCodeBench Pass@1 score | 89.4 | 88.8 | 80.2 | 85.2 | 71.4 | — | 91.7 | 93.5 | 82.6 |
Terminal-Bench 2.1 % solved | 68.2 | 66.1 | 60.3 | 78.2 | 58.4 | 76.2 | 70.3 | 67.9 | 49.1 |
SWE-Bench Verified: Real GitHub issues resolved as merged PRs. The standard coding-quality benchmark.
LiveCodeBench Pass@1: Competitive programming problem solving on a continuously refreshed problem set.
Terminal-Bench 2.1: Agentic coding in a real terminal with full filesystem + shell access.
Agentic + tool use
| Benchmark | Claude Opus 4.8 | Claude Opus 4.7 | Claude Sonnet 4.6 | GPT-5.5 | GPT-5 Mini | Gemini 3.5 Flash | Gemini 3.1 Pro | DeepSeek V4 Pro | DeepSeek V4 Flash |
|---|---|---|---|---|---|---|---|---|---|
MCP Atlas % solved | 79.4 | 79.1 | 69.5 | 75.3 | 64.2 | 83.6 | 78.2 | 73.6 | 56.6 |
Toolathlon % solved | — | — | — | 55.6 | — | 56.5 | 49.4 | 51.8 | — |
OSWorld-Verified % solved | 78.2 | 78.0 | 72.5 | 78.7 | 66.1 | 78.4 | 76.2 | — | — |
GDPval-AA (Elo) Elo | 1781 | 1753 | 1676 | 1769 | 1502 | 1656 | 1314 | — | — |
MCP Atlas: Multi-step tool-use workflows via Model Context Protocol.
Toolathlon: Long-horizon tool composition across heterogeneous tool sets.
OSWorld-Verified: Computer-use task completion — actual screen + keyboard control.
GDPval-AA (Elo): Economically valuable knowledge work — task-by-task pairwise Elo against human experts.
Reasoning
| Benchmark | Claude Opus 4.8 | Claude Opus 4.7 | Claude Sonnet 4.6 | GPT-5.5 | GPT-5 Mini | Gemini 3.5 Flash | Gemini 3.1 Pro | DeepSeek V4 Pro | DeepSeek V4 Flash |
|---|---|---|---|---|---|---|---|---|---|
Humanity's Last Exam % correct | 47.2 | 46.9 | 33.2 | 41.4 | 27.4 | 40.2 | 44.4 | 37.7 | — |
ARC-AGI-2 % solved | 76.1 | 75.8 | 58.3 | 84.6 | — | 72.1 | 77.1 | — | — |
Humanity's Last Exam: Hardest expert-level academic reasoning questions across all disciplines.
ARC-AGI-2: Novel abstract reasoning puzzles. Tests genuine generalisation, not memorisation.
Math
| Benchmark | Claude Opus 4.8 | Claude Opus 4.7 | Claude Sonnet 4.6 | GPT-5.5 | GPT-5 Mini | Gemini 3.5 Flash | Gemini 3.1 Pro | DeepSeek V4 Pro | DeepSeek V4 Flash |
|---|---|---|---|---|---|---|---|---|---|
HMMT 2026 Feb (math) Pass@1 | 96.5 | 96.2 | — | — | — | — | — | 95.2 | — |
HMMT 2026 Feb (math): Harvard-MIT Mathematics Tournament problems — competition math at the olympiad level.
Long context
| Benchmark | Claude Opus 4.8 | Claude Opus 4.7 | Claude Sonnet 4.6 | GPT-5.5 | GPT-5 Mini | Gemini 3.5 Flash | Gemini 3.1 Pro | DeepSeek V4 Pro | DeepSeek V4 Flash |
|---|---|---|---|---|---|---|---|---|---|
MRCR v2 (128K, 8-needle) % recall | 60.1 | 59.3 | 84.9 | 94.8 | 88.4 | 77.3 | 84.9 | — | — |
MRCR v2 (1M, 8-needle) % recall | — | — | — | — | — | 26.6 | 26.3 | 83.5 | 78.7 |
MRCR v2 (128K, 8-needle): Multi-needle long-context recall at 128K — the standard for production RAG ceilings.
MRCR v2 (1M, 8-needle): Multi-needle long-context recall at 1M tokens. Only DeepSeek V4 and some Gemini variants compete here.
Multimodal
| Benchmark | Claude Opus 4.8 | Claude Opus 4.7 | Claude Sonnet 4.6 | GPT-5.5 | GPT-5 Mini | Gemini 3.5 Flash | Gemini 3.1 Pro | DeepSeek V4 Pro | DeepSeek V4 Flash |
|---|---|---|---|---|---|---|---|---|---|
CharXiv Reasoning % correct | 82.3 | 82.1 | 72.4 | 84.1 | 76.2 | 84.2 | 83.3 | — | — |
CharXiv Reasoning: Scientific chart understanding — information synthesis from complex multi-panel figures.
Pick by use case, not by overall winner
No single model wins across every benchmark. Use this table to pick the right model for the workload, then run all of them through a single key with Anvat.
- • Hardest coding — Opus 4.8 (81.5% SWE-Bench) or DeepSeek V4 Pro (80.6% at 1/20th the cost)
- • Agentic + tool use — Gemini 3.5 Flash (MCP Atlas 83.6%) or GPT-5.5 (Toolathlon 55.6%)
- • Hardest reasoning — Opus 4.8 (HLE 47.2%) or GPT-5.5 (ARC-AGI-2 84.6%)
- • Long-context recall — GPT-5.5 (MRCR 128K 94.8%) or DeepSeek V4 (MRCR 1M 83.5%)
- • Multimodal — Gemini 3.5 Flash (CharXiv 84.2%) or GPT-5.5 (84.1%)
- • Cost-sensitive coding — DeepSeek V4 Flash or Gemini 3.5 Flash
Methodology
- • All scores from publisher system cards or official blog posts (Anthropic, OpenAI, Google DeepMind, DeepSeek).
- • Best-mode numbers (max reasoning tokens, max compute budget) where the publisher reports multiple modes.
- • Cross-verified against Artificial Analysis where the benchmark is in their tracking set.
- • Dash (—) where the publisher hasn't released a comparable number, not a zero score.
- • Updated as new model versions ship — see /changelog for history.
What this table doesn't measure
- • Latency. Gemini 3.5 Flash is 4× faster than other frontiers; not visible in benchmark scores.
- • Cost per task. See /cost-calculator — Opus 4.8 is ~21× more expensive than DeepSeek V4 per output token.
- • Ecosystem maturity. Tool-use frameworks (LangChain, MCP) are more mature for some providers than others — affects production reliability.
- • Reliability at scale. Benchmark numbers are single-pass; production reliability includes retries, rate limits, regional availability.