Anvat / Leaderboard · Week of 2026-06-06
AI cost-efficiency leaderboard.
Frontier LLMs ranked by intelligence-per-dollar. Each table divides a benchmark score by the Anvat effective blended token price ($/MTok, 1:3 input:output weighting). Higher value = more answer quality per dollar spent.
Updated every week from publisher pricing + system-card benchmark numbers. Source data: /benchmarks · /pricing.
What's on the site right now
- OpenAI-compatible
- Anthropic-compatible
- MCP-ready
- One API key
- 30% off list
- 2× credit on prepaid
- No card to start
- 5 languages
How to read this: The "Value" column is score ÷ Anvat $/MTok (blended). Numbers are comparable WITHIN a category, not across categories — SWE-Bench % and HLE % aren't the same scale. All prices already include Anvat's 30% off list rates.
Coding
Best $ per point — SWE-Bench Verified
| Rank | Model | Score | List $/MTok | Anvat $/MTok | Value (score/$) |
|---|---|---|---|---|---|
| #1 | DeepSeek V4 FlashDeepSeek | 62.4 | $0.473 | $0.331 | 189 |
| #2 | Gemini 3.5 FlashGoogle | 55.1 | $0.487 | $0.341 | 161 |
| #3 | GPT-5 MiniOpenAI | 52.1 | $1.56 | $1.09 | 47.6 |
| #4 | DeepSeek V4 ProDeepSeek | 80.6 | $3.04 | $2.13 | 37.8 |
| #5 | GPT-5.5OpenAI | 67.4 | $7.81 | $5.47 | 12.3 |
| #6 | Gemini 3.1 ProGoogle | 54.2 | $7.81 | $5.47 | 9.91 |
| #7 | Claude Sonnet 4.6Anthropic | 70.3 | $12.0 | $8.40 | 8.37 |
| #8 | Claude Opus 4.8Anthropic | 81.5 | $60.0 | $42.0 | 1.94 |
| #9 | Claude Opus 4.7Anthropic | 80.8 | $60.0 | $42.0 | 1.92 |
Agentic + tool use
Best $ per point — MCP Atlas
| Rank | Model | Score | List $/MTok | Anvat $/MTok | Value (score/$) |
|---|---|---|---|---|---|
| #1 | Gemini 3.5 FlashGoogle | 83.6 | $0.487 | $0.341 | 245 |
| #2 | DeepSeek V4 FlashDeepSeek | 56.6 | $0.473 | $0.331 | 171 |
| #3 | GPT-5 MiniOpenAI | 64.2 | $1.56 | $1.09 | 58.7 |
| #4 | DeepSeek V4 ProDeepSeek | 73.6 | $3.04 | $2.13 | 34.5 |
| #5 | Gemini 3.1 ProGoogle | 78.2 | $7.81 | $5.47 | 14.3 |
| #6 | GPT-5.5OpenAI | 75.3 | $7.81 | $5.47 | 13.8 |
| #7 | Claude Sonnet 4.6Anthropic | 69.5 | $12.0 | $8.40 | 8.27 |
| #8 | Claude Opus 4.8Anthropic | 79.4 | $60.0 | $42.0 | 1.89 |
| #9 | Claude Opus 4.7Anthropic | 79.1 | $60.0 | $42.0 | 1.88 |
Reasoning
Best $ per point — Humanity's Last Exam
| Rank | Model | Score | List $/MTok | Anvat $/MTok | Value (score/$) |
|---|---|---|---|---|---|
| #1 | Gemini 3.5 FlashGoogle | 40.2 | $0.487 | $0.341 | 118 |
| #2 | GPT-5 MiniOpenAI | 27.4 | $1.56 | $1.09 | 25.1 |
| #3 | DeepSeek V4 ProDeepSeek | 37.7 | $3.04 | $2.13 | 17.7 |
| #4 | Gemini 3.1 ProGoogle | 44.4 | $7.81 | $5.47 | 8.12 |
| #5 | GPT-5.5OpenAI | 41.4 | $7.81 | $5.47 | 7.57 |
| #6 | Claude Sonnet 4.6Anthropic | 33.2 | $12.0 | $8.40 | 3.95 |
| #7 | Claude Opus 4.8Anthropic | 47.2 | $60.0 | $42.0 | 1.12 |
| #8 | Claude Opus 4.7Anthropic | 46.9 | $60.0 | $42.0 | 1.12 |
Math
Best $ per point — HMMT 2026 Feb (math)
| Rank | Model | Score | List $/MTok | Anvat $/MTok | Value (score/$) |
|---|---|---|---|---|---|
| #1 | DeepSeek V4 ProDeepSeek | 95.2 | $3.04 | $2.13 | 44.7 |
| #2 | Claude Opus 4.8Anthropic | 96.5 | $60.0 | $42.0 | 2.30 |
| #3 | Claude Opus 4.7Anthropic | 96.2 | $60.0 | $42.0 | 2.29 |
Long context
Best $ per point — MRCR v2 (128K, 8-needle)
| Rank | Model | Score | List $/MTok | Anvat $/MTok | Value (score/$) |
|---|---|---|---|---|---|
| #1 | Gemini 3.5 FlashGoogle | 77.3 | $0.487 | $0.341 | 227 |
| #2 | GPT-5 MiniOpenAI | 88.4 | $1.56 | $1.09 | 80.8 |
| #3 | GPT-5.5OpenAI | 94.8 | $7.81 | $5.47 | 17.3 |
| #4 | Gemini 3.1 ProGoogle | 84.9 | $7.81 | $5.47 | 15.5 |
| #5 | Claude Sonnet 4.6Anthropic | 84.9 | $12.0 | $8.40 | 10.1 |
| #6 | Claude Opus 4.8Anthropic | 60.1 | $60.0 | $42.0 | 1.43 |
| #7 | Claude Opus 4.7Anthropic | 59.3 | $60.0 | $42.0 | 1.41 |
Multimodal
Best $ per point — CharXiv Reasoning
| Rank | Model | Score | List $/MTok | Anvat $/MTok | Value (score/$) |
|---|---|---|---|---|---|
| #1 | Gemini 3.5 FlashGoogle | 84.2 | $0.487 | $0.341 | 247 |
| #2 | GPT-5 MiniOpenAI | 76.2 | $1.56 | $1.09 | 69.7 |
| #3 | GPT-5.5OpenAI | 84.1 | $7.81 | $5.47 | 15.4 |
| #4 | Gemini 3.1 ProGoogle | 83.3 | $7.81 | $5.47 | 15.2 |
| #5 | Claude Sonnet 4.6Anthropic | 72.4 | $12.0 | $8.40 | 8.62 |
| #6 | Claude Opus 4.8Anthropic | 82.3 | $60.0 | $42.0 | 1.96 |
| #7 | Claude Opus 4.7Anthropic | 82.1 | $60.0 | $42.0 | 1.95 |
Pure price ranking
Cheapest models — ignoring quality
Don't pick a model from this table alone — pair it with the benchmark you actually care about above. Useful for budget-bounded workloads (classification, extraction, background processing).
| Rank | Model | List $/MTok | Anvat $/MTok | Δ vs Opus 4.8 |
|---|---|---|---|---|
| #1 | DeepSeek V4 FlashDeepSeek | $0.473 | $0.331 | 127× cheaper |
| #2 | Gemini 3.5 FlashGoogle | $0.487 | $0.341 | 123× cheaper |
| #3 | GPT-5 MiniOpenAI | $1.56 | $1.09 | 38× cheaper |
| #4 | DeepSeek V4 ProDeepSeek | $3.04 | $2.13 | 20× cheaper |
| #5 | GPT-5.5OpenAI | $7.81 | $5.47 | 8× cheaper |
| #6 | Gemini 3.1 ProGoogle | $7.81 | $5.47 | 8× cheaper |
| #7 | Claude Sonnet 4.6Anthropic | $12.0 | $8.40 | 5× cheaper |
| #8 | Claude Opus 4.8Anthropic | $60.0 | $42.0 | — |
| #9 | Claude Opus 4.7Anthropic | $60.0 | $42.0 | — |
Why this ranking moves every week
Frontier model pricing is genuinely volatile. New tiers ship (DeepSeek V4 Flash, Gemini 3.5 Flash, GPT-5 Mini), benchmark scores get re-run on cleaner harnesses, and providers cut prices when competition forces them to. The industry feed on the right tracks the underlying provider releases that drive this ranking.
The news digest covers the underlying provider releases that move this leaderboard.
Methodology
- • Blended price = (input + 3 × output) / 4, USD per million tokens. The 1:3 ratio mirrors a typical chat / coding workload.
- • Anvat effective price = list × 0.70 (30% off). Combined with 2× credit match on prepaid packs, realised cost is roughly half of provider-direct.
- • Value = benchmark score / Anvat blended $/MTok. A model that scores 80 at $1/MTok beats one that scores 90 at $10/MTok — usually.
- • Refresh cadence: weekly, or sooner when a provider ships a price change or a new model.
Caveats — read these
- • Benchmark scores are single-pass, best-effort, from publisher system cards. Production reliability varies.
- • Cheap-and-fast isn't the same as cheap-and-good. A 90% model that takes 1 retry effectively costs 2× its sticker.
- • Long-context and multimodal pricing is often quoted with different premiums — the blended ratio is conservative.
- • Prepaid-credit advantage isn't visible here. Real $/dollar drops further if you use Anvat's 2× match.
Run any of these models from one key.
Switch models with a single param change. No vendor lock-in, 30% off published list, 2× credit on prepaid packs.