Anvat / Leaderboard · Week of 2026-06-06

AI cost-efficiency leaderboard.

Frontier LLMs ranked by intelligence-per-dollar. Each table divides a benchmark score by the Anvat effective blended token price ($/MTok, 1:3 input:output weighting). Higher value = more answer quality per dollar spent.

Updated every week from publisher pricing + system-card benchmark numbers. Source data: /benchmarks · /pricing.

What's on the site right now

How to read this: The "Value" column is score ÷ Anvat $/MTok (blended). Numbers are comparable WITHIN a category, not across categories — SWE-Bench % and HLE % aren't the same scale. All prices already include Anvat's 30% off list rates.

Coding

Best $ per point — SWE-Bench Verified

See raw scores →
RankModelScoreList $/MTokAnvat $/MTokValue (score/$)
#1DeepSeek V4 FlashDeepSeek62.4$0.473$0.331189
#2Gemini 3.5 FlashGoogle55.1$0.487$0.341161
#3GPT-5 MiniOpenAI52.1$1.56$1.0947.6
#4DeepSeek V4 ProDeepSeek80.6$3.04$2.1337.8
#5GPT-5.5OpenAI67.4$7.81$5.4712.3
#6Gemini 3.1 ProGoogle54.2$7.81$5.479.91
#7Claude Sonnet 4.6Anthropic70.3$12.0$8.408.37
#8Claude Opus 4.8Anthropic81.5$60.0$42.01.94
#9Claude Opus 4.7Anthropic80.8$60.0$42.01.92

Agentic + tool use

Best $ per point — MCP Atlas

See raw scores →
RankModelScoreList $/MTokAnvat $/MTokValue (score/$)
#1Gemini 3.5 FlashGoogle83.6$0.487$0.341245
#2DeepSeek V4 FlashDeepSeek56.6$0.473$0.331171
#3GPT-5 MiniOpenAI64.2$1.56$1.0958.7
#4DeepSeek V4 ProDeepSeek73.6$3.04$2.1334.5
#5Gemini 3.1 ProGoogle78.2$7.81$5.4714.3
#6GPT-5.5OpenAI75.3$7.81$5.4713.8
#7Claude Sonnet 4.6Anthropic69.5$12.0$8.408.27
#8Claude Opus 4.8Anthropic79.4$60.0$42.01.89
#9Claude Opus 4.7Anthropic79.1$60.0$42.01.88

Reasoning

Best $ per point — Humanity's Last Exam

See raw scores →
RankModelScoreList $/MTokAnvat $/MTokValue (score/$)
#1Gemini 3.5 FlashGoogle40.2$0.487$0.341118
#2GPT-5 MiniOpenAI27.4$1.56$1.0925.1
#3DeepSeek V4 ProDeepSeek37.7$3.04$2.1317.7
#4Gemini 3.1 ProGoogle44.4$7.81$5.478.12
#5GPT-5.5OpenAI41.4$7.81$5.477.57
#6Claude Sonnet 4.6Anthropic33.2$12.0$8.403.95
#7Claude Opus 4.8Anthropic47.2$60.0$42.01.12
#8Claude Opus 4.7Anthropic46.9$60.0$42.01.12

Math

Best $ per point — HMMT 2026 Feb (math)

See raw scores →
RankModelScoreList $/MTokAnvat $/MTokValue (score/$)
#1DeepSeek V4 ProDeepSeek95.2$3.04$2.1344.7
#2Claude Opus 4.8Anthropic96.5$60.0$42.02.30
#3Claude Opus 4.7Anthropic96.2$60.0$42.02.29

Long context

Best $ per point — MRCR v2 (128K, 8-needle)

See raw scores →
RankModelScoreList $/MTokAnvat $/MTokValue (score/$)
#1Gemini 3.5 FlashGoogle77.3$0.487$0.341227
#2GPT-5 MiniOpenAI88.4$1.56$1.0980.8
#3GPT-5.5OpenAI94.8$7.81$5.4717.3
#4Gemini 3.1 ProGoogle84.9$7.81$5.4715.5
#5Claude Sonnet 4.6Anthropic84.9$12.0$8.4010.1
#6Claude Opus 4.8Anthropic60.1$60.0$42.01.43
#7Claude Opus 4.7Anthropic59.3$60.0$42.01.41

Multimodal

Best $ per point — CharXiv Reasoning

See raw scores →
RankModelScoreList $/MTokAnvat $/MTokValue (score/$)
#1Gemini 3.5 FlashGoogle84.2$0.487$0.341247
#2GPT-5 MiniOpenAI76.2$1.56$1.0969.7
#3GPT-5.5OpenAI84.1$7.81$5.4715.4
#4Gemini 3.1 ProGoogle83.3$7.81$5.4715.2
#5Claude Sonnet 4.6Anthropic72.4$12.0$8.408.62
#6Claude Opus 4.8Anthropic82.3$60.0$42.01.96
#7Claude Opus 4.7Anthropic82.1$60.0$42.01.95

Pure price ranking

Cheapest models — ignoring quality

Don't pick a model from this table alone — pair it with the benchmark you actually care about above. Useful for budget-bounded workloads (classification, extraction, background processing).

RankModelList $/MTokAnvat $/MTokΔ vs Opus 4.8
#1DeepSeek V4 FlashDeepSeek$0.473$0.331127× cheaper
#2Gemini 3.5 FlashGoogle$0.487$0.341123× cheaper
#3GPT-5 MiniOpenAI$1.56$1.0938× cheaper
#4DeepSeek V4 ProDeepSeek$3.04$2.1320× cheaper
#5GPT-5.5OpenAI$7.81$5.478× cheaper
#6Gemini 3.1 ProGoogle$7.81$5.478× cheaper
#7Claude Sonnet 4.6Anthropic$12.0$8.405× cheaper
#8Claude Opus 4.8Anthropic$60.0$42.0
#9Claude Opus 4.7Anthropic$60.0$42.0

Why this ranking moves every week

Frontier model pricing is genuinely volatile. New tiers ship (DeepSeek V4 Flash, Gemini 3.5 Flash, GPT-5 Mini), benchmark scores get re-run on cleaner harnesses, and providers cut prices when competition forces them to. The industry feed on the right tracks the underlying provider releases that drive this ranking.

The news digest covers the underlying provider releases that move this leaderboard.

Methodology

  • Blended price = (input + 3 × output) / 4, USD per million tokens. The 1:3 ratio mirrors a typical chat / coding workload.
  • Anvat effective price = list × 0.70 (30% off). Combined with 2× credit match on prepaid packs, realised cost is roughly half of provider-direct.
  • Value = benchmark score / Anvat blended $/MTok. A model that scores 80 at $1/MTok beats one that scores 90 at $10/MTok — usually.
  • Refresh cadence: weekly, or sooner when a provider ships a price change or a new model.

Caveats — read these

  • • Benchmark scores are single-pass, best-effort, from publisher system cards. Production reliability varies.
  • • Cheap-and-fast isn't the same as cheap-and-good. A 90% model that takes 1 retry effectively costs 2× its sticker.
  • • Long-context and multimodal pricing is often quoted with different premiums — the blended ratio is conservative.
  • • Prepaid-credit advantage isn't visible here. Real $/dollar drops further if you use Anvat's 2× match.

Run any of these models from one key.

Switch models with a single param change. No vendor lock-in, 30% off published list, 2× credit on prepaid packs.