DeepSeek V4 Pro deep-dive: 1M context, 80.6% SWE-Bench, and the 1/30th cost claim (2026)

DeepSeek V4 launched April 24, 2026 — the fourth major release in the series that started commoditising frontier-tier AI when V3 shipped in December 2024. V4 isn't a quality leap; the benchmark numbers are at parity with frontier closed models on coding, behind on the hardest reasoning. The story is the economics + the agentic positioning. Here's the honest assessment.

The release at a glance

Two MoE checkpoints shipped, both at 1M-token context, both MIT-licensed:

Model	Total params	Active params	Context	License
DeepSeek V4 Pro	1.6T	49B	1M	MIT
DeepSeek V4 Flash	284B	13B	1M	MIT

Plus base versions (V4-Pro-Base, V4-Flash-Base) for fine-tuning, and "-Max" inference modes that enable extended reasoning tokens for higher benchmark scores.

The benchmark table that matters

V4-Pro-Max vs the frontier closed models, per the DeepSeek paper:

Benchmark	V4-Pro-Max	Opus 4.6	GPT-5.4 xHigh	Gemini 3.1 Pro
SWE-Bench Verified	80.6%	80.8%	—	80.6%
LiveCodeBench Pass@1	93.5	88.8	—	91.7
Codeforces Rating	3206	—	3168	3052
MCPAtlas Public	73.6	73.8	—	—
Terminal-Bench 2.0	67.9	—	75.1	68.5
Toolathlon	51.8	—	—	48.8
HMMT 2026 Feb (math)	95.2	96.2	—	—
HLE (Humanity's Last Exam)	37.7	40.0	—	—
MRCR 1M (long-context recall)	83.5	—	—	—

Where V4-Pro-Max leads:

LiveCodeBench Pass@1 (highest of any frontier model)
Codeforces Rating
Toolathlon

Where V4-Pro-Max is competitive:

SWE-Bench Verified (within 0.2 points of Opus 4.6)
MCPAtlas (within 0.2 points of Opus 4.6)
1M context retrieval (no closed peer at this length)

Where V4-Pro-Max trails:

Terminal-Bench 2.0 (GPT-5.4 leads by 7 points)
HMMT math (Opus 4.6 leads by 1 point)
HLE (Opus 4.6 leads by 2.3 points)

The honest read: DeepSeek V4 Pro is genuinely frontier-class on coding + agentic tasks, ~6-12 months behind on the hardest reasoning.

The cost angle (the actual story)

V4-Pro: $1.74 / $3.48 per MTok input/output.

For comparison:

Model	Input	Output	Output ratio to V4-Pro
DeepSeek V4-Flash	$0.14	$0.28	0.08×
DeepSeek V4-Pro	$1.74	$3.48	1×
Claude Sonnet 4.6	$3.00	$15.00	4.3×
Gemini 3.5 Flash	$0.30	$2.50	0.72×
GPT-5.5	$1.25	$10.00	2.9×
Claude Opus 4.7	$5.00	$25.00	7.2×

V4-Pro output is 7.2× cheaper than Opus 4.7 at SWE-Bench-equivalent quality on coding workloads. (Anthropic dropped the Opus tier from $15/$75 to $5/$25 starting with the 4.5 release, so this gap is materially smaller than it would have been at launch — but still very much in V4-Pro's favour for high-volume coding-agent work where output token count dominates the bill.)

When DeepSeek V4 actually wins

Be precise about the use case:

Winning case 1 — High-volume coding tasks where cost dominates

Pattern: bulk PR review, code-quality classification across thousands of repos, automated test generation at scale. V4-Pro at $3.48/MTok output vs Opus 4.7's $25 is a 7.2× cost compression. The SWE-Bench quality gap (0.2 points) is invisible at the per-task level.

Winning case 2 — Long-context analysis where 200K isn't enough

V4 supports 1M tokens natively without separate pricing tiers (Gemini 2.5 Pro doubles the price above 200K; Anthropic's Sonnet 4.6 has a separate 1M tier at 2× the base rate). For workloads that genuinely need 500K-1M context (full codebase analysis, large legal doc review), V4 is the cheapest path.

Winning case 3 — Open-weight requirement

Regulated environments, audit requirements, or "we need to be able to run this on-prem" mandates that exclude closed models. V4 is MIT- licensed. You can self-host it via vLLM/SGLang/TGI.

Winning case 4 — Pairing in a router

V4-Pro on coding-heavy first-pass, Opus 4.7 on the 10-15% of tasks where the model returns "uncertain" or downstream eval flags low confidence. Typical production saving: 50-70% vs pure-Opus.

When NOT to use V4

Don't use case 1 — Hardest reasoning

If your workload is dominated by math olympiad-style problems, novel proof construction, or PhD-level science reasoning, the 2-3 point HLE

HMMT gap matters. Stay on Opus 4.7 or wait for Gemini 3.5 Pro.

Don't use case 2 — Multimodal

V4 is text-only. No vision, no audio. If your workload needs vision (GPT-5-style multimodal or Gemini 3.5 Flash native multimodal), V4 doesn't compete.

Don't use case 3 — Tool ecosystem maturity matters more than cost

V4 is well-supported in LangChain, but the agent-framework ecosystem is still primarily built around OpenAI + Anthropic SDKs. If you're shipping fast and ecosystem maturity > 5× cost savings, stay with the closed-model SDKs.

The architecture story

The benchmarks are competitive; what's actually new in V4 is the inference-cost architecture:

Mixture-of-Experts with 49B active. Out of 1.6T total parameters, only 49B are activated per token. Inference cost (compute + memory) scales with active params, not total — that's how a 1.6T model serves at $1.74/MTok.
FP4 + FP8 mixed precision. V4-Pro ships in FP4 + FP8 mixed precision rather than the more common BF16. ~50% memory footprint reduction; minor accuracy cost amortised by the larger total parameter count.
Interleaved thinking. V4 uses an interleaved-thinking pattern for reasoning-heavy tasks rather than the separate <thinking> block of earlier reasoners. Lower overhead, faster TTFT, but the pattern is harder to integrate into agent frameworks expecting the classic separator.

The "expensive frontier" tier is now a smaller wedge. Opus 4.7 and GPT-5.5 keep their lead on the absolute hardest tasks. Most production AI workloads don't need that ceiling. V4-Pro covers the 80% of the value distribution at 1/20th the cost.
Open weights at frontier quality is a 2026 fact. Two years ago open-source meant "trail closed models by 12 months." V4 trails by ~6 months on hardest tasks, leads on coding. The gap is now small enough that "use the open model" is a defensible default for most workloads.
Pricing pressure on the closed labs increases. When V4-Pro delivers near-Opus quality at $3.48/MTok output, the Opus 4.7 $75/MTok price becomes harder to justify for any workload that isn't doing genuinely hardest-tier reasoning. Expect closed-model pricing to compress in the second half of 2026.

Run V4-Pro alongside Claude, GPT, and Gemini on one key

Anvat is OpenAI- and Anthropic-compatible. DeepSeek V4-Pro routes through the same /v1/chat/completions endpoint as every other model — no per-provider integration. 30% off list price.

Try free → →