deepseekopen-sourcemodelscomparison

DeepSeek V4 Pro deep-dive: 1M context, 80.6% SWE-Bench, and the 1/30th cost claim (2026)

DeepSeek V4 Pro hits 80.6% SWE-Bench Verified (within 0.2 points of Claude Opus 4.6) and 93.5 LiveCodeBench (highest of any model) at $1.74/$3.48 per MTok. The full benchmark + when to actually use it.

Anvat team7 min read

DeepSeek V4 launched April 24, 2026 — the fourth major release in the series that started commoditising frontier-tier AI when V3 shipped in December 2024. V4 isn't a quality leap; the benchmark numbers are at parity with frontier closed models on coding, behind on the hardest reasoning. The story is the economics + the agentic positioning. Here's the honest assessment.

The release at a glance

Two MoE checkpoints shipped, both at 1M-token context, both MIT-licensed:

ModelTotal paramsActive paramsContextLicense
DeepSeek V4 Pro1.6T49B1MMIT
DeepSeek V4 Flash284B13B1MMIT

Plus base versions (V4-Pro-Base, V4-Flash-Base) for fine-tuning, and "-Max" inference modes that enable extended reasoning tokens for higher benchmark scores.

The benchmark table that matters

V4-Pro-Max vs the frontier closed models, per the DeepSeek paper:

BenchmarkV4-Pro-MaxOpus 4.6GPT-5.4 xHighGemini 3.1 Pro
SWE-Bench Verified80.6%80.8%80.6%
LiveCodeBench Pass@193.588.891.7
Codeforces Rating320631683052
MCPAtlas Public73.673.8
Terminal-Bench 2.067.975.168.5
Toolathlon51.848.8
HMMT 2026 Feb (math)95.296.2
HLE (Humanity's Last Exam)37.740.0
MRCR 1M (long-context recall)83.5

Where V4-Pro-Max leads:

  • LiveCodeBench Pass@1 (highest of any frontier model)
  • Codeforces Rating
  • Toolathlon

Where V4-Pro-Max is competitive:

  • SWE-Bench Verified (within 0.2 points of Opus 4.6)
  • MCPAtlas (within 0.2 points of Opus 4.6)
  • 1M context retrieval (no closed peer at this length)

Where V4-Pro-Max trails:

  • Terminal-Bench 2.0 (GPT-5.4 leads by 7 points)
  • HMMT math (Opus 4.6 leads by 1 point)
  • HLE (Opus 4.6 leads by 2.3 points)

The honest read: DeepSeek V4 Pro is genuinely frontier-class on coding + agentic tasks, ~6-12 months behind on the hardest reasoning.

The cost angle (the actual story)

V4-Pro: $1.74 / $3.48 per MTok input/output.

For comparison:

ModelInputOutputOutput ratio to V4-Pro
DeepSeek V4-Flash$0.14$0.280.08×
DeepSeek V4-Pro$1.74$3.48
Claude Sonnet 4.6$3.00$15.004.3×
Gemini 3.5 Flash$0.30$2.500.72×
GPT-5.5$1.25$10.002.9×
Claude Opus 4.7$15.00$75.0021.6×

V4-Pro output is 21.6× cheaper than Opus 4.7 at SWE-Bench-equivalent quality on coding workloads. If you're doing high-volume coding-agent work where output token count dominates the bill, V4-Pro changes the economics enough to potentially restructure your stack.

When DeepSeek V4 actually wins

Be precise about the use case:

Winning case 1 — High-volume coding tasks where cost dominates

Pattern: bulk PR review, code-quality classification across thousands of repos, automated test generation at scale. V4-Pro at $3.48/MTok output vs Opus 4.7's $75 is a 21.6× cost compression. The SWE-Bench quality gap (0.2 points) is invisible at the per-task level.

Winning case 2 — Long-context analysis where 200K isn't enough

V4 supports 1M tokens natively without separate pricing tiers (Gemini 2.5 Pro doubles the price above 200K; Anthropic's Sonnet 4.6 has a separate 1M tier at 2× the base rate). For workloads that genuinely need 500K-1M context (full codebase analysis, large legal doc review), V4 is the cheapest path.

Winning case 3 — Open-weight requirement

Regulated environments, audit requirements, or "we need to be able to run this on-prem" mandates that exclude closed models. V4 is MIT- licensed. You can self-host it via vLLM/SGLang/TGI.

Winning case 4 — Pairing in a router

V4-Pro on coding-heavy first-pass, Opus 4.7 on the 10-15% of tasks where the model returns "uncertain" or downstream eval flags low confidence. Typical production saving: 50-70% vs pure-Opus.

When NOT to use V4

Don't use case 1 — Hardest reasoning

If your workload is dominated by math olympiad-style problems, novel proof construction, or PhD-level science reasoning, the 2-3 point HLE

  • HMMT gap matters. Stay on Opus 4.7 or wait for Gemini 3.5 Pro.

Don't use case 2 — Multimodal

V4 is text-only. No vision, no audio. If your workload needs vision (GPT-5-style multimodal or Gemini 3.5 Flash native multimodal), V4 doesn't compete.

Don't use case 3 — Tool ecosystem maturity matters more than cost

V4 is well-supported in LangChain, but the agent-framework ecosystem is still primarily built around OpenAI + Anthropic SDKs. If you're shipping fast and ecosystem maturity > 5× cost savings, stay with the closed-model SDKs.

The architecture story

The benchmarks are competitive; what's actually new in V4 is the inference-cost architecture:

  • Mixture-of-Experts with 49B active. Out of 1.6T total parameters, only 49B are activated per token. Inference cost (compute + memory) scales with active params, not total — that's how a 1.6T model serves at $1.74/MTok.
  • FP4 + FP8 mixed precision. V4-Pro ships in FP4 + FP8 mixed precision rather than the more common BF16. ~50% memory footprint reduction; minor accuracy cost amortised by the larger total parameter count.
  • Interleaved thinking. V4 uses an interleaved-thinking pattern for reasoning-heavy tasks rather than the separate <thinking> block of earlier reasoners. Lower overhead, faster TTFT, but the pattern is harder to integrate into agent frameworks expecting the classic separator.

How to use V4 today

Three paths:

Path 1: Cloud API

DeepSeek's own API at api.deepseek.com — pay-as-you-go at the $1.74/$3.48 rates. Standard OpenAI-compatible wire format.

Path 2: Self-host

V4-Flash (284B / 13B active) runs on a single 8× H100 node. V4-Pro (1.6T / 49B active) needs 16-32× H100 depending on the quantization. vLLM, SGLang, and TGI all support V4 as of release.

Path 3: Multi-provider gateway

Anvat (the gateway we build) exposes V4-Pro alongside Claude, GPT, and Gemini on a single OpenAI-compatible key. Route per-task: V4 for bulk coding, Opus for reasoning, GPT for multimodal — no key juggling.

What V4 changes about the market

Three implications:

  1. The "expensive frontier" tier is now a smaller wedge. Opus 4.7 and GPT-5.5 keep their lead on the absolute hardest tasks. Most production AI workloads don't need that ceiling. V4-Pro covers the 80% of the value distribution at 1/20th the cost.

  2. Open weights at frontier quality is a 2026 fact. Two years ago open-source meant "trail closed models by 12 months." V4 trails by ~6 months on hardest tasks, leads on coding. The gap is now small enough that "use the open model" is a defensible default for most workloads.

  3. Pricing pressure on the closed labs increases. When V4-Pro delivers near-Opus quality at $3.48/MTok output, the Opus 4.7 $75/MTok price becomes harder to justify for any workload that isn't doing genuinely hardest-tier reasoning. Expect closed-model pricing to compress in the second half of 2026.

Run V4-Pro alongside Claude, GPT, and Gemini on one key

Anvat is OpenAI- and Anthropic-compatible. DeepSeek V4-Pro routes through the same /v1/chat/completions endpoint as every other model — no per-provider integration. 30% off list price.

Try free →