Gemini 3.5 Pro launch tracker (June 2026) + Flash benchmark deep-dive

Google shipped Gemini 3.5 Flash at I/O 2026 on May 19 — and it already outperforms last year's flagship Gemini 3.1 Pro on most agentic and coding benchmarks. Sundar Pichai confirmed on stage that Gemini 3.5 Pro arrives in June, currently in internal use. This is the tracker — what Flash actually delivers today, where it regressed relative to 3.1 Pro (and why Pro exists to fix it), and what to plan around.

The Flash story in one paragraph

A model wearing the "Flash" badge — Google's lightweight tier — just outscored Gemini 3.1 Pro on Terminal-Bench, MCP Atlas, SWE-Bench Pro, Toolathlon, and OSWorld. That's never happened before in the Gemini lineup. The implication: Gemini 3.5 Pro will land at a level the benchmark community hasn't seen yet — but it won't ship for another ~3 weeks because Google has reasoning + long-context regressions to fix first.

Gemini 3.5 Flash benchmarks vs Gemini 3.1 Pro

Wins:

Benchmark	3.5 Flash	3.1 Pro	Delta
Terminal-Bench 2.1	76.2%	70.3%	+5.9
MCP Atlas	83.6%	78.2%	+5.4
Finance Agent v2	57.9%	43.0%	+14.9
GDPval-AA (Elo)	1656	1314	+342
OSWorld-Verified	78.4%	76.2%	+2.2
Blueprint-Bench 2	33.6%	26.5%	+7.1
Toolathlon	56.5%	49.4%	+7.1
CharXiv Reasoning	84.2%	83.3%	+0.9
SWE-Bench Pro (Public)	55.1%	54.2%	+0.9

Where Flash regressed vs 3.1 Pro:

Benchmark	3.5 Flash	3.1 Pro	Delta
Humanity's Last Exam	40.2%	44.4%	−4.2
ARC-AGI-2	72.1%	77.1%	−5.0
MRCR v2 (128K)	77.3%	84.9%	−7.6

The regression pattern is consistent: hardest expert reasoning + long-context retrieval. That's exactly what a "Pro" tier exists to restore.

What Pro almost certainly does

If 3.5 Pro is going to justify its premium positioning over 3.5 Flash, it has to:

Restore Humanity's Last Exam to >44.4% (3.1 Pro baseline).
Restore ARC-AGI-2 to >77.1%.
Restore MRCR v2 (128K) to >85% — long-context retrieval.
Match or exceed Flash on Terminal-Bench, MCP Atlas, etc.
Hold the GDPval-AA lead. Flash's 1656 Elo is already very high.

If Google delivers all five, Gemini 3.5 Pro becomes the strongest production frontier model — surpassing both Opus 4.7 (Elo 1753) and GPT-5.5 (Elo 1769) at the long-context + reasoning intersection.

If they only fix the regression at the cost of agentic speed, it becomes "Pro for hard tasks, Flash for fast" — useful but more narrowly positioned.

Release window

Signal	Date	What it means
Pichai keynote	May 19, 2026	"Coming next month" — confirmed June launch intent
Currently in internal testing	May 19, 2026	Late-stage eval, not early development
Expected GA	Mid-to-late June 2026	Based on "next month" framing

A more aggressive interpretation: "next month" said on May 19 means June ship, not "June announcement." Watch the Gemini API blog posts in the second week of June for the announcement.

What's live today (Flash)

Available now via:

Gemini API in Google AI Studio
Vertex AI
Antigravity (Google's IDE — Flash is the default model)
Android Studio
Gemini app + AI Mode in Search

Pricing is unchanged from Gemini 3.1 Flash. Throughput: ~289 tokens/sec in Antigravity-optimised inference (4× faster than other frontier models per Google's published numbers).

Comparison: Gemini 3.5 Flash vs the field

Per Google's own published table (Gemini DeepMind page):

Benchmark	3.5 Flash	Sonnet 4.6	Opus 4.7	GPT-5.5
Terminal-Bench 2.1	76.2%	—	66.1%	78.2%
SWE-Bench Pro	55.1%	—	64.3%	58.6%
MCP Atlas	83.6%	69.5%	79.1%	75.3%
OSWorld-Verified	78.4%	72.5%	78.0%	78.7%
GDPval-AA (Elo)	1656	1676	1753	1769
MRCR v2 (128K)	77.3%	84.9%	59.3%	94.8%

Reading the table fairly: GPT-5.5 leads on knowledge work economics (GDPval-AA Elo) and long-context recall. Opus 4.7 leads on hard coding (SWE-Bench Pro 64.3%) and reasoning. Gemini 3.5 Flash leads on tool-use composition (MCP Atlas) and agentic benchmarks where speed matters. All four are close on OSWorld.

3.5 Pro is being engineered to put Google ahead on the dimensions Flash gave up — specifically the GDPval-AA Elo lead and long-context recall.

What to do until Pro lands

Run Flash today for agentic work. It's strictly better than 3.1 Pro on agentic/coding/tool benchmarks. Migration is free — same wire format, lower-priced tier.
Hold Pro-class hard-reasoning workloads on 3.1 Pro or Opus 4.7 for the next 2-3 weeks. Flash regresses on Humanity's Last Exam and ARC-AGI-2. Wait for 3.5 Pro to catch up there.
Define your Flash → Pro A/B eval before June 18. Same advice we'd give for GPT-5.6: don't lose the comparison by flipping the model on launch day. Lock the eval first.
Test the long-context regression yourself. MRCR v2 (128K) dropped 7.6 points on Flash. If your RAG pipeline depends on the 128K range, run your eval to verify before migrating.

What this means for builders

The dominant frontier-model release pattern in mid-2026 is:

Smaller / faster tier ships first — fast iteration on the most cost-sensitive workloads
Premium tier ships ~30 days later — restores reasoning + long-context

This is the second time in 2026 we've seen this (Google did the same with Gemini 3 Flash → 3 Pro). It's becoming the playbook. Plan your adoption around it — Flash on launch day, Pro 4-6 weeks later for the reasoning-intensive portion of the stack.

Pricing comparison (current GA)

Model	Input / MTok	Output / MTok
Gemini 3.5 Flash	$0.30	$2.50
Gemini 3.5 Pro	TBD (expected ~$2.50 / $15)	—
Claude Sonnet 4.6	$3.00	$15.00
Claude Opus 4.7	$5.00	$25.00
GPT-5.5	$1.25	$10.00

At those numbers, Gemini 3.5 Flash is the cheapest frontier-tier model for agentic work. A typical Cursor-style autocomplete request (2K input, 200 output) costs roughly $0.0011 on Flash vs $0.0065 on Sonnet 4.6 — 6× cheaper.

If Pro lands at the rumoured $2.50/$15 tier, it becomes price-competitive with Sonnet 4.6 while expected to outperform it on agentic + coding — the strongest cost-quality position in the lineup.

Run Gemini 3.5 Flash + every frontier model on one key

Anvat exposes Gemini, Claude, GPT, and DeepSeek through one OpenAI- + Anthropic-compatible key. Day-0 launch support for the upcoming GPT-5.6 + Gemini 3.5 Pro releases.

Start free → →