Google shipped Gemini 3.5 Flash at I/O 2026 on May 19 — and it already outperforms last year's flagship Gemini 3.1 Pro on most agentic and coding benchmarks. Sundar Pichai confirmed on stage that Gemini 3.5 Pro arrives in June, currently in internal use. This is the tracker — what Flash actually delivers today, where it regressed relative to 3.1 Pro (and why Pro exists to fix it), and what to plan around.
The Flash story in one paragraph
A model wearing the "Flash" badge — Google's lightweight tier — just outscored Gemini 3.1 Pro on Terminal-Bench, MCP Atlas, SWE-Bench Pro, Toolathlon, and OSWorld. That's never happened before in the Gemini lineup. The implication: Gemini 3.5 Pro will land at a level the benchmark community hasn't seen yet — but it won't ship for another ~3 weeks because Google has reasoning + long-context regressions to fix first.
Gemini 3.5 Flash benchmarks vs Gemini 3.1 Pro
Wins:
| Benchmark | 3.5 Flash | 3.1 Pro | Delta |
|---|---|---|---|
| Terminal-Bench 2.1 | 76.2% | 70.3% | +5.9 |
| MCP Atlas | 83.6% | 78.2% | +5.4 |
| Finance Agent v2 | 57.9% | 43.0% | +14.9 |
| GDPval-AA (Elo) | 1656 | 1314 | +342 |
| OSWorld-Verified | 78.4% | 76.2% | +2.2 |
| Blueprint-Bench 2 | 33.6% | 26.5% | +7.1 |
| Toolathlon | 56.5% | 49.4% | +7.1 |
| CharXiv Reasoning | 84.2% | 83.3% | +0.9 |
| SWE-Bench Pro (Public) | 55.1% | 54.2% | +0.9 |
Where Flash regressed vs 3.1 Pro:
| Benchmark | 3.5 Flash | 3.1 Pro | Delta |
|---|---|---|---|
| Humanity's Last Exam | 40.2% | 44.4% | −4.2 |
| ARC-AGI-2 | 72.1% | 77.1% | −5.0 |
| MRCR v2 (128K) | 77.3% | 84.9% | −7.6 |
The regression pattern is consistent: hardest expert reasoning + long-context retrieval. That's exactly what a "Pro" tier exists to restore.
What Pro almost certainly does
If 3.5 Pro is going to justify its premium positioning over 3.5 Flash, it has to:
- Restore Humanity's Last Exam to >44.4% (3.1 Pro baseline).
- Restore ARC-AGI-2 to >77.1%.
- Restore MRCR v2 (128K) to >85% — long-context retrieval.
- Match or exceed Flash on Terminal-Bench, MCP Atlas, etc.
- Hold the GDPval-AA lead. Flash's 1656 Elo is already very high.
If Google delivers all five, Gemini 3.5 Pro becomes the strongest production frontier model — surpassing both Opus 4.7 (Elo 1753) and GPT-5.5 (Elo 1769) at the long-context + reasoning intersection.
If they only fix the regression at the cost of agentic speed, it becomes "Pro for hard tasks, Flash for fast" — useful but more narrowly positioned.
Release window
| Signal | Date | What it means |
|---|---|---|
| Pichai keynote | May 19, 2026 | "Coming next month" — confirmed June launch intent |
| Currently in internal testing | May 19, 2026 | Late-stage eval, not early development |
| Expected GA | Mid-to-late June 2026 | Based on "next month" framing |
A more aggressive interpretation: "next month" said on May 19 means June ship, not "June announcement." Watch the Gemini API blog posts in the second week of June for the announcement.
What's live today (Flash)
Available now via:
- Gemini API in Google AI Studio
- Vertex AI
- Antigravity (Google's IDE — Flash is the default model)
- Android Studio
- Gemini app + AI Mode in Search
Pricing is unchanged from Gemini 3.1 Flash. Throughput: ~289 tokens/sec in Antigravity-optimised inference (4× faster than other frontier models per Google's published numbers).
Comparison: Gemini 3.5 Flash vs the field
Per Google's own published table (Gemini DeepMind page):
| Benchmark | 3.5 Flash | Sonnet 4.6 | Opus 4.7 | GPT-5.5 |
|---|---|---|---|---|
| Terminal-Bench 2.1 | 76.2% | — | 66.1% | 78.2% |
| SWE-Bench Pro | 55.1% | — | 64.3% | 58.6% |
| MCP Atlas | 83.6% | 69.5% | 79.1% | 75.3% |
| OSWorld-Verified | 78.4% | 72.5% | 78.0% | 78.7% |
| GDPval-AA (Elo) | 1656 | 1676 | 1753 | 1769 |
| MRCR v2 (128K) | 77.3% | 84.9% | 59.3% | 94.8% |
Reading the table fairly: GPT-5.5 leads on knowledge work economics (GDPval-AA Elo) and long-context recall. Opus 4.7 leads on hard coding (SWE-Bench Pro 64.3%) and reasoning. Gemini 3.5 Flash leads on tool-use composition (MCP Atlas) and agentic benchmarks where speed matters. All four are close on OSWorld.
3.5 Pro is being engineered to put Google ahead on the dimensions Flash gave up — specifically the GDPval-AA Elo lead and long-context recall.
What to do until Pro lands
- Run Flash today for agentic work. It's strictly better than 3.1 Pro on agentic/coding/tool benchmarks. Migration is free — same wire format, lower-priced tier.
- Hold Pro-class hard-reasoning workloads on 3.1 Pro or Opus 4.7 for the next 2-3 weeks. Flash regresses on Humanity's Last Exam and ARC-AGI-2. Wait for 3.5 Pro to catch up there.
- Define your Flash → Pro A/B eval before June 18. Same advice we'd give for GPT-5.6: don't lose the comparison by flipping the model on launch day. Lock the eval first.
- Test the long-context regression yourself. MRCR v2 (128K) dropped 7.6 points on Flash. If your RAG pipeline depends on the 128K range, run your eval to verify before migrating.
What this means for builders
The dominant frontier-model release pattern in mid-2026 is:
- Smaller / faster tier ships first — fast iteration on the most cost-sensitive workloads
- Premium tier ships ~30 days later — restores reasoning + long-context
This is the second time in 2026 we've seen this (Google did the same with Gemini 3 Flash → 3 Pro). It's becoming the playbook. Plan your adoption around it — Flash on launch day, Pro 4-6 weeks later for the reasoning-intensive portion of the stack.
Pricing comparison (current GA)
| Model | Input / MTok | Output / MTok |
|---|---|---|
| Gemini 3.5 Flash | $0.30 | $2.50 |
| Gemini 3.5 Pro | TBD (expected ~$2.50 / $15) | — |
| Claude Sonnet 4.6 | $3.00 | $15.00 |
| Claude Opus 4.7 | $15.00 | $75.00 |
| GPT-5.5 | $1.25 | $10.00 |
At those numbers, Gemini 3.5 Flash is the cheapest frontier-tier model for agentic work. A typical Cursor-style autocomplete request (2K input, 200 output) costs roughly $0.0011 on Flash vs $0.0065 on Sonnet 4.6 — 6× cheaper.
If Pro lands at the rumoured $2.50/$15 tier, it becomes price-competitive with Sonnet 4.6 while expected to outperform it on agentic + coding — the strongest cost-quality position in the lineup.
Related coverage
Run Gemini 3.5 Flash + every frontier model on one key
Anvat exposes Gemini, Claude, GPT, and DeepSeek through one OpenAI- + Anthropic-compatible key. Day-0 launch support for the upcoming GPT-5.6 + Gemini 3.5 Pro releases.
Start free → →