Prompt caching deep-dive: how to cut Claude input costs by 90% (2026 guide)

Prompt caching is the single highest-ROI optimisation available to Claude API users in 2026 — bigger than batching, bigger than model routing, bigger than switching to a discount gateway. It's also the one most teams misunderstand: they implement it wrong, miss the cache hit, and conclude it doesn't work.

This is the complete 2026 guide: how the mechanism works, what to cache, what NOT to cache, the math behind a 90% effective discount, and the production patterns we see deliver 60-85% cache hit rates on real Claude Code traffic.

The 30-second TL;DR

Anthropic charges 1.25× input price to write a cache entry, 10% of input price to read it on subsequent calls within the TTL.
Default TTL is 5 minutes from last access; a 1-hour TTL beta exists at 2× input write cost.
You opt in per-block with cache_control: { type: "ephemeral" } on system, messages, or tools content.
Cached content must be at least 1024 tokens (Sonnet/Opus) or 2048 tokens (Haiku) — smaller chunks aren't cached even with the flag.
Cache hit ≠ free — you still pay 10% of input price. The win is the delta on the cacheable portion of your prompt.

How the math actually works

Take a representative Claude Code request: 25K input tokens (mostly system prompt + tool definitions + conversation history), 1.5K output. Sonnet 4.6 list pricing: $3/MTok input, $15/MTok output.

Scenario	Input cost	Output cost	Total per request
No caching	$0.075	$0.0225	$0.0975
First call (cache write, full 25K cached)	$0.094	$0.0225	$0.116
Cached call (100% hit on 25K)	$0.0075	$0.0225	$0.030
Cached call (80% hit, 20% fresh)	$0.021	$0.0225	$0.044

For a developer running 200 Claude Code requests per day, that 80%-hit scenario means $8.80/day instead of $19.50/day — a $325/month saving on a single setting change. At 75-85% cache hit rate (typical production), total bill reduction lands at 60-80% of input cost.

Cache writes are NOT amortized across users — each individual Anthropic account warms its own cache. If you operate a multi-tenant SaaS, every customer's first request pays the 1.25× warm-up cost.

What to cache

The single rule: cache the stable prefix, skip the volatile tail.

The cacheable parts of a typical Claude request, in priority order:

1. System prompt (highest priority)

If your system prompt is more than ~1K tokens (most agent system prompts are 5-20K), this is the biggest win available. Wrap it:

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: largeSystemPrompt, // 5-20K tokens, stable
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: userQuery }],
});

2. Tool definitions

If you ship 20+ MCP tools with verbose schemas, the JSON definitions can easily eat 5-15K tokens. Cache them once per session.

tools: [
  {
    type: "custom",
    name: "search_codebase",
    description: "...",
    input_schema: { /* ... */ },
    cache_control: { type: "ephemeral" }, // applies to ALL prior tools
  },
],

Note: cache_control on a tool caches that tool and everything before it in the array. Place on the last tool to cache the whole set.

3. Long stable conversation history

If you're running a multi-turn agent where the early conversation stabilises (e.g. project context loaded in turn 1, then 20 turns of back-and-forth), cache the early stable messages:

messages: [
  { role: "user", content: projectContextPrompt }, // turn 1, stable
  {
    role: "assistant",
    content: [
      {
        type: "text",
        text: previousResponse,
        cache_control: { type: "ephemeral" }, // cache up through here
      },
    ],
  },
  { role: "user", content: currentTurnPrompt }, // volatile
],

4. Retrieved documents in RAG

If your RAG retriever returns the same top-K documents for a stable knowledge base, the doc prefix is cacheable:

system: [
  { type: "text", text: SYSTEM_PROMPT, cache_control: { type: "ephemeral" } },
  {
    type: "text",
    text: retrievedDocs.join("\n\n---\n\n"),
    cache_control: { type: "ephemeral" }, // doc prefix cached too
  },
],

This works well for domain-narrow RAG (e.g. customer support over a fixed product manual) and poorly for open-ended RAG with high doc churn.

What to NOT cache

Short prompts (<1K tokens). Below the minimum cacheable size, the flag is silently ignored. You pay the 1.25× write surcharge for nothing.
The current user message. This is volatile by definition — caching it makes the cache key change every request, defeating the purpose.
High-cardinality dynamic content that changes per request: timestamps in the prompt, request IDs, user-specific data injected into system. Strip these out or move to the volatile message tail.

The cache-key invariants

The cache key is computed from the byte content of every cached block, in order, up through the cache boundary. Even one different byte invalidates the cache and forces a fresh write.

Common silent killers:

Timestamps in system prompts. "Today's date is 2026-06-06" rotates daily → cache miss every midnight.
User name interpolation. Putting "User: Alice" in the system prompt forces a cache key per user — cache size explodes, hit rate drops.
Trailing whitespace differences. A trailing newline difference is a different key. Normalise prompts before sending.
Random tool ordering. If your tool array re-orders between requests (e.g. iterating a Map), the tool definitions cache misses every time.

Choosing TTL

Anthropic ships two cache TTLs:

TTL	Write cost	Read cost	Best for
5 minutes (default)	1.25× input	0.10× input	Bursty traffic, multi-user concurrent
1 hour (beta)	2× input	0.10× input	Single-user long sessions, batch evals

Math: 5-minute TTL pays off if you'll hit the cache within 5 minutes at least once (1.25× cost recovered by 1 read at 90% savings). 1-hour TTL pays off if you'll get 8+ hits over the hour (2× cost amortised across more reads).

For most production Claude Code / Cursor / agent traffic, 5-minute is the right default — concurrent users share warm caches and the per-user session length covers the TTL.

Production cache-hit patterns we see

Real-world cache hit rates on Anvat-monitored traffic, by workload:

Workload	Cache hit rate (input tokens)
Claude Code (active developer)	78%
Cursor Composer	45%
RAG with stable corpus	86%
Customer support chatbot	62%
Stateless single-shot agents	8%

The single biggest factor: how much of your prompt is genuinely stable across calls. Engineering this stability is the engineering work.

Verifying it's working

Every Claude response includes a usage object with cache-specific fields:

{
  "usage": {
    "input_tokens": 1500,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 23500,
    "output_tokens": 1500
  }
}

cache_creation_input_tokens > 0 means you paid the 1.25× write surcharge
cache_read_input_tokens > 0 means you got a cache hit at 10% rate
input_tokens is the fresh-input portion at full rate

If cache_read_input_tokens is always 0 across many requests, your cache key is invalidating. Audit for the silent killers above.

On Anvat, the dashboard surfaces cache hit rate per key, per day — useful for spotting cache regressions after a prompt change.

Caching + a discount gateway

A discounted gateway like Anvat applies its discount on top of the cached rate. Cached read rate on Sonnet 4.6 = 10% of input = $0.30/MTok list → $0.21/MTok Anvat effective. Output rate = $15/MTok list → $10.50/MTok Anvat effective.

Stack the two and the same 80%-cache-hit Claude Code workload above costs $6.16/day instead of $19.50/day — about 68% off versus no-optimisation direct-Anthropic billing.

Common gotchas

MCP tools loading after first call. Some MCP setups dynamically add tools mid-session. New tool added → tool definition array changed → cache invalidates. Lock the tool set at session start.
Conversation rollover. If you trim old turns to stay under context limits, the surviving conversation prefix changes → invalidation. Trim on TURN boundaries that align with your cache breakpoints.
Different models = different caches. Sonnet 4.6 and Opus 4.8 maintain separate caches. Switching models mid-conversation pays the warm-up cost again on the new model.

Bottom line

Prompt caching is the highest-leverage optimisation on the Claude API. The 30-minute implementation pays back within hours of being deployed. Combine with a discount gateway and the same agent workload runs at roughly a third of the original Anthropic-direct cost — with identical model output quality.

Full Anthropic pricing breakdown → Cheap Claude API in 2026 →

Stack caching with a 30% gateway discount

Anvat passes Anthropic's cache_control through unchanged + adds 30% off every rate (input, output, cached read). $2 free credit on signup.

Start free → →