Anvat for RAG pipelines
Cheap, fast RAG with prompt caching + 30%-off frontier models
RAG workloads are unusually well-suited to a discounted gateway with passthrough caching. Most of your input tokens are the same large system prompt or retrieved-document prefix on every request — exactly what prompt caching is designed to optimise. Anvat passes Anthropic's cache_control through unchanged, then layers a 30% discount on top.
RAG is input-token-heavy and the bill stacks fast
A typical RAG turn: 80K-150K tokens of retrieved context, 1-2K of generated answer. Input tokens dominate the bill. Pay the full input rate on every request and a 10K-request-per-day knowledge base costs hundreds per day. Skip prompt caching and you're paying the full rate on the same chunks of context over and over.
Cache the system prompt + retrieved-doc prefix, then discount
Wrap your stable prefix (system instructions, tool definitions, the top-K retrieved documents) in cache_control: { type: 'ephemeral' }. The first request pays 1.25× input price to warm the cache; every subsequent request within 5 minutes pays 10% of input price for those same tokens. On Anvat, that 10% is discounted by another 30%, landing at 7% of original list. A $300/day RAG bill typically drops to $50-80 with both optimisations stacked.
Why this beats the obvious alternative
Stacked savings: caching × discount
10% × 70% = 7% of original input cost on cached tokens.
1M context tier on Sonnet 4.6
When 200K isn't enough — extended context at $4.20/$15.75 effective rate.
No infrastructure for caching
Anthropic + Anvat handle cache TTL, eviction, billing — your code just adds the cache_control flag.
Same wire format as direct Anthropic
Existing RAG implementations (LangChain, LlamaIndex, custom) work unchanged — only the base URL differs.
Batch API for backfills
Offline doc-classification or embedding-generation jobs get an additional 50% off via Anthropic's batch endpoint.
Quickstart
TypeScript — RAG with prompt caching through Anvat
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
baseURL: "https://api.anvat.app/v1",
authToken: process.env.ANVAT_API_KEY,
});
const retrievedDocs = await retrieve(query, topK: 8); // your retriever
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: SYSTEM_PROMPT, // stable across requests
cache_control: { type: "ephemeral" },
},
{
type: "text",
text: retrievedDocs.join("\n\n"),
cache_control: { type: "ephemeral" }, // cache the doc prefix too
},
],
messages: [{ role: "user", content: query }],
});FAQ
- How long does the cache last?
- Anthropic's default ephemeral cache TTL is 5 minutes from last access. A 1-hour TTL beta exists (2× input write cost). For RAG workloads with bursty traffic, the 5-minute TTL usually suffices since concurrent users share warm caches.
- Can I cache retrieved documents that change per query?
- Partially. Cache the truly-stable prefix (system prompt + tool definitions) for max hit rate. Document chunks that change per query won't cache effectively unless your retriever returns the same top-K for many queries. For mostly-stable top-K (e.g. domain-narrow knowledge bases), caching still pays.
- What about embeddings — does Anvat host an embedding endpoint?
- Yes — OpenAI's text-embedding-3-large and Voyage AI embeddings are available through /v1/embeddings at standard discounted rates. Use for indexing your corpus before query time.
- Does the 1M-context Sonnet tier work through Anvat?
- Yes. Use the same model name (claude-sonnet-4-6) and Anthropic auto-routes to the extended-context tier when input > 200K. Pricing changes at the threshold ($6/$22.50 list → $4.20/$15.75 Anvat) — Anvat passes the correct billing through.
Try Anvat for RAG pipelines
$2 free credit on signup, no card required. Setup is two env vars — reversible in 60 seconds.
Keep reading
Blog
Claude API pricing in 2026: a complete breakdown (Opus 4.8, Sonnet 4.6, Haiku 4.5)
The full Anthropic Claude API price list for 2026 — every model, input/output rates, prompt caching discounts, batch API savings, and how to cut the bill by ~50% with a discounted gateway.
Blog
Cheap Claude API in 2026: four legitimate ways to cut the bill (and three you should avoid)
Real strategies for cutting your Anthropic Claude API spend without sacrificing quality — prompt caching, batch API, model routing, and discounted gateways. Plus the dodgy resellers you should walk away from.
Model
Claude Sonnet 4.6
The default workhorse — best price-performance for >90% of agent traffic
Model
Claude Haiku 4.5
Fast, cheap, surprisingly capable — the right pick for high-volume cost-sensitive work