Anvat for RAG pipelines

Cheap, fast RAG with prompt caching + 30%-off frontier models

RAG workloads are unusually well-suited to a discounted gateway with passthrough caching. Most of your input tokens are the same large system prompt or retrieved-document prefix on every request — exactly what prompt caching is designed to optimise. Anvat passes Anthropic's cache_control through unchanged, then layers a 30% discount on top.

RAG is input-token-heavy and the bill stacks fast

A typical RAG turn: 80K-150K tokens of retrieved context, 1-2K of generated answer. Input tokens dominate the bill. Pay the full input rate on every request and a 10K-request-per-day knowledge base costs hundreds per day. Skip prompt caching and you're paying the full rate on the same chunks of context over and over.

Cache the system prompt + retrieved-doc prefix, then discount

Wrap your stable prefix (system instructions, tool definitions, the top-K retrieved documents) in cache_control: { type: 'ephemeral' }. The first request pays 1.25× input price to warm the cache; every subsequent request within 5 minutes pays 10% of input price for those same tokens. On Anvat, that 10% is discounted by another 30%, landing at 7% of original list. A $300/day RAG bill typically drops to $50-80 with both optimisations stacked.

Why this beats the obvious alternative

  • Stacked savings: caching × discount

    10% × 70% = 7% of original input cost on cached tokens.

  • 1M context tier on Sonnet 4.6

    When 200K isn't enough — extended context at $4.20/$15.75 effective rate.

  • No infrastructure for caching

    Anthropic + Anvat handle cache TTL, eviction, billing — your code just adds the cache_control flag.

  • Same wire format as direct Anthropic

    Existing RAG implementations (LangChain, LlamaIndex, custom) work unchanged — only the base URL differs.

  • Batch API for backfills

    Offline doc-classification or embedding-generation jobs get an additional 50% off via Anthropic's batch endpoint.

Quickstart

TypeScript — RAG with prompt caching through Anvat

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  baseURL: "https://api.anvat.app/v1",
  authToken: process.env.ANVAT_API_KEY,
});

const retrievedDocs = await retrieve(query, topK: 8); // your retriever

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: SYSTEM_PROMPT, // stable across requests
      cache_control: { type: "ephemeral" },
    },
    {
      type: "text",
      text: retrievedDocs.join("\n\n"),
      cache_control: { type: "ephemeral" }, // cache the doc prefix too
    },
  ],
  messages: [{ role: "user", content: query }],
});

FAQ

How long does the cache last?
Anthropic's default ephemeral cache TTL is 5 minutes from last access. A 1-hour TTL beta exists (2× input write cost). For RAG workloads with bursty traffic, the 5-minute TTL usually suffices since concurrent users share warm caches.
Can I cache retrieved documents that change per query?
Partially. Cache the truly-stable prefix (system prompt + tool definitions) for max hit rate. Document chunks that change per query won't cache effectively unless your retriever returns the same top-K for many queries. For mostly-stable top-K (e.g. domain-narrow knowledge bases), caching still pays.
What about embeddings — does Anvat host an embedding endpoint?
Yes — OpenAI's text-embedding-3-large and Voyage AI embeddings are available through /v1/embeddings at standard discounted rates. Use for indexing your corpus before query time.
Does the 1M-context Sonnet tier work through Anvat?
Yes. Use the same model name (claude-sonnet-4-6) and Anthropic auto-routes to the extended-context tier when input > 200K. Pricing changes at the threshold ($6/$22.50 list → $4.20/$15.75 Anvat) — Anvat passes the correct billing through.

Try Anvat for RAG pipelines

$2 free credit on signup, no card required. Setup is two env vars — reversible in 60 seconds.

Keep reading