coding-agentstutorialarchitecture

How to build a production coding agent in 2026 (architecture + cost guide)

Practical guide to building a coding agent that ships — tool design, model routing, prompt caching, cost control, and the patterns that actually scale from prototype to production.

Anvat team7 min read

Building a coding agent in 2026 is no longer a research project. Claude Opus 4.8 is reliable enough to handle multi-file refactors autonomously, the tooling ecosystem (MCP, Anthropic Tool Use, OpenAI function calling) is mature, and the unit economics are predictable.

But "no longer a research project" is not the same as "trivially easy". Production coding agents fail in specific, predictable ways. This guide covers the architecture decisions that actually matter for scaling beyond the demo.

The architecture

Five layers, top to bottom:

┌────────────────────────────────────────┐
│  1. Front-end / chat surface           │
├────────────────────────────────────────┤
│  2. Orchestrator (your agent loop)     │
├────────────────────────────────────────┤
│  3. Tool layer (MCP servers / native)  │
├────────────────────────────────────────┤
│  4. LLM gateway (one key, many models) │
├────────────────────────────────────────┤
│  5. Provider APIs (Claude, GPT, …)     │
└────────────────────────────────────────┘

The interesting design decisions live in layers 2-4. Layers 1 and 5 are mostly off-the-shelf.

Layer 2: the orchestrator

The single most important design decision: how does the agent know when to stop? Three patterns we see work in production:

Pattern A: Bounded tool-call loop

let turns = 0;
const MAX_TURNS = 25;
 
while (turns < MAX_TURNS) {
  const response = await llm.create({ ... });
  if (response.stop_reason === "end_turn") break;
  if (response.stop_reason === "tool_use") {
    const toolResults = await executeTool(response.tool_uses);
    messages.push({ role: "user", content: toolResults });
    turns++;
  }
}

Simple, works for 90% of cases. The MAX_TURNS cap is your safety net against runaway loops. Set it generously (25-50) but always have it.

Pattern B: Plan-then-execute

// 1. Use Opus 4.8 to plan
const plan = await opus.create({ messages: [{ role: "user", content: planningPrompt }] });
 
// 2. Use Sonnet 4.6 for each step in the plan
for (const step of plan.steps) {
  await sonnet.create({ messages: [...history, { role: "user", content: step.prompt }] });
}

Higher quality for complex tasks (multi-file refactors, large feature implementations). Higher cost per task. Use Opus for planning where quality matters, Sonnet for execution where cost matters.

Pattern C: Subagent dispatch

const PLAN = await dispatcher.create({ ... });  // Opus
 
const results = await Promise.all(
  PLAN.subtasks.map(async (task) => {
    return await subagent.create({  // Sonnet or Haiku
      system: subagentSystemPrompt,
      messages: [{ role: "user", content: task.prompt }],
    });
  })
);

For parallelisable work (10 files to refactor independently, batch analysis). Anthropic supports this natively through Claude Code's subagent feature; for custom agents, implement with Promise.all against the same model.

Layer 3: tool design

Three rules that pay off:

1. Make tools idempotent

The agent will retry. Tools must handle "I just ran this" gracefully — return cached results, no-op duplicate writes, return current state on repeated calls. Non-idempotent tools cause cascade failures when the agent loops on a partial result.

2. Return errors as data, not exceptions

// Bad — agent has no information to recover from
tool.run({ ... });  // throws "FileNotFoundError"
 
// Good — agent can read the error and adapt
tool.run({ ... });
// → { ok: false, error: "file_not_found", path: "src/missing.ts",
//     suggestion: "list directory to find correct path" }

The model is much better at reading structured error data and correcting course than it is at handling thrown exceptions.

3. Limit tool surface area

20 well-defined tools beat 100 vague ones. The model spends context budget on tool definitions; bloated tool sets eat your prompt cache and make tool selection less reliable. Audit ruthlessly — every tool needs to justify its presence.

Layer 4: the LLM gateway

This is where most teams accidentally lock themselves in. Decisions to make BEFORE you ship:

Decision: provider-direct vs gateway

If you're calling a single provider with one key, direct is fine. The moment you have:

  • Multiple providers (Claude + GPT + Gemini)
  • Multi-tenant traffic needing attribution
  • Need to swap providers without redeploys
  • Cost optimisation pressure

…a gateway pays for itself. Options:

  • OpenRouter — broad model catalog, ~5.5% markup
  • LiteLLM — self-hosted, MIT, ops burden
  • Portkey — governance + guardrails, hosted SaaS
  • Anvat — discount-focused, OpenAI + Anthropic shapes, -30% off list

Full gateway comparison →

Decision: which API shape?

Both Anthropic's /v1/messages and OpenAI's /v1/chat/completions work fine for agent loops. Pick by which provider you'll use most often, then let your gateway translate the rest.

Anvat exposes both shapes on one key. The pattern:

// Anthropic-shape calls for Claude
const anthropic = new Anthropic({
  baseURL: "https://api.anvat.app/v1",
  authToken: process.env.ANVAT_API_KEY,
});
 
// OpenAI-shape calls for GPT
const openai = new OpenAI({
  baseURL: "https://api.anvat.app/v1",
  apiKey: process.env.ANVAT_API_KEY,
});

One key, both SDKs, every frontier model.

Cost control patterns

Coding agents are unusually expensive. Defense in depth:

1. Prompt caching (mandatory)

Wrap system prompts + tool definitions + stable conversation prefix in cache_control. 60-80% input cost reduction for ~30 minutes of work.

Full prompt caching guide →

2. Model routing

Don't default to Opus. Cheap classifier (Haiku) → escalate to Sonnet by default → escalate to Opus only for genuinely hard problems. Typical production traffic mix: 20% Opus, 60% Sonnet, 20% Haiku — yields ~50% of pure-Opus cost.

3. Per-tenant spend caps

In a multi-tenant SaaS, one runaway customer can eat your monthly budget. Cap per-tenant daily spend BEFORE letting users at the agent.

async function dispatch(tenant: string, prompt: string) {
  const spent = await tenantDailySpend(tenant);
  if (spent > tenant.dailyLimit) {
    throw new Error("Daily AI budget exceeded");
  }
  return await agent.run(prompt);
}

Anvat tracks per-tenant spend via custom headers, surfaces it in the dashboard.

4. Streaming + early-stop

If your agent generates a multi-step plan, stream the first few tokens to detect "the agent is going down a wrong path" and cancel before paying for the full generation. Easier said than done — but worth it at scale.

Failure modes to test for

In order of cost:

FailureCost impactMitigation
Tool loop (calls same tool 20× in a row)CatastrophicMAX_TURNS cap, duplicate-detection in tool layer
Context window overflowRequest fails outrightConversation trimming, summarisation
Cache invalidation (silent miss)10× per-request costAudit prompt for non-stable bytes
Provider outageTotal agent failureMulti-provider failover
Tool error → agent doesn't recoverWasted turnsStructured error responses
Hallucinated tool callWasted turnStrict schema validation

Build observability for all six before scaling. The "we'll add monitoring when we need it" path leads to surprise four-figure invoices.

Observability checklist

What you need in your logs from day one:

  • Per-request cost (input/output tokens + cache stats × rate)
  • Per-tenant aggregate spend
  • Tool call success/failure rate per tool
  • Agent loop turn count distribution
  • Cache hit rate distribution
  • p50/p95/p99 turn latency

Anvat ships all six in the dashboard out of the box. Self-hosted options: LiteLLM + Langfuse, Helicone, Portkey (in maintenance mode for new projects).

Where this is going

Three trends worth designing around:

  1. Models keep getting better at multi-step planning. The MAX_TURNS cap will keep relaxing. Build for 50+ turns, not 10.
  2. Subagent dispatch is becoming the dominant pattern. Anthropic's first-class subagent support in Claude Code is a leading indicator.
  3. Per-tenant cost attribution will be table stakes. Customers will demand to see their AI bill broken out by feature.

Bottom line

Building a coding agent that ships is mostly engineering, not ML research. The model is good enough. The hard parts are:

  1. A good orchestrator with bounded loops
  2. Idempotent tools with structured error data
  3. Prompt caching + model routing for cost
  4. A gateway that gives you optionality
  5. Observability from day one

Get those right and the agent works. Skip any one and you'll learn about it the expensive way.

Ship coding agents at half the cost

Anvat is OpenAI- and Anthropic-compatible — one key, both SDKs, every frontier model at 30% off list. Per-tenant attribution + cost dashboard included.

Start free →