Building a coding agent in 2026 is no longer a research project. Claude Opus 4.8 is reliable enough to handle multi-file refactors autonomously, the tooling ecosystem (MCP, Anthropic Tool Use, OpenAI function calling) is mature, and the unit economics are predictable.
But "no longer a research project" is not the same as "trivially easy". Production coding agents fail in specific, predictable ways. This guide covers the architecture decisions that actually matter for scaling beyond the demo.
The architecture
Five layers, top to bottom:
┌────────────────────────────────────────┐
│ 1. Front-end / chat surface │
├────────────────────────────────────────┤
│ 2. Orchestrator (your agent loop) │
├────────────────────────────────────────┤
│ 3. Tool layer (MCP servers / native) │
├────────────────────────────────────────┤
│ 4. LLM gateway (one key, many models) │
├────────────────────────────────────────┤
│ 5. Provider APIs (Claude, GPT, …) │
└────────────────────────────────────────┘The interesting design decisions live in layers 2-4. Layers 1 and 5 are mostly off-the-shelf.
Layer 2: the orchestrator
The single most important design decision: how does the agent know when to stop? Three patterns we see work in production:
Pattern A: Bounded tool-call loop
let turns = 0;
const MAX_TURNS = 25;
while (turns < MAX_TURNS) {
const response = await llm.create({ ... });
if (response.stop_reason === "end_turn") break;
if (response.stop_reason === "tool_use") {
const toolResults = await executeTool(response.tool_uses);
messages.push({ role: "user", content: toolResults });
turns++;
}
}Simple, works for 90% of cases. The MAX_TURNS cap is your safety net against runaway loops. Set it generously (25-50) but always have it.
Pattern B: Plan-then-execute
// 1. Use Opus 4.8 to plan
const plan = await opus.create({ messages: [{ role: "user", content: planningPrompt }] });
// 2. Use Sonnet 4.6 for each step in the plan
for (const step of plan.steps) {
await sonnet.create({ messages: [...history, { role: "user", content: step.prompt }] });
}Higher quality for complex tasks (multi-file refactors, large feature implementations). Higher cost per task. Use Opus for planning where quality matters, Sonnet for execution where cost matters.
Pattern C: Subagent dispatch
const PLAN = await dispatcher.create({ ... }); // Opus
const results = await Promise.all(
PLAN.subtasks.map(async (task) => {
return await subagent.create({ // Sonnet or Haiku
system: subagentSystemPrompt,
messages: [{ role: "user", content: task.prompt }],
});
})
);For parallelisable work (10 files to refactor independently, batch analysis). Anthropic supports this natively through Claude Code's subagent feature; for custom agents, implement with Promise.all against the same model.
Layer 3: tool design
Three rules that pay off:
1. Make tools idempotent
The agent will retry. Tools must handle "I just ran this" gracefully — return cached results, no-op duplicate writes, return current state on repeated calls. Non-idempotent tools cause cascade failures when the agent loops on a partial result.
2. Return errors as data, not exceptions
// Bad — agent has no information to recover from
tool.run({ ... }); // throws "FileNotFoundError"
// Good — agent can read the error and adapt
tool.run({ ... });
// → { ok: false, error: "file_not_found", path: "src/missing.ts",
// suggestion: "list directory to find correct path" }The model is much better at reading structured error data and correcting course than it is at handling thrown exceptions.
3. Limit tool surface area
20 well-defined tools beat 100 vague ones. The model spends context budget on tool definitions; bloated tool sets eat your prompt cache and make tool selection less reliable. Audit ruthlessly — every tool needs to justify its presence.
Layer 4: the LLM gateway
This is where most teams accidentally lock themselves in. Decisions to make BEFORE you ship:
Decision: provider-direct vs gateway
If you're calling a single provider with one key, direct is fine. The moment you have:
- Multiple providers (Claude + GPT + Gemini)
- Multi-tenant traffic needing attribution
- Need to swap providers without redeploys
- Cost optimisation pressure
…a gateway pays for itself. Options:
- OpenRouter — broad model catalog, ~5.5% markup
- LiteLLM — self-hosted, MIT, ops burden
- Portkey — governance + guardrails, hosted SaaS
- Anvat — discount-focused, OpenAI + Anthropic shapes, -30% off list
Decision: which API shape?
Both Anthropic's /v1/messages and OpenAI's /v1/chat/completions work
fine for agent loops. Pick by which provider you'll use most often, then
let your gateway translate the rest.
Anvat exposes both shapes on one key. The pattern:
// Anthropic-shape calls for Claude
const anthropic = new Anthropic({
baseURL: "https://api.anvat.app/v1",
authToken: process.env.ANVAT_API_KEY,
});
// OpenAI-shape calls for GPT
const openai = new OpenAI({
baseURL: "https://api.anvat.app/v1",
apiKey: process.env.ANVAT_API_KEY,
});One key, both SDKs, every frontier model.
Cost control patterns
Coding agents are unusually expensive. Defense in depth:
1. Prompt caching (mandatory)
Wrap system prompts + tool definitions + stable conversation prefix in
cache_control. 60-80% input cost reduction for ~30 minutes of work.
2. Model routing
Don't default to Opus. Cheap classifier (Haiku) → escalate to Sonnet by default → escalate to Opus only for genuinely hard problems. Typical production traffic mix: 20% Opus, 60% Sonnet, 20% Haiku — yields ~50% of pure-Opus cost.
3. Per-tenant spend caps
In a multi-tenant SaaS, one runaway customer can eat your monthly budget. Cap per-tenant daily spend BEFORE letting users at the agent.
async function dispatch(tenant: string, prompt: string) {
const spent = await tenantDailySpend(tenant);
if (spent > tenant.dailyLimit) {
throw new Error("Daily AI budget exceeded");
}
return await agent.run(prompt);
}Anvat tracks per-tenant spend via custom headers, surfaces it in the dashboard.
4. Streaming + early-stop
If your agent generates a multi-step plan, stream the first few tokens to detect "the agent is going down a wrong path" and cancel before paying for the full generation. Easier said than done — but worth it at scale.
Failure modes to test for
In order of cost:
| Failure | Cost impact | Mitigation |
|---|---|---|
| Tool loop (calls same tool 20× in a row) | Catastrophic | MAX_TURNS cap, duplicate-detection in tool layer |
| Context window overflow | Request fails outright | Conversation trimming, summarisation |
| Cache invalidation (silent miss) | 10× per-request cost | Audit prompt for non-stable bytes |
| Provider outage | Total agent failure | Multi-provider failover |
| Tool error → agent doesn't recover | Wasted turns | Structured error responses |
| Hallucinated tool call | Wasted turn | Strict schema validation |
Build observability for all six before scaling. The "we'll add monitoring when we need it" path leads to surprise four-figure invoices.
Observability checklist
What you need in your logs from day one:
- Per-request cost (input/output tokens + cache stats × rate)
- Per-tenant aggregate spend
- Tool call success/failure rate per tool
- Agent loop turn count distribution
- Cache hit rate distribution
- p50/p95/p99 turn latency
Anvat ships all six in the dashboard out of the box. Self-hosted options: LiteLLM + Langfuse, Helicone, Portkey (in maintenance mode for new projects).
Where this is going
Three trends worth designing around:
- Models keep getting better at multi-step planning. The MAX_TURNS cap will keep relaxing. Build for 50+ turns, not 10.
- Subagent dispatch is becoming the dominant pattern. Anthropic's first-class subagent support in Claude Code is a leading indicator.
- Per-tenant cost attribution will be table stakes. Customers will demand to see their AI bill broken out by feature.
Bottom line
Building a coding agent that ships is mostly engineering, not ML research. The model is good enough. The hard parts are:
- A good orchestrator with bounded loops
- Idempotent tools with structured error data
- Prompt caching + model routing for cost
- A gateway that gives you optionality
- Observability from day one
Get those right and the agent works. Skip any one and you'll learn about it the expensive way.
Ship coding agents at half the cost
Anvat is OpenAI- and Anthropic-compatible — one key, both SDKs, every frontier model at 30% off list. Per-tenant attribution + cost dashboard included.
Start free → →