Anthropic's "Building Effective Agents" is the single best filter we know for what to build with LLMs in 2026. The thesis is plain: most "agent" projects should be workflows, not agents - because workflows ship and most agent demos do not. When you do need agentic behavior, three patterns dominate the production-ready set: routing (an LLM decides which workflow to invoke), orchestrator-workers (one LLM coordinates multiple workers), and evaluator-optimizer (one LLM judges another in a loop).
Across the engagements we audit, these three account for roughly 90% of the agent code that actually ships and stays shipped. The pattern that keeps failing - peer-to-peer multi-agent chatter - is also the most overhyped in 2025-era marketing. This is the field guide to the three production agent patterns: when each fits, what fails, and real examples we have seen go to production.
The TL;DR
- Three production AI agent patterns ship reliably in 2026: routing, orchestrator-workers, evaluator-optimizer.
- Routing fits triage, classification, and "which workflow do we run for this input."
- Orchestrator-workers fits complex tasks where the subtasks are not predictable until runtime (like coding, research synthesis, complex extraction).
- Evaluator-optimizer fits tasks with clear success criteria and where iteration measurably improves output (translation, code generation, structured-data extraction).
- Peer-to-peer multi-agent (agents talking to each other freely) fails in production: feedback loops, false consensus, runaway token spend.
- Anthropic's guidance is consistent: start with the simplest pattern that works, only add complexity when it clearly pays.
Why "production AI agent patterns" is the right framing in 2026
Multi-agent frameworks proliferated through 2024-2025; production data through 2026 narrowed the field. The patterns that shipped reliably share three traits: control flow is mostly known (the LLM picks from a small, engineered, instrumented option set), failure modes are observable (bad behavior shows up in dashboards before user complaints), and cost is bounded (each pattern has a clean answer to worst-case spend per task; peer-to-peer multi-agent does not). Anthropic's agentic systems writeup makes this explicit: prefer workflows, use agents only when justified, and within agent territory the three patterns below are the ones to reach for first.
Pattern 1: Routing
What it is
A first LLM call classifies the input and decides which downstream workflow or model handles it. The classifier is small, fast, and bounded. Each downstream branch is a separate workflow with its own prompt, tools, and guardrails.
The pattern is mature enough that several frameworks ship it as a primitive: LangGraph's conditional edges, the Anthropic SDK's tool-use routing, OpenAI Agents SDK handoffs.
When it fits
Customer support triage (inbound message classified into intents like billing, technical, refund, sales lead; each routes to a workflow tuned for that flow; production deflection lands at 55-70% with this pattern, CSAT lift typically +18% in 90 days). Cost-tier routing (easy queries to a budget-tier model like Haiku or Gemini Flash; hard queries to Claude Opus or Sonnet 4.6; cuts cost 60-90% with no quality hit on easy queries - the single biggest cost lever in agent economics). Document classification ("is this an invoice, contract, claim, or purchase order?" routes to the right extractor). Email and ticket triage on internal queues.
What fails
Routing into routing into routing (three layers of classifier is a debugger's nightmare; stay flat). Open-ended classification ("decide what to do with this" is too broad; routing works with a known small branch set, typically 3-12). Skipping the eval set (5% misclassification means 1 in 20 customers in the wrong workflow; build an eval of 50-200 real examples with ground-truth labels and measure on every prompt change).
Real example
A B2B SaaS we audited routes inbound support to four flows: account access, billing, product question, escalation. Router is Claude Haiku. Account access and billing are deterministic workflows that resolve without an agent. Product questions go to a RAG-backed Sonnet workflow over company docs. Escalations route to humans with a summary attached. Result: 64% tier-1 deflection, $0.04 per ticket, and routing accuracy above 94% for nine months. The pattern worked because it never tried to be clever. Broader framing in AI agents for business 2026.
Pattern 2: Orchestrator-workers
What it is
A central orchestrator LLM dynamically decomposes a task, dispatches subtasks to worker LLMs, collects results, and synthesizes the final output. The workers are typically the same model class as the orchestrator but with narrower prompts and tools. The orchestrator decides how many workers to spawn and what each one is responsible for.
This is the pattern Anthropic uses in Claude Code for complex code-modification tasks: one orchestrator decides "we need to update the auth module, the tests, and the migration script," dispatches a worker per surface, and synthesizes the final PR.
When it fits
Complex coding tasks (multi-file refactors, "implement this feature across the stack," large-scale code review). Research synthesis ("compare these 5 vendors against our criteria"; each worker handles one vendor or criterion; orchestrator synthesizes). Multi-source extraction (a contract has 12 clauses; one worker per clause class; orchestrator merges into the final structured object). Long-form document generation where each section has its own data sources and constraints.
What fails
Workers calling workers (collapses into peer-to-peer the moment workers can spawn workers; keep it two-level). Untrimmed worker context (passing the full conversation to every worker blows token budget and latency; workers receive only the subtask and the context they need). No worker timeouts (a stuck worker stalls the task; hard timeouts per worker with the orchestrator deciding retry/skip/escalate). Synthesis as an afterthought (the synthesis prompt is where most quality is won or lost; not "here are five outputs, combine them" but a real prompt with constraints, conflict resolution, and a quality bar).
Real example
A claims-processing engagement we built uses this pattern. An orchestrator Sonnet 4.6 ingests a claim packet (PDF + supporting docs + customer notes) and dispatches workers that extract claim form fields, extract supporting-doc fields, flag inconsistencies between them, and pull the customer's prior claim history from internal systems. The orchestrator synthesizes a structured claim record plus confidence score and flagged anomalies. Anything below threshold routes to a human adjuster with the work pre-staged. Result: 89% straight-through processing vs the previous 53% with rule-based RPA, 40% more accurate on variable layouts. Supporting human-in-the-loop architecture and API integration patterns covered elsewhere.
Pattern 3: Evaluator-optimizer
What it is
One LLM generates a candidate output. A second LLM evaluates it against a rubric and either approves it or returns critique. The first LLM revises based on critique. The loop continues until the evaluator approves or a max-iteration ceiling hits.
This is the production-ready version of the "self-critique" pattern. The key separation: the generator and evaluator are separate calls with separate prompts, often the same model class but tuned for different objectives.
When it fits
Translation and localization (generator translates; evaluator checks fluency, register, and terminology against a glossary). Code generation with tests (generator writes; evaluator runs tests; failures route back as critique). Structured-data extraction with validation (generator extracts; evaluator checks against schema and named constraints). Long-form writing against a style guide and fact list.
What fails
Same prompt for generator and evaluator (the evaluator just rubber-stamps; different prompts, different objectives). No iteration ceiling (you can burn 20+ calls on an edge case; cap at 3-5 and route uncapped cases to humans). Vague rubrics ("is this good?" gets you mush; the evaluator needs a checklist with named criteria and a binary or numeric score per criterion). No ground-truth eval set (you think the loop improves quality; you should know).
Real example
A SaaS support-content team we work with uses this for AI-drafted help articles. Generator (Sonnet) drafts from a Linear ticket plus relevant existing articles. Evaluator (also Sonnet, different prompt) checks against a rubric: factual accuracy vs source ticket, no broken jargon, house style, right structure, working links. Failed criteria return critique and the generator revises. In production: 71% pass on iteration one, 24% on two, 4% on three; 1% hit the max-iteration ceiling and route to a human. Average 12k tokens per article; 8-12 hours saved per article vs the previous human-only flow. The testing discipline that makes this work is in testing AI features with golden sets.
Why peer-to-peer multi-agent mostly fails in production
The pattern hyped through 2024-2025 - multiple "specialist" agents that talk to each other freely and converge on an answer - has a poor production record in 2026. The consistent failure modes: feedback loops (two agents reinforce each other's mistakes; the consensus is wrong and the conversation looks reasonable); runaway token spend (multi-agent chatter uses ~15x the tokens of equivalent single-agent flows; a bug can run a session into thousands of dollars); no clean owner of the final output (who decided - the lead agent? the last to speak? the highest-confidence claim?); fragility to upstream changes.
The fix is mostly to not do it. Almost every peer-to-peer flow we have audited is better expressed as orchestrator-workers - same parallelism, clear hierarchy, bounded cost. The exception is research-style workflows exploring a wide problem space (Claude's Computer Use research, multi-agent debate research). Those are research surfaces, not production. Treat them as such.
A decision table for picking the pattern
| If the task is... | Use this pattern | Avoid |
|---|---|---|
| Triage to one of N known workflows | Routing | Building a single big agent |
| Cost optimization across model tiers | Routing | Frontier-only architectures |
| Complex task with unpredictable subtasks | Orchestrator-workers | Peer-to-peer multi-agent |
| Multi-source extraction with synthesis | Orchestrator-workers | Single-agent loop |
| Output has clear success criteria + iterates well | Evaluator-optimizer | Single-shot generation |
| Translation, structured extraction, code with tests | Evaluator-optimizer | Hand review on every output |
| Open-ended exploration with no clear stopping point | Reconsider whether you need an agent | Peer-to-peer "agent debate" |
Where to start
If you are scoping or auditing an agent project in 2026:
- Read Anthropic's Building Effective Agents first. Twenty minutes. Single biggest filter on the field.
- Default to a workflow. If your problem has a known path, you do not need an agent.
- If it is genuinely agentic, pick one of the three patterns. Routing, orchestrator-workers, or evaluator-optimizer. Not a custom architecture you read about on Twitter.
- Build the eval set before the agent. Without eval, you cannot tell whether a prompt change improved or regressed the system.
- Instrument from day one. Token counts, retry counts, tool failure rates, cost per task.
We covered the broader landscape in AI agents for business 2026, team economics in the economics of an AI-augmented engineering team, and supporting architecture in human-in-the-loop architecture. For the customer-facing UX side, conversational chat agent UI design and designing for trust in UX with AI features are the right reads.
If you are figuring out which pattern fits a roadmap project, that is the conversation we run as part of our AI Integration, SaaS Development, and API Integration services. We will tell you straight when the answer is "this is a workflow, not an agent."
Want a second opinion on an agent project or which pattern fits? Contact us for a free 30-minute consultation and we will run the patterns above against your specific case.