Multi-Agent
Orchestration
Academy
Design, build, and orchestrate teams of specialized AI agents. From single-agent limits to production-grade multi-agent architectures — with interactive code examples.
When to Go Multi-Agent
Single agent limits and the case for teams
Multi-agent is not always the answer. A single well-prompted agent with good context engineering handles most tasks. Multi-agent architecture introduces coordination cost, latency, and complexity. Use it when — and only when — the benefits outweigh these costs.
“The future of AI is not a single model doing everything, but specialized agents collaborating — each with its own context, tools, and expertise.”
Andrew Ng
Founder, DeepLearning.AI
“Multi-agent systems let you decompose complex workflows into manageable pieces. Each agent gets a clean context window focused on what it does best.”
Harrison Chase
CEO, LangChain
When a Single Agent Breaks Down
Context Window Pollution
When a single agent handles research, coding, testing, and review, its context fills with unrelated information. Research papers dilute coding context, test output obscures security review, and quality degrades across all tasks.
Tool Overload
A single agent given 20+ tools struggles to select the right one. Studies show tool selection accuracy drops 3x when toolsets exceed 30 tools. The model wastes tokens reading irrelevant tool descriptions.
Conflicting Objectives
Asking one agent to both write code AND review it for security creates conflicting incentives. The agent tends to be lenient with its own code. Separate agents provide genuine independent review.
Sequential Bottleneck
A single agent can only do one thing at a time. If a task has independent subtasks (e.g., frontend + backend + tests), a single agent runs them sequentially while a team runs them in parallel.
Decision Framework: Single vs Multi-Agent
Does the task involve more than 2 distinct roles?
Yes
Multi-agent: Give each role a focused agent
No
Single agent with role-switching instructions
Does the agent need more than 10 tools?
Yes
Multi-agent: Split tools across specialists
No
Single agent with dynamic tool selection (RAG)
Are there independent subtasks that could run in parallel?
Yes
Multi-agent: Fan out to parallel agents
No
Single agent with sequential execution
Does the task require self-evaluation or peer review?
Yes
Multi-agent: Separate creator and reviewer
No
Single agent with output validation
Is the task expected to run for 30+ minutes?
Yes
Multi-agent: Isolate contexts to prevent degradation
No
Single agent with compaction may suffice
Cost / Benefit Analysis
| Factor | Single Agent | Multi-Agent | Verdict |
|---|---|---|---|
| Latency | Lower (one LLM call per step) | Higher (orchestration + multiple calls) | Single agent wins for simple tasks |
| Token Cost | Lower total tokens | Higher total, but each call is more focused | Multi-agent costs 2-5x more |
| Quality (complex tasks) | Degrades with context pollution | Maintains quality via isolation | Multi-agent wins for complex tasks |
| Reliability | Single failure point | Multiple failure points, but recoverable | Depends on error handling |
| Debuggability | One conversation to trace | Multiple conversations + coordination logs | Single agent easier to debug |
Key Insight
Anthropic's own recommendation: “Start with a single agent. Only split into multi-agent when you have clear evidence that a single agent's context is being degraded by mixed concerns.” The best multi-agent system is one where every agent earns its place through measurable quality improvement.
Orchestration Patterns
Sequential, parallel, hierarchical, swarm
There are five fundamental patterns for orchestrating multiple agents. Each pattern makes different tradeoffs between latency, cost, quality, and complexity. Choose the simplest pattern that meets your requirements.
Sequential Pipeline
Parallel Fan-Out / Fan-In
Hierarchical (Supervisor)
Swarm / Peer-to-Peer
Debate / Consensus
Agents execute in a fixed order, each passing output to the next. The simplest multi-agent pattern and the right starting point for most workflows.
How It Works
- 1.Define a fixed sequence of agents with clear roles
- 2.Each agent receives the previous agent's output as input
- 3.The orchestrator passes structured summaries between stages
- 4.Each agent has its own context window, tools, and system prompt
- 5.Final output is the last agent's result
Strengths
- +Simple to implement and debug
- +Clear data flow — easy to trace failures
- +Each agent gets a clean, focused context
- +Deterministic execution order
Weaknesses
- -No parallelism — total latency is sum of all agents
- -A failure in any stage blocks the entire pipeline
- -Can't handle tasks with independent subtasks
Real-World Example
Claude Code uses a sequential pipeline for complex refactoring: analyze codebase -> plan changes -> implement -> run tests -> fix failures.
An orchestrator fans out independent subtasks to multiple agents running in parallel, then fans in (aggregates) their results.
How It Works
- 1.Orchestrator decomposes task into independent subtasks
- 2.Independent subtasks are dispatched to agents in parallel (Promise.all)
- 3.Each agent works in its own context window simultaneously
- 4.An aggregator agent combines results into a final output
- 5.Dependencies are resolved before fan-out to avoid deadlocks
Strengths
- +Significantly faster for tasks with independent subtasks
- +Near-linear speedup for embarrassingly parallel tasks
- +Each parallel agent gets a clean context
- +Partial results available if some agents fail
Weaknesses
- -Higher token cost (multiple simultaneous LLM calls)
- -Aggregation step can be complex for conflicting outputs
- -Not suitable when subtasks depend on each other
Real-World Example
LangGraph uses fan-out to process multiple documents in parallel during RAG, then fans in to produce a single synthesized answer.
A supervisor agent manages a team of worker agents, delegating tasks, reviewing results, and deciding next steps dynamically based on intermediate outcomes.
How It Works
- 1.Supervisor receives the task and decides which agents to involve
- 2.Supervisor delegates subtasks with specific instructions
- 3.Workers execute and return results to the supervisor
- 4.Supervisor reviews results, decides if revision is needed
- 5.Supervisor can re-delegate, request changes, or finalize
Strengths
- +Dynamic — supervisor adapts strategy based on results
- +Quality control — supervisor reviews before finalizing
- +Can handle complex, unpredictable workflows
- +Natural escalation path for difficult subtasks
Weaknesses
- -Supervisor becomes a single point of failure and bottleneck
- -Supervisor's context grows with each worker interaction
- -Higher latency due to review loops
- -Supervisor LLM costs can exceed worker costs
Real-World Example
CrewAI's hierarchical process uses a manager agent that delegates to crew members, reviews their output, and decides whether to accept or request revisions.
Agents hand off control to each other directly, without a central orchestrator. Each agent decides which agent should handle the conversation next.
How It Works
- 1.Each agent has a list of other agents it can transfer to
- 2.When an agent determines another is better suited, it transfers
- 3.The conversation context transfers with the handoff (or a summary)
- 4.No central coordinator — agents self-organize
- 5.Each agent includes transfer functions in its tool definitions
Strengths
- +No single point of failure (no central orchestrator)
- +Natural for customer service routing and triage
- +Low overhead — no orchestrator LLM calls
- +Agents can specialize without rigid pipeline structure
Weaknesses
- -Difficult to debug — no central coordination log
- -Agents can enter infinite handoff loops
- -Context transfer between agents can be lossy
- -Hard to ensure all necessary work gets done
Real-World Example
OpenAI Swarm uses this pattern — agents include transfer_to_agent() functions in their tools. A triage agent routes to specialists who can route to each other.
Multiple agents independently generate solutions, then debate or vote to reach consensus. Used when correctness matters more than speed.
How It Works
- 1.Multiple agents independently analyze the same problem
- 2.Each agent produces a solution with reasoning
- 3.Agents are shown each other's solutions and asked to critique
- 4.Multiple rounds of debate refine positions
- 5.A judge agent (or majority vote) selects the final answer
Strengths
- +Higher accuracy through independent verification
- +Catches errors that a single agent might miss
- +Reduces hallucination through cross-validation
- +Especially effective for reasoning and analysis tasks
Weaknesses
- -3-5x more expensive than single-agent (multiple full analyses)
- -Significantly higher latency (multiple rounds)
- -Debate can converge on a wrong answer if all agents share a bias
- -Diminishing returns beyond 3 agents in most cases
Real-World Example
Google DeepMind's multi-agent debate showed improved factual accuracy and reasoning. Multiple agents independently verify claims and argue for their conclusions.
Choosing the Right Pattern
Most teams should start with Sequential Pipeline and only adopt more complex patterns when they hit specific limitations:
- 1.Need speed? Move independent steps to Parallel Fan-Out
- 2.Need quality control? Add a Supervisor for review loops
- 3.Need flexible routing? Use Swarm for agent-to-agent handoffs
- 4.Need accuracy? Use Debate for critical decisions
Agent Communication
Message passing, shared state, event-driven
How agents communicate determines the quality, speed, and reliability of your multi-agent system. The wrong communication pattern creates context pollution, race conditions, and debugging nightmares. The right pattern keeps each agent focused while enabling effective coordination.
Direct Message Passing
Shared Blackboard
Event-Driven / Pub-Sub
Request-Response (RPC-style)
Agents communicate by sending structured messages directly to each other through the orchestrator. The orchestrator acts as a message router, deciding which agent receives which messages.
How It Works
- 1.Agent A completes its task and produces a structured output
- 2.The orchestrator extracts relevant information from the output
- 3.The orchestrator formats the information as input for Agent B
- 4.Agent B receives only the filtered, relevant context
- 5.The orchestrator logs all inter-agent messages for debugging
Strengths
- +Simple to implement and reason about
- +Orchestrator controls information flow — prevents context pollution
- +Easy to add message validation and filtering
- +Clear audit trail of what each agent received
Weaknesses
- -Orchestrator is a bottleneck for all communication
- -Sequential by nature — agents can't communicate concurrently
- -Orchestrator's context grows with each message routed
A shared data structure (the 'blackboard') that all agents can read from and write to. Agents coordinate by reading the current state and contributing their results. Originally from AI research in the 1980s.
How It Works
- 1.Define a typed shared state object with sections per agent
- 2.Each agent reads the blackboard to understand current state
- 3.Agents write their results to their designated section
- 4.Other agents reactively read updated sections when needed
- 5.Locking prevents concurrent write conflicts
Strengths
- +Agents can work independently — no direct coupling
- +Any agent can read any other agent's results
- +Natural fit for parallel execution patterns
- +State persists across agent restarts and failures
Weaknesses
- -Requires careful access control to prevent corruption
- -Agents can become dependent on stale data
- -Debugging is harder — no clear message trail
- -Need locking or versioning for concurrent writes
Agents publish events when they complete work or need assistance. Other agents subscribe to relevant events and react accordingly. Decouples producers from consumers.
How It Works
- 1.Define event types: TaskCompleted, NeedsReview, ErrorOccurred, etc.
- 2.Agents publish events to an event bus when they complete steps
- 3.Other agents subscribe to events they care about
- 4.Events carry minimal payload — just enough to trigger the subscriber
- 5.Subscribers fetch full context from shared state if needed
Strengths
- +Highly decoupled — agents don't need to know about each other
- +Easy to add new agents without modifying existing ones
- +Natural fit for reactive, event-driven architectures
- +Supports complex coordination without central bottleneck
Weaknesses
- -Harder to reason about overall flow (event spaghetti)
- -Debugging requires event tracing infrastructure
- -Event ordering and exactly-once delivery are non-trivial
- -Can lead to cascading event storms if not throttled
Agents call each other like function calls — sending a request with parameters and waiting for a structured response. The most familiar pattern for developers, mapping cleanly to async/await.
How It Works
- 1.Define typed request/response interfaces for each agent
- 2.Calling agent sends a typed request to the target agent
- 3.Target agent processes the request and returns a typed response
- 4.Calling agent blocks (or awaits) until the response arrives
- 5.Timeouts prevent indefinite waiting on failed agents
Strengths
- +Familiar mental model — just like function calls
- +Strong typing catches integration errors at compile time
- +Easy to test each agent independently with mock requests
- +Clear request-response pairing simplifies debugging
Weaknesses
- -Synchronous by default — blocks the calling agent
- -Tight coupling between caller and callee interfaces
- -Cascading failures if a downstream agent is slow or down
- -Not suitable for broadcast or multi-receiver communication
Communication Protocol Design Principles
Keep payloads under 500 tokens
Inter-agent messages should be summaries, not dumps. If you're passing more than 500 tokens between agents, you're leaking context that should stay in the sender's window.
Use structured formats, not free text
Define TypeScript interfaces for all inter-agent messages. Free-text messages are unpredictable and hard to validate. Structured messages can be schema-validated before delivery.
Include metadata with every message
Every message should carry: sender ID, timestamp, message type, and a correlation ID linking it to the original task. This metadata is essential for debugging and observability.
Separate data from control signals
Don't mix task results with coordination instructions in the same message. Use different message types for 'here are my results' vs 'please review this' vs 'I failed, need fallback'.
Key Insight
The best inter-agent communication feels like a well-run standup meeting: each participant shares a brief status update with just enough context for others to do their job. If your agents are sending each other novels, your communication pattern is wrong.
Task Decomposition
Breaking complex tasks into agent-sized pieces
Task decomposition is the art of breaking a complex task into agent-sized pieces — subtasks that are small enough for a single agent to handle with a focused context, yet large enough to produce meaningful output. Good decomposition is the difference between a multi-agent system that works and one that wastes tokens on coordination overhead.
Tasks are split into predefined stages at development time. The pipeline structure is fixed — every task follows the same sequence of agents regardless of complexity.
How It Works
- 1.Define a fixed sequence of stages at development time (research -> code -> test)
- 2.Every task flows through all stages in order
- 3.Each stage maps to a specific agent with a predetermined role
- 4.No runtime adaptation — the pipeline is the same for simple and complex tasks
Pros
- +Predictable execution
- +Easy to debug and monitor
- +No orchestrator LLM calls needed
Cons
- -Wastes time on unnecessary stages for simple tasks
- -Can't handle tasks that need additional stages
- -One-size-fits-all approach
An orchestrator LLM analyzes the task and dynamically generates a set of subtasks with dependencies. The decomposition adapts to task complexity — simple tasks get 2 subtasks, complex tasks get 8+.
How It Works
- 1.Orchestrator LLM receives the task description and list of available agent types
- 2.LLM produces a JSON array of subtasks with agent assignments and dependencies
- 3.Dependencies are validated as a DAG (no circular dependencies)
- 4.Subtasks with met dependencies execute in parallel
- 5.Orchestrator can add subtasks mid-execution if needed
Pros
- +Adapts to task complexity
- +Enables parallel execution of independent subtasks
- +Can handle novel task types without code changes
Cons
- -Orchestrator LLM call adds latency and cost
- -Decomposition quality depends on orchestrator model
- -Risk of over-decomposition (too many subtasks)
A top-level orchestrator breaks the task into major phases, then sub-orchestrators further decompose each phase into atomic subtasks. Scales to very complex projects.
How It Works
- 1.Top-level orchestrator splits task into 2-4 major phases
- 2.Each phase is assigned to a sub-orchestrator
- 3.Sub-orchestrators decompose their phase into atomic subtasks
- 4.Atomic subtasks are assigned to worker agents
- 5.Results flow back up: workers -> sub-orchestrators -> top-level orchestrator
Pros
- +Scales to very complex tasks
- +Each level has manageable scope
- +Sub-orchestrators can specialize
Cons
- -Multiple levels of orchestration overhead
- -Coordination between sub-orchestrators is complex
- -Debugging across levels is challenging
Task Graph Concepts
Dynamic decomposition produces a dependency graph — a DAG (Directed Acyclic Graph) that determines execution order and parallelization opportunities.
Dependency Graph (DAG)
A directed acyclic graph where nodes are subtasks and edges represent dependencies. Subtask B depends on A means A must complete before B starts. The graph must have no cycles.
Topological Sort
An ordering of subtasks such that every dependency comes before the tasks that depend on it. This determines the execution order and identifies which tasks can run in parallel.
Critical Path
The longest chain of dependent subtasks that determines the minimum total execution time. Optimizing the critical path has the biggest impact on overall latency.
Fan-Out Width
The maximum number of subtasks that can run in parallel at any point. Higher fan-out means more parallelism but also more concurrent LLM calls and higher peak cost.
Decomposition Guidelines
Each subtask should be completable by one agent in one call
If a subtask requires multiple LLM calls or multiple tools, it's not atomic enough. Break it down further until each subtask is a single, focused unit of work.
Subtask descriptions must be self-contained
An agent receiving a subtask should have all the context it needs in the subtask description plus the shared state. It should never need to 'guess' what previous agents did.
Minimize inter-subtask dependencies
The more dependencies, the less parallelism. Design subtasks to be as independent as possible. If two subtasks share a dependency, consider merging them or restructuring.
Set complexity estimates for capacity planning
Tag each subtask as low/medium/high complexity. This helps the orchestrator allocate resources: simple subtasks might use a smaller model, complex ones get the strongest model.
Include validation criteria in each subtask
Every subtask should define what 'done' looks like. This enables automatic validation before passing results downstream and prevents garbage from propagating through the pipeline.
Key Insight
The ideal subtask size is one that fills 30-50% of an agent's context window. Too small and you waste tokens on coordination overhead. Too large and the agent's context gets polluted with mixed concerns. When in doubt, err on the side of fewer, larger subtasks — you can always split further if quality degrades.
Error Handling & Recovery
Retries, fallbacks, and graceful degradation
In a multi-agent system, failures are multiplicative, not additive. If each agent has a 90% success rate, a 5-agent pipeline succeeds only 59% of the time (0.9^5). Production multi-agent systems need defense in depth: retries, fallbacks, circuit breakers, and graceful degradation at every level.
81%
2 agents @ 90% each
73%
3 agents @ 90% each
59%
5 agents @ 90% each
43%
8 agents @ 90% each
Pipeline success rate = (individual agent success rate) ^ (number of agents). This is why error handling isn't optional in multi-agent systems — it's the difference between a 59% and 99% success rate.
Common Failure Modes
Agent Output Failure
Agent produces malformed, incomplete, or hallucinated output that doesn't match the expected schema.
Rate Limiting / Throttling
LLM API returns 429 (rate limit) or 529 (overloaded). Multiple agents hitting the same API amplifies this.
Context Overflow
Agent's input exceeds the model's context window limit, causing the API call to fail entirely.
Deadlock / Circular Wait
Agent A waits for Agent B's output, but Agent B is waiting for Agent A. Pipeline hangs indefinitely.
Orchestrator Failure
The orchestrator itself fails — hallucinating a bad plan, crashing mid-coordination, or exceeding its own context limit.
Recovery Patterns
Automatically retry failed agent calls with increasing delays between attempts. The simplest and most effective error handling pattern. Handles transient failures (rate limits, network issues) without code changes.
Implementation
- 1.Set max retries per agent (typically 2-3)
- 2.Initial delay: 1 second, then 2s, 4s, 8s (exponential)
- 3.Add jitter (random 0-1s) to prevent thundering herd
- 4.Log each retry attempt with the error for debugging
- 5.After max retries, fall through to fallback strategy
For critical pipeline steps, define an alternative agent that can produce acceptable (if lower quality) output when the primary agent fails. Trade quality for reliability.
Implementation
- 1.Identify critical pipeline steps where failure is unacceptable
- 2.Define a fallback agent: simpler model, fewer tools, more constrained prompt
- 3.Fallback activates only after primary exhausts all retries
- 4.Fallback output is tagged so downstream agents know it's degraded
- 5.Monitor fallback activation rate — if too high, fix the primary agent
If an agent fails repeatedly, stop calling it entirely for a cooldown period. Prevents wasting tokens and time on a consistently failing agent. Borrowed from microservice architecture.
Implementation
- 1.Track failure rate per agent in a sliding window (e.g., last 10 calls)
- 2.If failure rate exceeds threshold (e.g., 50%), open the circuit
- 3.While open: all calls to that agent immediately return the fallback
- 4.After cooldown period (e.g., 60 seconds), allow one test call
- 5.If test succeeds, close the circuit. If it fails, extend cooldown.
Persist intermediate results after each pipeline step. On failure, resume from the last successful checkpoint instead of restarting the entire pipeline.
Implementation
- 1.After each agent completes, save its output to a durable store (DB, S3)
- 2.Each checkpoint includes: task ID, step name, output, timestamp
- 3.On pipeline failure, query checkpoints for the failing task
- 4.Resume execution from the step after the last successful checkpoint
- 5.Expired checkpoints are cleaned up after a configurable TTL
When non-critical agents fail, continue the pipeline with reduced functionality rather than failing entirely. Return partial results with clear indicators of what's missing.
Implementation
- 1.Classify each agent as critical (must succeed) or optional (nice to have)
- 2.If an optional agent fails, skip it and continue the pipeline
- 3.Mark the skipped step in the output so consumers know what's missing
- 4.Collect all partial results and return them with a completeness score
- 5.Example: code review skipped but code + tests still produced
Error Handling Checklist
Every agent call is wrapped in try/catch with structured error logging
Retries with exponential backoff on all LLM API calls (2-3 attempts)
Timeouts on every agent call (30-120 seconds depending on task complexity)
Schema validation (Zod) on agent outputs before passing downstream
Fallback agents defined for all critical pipeline steps
Checkpointing after each successful pipeline step
Circuit breaker on agents with high failure rates
Graceful degradation for non-critical agents
Deadlock detection via DAG validation before execution
Pipeline-level timeout to catch undetected hangs
Key Insight
The goal is not to prevent all failures — that's impossible with LLMs. The goal is to make failures recoverable. A well-designed multi-agent system fails gracefully: it retries, falls back, checkpoints progress, and returns partial results rather than nothing. The user should rarely see a raw error.
Frameworks
CrewAI, AutoGen, LangGraph, custom solutions
You don't need to build multi-agent orchestration from scratch. Several frameworks handle the hard parts — message routing, state management, error handling, and coordination. Choose based on your workflow complexity, team expertise, and production requirements.
CrewAI
LangGraph
AutoGen
OpenAI Swarm
Custom Solution
AI agents that work together like a real crew
CrewAI is built around the metaphor of a human team: you define agents with roles (Researcher, Writer, Editor), assign them tasks, and let them collaborate. Supports sequential and hierarchical processes with built-in memory and tool integration.
Key Features
Strengths
- +Intuitive role-based API — easy for beginners
- +Built-in sequential and hierarchical orchestration
- +Agent memory and learning across tasks
- +Large ecosystem of pre-built tools
- +Active community and frequent updates
Limitations
- -Less control over state management than LangGraph
- -Hierarchical process can be opaque (manager decisions)
- -Python-first (TypeScript support is newer)
- -Debugging complex interactions can be challenging
Build complex agent workflows as graphs
LangGraph models agent workflows as directed graphs with nodes (agents/functions) and edges (transitions). Provides fine-grained control over state, conditional routing, cycles, and human-in-the-loop. The most flexible framework for complex workflows.
Key Features
Strengths
- +Fine-grained control over state and transitions
- +Conditional edges for dynamic routing
- +Supports cycles (revision loops, iterative refinement)
- +Built-in checkpointing and state persistence
- +Human-in-the-loop patterns are first-class
- +TypeScript and Python SDKs
Limitations
- -Steeper learning curve (graph programming model)
- -More boilerplate than CrewAI for simple workflows
- -Requires more upfront design (defining state schema, edges)
- -Graph debugging tools are still maturing
Multi-agent conversations with humans in the loop
AutoGen (from Microsoft Research) models multi-agent systems as conversations between agents. Agents take turns speaking, can call tools, and can involve human participants. Excels at debate, brainstorming, and iterative refinement patterns.
Key Features
Strengths
- +Natural conversational model — agents chat with each other
- +Strong human-in-the-loop support
- +Group chat with multiple agents
- +Code execution sandbox built-in
- +Backed by Microsoft Research
Limitations
- -Conversation-based model doesn't fit all workflows
- -Agent ordering in group chat can be unpredictable
- -Heavier resource usage (multi-turn conversations)
- -API has undergone significant changes (v0.1 to v0.4)
Simple agent handoffs, nothing more
Swarm is OpenAI's experimental, educational framework that focuses on one thing: agent-to-agent handoffs. Agents include transfer functions in their tools, enabling fluid routing between specialists. Intentionally minimal — no state management, no orchestrator, no persistence.
Key Features
Strengths
- +Extremely simple — under 500 lines of code
- +No abstraction overhead — just functions and handoffs
- +Easy to understand and extend
- +Great for learning multi-agent concepts
- +Natural fit for customer support routing
Limitations
- -No built-in state management
- -No error handling or retry logic
- -No persistence or checkpointing
- -Experimental — not production-ready as-is
- -OpenAI models only (no multi-provider support)
Build exactly what you need, nothing you don't
Build your own orchestration layer using raw LLM APIs. Maximum control, maximum responsibility. Justified when existing frameworks don't support your specific orchestration pattern, or when you need tight integration with existing infrastructure.
Key Features
Strengths
- +Complete control over every aspect of orchestration
- +No framework overhead or abstractions to work around
- +Tight integration with existing infrastructure
- +Can optimize for specific performance requirements
- +No dependency on framework maintenance/updates
Limitations
- -Months of development for robust orchestration
- -Must build retry logic, state management, error handling
- -No community support or shared patterns
- -High maintenance burden over time
- -Risk of reinventing existing solutions poorly
Framework Comparison Matrix
| Criterion | CrewAI | LangGraph | AutoGen | Swarm | Custom |
|---|---|---|---|---|---|
| Learning Curve | Low | Medium-High | Medium | Very Low | High |
| Flexibility | Medium | Very High | Medium | Low | Maximum |
| Production Readiness | High | High | Medium | Low | Depends |
| TypeScript Support | Partial | Full | Limited | No | Full |
| State Management | Built-in | Advanced | Basic | None | Build it |
| Error Handling | Built-in | Built-in | Basic | None | Build it |
Choosing Your Framework
- 1.CrewAI if you want to ship a role-based agent team this week
- 2.LangGraph if you need fine-grained control over complex, stateful workflows
- 3.AutoGen if your use case is naturally conversational (debate, brainstorming, iterative code)
- 4.Swarm for learning, prototyping, or simple handoff routing
- 5.Custom only when existing frameworks genuinely can't support your pattern
Production Patterns
Scaling, monitoring, and deploying agent teams
Getting a multi-agent system working in development is the easy part. Getting it to work reliably at scale in production requires careful attention to scaling, monitoring, deployment strategies, and cost management. These patterns separate toy demos from production systems.
Scaling Patterns
Decouple agent execution from request handling using message queues. Each agent polls a queue for tasks, processes them, and publishes results. Enables horizontal scaling — add more agent workers when load increases.
How It Works
- 1.Incoming tasks are published to a task queue (Redis, SQS, Kafka)
- 2.Agent workers poll their designated queue for tasks
- 3.Each worker processes one task at a time with isolated context
- 4.Results are published to an output queue or stored in shared state
- 5.The orchestrator coordinates by publishing and consuming queue messages
Key Benefit
Scale each agent type independently. Add more 'researcher' workers during peak research demand without scaling the entire system.
Maintain a pool of pre-initialized agents that can handle tasks immediately. A load balancer distributes incoming subtasks across available agents, preventing any single agent from becoming a bottleneck.
How It Works
- 1.Initialize a pool of N agents per role at startup
- 2.Each agent in the pool is pre-configured with system prompt and tools
- 3.Load balancer assigns incoming subtasks to idle agents
- 4.Agents return to the pool after completing their task
- 5.Pool size auto-scales based on queue depth and latency metrics
Key Benefit
Eliminates cold-start latency. Agents are ready to work immediately, reducing per-task overhead from seconds to milliseconds.
Not every agent needs the most powerful (and expensive) model. Route simple subtasks to smaller, cheaper models and reserve large models for complex reasoning and decision-making.
How It Works
- 1.Classify each subtask by complexity: low, medium, high
- 2.Low-complexity tasks route to small models (GPT-4o-mini, Claude Haiku)
- 3.Medium tasks use standard models (GPT-4o, Claude Sonnet)
- 4.High-complexity tasks get the strongest model (Claude Opus, GPT-4.5)
- 5.The orchestrator itself can use a mid-tier model for routing decisions
Key Benefit
Reduce costs by 60-80% without quality degradation. Most subtasks in a pipeline don't need the strongest model.
Key Metrics to Monitor
Pipeline Success Rate
Percentage of end-to-end pipeline executions that complete without error. Target: >95% for production.
Agent-Level Success Rate
Success rate per individual agent. Identifies which specific agent is the weak link in the pipeline.
Orchestration Overhead Ratio
Tokens spent on coordination (routing, summarizing, decision-making) vs actual task work. High overhead means your architecture is too complex.
End-to-End Latency (P95)
95th percentile of total pipeline duration. Helps identify outliers caused by slow agents, retries, or deadlocks.
Token Cost per Task
Total token spend across all agents for a single task. Track this per task type to identify expensive patterns.
Retry Rate
How often agent calls need retries. High retry rates indicate systemic issues with prompts, tools, or model selection.
Deployment Strategies
Run two versions of your agent pipeline simultaneously. Route traffic to the 'blue' (current) version while testing 'green' (new). Switch traffic when green is validated.
- 1.Deploy new agent configurations to 'green' environment
- 2.Run evaluation suite against green with production-like traffic
- 3.Compare quality metrics between blue and green
- 4.If green passes, switch traffic. If not, green gets reverted.
- 5.Keep blue running for instant rollback if issues emerge
Update one agent at a time, routing a small percentage of traffic to the new version. Monitor for regressions before rolling out fully.
- 1.Update a single agent (e.g., researcher) to a new version
- 2.Route 5% of traffic to the new version, 95% to the old
- 3.Monitor quality and latency metrics for the canary
- 4.Gradually increase canary traffic: 5% -> 25% -> 50% -> 100%
- 5.Rollback to old version if any metric degrades
Use feature flags to enable or disable specific agent capabilities (new tools, prompt changes, model upgrades) without redeploying.
- 1.Wrap new agent capabilities in feature flags
- 2.Enable flags for internal testing first
- 3.Gradually roll out to production users by percentage
- 4.Kill switch: disable any flag instantly if issues arise
- 5.Track metrics per flag to measure capability impact
Observability Stack
Tracing
End-to-end trace of every pipeline execution, showing which agents ran, in what order, what they produced, and how long each took.
Logging
Structured logs for every agent call: input tokens, output tokens, model used, latency, success/failure, retry count.
Metrics
Aggregated dashboards showing pipeline success rates, per-agent metrics, cost trends, and latency distributions.
Alerting
Automated alerts when metrics cross thresholds: success rate drops, latency spikes, cost anomalies, or agent-specific failures.
The Production Readiness Checklist
Before deploying a multi-agent system to production, verify:
- 1.Every agent call has retry logic, timeouts, and fallback strategies
- 2.Pipeline results are checkpointed after each step
- 3.End-to-end tracing is enabled for every pipeline execution
- 4.Dashboards show per-agent success rate, latency, and token cost
- 5.Alerts fire on pipeline success rate drops and cost anomalies
- 6.You can roll back individual agents without redeploying the pipeline
Interactive Examples
See multi-agent patterns in action with live code
See multi-agent patterns in action. Each example shows a common mistake and its production-ready fix. Toggle between them to understand the difference.
When one agent tries to do everything vs specialized agents
// BAD: Single agent handles research, code, tests, review
const response = await llm.generate({
system: `You are a senior engineer. You must:
1. Research the best database for this use case
2. Design the schema
3. Write the migration code
4. Write integration tests
5. Review the code for security issues
6. Write deployment documentation
Do ALL of this in one conversation.`,
tools: [
searchWeb, readDocs, queryDatabase,
writeFile, readFile, runCommand,
runTests, lintCode, checkSecurity,
writeMarkdown, deployService,
// 20+ more tools...
],
messages, // Context gets polluted with mixed concerns
});Why this fails
A single agent juggling 6 different roles pollutes its context with unrelated information. Research findings dilute coding context, test output obscures security review, and the agent loses track of what it was doing. Tool overload (20+ tools) further degrades performance.
All Examples Quick Reference
Single Agent vs Agent Team
When one agent tries to do everything vs specialized agents
Inter-Agent Context Passing
How to pass information between agents without context pollution
Multi-Agent Error Handling
Graceful failure when agents in a pipeline fail
Shared State Management
How agents coordinate through shared state without conflicts
Task Decomposition
Breaking complex tasks into parallelizable subtasks
Framework Selection
Choosing the right multi-agent framework for your use case
Anti-Patterns & Failure Modes
Agent sprawl, deadlocks, context leakage, and more
Multi-agent systems introduce unique failure modes that don't exist in single-agent architectures. These anti-patterns are documented from real production failures across the industry — understanding them is the difference between a demo that works and a system that stays working.
Creating too many agents for a task that doesn't need them, adding complexity without proportional benefit.
Cause
Prematurely splitting into multi-agent before validating that a single agent can't handle the task. Adding agents for every minor subtask.
Symptom
High latency from excessive inter-agent communication. Increased costs from multiple LLM calls. More failure points than a single-agent approach. Simple tasks taking 10x longer.
Fix
Start with a single agent. Only split when you have clear evidence of context pollution or tool overload. A good heuristic: if an agent needs fewer than 3 tools and its context stays under 50% of the window, it doesn't need splitting.
Agents sending too many messages or too much data between each other, overwhelming the system with inter-agent chatter.
Cause
No protocol for what information gets passed between agents. Agents forwarding entire conversation histories instead of summaries. No message size limits.
Symptom
Orchestrator's context window fills up with agent outputs. Exponential token costs from message passing. Agents spending more time communicating than working. Latency spikes from serializing large messages.
Fix
Define a strict communication protocol. Pass structured summaries (under 500 tokens) between agents, not raw outputs. Use a shared state store for large artifacts instead of message passing. Implement message budgets per agent.
The orchestrator agent becomes a bottleneck — if it fails, the entire system fails with no recovery path.
Cause
All coordination runs through one agent with no redundancy. No checkpointing of intermediate results. No fallback orchestration strategy.
Symptom
A single orchestrator failure loses all progress from completed subtasks. Rate limit on the orchestrator model blocks the entire pipeline. If the orchestrator hallucinates a bad plan, all downstream agents execute the wrong tasks.
Fix
Checkpoint results after each pipeline step. Use a durable state store (database, not in-memory) so progress survives crashes. Implement a simple fallback orchestrator that can resume from checkpoints. Consider hierarchical orchestration where sub-orchestrators handle independent branches.
Sensitive information from one agent's context unintentionally leaks into another agent's context through shared state or message passing.
Cause
No access controls on shared state. Agents passing full context instead of filtered summaries. PII or secrets included in inter-agent messages without scrubbing.
Symptom
Customer PII from a support agent ends up in a reporting agent's context. API keys from a deployment agent leak into a logging agent. Agents reference information they shouldn't have access to.
Fix
Implement scoped state access — each agent can only read/write designated state sections. Scrub PII and secrets from inter-agent messages. Use role-based access control for shared state. Audit what information flows between agents.
The coordination cost between agents exceeds the actual work being done, making multi-agent slower and more expensive than single-agent.
Cause
Too many small agents that each do trivial work. Synchronous coordination when async would suffice. The orchestrator makes an LLM call for every routing decision.
Symptom
80% of token spend goes to orchestration, not task execution. Simple tasks take 30+ seconds due to sequential agent coordination. Cost per task is 5-10x higher than single-agent baseline with no quality improvement.
Fix
Measure orchestration overhead explicitly: track tokens spent on coordination vs actual work. Use deterministic routing (code, not LLM) when the routing logic is simple. Batch small tasks into a single agent call. Only use LLM-based orchestration when the routing decision genuinely requires intelligence.
Agents waiting on each other in a cycle, causing the pipeline to hang indefinitely.
Cause
Agent A waits for Agent B's output, but Agent B is waiting for Agent A. Poorly defined dependency graphs with cycles. No timeout on agent-to-agent waiting.
Symptom
Pipeline hangs indefinitely with no error message. Agents appear idle but are actually waiting on each other. Resource usage stays constant (agents are alive but blocked).
Fix
Validate the dependency graph is a DAG (directed acyclic graph) before execution. Implement timeouts on all inter-agent waits. Use topological sort to determine execution order. Add cycle detection to your orchestration layer.
Best Practices Checklist
Production-ready guidelines from leading AI teams
Production-ready guidelines distilled from Anthropic, LangChain, Microsoft Research, CrewAI, and teams running multi-agent systems at scale. These practices prevent the anti-patterns described in the previous section.
Start with one agent, split only when needed
Begin with a single well-prompted agent. Only introduce multi-agent when you see clear evidence of context pollution, tool overload, or quality degradation from mixed concerns.
Each agent gets a focused role and 3-5 tools
Specialized agents with narrow tool sets and focused system prompts outperform generalist agents. If an agent needs 10+ tools, it's doing too much.
Use the simplest orchestration pattern that works
Sequential pipelines cover 80% of use cases. Only add parallel execution, hierarchical coordination, or swarm patterns when the task genuinely requires them.
Design for partial failure
Every agent call can fail. Design pipelines where completed steps are preserved, failed steps can retry, and the system can return partial results rather than nothing.
Pass summaries, not full context, between agents
When handing off between agents, send a structured summary (under 500 tokens) of key decisions, artifacts, and open questions — not the entire conversation history.
Use typed, scoped shared state
Define a TypeScript interface for shared state. Give each agent read access to all state but write access only to its designated section. Use locks for concurrent writes.
Checkpoint after every pipeline step
Persist intermediate results to a durable store after each agent completes. This enables resume-from-checkpoint on failure and makes debugging much easier.
Define explicit communication protocols
Standardize how agents communicate: message format, maximum size, required fields. Ad-hoc message passing leads to context bloat and debugging nightmares.
Implement retries with exponential backoff
Agent calls fail due to rate limits, network issues, and model errors. Retry 2-3 times with increasing delays before falling back to an alternative strategy.
Use fallback agents for critical steps
For critical pipeline steps, define a fallback agent (e.g., a simpler model) that can produce acceptable output when the primary agent fails.
Set timeouts on all agent operations
Agents can hang due to infinite loops, model latency spikes, or deadlocks. Set timeouts on every agent call and inter-agent wait to prevent the entire system from freezing.
Validate agent outputs before passing downstream
Use schema validation (Zod) to verify each agent's output structure before passing it to the next agent. Catch malformed outputs early, before they corrupt downstream context.
Track token spend per agent, not just total
Know which agents consume the most tokens. Often, the orchestrator uses more tokens than the workers. Identify and optimize the most expensive agents first.
Log the full agent dependency graph per run
For every pipeline execution, log which agents ran, in what order, what they produced, and how long each took. This is essential for debugging multi-agent failures.
Measure orchestration overhead ratio
Track the ratio of tokens spent on coordination (orchestrator decisions, message passing) vs actual work (agent task execution). If overhead exceeds 30%, simplify your architecture.
Set up alerts for agent failure rates
Monitor each agent's success rate independently. An agent failing 20% of the time may be acceptable in isolation but catastrophic in a 5-agent pipeline (64% pipeline success rate).
Use queue-based orchestration for scalability
Decouple agent execution from request handling using message queues (Redis, SQS, Kafka). This lets you scale agents independently and handle burst traffic without losing requests.
Implement graceful degradation for each agent
If a non-critical agent (e.g., formatting, logging) fails, the pipeline should continue with reduced functionality rather than failing entirely. Only critical agents should block the pipeline.
Version your agent configurations independently
Each agent's system prompt, tools, and model version should be versioned separately. This lets you update one agent without redeploying the entire pipeline, and roll back individual agents on regression.
Load test with realistic agent interaction patterns
Test the full multi-agent pipeline under load, not just individual agents. Inter-agent communication, shared state contention, and orchestrator bottlenecks only appear at scale.
The Guiding Principle
A multi-agent system should be the simplest architecture that achieves your quality bar. Every additional agent adds latency, cost, and failure surface. If you can't measure a concrete quality improvement from splitting an agent, keep it as one. The best multi-agent system is one where every agent earns its place through measurable improvement.
— Anthropic, Building Effective Agents
Resources & Further Reading
Docs, papers, repos, and courses
Essential documentation, research papers, repositories, and courses for mastering multi-agent orchestration.
Building Effective Agents
Anthropic's official guide on agent architectures, including when and how to use multi-agent patterns with Claude.
CrewAI Documentation
Official docs for CrewAI — the framework for orchestrating role-based AI agent teams with sequential and hierarchical processes.
AutoGen: Enabling Next-Gen LLM Applications
The foundational paper on AutoGen's multi-agent conversation framework, enabling complex LLM workflows through inter-agent communication.
LangGraph Multi-Agent Guide
LangGraph's guide to building multi-agent systems with stateful graphs, conditional routing, and human-in-the-loop patterns.
OpenAI Swarm (Experimental)
OpenAI's lightweight, experimental framework for multi-agent handoffs. Educational reference for understanding agent-to-agent transfers.
Multi-Agent Systems with LLMs: A Survey
Comprehensive survey of LLM-based multi-agent systems covering communication, coordination, and evaluation frameworks.
AgentScope: A Flexible yet Robust Multi-Agent Platform
Platform for building robust multi-agent applications with built-in fault tolerance, distributed execution, and monitoring.
The Landscape of Emerging AI Agent Architectures
Survey of single-agent and multi-agent architectures including hierarchical, peer-to-peer, and debate patterns for production use.
Claude Code: Multi-Agent Orchestration
How Claude Code uses multi-agent patterns internally — task delegation, context isolation, and parallel execution.
Building Multi-Agent Systems (DeepLearning.AI)
Andrew Ng's short course on building multi-agent systems with CrewAI, covering role-playing agents, task delegation, and collaboration.
Recommended Learning Path
- 1.Start with Anthropic's Building Effective Agents for foundational principles on when and how to use multi-agent patterns.
- 2.Take Andrew Ng's Multi-Agent Systems course on DeepLearning.AI for hands-on experience with CrewAI.
- 3.Read the LangGraph Multi-Agent Guide to understand stateful graph-based orchestration patterns.
- 4.Study OpenAI Swarm's source code (under 500 lines) to understand the simplest possible multi-agent implementation.
- 5.Read the academic surveys for comprehensive coverage of patterns, evaluation, and open research questions.