Context Engineering for Multi-Agent Systems

The Context Problem

As Andrej Karpathy puts it: the LLM context window is like RAM—it requires careful curation, not maximization.

This matters more for multi-agent systems. Anthropic’s research shows multi-agent architectures can consume 15x more tokens than single-agent approaches. Without context engineering, costs spiral, performance degrades, and agents fail.

The goal isn’t to fill context windows. It’s to find the smallest possible set of high-signal tokens that maximize desired outcomes.

Context Failure Modes

Before discussing solutions, understand the problems:

Failure Mode	What Happens
Context Rot	Model performance degrades as token count increases
Context Poisoning	Hallucinations enter stored information, compound over time
Context Distraction	Excessive information overwhelms the model
Context Confusion	Irrelevant content influences responses
Context Clash	Conflicting information within the same context

These aren’t theoretical. We’ve observed all of them in production multi-agent systems. Context rot alone caused a 23% accuracy drop in one research agent operating above 60% context utilization.

The 40% Smart Zone

Claude Code triggers auto-compact at 95% context saturation. But performance degrades long before that threshold.

Based on our testing across 45 agents in 8 squads:

Context Utilization	Performance Impact
0-30%	Optimal reasoning quality
30-40%	Smart Zone - good balance of context and reasoning
40-60%	Noticeable degradation on complex tasks
60-80%	Significant quality loss, more hallucinations
80-95%	Severe degradation, unreliable outputs
95%+	Auto-compact triggers, context loss

Recommendation: Design agents to operate in the 40% Smart Zone. Build in checkpoints and summarization before hitting 40%.

The Four Techniques

Based on Anthropic’s official guidance, context management falls into four categories.

1. Write Context (Externalize)

Pattern: Persist information outside the context window. Retrieve when needed.

Approach	Implementation	Use Case
Scratchpads	Tool calls write to runtime state	Multi-step reasoning
Long-term memory	Agent synthesizes to storage	Cross-session learning
Todo lists	Progress trackers as files	Complex tasks
Notes files	NOTES.md, CLAUDE.md patterns	Project context

In practice: Agents should write summaries to files rather than accumulating in context.

Instead of: Keep all findings in working memory
Do: Write findings to files, keep summary in context

A research agent investigating 10 sources should write each source analysis to a file, keeping only a 2-3 sentence summary in context. Total context: ~500 tokens instead of ~15,000.

2. Select Context (Retrieve)

Pattern: Dynamically fetch only relevant information at runtime.

Approach	Implementation	Use Case
RAG retrieval	Embeddings + vector search	Large knowledge bases
Static files	CLAUDE.md loaded upfront	Project conventions
Just-in-time	Glob/grep during execution	Code exploration
Tool descriptions	RAG on tool docs	Large tool sets

Evidence: RAG on tool descriptions showed 3x improvement in tool selection accuracy for agents with 20+ available tools.

In practice:

Load CLAUDE.md files upfront (always relevant)
Use file search for discovery, read only what’s needed
Don’t pre-load “just in case”

3. Compress Context (Summarize)

Pattern: Condense information while preserving critical decisions.

Approach	Threshold	Trade-off
Auto-compact	95% context saturation	May lose subtle context
Recursive summarization	At agent boundaries	Compression artifacts
Context trimming	Remove older messages	Lost history
Tool result clearing	After processing	Safest approach

In practice:

Summarize after completing subtasks (2-3 sentences)
Drop tool outputs after extracting conclusions
Keep decisions and rationale, not raw data

Warning: Overly aggressive compression risks losing subtle but critical context. Test compression on representative tasks before deploying.

4. Isolate Context (Sub-agents)

Pattern: Specialized agents with clean, focused context windows.

Approach	Benefit	Cost
Task-specific sub-agents	Deep focus	Coordination overhead
Parallel execution	More total tokens on problem	15x token multiplier
Condensed handoffs	Clean interfaces	Information loss risk

In practice:

Sub-agents return summaries (1,000-2,000 tokens), not full results
Each sub-agent focuses on one concern
Parent coordinates, doesn’t duplicate work

Evidence: Splitting complex research across sub-agents (each with isolated context) significantly outperformed single-agent approaches—90% improvement on multi-source synthesis tasks.

Agent-Specific Budgets

Different agent types have different context needs:

Agent Type	Context Target	Budget/Run	Timeout	Rationale
Monitor	< 20%	$0.50-1.00	5 min	Fetch → Report (focused)
Analyzer	< 30%	$1.00-2.00	10 min	Read upstream → Synthesize
Generator	< 40%	$2.00-5.00	15 min	Create artifacts (needs more context)
Orchestrator	< 25%	$2.00-3.00	15 min	Coordinate, don’t accumulate
Reviewer	< 30%	$1.00-2.00	5 min	Diff + rules (bounded input)

Input/Output Patterns

Monitors (scheduled data fetching):

Inputs: Config only (no upstream context)
Outputs: Structured reports (markdown + JSON)
Context: Fresh each run

Analyzers (synthesis agents):

Inputs: Upstream data (bounded, recent only)
Outputs: Analysis + structured data
Context: Read 5 previous reports max

Orchestrators (lead agents):

Inputs: Briefs, requests
Outputs: Issues, coordination artifacts
Context: Pass minimal viable context to workers

Handoff Protocol

When passing context between agents, structure matters:

Good Handoff (Minimal Viable Context)

## Task
Investigate context engineering patterns for multi-agent systems

## Constraints
- Max 2 hours
- Focus on practical techniques
- Cite sources

## Context Summary
We're building production multi-agent systems. Need to understand
how to manage context across agents without degradation.

## Expected Output
Deep-dive document with evidence-backed recommendations

Total: ~150 tokens

Bad Handoff (Context Hoarding)

## Full Conversation History
[10,000 tokens of prior discussion]

## All Files Read
[5,000 tokens of file contents]

## Everything Just In Case
[3,000 tokens of tangentially related information]

Total: ~18,000 tokens

The bad handoff poisons the sub-agent’s context before it even starts working.

Warning Signs

Yellow Zone (30-40% context)

Watch for:

Reading 5+ files without producing output
Multiple large file reads in sequence
Tool outputs accumulating without summarization
Conversation going 10+ turns on same task
Search returning large result sets being fully read

Action: Pause, summarize current state, consider spawning sub-agent.

Red Zone (>40% context)

Immediate actions:

Stop accumulating
Summarize current state
Spawn fresh sub-agent with summary only
Or: trigger manual checkpoint

Anti-Patterns

1. Context Hoarding

Pattern: Reading files “just in case” Fix: Only read what you need now

2. History Dependency

Pattern: Relying on “what we discussed earlier” Fix: State it directly or write to external file

3. Output Verbosity

Pattern: Including full file contents in responses Fix: Summaries with file references

4. Tool Output Accumulation

Pattern: Running many tools without processing results Fix: Process → summarize → proceed

5. Bloated Tool Sets

Pattern: Tools with overlapping functionality Fix: Minimal viable tool set, unambiguous selection

Measuring Context Efficiency

Track these metrics:

Metric	Target	How to Measure
Context utilization	< 40% average	Trace analysis
Cost per outcome	Decreasing trend	Budget tracking
Sub-agent spawn rate	20-30% of complex tasks	Execution logs
Handoff token size	< 2,000 tokens	Trace analysis
Compression ratio	10:1 for tool outputs	Before/after comparison

Implementation Checklist

Agent Design

Context target defined (% of window)
Token budget set (per run)
Inputs are bounded and specific
Outputs are structured and summarized
Tools are minimal and unambiguous

During Execution

Reading only necessary files
Summarizing after subtasks
Dropping tool outputs after processing
Spawning sub-agents for deep work
Writing to external files for persistence

Handoffs

Passing conclusions, not raw data
Specifying constraints clearly
Defining expected output format
Limiting scope to single concern

The Economics

Context engineering isn’t just about quality—it’s about cost.

At $3-15 per million tokens:

An agent running at 80% context uses 2x the tokens of one at 40%
Multi-agent systems multiply this across every agent
Inefficient handoffs compound costs exponentially

A well-engineered multi-agent system operating in the Smart Zone can cost 60-70% less than an equivalent unoptimized system while producing better results.

The 40% Smart Zone isn’t just optimal for reasoning—it’s optimal for economics.

Summary

Technique	When to Use	Token Savings
Write (Externalize)	Multi-step reasoning, cross-session	50-80%
Select (Retrieve)	Large knowledge bases, many tools	30-60%
Compress (Summarize)	After subtasks, tool outputs	40-70%
Isolate (Sub-agents)	Complex tasks, parallel work	Enables 15x parallelization

Context engineering is the discipline of curating tokens, not maximizing them. Multi-agent systems make this critical—and make the payoff substantial.

Sources: Anthropic Engineering (Effective Context Engineering for AI Agents, 2025), LangChain (Context Engineering for Agents), Chroma Research (Context Rot), internal analysis of 45 agents across 8 squads.

Context Engineering for Multi-Agent Systems

The Context Problem

Context Failure Modes

The 40% Smart Zone

The Four Techniques

1. Write Context (Externalize)

2. Select Context (Retrieve)

3. Compress Context (Summarize)

4. Isolate Context (Sub-agents)

Agent-Specific Budgets

Input/Output Patterns

Handoff Protocol

Good Handoff (Minimal Viable Context)

Bad Handoff (Context Hoarding)

Warning Signs

Yellow Zone (30-40% context)

Red Zone (>40% context)

Anti-Patterns

1. Context Hoarding

2. History Dependency

3. Output Verbosity

4. Tool Output Accumulation

5. Bloated Tool Sets

Measuring Context Efficiency

Implementation Checklist

Agent Design

During Execution

Handoffs

The Economics

Summary

Continue Reading

Related Reading

AI Adoption 2026: The Reality vs. The Hype

AI Labor Disruption: Who's Actually Getting Displaced

The AI Productivity Puzzle: Where Are the GDP Gains?