Engineering

Context Optimization: Keeping AI Agents in the Smart Zone

By Agents Squads · · 10 min

“The best agents use the least context to do the most work.”

Why Context Is Currency

Every token in Claude’s context window comes with costs. More tokens means higher API bills—that’s the obvious one. But more tokens also means slower responses, because the model has to process everything in context before generating output. And perhaps most importantly, bloated context degrades reasoning quality. When an agent has 150,000 tokens of information to sort through, it struggles to focus on what actually matters.

We’ve seen agents balloon past 150K tokens and watched them become simultaneously expensive, slow, and confused. The fix isn’t better prompting—it’s disciplined context management.

The Smart Zone

We target under 40% context utilization for most operations. With Claude’s 200K token context window, that means staying under 80K tokens. Going above 60% of the context window (around 120K tokens) puts you in what we call the danger zone—performance degradation becomes noticeable.

Why 40%? Because agent work is unpredictable. Tool outputs can be large. Multi-turn reasoning needs room to work. Unexpected complexity requires space to explore. If you start a task already using 70% of your context, you have no margin for anything that takes more room than expected.

Estimating Context Usage

Before you can optimize context, you need to understand how much you’re using. Here are the rules of thumb we rely on.

For text, one token is roughly four characters in English, or about three-quarters of a word. Code tends to be denser—figure around 2.5 tokens per line. Markdown is lighter at about 1.5 tokens per line. JSON is verbose at roughly 3 tokens per line because of all the structural characters.

For file sizes, a small file under 100 lines typically consumes around 250 tokens. A medium file of 100-500 lines runs about 1,250 tokens. Large files over 500 lines can easily consume 2,500 tokens or more.

Common operations have predictable costs: reading a small file adds 200-500 tokens to context, reading a large file adds 2,000-5,000, grep results with ten matches add 500-1,000, scraping a web article adds 2,000-4,000, and injecting squad memory adds 500-2,000 depending on how much you’ve accumulated.

Techniques That Work

Read Only What You Need

The most common waste we see is reading entire files to find one piece of information. If you need to find a specific function in a 2,000-line file, don’t read the whole file. Grep for the function name first to find its location, then read just the relevant section.

This sounds obvious, but agents default to thorough when they should default to targeted. Explicitly designing for minimal reads makes a huge difference.

Progressive Disclosure

Start narrow and expand only if needed. Begin by reading file lists—that’s cheap, usually just a few hundred tokens. If you need more, read relevant file headers or just the first section. Only read full implementations when you’ve confirmed they’re actually relevant to the task.

This inverted approach—cheap operations first, expensive operations only when necessary—keeps context lean while still gathering the information agents need.

Summarize Before Injecting

Raw memory dumps are expensive. If you’re injecting 3,000 tokens of conversation history, consider summarizing it first. A summary that captures key findings, current status, and blockers might only take 300 tokens while preserving what the agent actually needs to know.

The same applies to tool outputs, research results, and any other large data that needs to enter context. Ask: can this be summarized without losing essential information?

Spawn Sub-Agents for Heavy Work

When a task requires reading many files or doing extensive research, consider spawning a sub-agent to handle it. The sub-agent does the heavy reading in its own context, then returns a summary to the main agent.

This pattern keeps the main agent’s context bounded. Instead of consuming 50K tokens reading twenty files, the main agent consumes maybe 2K tokens receiving a summary of what those files contained.

Use Appropriate Models

Not every task needs the most powerful (and most expensive) model. Simple classification can run on Haiku at a fraction of the cost. Standard work fits Sonnet well. Reserve Opus for genuinely complex reasoning where the capability matters.

Using Opus tokens for Haiku tasks is wasteful in both cost and context. Match the model to the task.

Patterns to Avoid

The Kitchen Sink

“Let me read all 50 files in this directory to understand the codebase.”

The result: 100K tokens consumed, agent confused by information overload, user frustrated waiting. The agent has so much information that it can’t figure out what’s relevant, so it produces vague or incomplete answers despite having nominally thorough context.

The Hoarder

“I’ll keep all previous conversation in context in case it’s relevant later.”

The result: context grows every turn. Eventually you hit the limit and things break, or performance degrades gradually until the agent becomes nearly useless. Most old conversation context isn’t relevant—summarize it or let it go.

The Over-Researcher

“Before I can answer this simple question, let me thoroughly investigate the entire relevant domain.”

The result: ten minutes of research and 50K tokens consumed to produce a one-sentence answer. The agent’s thoroughness instinct, usually a strength, becomes a liability when it triggers unnecessarily.

A Decision Framework

When you notice context getting heavy, ask these questions in order:

Is the task complete? If yes, summarize what you found and finish. Don’t keep reading once you have the answer.

Can a sub-agent handle the remaining research? If the work involves reading many files or sources, spawn a sub-agent to do it and return a summary.

Is the remaining work critical? If you genuinely need more context to complete an important task, accept the cost. But verify it’s actually necessary.

Can you summarize partial results? If you’ve gathered useful information but can’t complete the full task efficiently, return what you have. Partial results are better than blown context limits.

Monitoring in Practice

Track context usage patterns across your agents. A healthy session typically stays in the 20-40K token range. Sessions that regularly hit 60-80K deserve investigation—something is probably inefficient. Sessions that exceed 100K almost always indicate a design problem.

When you see patterns of high context usage, investigate: What’s consuming the tokens? Is it necessary? Can the agent work smarter?

A Concrete Example

Task: “Find where authentication is handled.”

The bloated approach reads all files in src/, then all files in lib/, then all config files. Eventually it finds the auth handler. Total cost: 80K tokens.

The optimized approach greps for “auth” and finds five relevant files. It reads the header of auth.ts—about 100 lines—and finds what it needs. Total cost: 8K tokens.

Same result. Ten times less context. The optimized agent is faster, cheaper, and has plenty of room for follow-up work.

The Core Principle

Think about context quality as relevance times information density, divided by tokens used.

You want high relevance—everything in context should matter for the current task. You want high density—the information should be meaningful, not padded with irrelevant detail. And you want minimal tokens—the smallest representation that preserves what you need.

Good agents are context-efficient by design. They don’t become efficient through post-hoc optimization—efficiency is built into how they approach tasks from the start.

Monitor your squad’s context usage:

squads dash

Or learn more about our CLI.


This is part of the Engineering AI Agents series—practical patterns for building autonomous AI systems.

Related Reading

Back to Engineering