The Hidden Cost Model
Every token in a context window has a cost—not just monetary, but computational. At $3-15 per million tokens (depending on model and direction), the naive approach is to minimize tokens.
But token minimization is the wrong optimization target.
The real question: when does injecting information upfront save more than it costs?
The Basic Math
Consider two approaches to providing an AI agent with project state:
Upfront injection: ~870 tokens, instant
Tool-call discovery: ~920 tokens + latency
Token cost is roughly equivalent. So why does injection often win?
The hidden factor is relevance rate—how often the injected information gets used.
The Value Formula
Value = (tokens_saved × usage_rate) - tokens_injected
For a status command that costs 870 tokens with 80% session usage:
Value = (920 × 0.8) - 870 = 736 - 870 = -134 tokens
Slightly negative on pure token math. But this ignores latency.
Each tool call adds an API roundtrip—typically 200-500ms. The agent also spends “thinking tokens” deciding whether to check state. When you account for these factors, upfront injection often wins despite the token cost.
When High-Density Injection Wins
Inject upfront when:
- Information is used in >70% of sessions
- It eliminates “what’s the current state?” discovery phases
- It reduces multiple tool calls to zero
- The agent would otherwise waste thinking tokens on “should I check X?”
Real example: A session status summary (squad states, recent activity, active goals) gets referenced in nearly every interaction. The 870 tokens of upfront context saves 2-3 tool calls and 500+ thinking tokens per session.
When It Loses
Avoid upfront injection when:
- Information is used in <30% of sessions
- One-size-fits-all injection for diverse task types
- Static snapshot when fresh data is critically important
- The information is easily discoverable via targeted queries
Real example: Full project history (10,000+ tokens) when most sessions only need recent commits. Better to query on demand.
Practical Measurements
We measured actual token costs for common context injections:
| Context Type | Chars | Tokens | Use Case |
|---|---|---|---|
| Minimal status | ~800 | ~200 | Session hooks (always) |
| Full status | 3,493 | ~870 | Most sessions |
| Full dashboard | 9,367 | ~2,340 | Deep analysis |
| Project CLAUDE.md | 8,000 | ~2,000 | Always relevant |
| Full codebase index | 40,000+ | ~10,000 | Rarely needed upfront |
At session start, ~970 tokens of context represents less than 1% of a 200K token window. That’s cheap insurance against discovery overhead.
Progressive Density Strategy
The optimal approach isn’t “inject everything” or “inject nothing”—it’s progressive density based on relevance probability.
# Level 1: Always inject (100% relevance)
# Squad names, activity flags, critical state
~200 tokens
# Level 2: High-relevance sessions (70%+)
# Full status, recent goals, active work
~870 tokens
# Level 3: Deep analysis (specific tasks only)
# Full history, complete memory, all context
~2,340 tokens
The key insight: don’t optimize globally—optimize per session type.
The Extended Formula
A more complete value calculation:
Value = (tokens_saved × usage_rate × sessions)
- (tokens_injected × sessions)
+ (latency_saved_ms × value_per_ms)
+ (thinking_tokens_saved × usage_rate)
Where:
tokens_saved= tokens the agent would spend discovering via toolsusage_rate= probability the information gets used (0-1)latency_saved_ms= API roundtrip time avoidedvalue_per_ms= productivity cost of waitingthinking_tokens_saved= tokens spent deciding whether to query
For interactive sessions, latency dominates. A 500ms roundtrip feels slow. For background automation, token cost dominates.
Applying This to Agent Design
Prompt Engineering
Structure prompts with relevance-aware sections:
## Context (Always Relevant)
{minimal_state}
## Extended Context (If Needed)
{full_state if complex_task else "Use tools to query"}
## Task-Specific Context
{injected only for matching task types}
Tool Descriptions
High-density descriptions for frequently-used tools pay off:
{
"name": "search_codebase",
"description": "Semantic search across all source files. Returns top 10 matches with surrounding context. Use for: finding implementations, understanding patterns, locating related code. Prefer over file reads when location unknown."
}
Longer description (~50 tokens) saves thinking tokens deciding which tool to use.
Memory Loading
Load memory progressively:
Session start: Active goals, recent decisions (500 tokens)
On research task: Full topic memory (2,000 tokens)
On complex analysis: Everything relevant (5,000+ tokens)
Don’t load the full knowledge base for a simple commit message.
The Counter-Intuitive Insight
Teams optimizing for token minimization often create slower, more expensive agents.
Why? An agent that doesn’t have context:
- Spends tokens deciding what context it needs
- Spends latency calling tools to discover
- May miss relevant information due to incomplete discovery
- Repeats discovery across sessions
An agent with appropriate upfront context:
- Starts working immediately
- References injected information without tool calls
- Completes tasks faster with fewer total tokens
- Maintains coherence across interactions
The goal isn’t minimum tokens—it’s maximum value per token.
Measurement Framework
To optimize your own context strategy:
- Track usage rates: Log which injected context actually gets referenced
- Measure discovery costs: Count tool calls that inject context
- Time sessions: Compare task completion with different context levels
- Calculate total tokens: Include thinking, discovery, and injected tokens
The numbers will tell you where to optimize.
Key Takeaways
- Relevance rate matters more than token count
- 70%+ usage rate justifies upfront injection
- Latency savings often exceed token costs
- Progressive density beats one-size-fits-all
- Measure actual usage, don’t assume
The formula that matters:
Value = tokens × relevance × latency_savings
Optimize for relevance first, then density.
Note: Token estimates based on Claude tokenization. GPT and other models may vary by 10-20%. The principles apply regardless of specific counts.