Engineering

Context Window Economics: The Math Behind LLM Token Optimization

By Agents Squads · · 7 min

The Hidden Cost Model

Every token in a context window has a cost—not just monetary, but computational. At $3-15 per million tokens (depending on model and direction), the naive approach is to minimize tokens.

But token minimization is the wrong optimization target.

The real question: when does injecting information upfront save more than it costs?

The Basic Math

Consider two approaches to providing an AI agent with project state:

Upfront injection:     ~870 tokens, instant
Tool-call discovery:   ~920 tokens + latency

Token cost is roughly equivalent. So why does injection often win?

The hidden factor is relevance rate—how often the injected information gets used.

The Value Formula

Value = (tokens_saved × usage_rate) - tokens_injected

For a status command that costs 870 tokens with 80% session usage:

Value = (920 × 0.8) - 870 = 736 - 870 = -134 tokens

Slightly negative on pure token math. But this ignores latency.

Each tool call adds an API roundtrip—typically 200-500ms. The agent also spends “thinking tokens” deciding whether to check state. When you account for these factors, upfront injection often wins despite the token cost.

When High-Density Injection Wins

Inject upfront when:

Real example: A session status summary (squad states, recent activity, active goals) gets referenced in nearly every interaction. The 870 tokens of upfront context saves 2-3 tool calls and 500+ thinking tokens per session.

When It Loses

Avoid upfront injection when:

Real example: Full project history (10,000+ tokens) when most sessions only need recent commits. Better to query on demand.

Practical Measurements

We measured actual token costs for common context injections:

Context TypeCharsTokensUse Case
Minimal status~800~200Session hooks (always)
Full status3,493~870Most sessions
Full dashboard9,367~2,340Deep analysis
Project CLAUDE.md8,000~2,000Always relevant
Full codebase index40,000+~10,000Rarely needed upfront

At session start, ~970 tokens of context represents less than 1% of a 200K token window. That’s cheap insurance against discovery overhead.

Progressive Density Strategy

The optimal approach isn’t “inject everything” or “inject nothing”—it’s progressive density based on relevance probability.

# Level 1: Always inject (100% relevance)
# Squad names, activity flags, critical state
~200 tokens

# Level 2: High-relevance sessions (70%+)
# Full status, recent goals, active work
~870 tokens

# Level 3: Deep analysis (specific tasks only)
# Full history, complete memory, all context
~2,340 tokens

The key insight: don’t optimize globally—optimize per session type.

The Extended Formula

A more complete value calculation:

Value = (tokens_saved × usage_rate × sessions)
      - (tokens_injected × sessions)
      + (latency_saved_ms × value_per_ms)
      + (thinking_tokens_saved × usage_rate)

Where:

For interactive sessions, latency dominates. A 500ms roundtrip feels slow. For background automation, token cost dominates.

Applying This to Agent Design

Prompt Engineering

Structure prompts with relevance-aware sections:

## Context (Always Relevant)
{minimal_state}

## Extended Context (If Needed)
{full_state if complex_task else "Use tools to query"}

## Task-Specific Context
{injected only for matching task types}

Tool Descriptions

High-density descriptions for frequently-used tools pay off:

{
  "name": "search_codebase",
  "description": "Semantic search across all source files. Returns top 10 matches with surrounding context. Use for: finding implementations, understanding patterns, locating related code. Prefer over file reads when location unknown."
}

Longer description (~50 tokens) saves thinking tokens deciding which tool to use.

Memory Loading

Load memory progressively:

Session start: Active goals, recent decisions (500 tokens)
On research task: Full topic memory (2,000 tokens)
On complex analysis: Everything relevant (5,000+ tokens)

Don’t load the full knowledge base for a simple commit message.

The Counter-Intuitive Insight

Teams optimizing for token minimization often create slower, more expensive agents.

Why? An agent that doesn’t have context:

  1. Spends tokens deciding what context it needs
  2. Spends latency calling tools to discover
  3. May miss relevant information due to incomplete discovery
  4. Repeats discovery across sessions

An agent with appropriate upfront context:

  1. Starts working immediately
  2. References injected information without tool calls
  3. Completes tasks faster with fewer total tokens
  4. Maintains coherence across interactions

The goal isn’t minimum tokens—it’s maximum value per token.

Measurement Framework

To optimize your own context strategy:

  1. Track usage rates: Log which injected context actually gets referenced
  2. Measure discovery costs: Count tool calls that inject context
  3. Time sessions: Compare task completion with different context levels
  4. Calculate total tokens: Include thinking, discovery, and injected tokens

The numbers will tell you where to optimize.

Key Takeaways

  1. Relevance rate matters more than token count
  2. 70%+ usage rate justifies upfront injection
  3. Latency savings often exceed token costs
  4. Progressive density beats one-size-fits-all
  5. Measure actual usage, don’t assume

The formula that matters:

Value = tokens × relevance × latency_savings

Optimize for relevance first, then density.


Note: Token estimates based on Claude tokenization. GPT and other models may vary by 10-20%. The principles apply regardless of specific counts.

Related Reading

Back to Engineering