Context Window Economics: The Math Behind LLM Token Optimization

The Hidden Cost Model

Every token in a context window has a cost—not just monetary, but computational. At $3-15 per million tokens (depending on model and direction), the naive approach is to minimize tokens.

But token minimization is the wrong optimization target.

The real question: when does injecting information upfront save more than it costs?

The Basic Math

Consider two approaches to providing an AI agent with project state:

Upfront injection:     ~870 tokens, instant
Tool-call discovery:   ~920 tokens + latency

Token cost is roughly equivalent. So why does injection often win?

The hidden factor is relevance rate—how often the injected information gets used.

The Value Formula

Value = (tokens_saved × usage_rate) - tokens_injected

For a status command that costs 870 tokens with 80% session usage:

Value = (920 × 0.8) - 870 = 736 - 870 = -134 tokens

Slightly negative on pure token math. But this ignores latency.

Each tool call adds an API roundtrip—typically 200-500ms. The agent also spends “thinking tokens” deciding whether to check state. When you account for these factors, upfront injection often wins despite the token cost.

When High-Density Injection Wins

Inject upfront when:

Information is used in >70% of sessions
It eliminates “what’s the current state?” discovery phases
It reduces multiple tool calls to zero
The agent would otherwise waste thinking tokens on “should I check X?”

Real example: A session status summary (squad states, recent activity, active goals) gets referenced in nearly every interaction. The 870 tokens of upfront context saves 2-3 tool calls and 500+ thinking tokens per session.

When It Loses

Avoid upfront injection when:

Information is used in <30% of sessions
One-size-fits-all injection for diverse task types
Static snapshot when fresh data is critically important
The information is easily discoverable via targeted queries

Real example: Full project history (10,000+ tokens) when most sessions only need recent commits. Better to query on demand.

Practical Measurements

We measured actual token costs for common context injections:

Context Type	Chars	Tokens	Use Case
Minimal status	~800	~200	Session hooks (always)
Full status	3,493	~870	Most sessions
Full dashboard	9,367	~2,340	Deep analysis
Project CLAUDE.md	8,000	~2,000	Always relevant
Full codebase index	40,000+	~10,000	Rarely needed upfront

At session start, ~970 tokens of context represents less than 1% of a 200K token window. That’s cheap insurance against discovery overhead.

Progressive Density Strategy

The optimal approach isn’t “inject everything” or “inject nothing”—it’s progressive density based on relevance probability.

# Level 1: Always inject (100% relevance)
# Squad names, activity flags, critical state
~200 tokens

# Level 2: High-relevance sessions (70%+)
# Full status, recent goals, active work
~870 tokens

# Level 3: Deep analysis (specific tasks only)
# Full history, complete memory, all context
~2,340 tokens

The key insight: don’t optimize globally—optimize per session type.

The Extended Formula

A more complete value calculation:

Value = (tokens_saved × usage_rate × sessions)
      - (tokens_injected × sessions)
      + (latency_saved_ms × value_per_ms)
      + (thinking_tokens_saved × usage_rate)

Where:

tokens_saved = tokens the agent would spend discovering via tools
usage_rate = probability the information gets used (0-1)
latency_saved_ms = API roundtrip time avoided
value_per_ms = productivity cost of waiting
thinking_tokens_saved = tokens spent deciding whether to query

For interactive sessions, latency dominates. A 500ms roundtrip feels slow. For background automation, token cost dominates.

Applying This to Agent Design

Prompt Engineering

Structure prompts with relevance-aware sections:

## Context (Always Relevant)
{minimal_state}

## Extended Context (If Needed)
{full_state if complex_task else "Use tools to query"}

## Task-Specific Context
{injected only for matching task types}

Tool Descriptions

High-density descriptions for frequently-used tools pay off:

{
  "name": "search_codebase",
  "description": "Semantic search across all source files. Returns top 10 matches with surrounding context. Use for: finding implementations, understanding patterns, locating related code. Prefer over file reads when location unknown."
}

Longer description (~50 tokens) saves thinking tokens deciding which tool to use.

Memory Loading

Load memory progressively:

Session start: Active goals, recent decisions (500 tokens)
On research task: Full topic memory (2,000 tokens)
On complex analysis: Everything relevant (5,000+ tokens)

Don’t load the full knowledge base for a simple commit message.

The Counter-Intuitive Insight

Teams optimizing for token minimization often create slower, more expensive agents.

Why? An agent that doesn’t have context:

Spends tokens deciding what context it needs
Spends latency calling tools to discover
May miss relevant information due to incomplete discovery
Repeats discovery across sessions

An agent with appropriate upfront context:

Starts working immediately
References injected information without tool calls
Completes tasks faster with fewer total tokens
Maintains coherence across interactions

The goal isn’t minimum tokens—it’s maximum value per token.

Measurement Framework

To optimize your own context strategy:

Track usage rates: Log which injected context actually gets referenced
Measure discovery costs: Count tool calls that inject context
Time sessions: Compare task completion with different context levels
Calculate total tokens: Include thinking, discovery, and injected tokens

The numbers will tell you where to optimize.

Key Takeaways

Relevance rate matters more than token count
70%+ usage rate justifies upfront injection
Latency savings often exceed token costs
Progressive density beats one-size-fits-all
Measure actual usage, don’t assume

The formula that matters:

Value = tokens × relevance × latency_savings

Optimize for relevance first, then density.

Note: Token estimates based on Claude tokenization. GPT and other models may vary by 10-20%. The principles apply regardless of specific counts.