The Cost Trap
The instinct is natural: tokens cost money, so minimize tokens.
This thinking is wrong.
Token minimization optimizes for the wrong metric. The right question isn’t “how few tokens can we use?” It’s “how much value can we create per token?”
The Real Math
Consider two agents solving the same problem:
Agent A (Token-Minimized)
- Tokens used: 10,000
- Cost: $0.15
- Success rate: 60%
- Rework required: 40% of tasks
- Effective cost per successful outcome: $0.25
Agent B (Context-Rich)
- Tokens used: 25,000
- Cost: $0.375
- Success rate: 95%
- Rework required: 5% of tasks
- Effective cost per successful outcome: $0.40
Agent A looks cheaper. But factor in rework:
- Agent A: $0.25 + (0.4 × $0.25) = $0.35 effective cost
- Agent B: $0.40 + (0.05 × $0.40) = $0.42 effective cost
Still close. Now factor in human time:
- Agent A rework requires 15 minutes human review per failure
- At $100/hour engineering cost, that’s $10 per rework
- Agent A effective cost: $0.35 + (0.4 × $10) = $4.35
- Agent B effective cost: $0.42 + (0.05 × $10) = $0.92
Agent B costs 4.7x less when you include human time.
The lesson: Token cost is noise. Outcome cost is signal.
Value Per Token
The metric that matters:
Value Per Token = Business Outcome Value / Tokens Consumed
For a customer service agent:
- Resolved ticket value: $50 (cost of human agent resolution avoided)
- Tokens consumed: 15,000
- Cost at $10/M tokens: $0.15
- Value per token: $50 / 15,000 = $0.0033
For every token, you’re generating $0.0033 in value. That’s a 220x return.
When value per token is high, the optimization target is throughput, not token reduction.
Model Selection Economics
Different models have different economics:
| Model | Input Cost/M | Output Cost/M | Best For |
|---|---|---|---|
| Claude Opus 4.5 | $15 | $75 | Complex reasoning, high-stakes |
| Claude Sonnet 4 | $3 | $15 | General production work |
| Claude Haiku 3.5 | $0.25 | $1.25 | High volume, simple tasks |
| GPT-4o | $2.50 | $10 | Multimodal, fast iteration |
The Model Arbitrage Strategy
Use expensive models for high-value decisions, cheap models for low-value operations:
Example: Code Review Pipeline
- Haiku scans for obvious issues (syntax, formatting): $0.001 per file
- Sonnet reviews logic and architecture: $0.02 per file
- Opus analyzes security-critical code: $0.10 per file
Total cost for reviewing 100 files:
- Haiku (all 100): $0.10
- Sonnet (30 flagged): $0.60
- Opus (5 security-sensitive): $0.50
- Total: $1.20
vs. Opus for all 100 files: $10.00
8x cost reduction with equivalent or better outcomes, because each model operates where it provides best value.
Batch vs Interactive Economics
Token costs vary by usage pattern:
Interactive (Real-time)
- User waiting for response
- Latency critical
- Often willing to pay premium
- Example: Customer service chat
Batch (Background)
- No user waiting
- Can use prompt caching aggressively
- Can retry on cheaper models first
- Example: Nightly report generation
Anthropic’s prompt caching reduces input token costs by 90% for repeated context. For batch operations with consistent prompts, this is transformative:
Without caching: $15/M tokens
With 90% cache hit: $1.50/M tokens effective
A nightly batch job running 1M tokens of analysis:
- Without caching: $15
- With caching: $1.50
- Monthly savings: $405
The Budget Framework
Structure AI spending like engineering infrastructure:
1. Fixed Costs (Predictable)
- Monitoring agents (scheduled runs)
- Batch processing (daily/weekly reports)
- Integration maintenance
Budget these like any infrastructure cost. They should decrease as a percentage of value over time.
2. Variable Costs (Scales with Value)
- Customer-facing agents (per interaction)
- Development assistance (per developer)
- Research tasks (per investigation)
These should scale with business value. More interactions = more cost = more value.
3. Investment Costs (Capability Building)
- Training data generation
- New agent development
- Model fine-tuning
Treat these as R&D. Expect payback over quarters, not days.
Cost Anomaly Detection
Watch for these patterns:
Red Flags
| Pattern | Likely Cause | Action |
|---|---|---|
| Sudden 10x spike | Agent loop | Kill and investigate |
| Gradual daily increase | Context bloat | Review prompts |
| High variation between runs | Inconsistent inputs | Standardize |
| Cost increasing, outcomes flat | Model inefficiency | Re-evaluate model choice |
Optimization Triggers
When to optimize:
- Cost per outcome increasing 3+ months
- Token efficiency < 50% of comparable tasks
- Human intervention rate > 20%
When NOT to optimize:
- Outcomes are excellent
- Cost is within budget
- System is stable
Don’t optimize working systems for marginal token savings.
The Agent Cost Stack
For multi-agent systems, costs compound:
Total Cost = Σ (Agent Tokens × Agent Cost Rate)
+ Coordination Overhead
+ Retry Costs
+ Human Escalation Costs
Coordination Overhead
Each agent handoff costs tokens:
- Handoff context: 500-2,000 tokens per handoff
- Routing decisions: 200-500 tokens per decision
- Result summarization: 300-1,000 tokens per summary
For a 5-agent workflow with 4 handoffs:
- Minimum overhead: 4 × 1,000 = 4,000 tokens
- Typical overhead: 4 × 2,500 = 10,000 tokens
Design implication: Fewer, capable agents often beat many specialized agents on cost efficiency.
When Multi-Agent Is Worth It
Multi-agent systems cost more in tokens but can deliver more value:
| Scenario | Single-Agent | Multi-Agent | Winner |
|---|---|---|---|
| Simple task | Lower cost | Higher cost | Single |
| Complex research | Lower quality | Higher quality | Multi |
| Time-critical | Sequential | Parallel | Multi |
| High-stakes | One perspective | Multiple perspectives | Multi |
The break-even: Multi-agent systems win when the quality/speed improvement exceeds the coordination overhead.
Pricing Your AI Products
If you’re building AI products, token economics affect pricing strategy:
Cost-Plus Pricing
Price = (Token Cost × Markup) + Fixed Costs
- Predictable margins
- Doesn’t capture value
- Vulnerable to model price drops
Value-Based Pricing
Price = (Value Delivered × Capture Rate)
- Aligns with customer outcomes
- Higher margins possible
- Requires measuring value
Outcome-Based Pricing
Price = (Per Outcome Fee)
- Customer pays for results
- You bear efficiency risk
- Highest trust signal
Sierra’s $100M ARR came from outcome-based pricing: pay per resolved ticket. They capture value, not cost.
Key Metrics Dashboard
Track these for AI economics:
| Metric | Formula | Target |
|---|---|---|
| Cost per outcome | Total tokens × rate / successful outcomes | Decreasing |
| Value per token | Business value / tokens consumed | Increasing |
| Model efficiency | Outcomes / model-specific tokens | Compare across models |
| Retry rate | Retried tasks / total tasks | < 10% |
| Human escalation rate | Human interventions / total tasks | < 5% |
Summary
Token economics principles:
- Optimize for value, not cost: Value per token matters more than tokens consumed
- Model arbitrage: Use expensive models for high-value decisions only
- Batch when possible: Caching and batching dramatically reduce costs
- Measure outcomes: Cost per successful outcome, not cost per token
- Include human time: Token costs are often <5% of total cost including human time
The goal isn’t to minimize tokens. It’s to maximize value per token.
Related: Context Window Economics covers when to inject context vs discover via tools. Profitable AI analyzes companies successfully monetizing AI products.