Pillar: cost-optimization | Date: April 2026
Scope: Actual benchmarks on prompt caching effectiveness (TTL, hit rate, cache_read vs fresh tokens on Max). Model selection ROI: Opus 4.7 vs Sonnet 4.6 vs Haiku 4.5 for specific task types with real numbers. MCP manifest hygiene: always-on vs on-demand activation cost difference. Subagent dispatch ROI — when it pays to spawn vs inline. Fast mode tradeoffs. Claude Max-specific patterns: rate limits, weekly caps, burst behavior. No marketing — real numbers only.
Sources: 18 gathered, consolidated, synthesized.
The full optimization stack reduces Claude costs by 87% without any quality tradeoff: an unoptimized Opus-only workflow running 1,289 requests costs $300.90; applying default caching (84% hit rate), model routing to Sonnet for 80% of tasks, and session hygiene brings the same workload to $73.56 — and moving to a 95% cache hit rate with Haiku subagents and MCP deferral takes a $1,320/month baseline to $176/month, achievable entirely through configuration changes.[3][18]
The structural engine behind Claude cost management is a single ratio: cache reads cost 0.1x base input price, while 5-minute cache writes cost 1.25x — a 12.5x price differential between the two states.[4] The 1-hour TTL tier widens this to 20x (2.0x write, 0.1x read). Two production datasets quantify what this means in practice: an 84% baseline cache hit rate across 1,289 requests and 100.3M tokens reduced per-request cost from $0.23 to $0.06 — a 76% reduction.[2] Optimized sessions reach 95% hit rates by turn 20, compressing the cost of a 400,000-token Sonnet request from $1.20 to $0.18.[3] Critically: cache hits do not reduce latency — "cached requests weren't any faster" — making caching a pure cost lever with no quality or speed upside beyond economics.[2]
A documented infrastructure regression beginning March 6, 2026 converted Claude Code sessions from 1-hour TTL (the intended behavior, active February 1–March 5) back to 5-minute TTL, generating $949.08 in avoidable costs across 119,866 API calls from a single developer over four months.[1] The transition was abrupt: on March 5, 5-minute tokens were 0% of cache writes; by March 8, they were 83%; by March 21, 93%.[16] The February 2026 baseline — when 1-hour TTL was correctly active — showed only 1.1% cost waste; March 2026 waste climbed to 25.9%. The 17.1% total waste ratio was identical across Sonnet and Opus because waste is driven entirely by the token split between TTL tiers, not per-token price. Neither the regression nor its cost impact was communicated to users; the only reliable detection method is monitoring cache_creation_input_tokens vs cache_read_input_tokens in API response usage fields.[1][16]
MCP tool definitions inject full parameter schemas at every conversation turn, not once at session start — making always-on MCP configurations a per-message cost multiplier that scales with both tool count and session length. A heavy configuration with 120 tools across 25 turns generates 362,350 schema tokens before any user content is processed; at enterprise scale (200 tools, 1,000 daily conversations at Sonnet pricing), MCP schema overhead reaches $21,000/month before a single user prompt is served.[8] On-demand (deferred) tool loading eliminates 96–99% of this overhead: the same 120-tool, 25-turn session drops from 362,350 tokens to 5,181 tokens with no reduction in task pass rate.[8] A Google Drive-to-Salesforce workflow was measured at 150,000 tokens with always-on MCP and 2,000 tokens with on-demand loading — a 98.7% reduction.[9] Official Claude Code documentation confirms MCP tools are deferred by default; the cost problem arises when developers override this behavior or accumulate servers without auditing which are used.
Model selection is the single highest-leverage cost intervention after caching. The pricing cascade is stark: Opus 4.7 output costs $25/MTok — 5x Haiku 4.5's $5/MTok — making Opus output the dominant cost driver in any agentic session.[6] In production optimization data, routing 80% of tasks to Sonnet instead of Opus contributed 20% of total cost savings; routing 60% of agent calls to Haiku cuts total cost by approximately 40%.[18][7] Haiku 4.5's SWE-bench Verified score of 73.3% (vs Sonnet 4.5's 77.2%) confirms the quality gap is real but narrow for software tasks — a 3.9 percentage point difference that does not justify 3–5x cost on the majority of a workload.[7] The opusplan preset captures the optimal routing automatically: Opus 4.7 for planning (superior reasoning), Sonnet 4.6 for execution (code generation), delivering Opus-quality architecture at Sonnet execution cost.[15] Setting CLAUDE_CODE_SUBAGENT_MODEL=haiku routes all subagents to Haiku 4.5, saving 83% of subagent compute cost versus Opus.[15]
Subagent dispatch cost is superlinear, not proportional. At 3 active agents the cost multiplier is 3–4x; at 10 agents it is 8–12x; at 49 agents it reaches 30–50x, with an estimated session cost of $8,000–$15,000.[10] A documented real-world runaway incident sustained 887,000 tokens per minute for 2.5 hours across 49 simultaneous agents, with initialization overhead of 5,000–15,000 tokens per subagent and 1,000–5,000 tokens per inter-agent collaboration event.[10] Official documentation warns that agent teams use approximately 7x more tokens than standard sessions when teammates run in plan mode.[11] The cache mitigation factor (90%+ of tokens served as cache reads at $0.50/MTok for Opus) partially offsets these costs — list-price calculations overstate actual subagent cost by roughly an order of magnitude when cache hit rates are high. There is one non-obvious ROI case for subagent dispatch: routing verbose output (10k–50k tokens) to a subagent isolates it from the main context, protecting the cache prefix and keeping downstream turns cheaper.[18]
Claude Max quota accounting is opaque and non-deterministic. A documented 5-hour session consumed 65% of the Max $100/month quota with token counts that do not reconcile with published pricing. Analysis of 5,396 mitmproxy-captured API calls from the same account on the same day found tokens-per-percent-quota ranging from 2,517 to 18,531,900 — a 1,500x variance.[14] Three confirmed quota inflators exist: silent auto-upgrade to Opus (generates ~3x more tokens for identical tasks than Sonnet), a version 2.1.51 billing tier change that routed 200K+ context calls to an "Extra Usage" tier, and the 5m TTL regression forcing full-price cache re-creation.[14] Peak-hour throttling, introduced March 26, 2026 and affecting ~7% of users, caused Max 5x limits to exhaust in ~90 minutes on normal tasks and a single Max 20x prompt to jump from 21% to 100% quota consumption.[5] Anthropic's formula for converting tokens to quota percentage is not published; the API usage fields are the only reliable cost signal — the dashboard reflects quota depletion under an unknown weighting function.
Fast mode delivers up to 2.5x higher output tokens per second at a 6x price premium ($30/$150 input/output vs $5/$25 standard for Opus 4.6), is currently in beta requiring waitlist access, and applies to Opus 4.6 only — not Opus 4.7, Sonnet, or Haiku.[12] Time to first token is unchanged; only output generation speed improves. A critical technical constraint makes fast mode a committed session choice: falling back from fast to standard speed causes a prompt cache miss, because fast and standard modes do not share cached prefixes.[12] A session that starts fast and hits fast-mode rate limits then rebuilds its entire context at cache write rates — incurring 6x cost on fast-mode turns plus forced cache recreation at 1.25x write rate on all subsequent standard turns. Fast mode's ROI is positive only when developer wait time has direct monetary value and the developer is the active bottleneck; overnight autonomous runs, batch processing, and unattended pipelines receive zero benefit from faster output generation.[12]
Practitioners should treat the optimization stack as ordered, not optional: default caching achieves 76% cost reduction automatically; adding deliberate model routing (opusplan + CLAUDE_CODE_SUBAGENT_MODEL=haiku) adds 20–40% on top; session hygiene (/compact and /clear between tasks) contributes a documented 25% of remaining savings; MCP deferral eliminates 96–99% of schema overhead at zero quality cost. The monitoring prerequisite for all of this is tracking cache_creation_input_tokens vs cache_read_input_tokens per session — the March 2026 TTL regression went undetected for weeks and cost individual developers hundreds of dollars because this signal was not watched. Infrastructure regressions (TTL changes, thinking deletion, billing tier reclassifications) have occurred silently with no proactive user notification; treating cost as a stable baseline and only auditing on billing shock is the pattern that generated the $949 avoidable-cost case study documented here.
Anthropic's prompt caching system operates on two TTL tiers with radically different economics: a 5-minute tier (default, 1.25x write / 0.1x read) and a 1-hour tier (2.0x write / 0.1x read).[4] Cache reads are priced at 0.1x base input — making them 10x cheaper than uncached input and up to 12.5x cheaper than a 5-minute cache write on Sonnet 4.6.[4]
| Model | Base Input | 5m Write (1.25x) | 1h Write (2.0x) | Cache Read (0.1x) | Write/Read Ratio |
|---|---|---|---|---|---|
| Claude Opus 4.7[6] | $5/MTok | $6.25/MTok | $10.00/MTok | $0.50/MTok | 12.5x (5m) / 20x (1h) |
| Claude Sonnet 4.6[6] | $3/MTok | $3.75/MTok | $6.00/MTok | $0.30/MTok | 12.5x (5m) / 20x (1h) |
| Claude Haiku 4.5[6][7] | $1/MTok | $1.25/MTok | — (N/A) | $0.10/MTok | 12.5x (5m only) |
Prompts below the minimum threshold are processed without caching — no error is returned, savings simply do not apply.[4]
| Threshold | Models |
|---|---|
| 4,096 tokens | Claude Opus 4.7, 4.6, 4.5; Claude Haiku 4.5[4] |
| 2,048 tokens | Claude Sonnet 4.6; Claude Haiku 3.5[4] |
| 1,024 tokens | Claude Sonnet 4.5, 4, 3.7; Claude Opus 4.1, 4[4] |
The official invalidation hierarchy is Tools → System → Messages: changes at any level cascade downward and invalidate all subsequent layers.[4] Eight confirmed triggers exist in practice:[18]
/clear commands/compact compression/rewind operationsAdditional invalidation triggers from official documentation: changing tool_choice parameter, adding/removing images, enabling extended thinking, switching between fast and standard speed modes.[4]
| Cacheable ✓ | Not Cacheable ✗ |
|---|---|
| Tool definitions, system messages, text messages | Thinking blocks with explicit cache_control |
| Images, documents, tool use blocks, tool results | Empty text blocks |
| Thinking blocks in prior assistant turns | — |
Cache economics are visible in API response usage fields. Total input cost = cache_read_input_tokens (0.1x) + cache_creation_input_tokens (1.25x or 2.0x) + input_tokens (1.0x).[4] Up to 4 explicit cache breakpoints per request are supported; static content should be positioned first, with the breakpoint placed on the last identical block across requests.[4]
Key finding: The write-to-read price differential (12.5x for 5m tier, 20x for 1h tier) is the single most important number in Claude cost management. Every optimization decision flows from this ratio: cache hits are not just cheaper — they are an order of magnitude cheaper than cache writes.[4]
Two independent production datasets provide ground truth on achievable cache hit rates: an 84% baseline study from unoptimized usage and a 95% ceiling from deliberately optimized sessions. Both use actual Claude Code API calls with real workloads.
| Metric | Value |
|---|---|
| Total requests | 1,289[2][17] |
| Total input tokens | 100.3M |
| Cached tokens | 84.2M (84%) |
| Uncached tokens | 16.1M (16%) |
| Cost without caching | $300.90 |
| Cost with 84% cache hit | $73.56 |
| Savings | $227.34 (76% reduction) |
| Per-request cost | $0.23 → $0.06 |
Latency note: "Cached requests weren't any faster. Caching reduces costs, not latency." The model still processes all tokens; speed is unchanged by cache hits.[2][17]
By turn 20 of an optimized session, 95%+ of input tokens are typically served from cache.[3] A tracked case study moved from ~70% to ~92% hit rate through targeted interventions:[3]
| Metric | Before Optimization | After Optimization | Reduction |
|---|---|---|---|
| Input tokens per turn | 350k–450k | 80k–120k | ~73% |
| Daily cost | $60 | $8 | 87% |
| Monthly cost | $1,320 | $176 | 87% |
| Intervention | % of Total Savings |
|---|---|
| More precise prompts | 35%[18] |
Timely /compact//clear |
25%[18] |
| Routing 80% of tasks to Sonnet instead of Opus | 20%[18] |
Top 3 contributors representing 80% of measured savings. Remaining ~20% attributed to miscellaneous factors including verbose output reduction and tool selection.[18]
For a 400,000-token request at Sonnet 4.6 pricing:[3][18]
Key finding: The difference between 70% and 95% cache hit rate is not incremental — it is the difference between $1,320/month and $176/month for the same workload. Optimization effort compounds: the first 84% is free from Claude Code's default caching behavior; the gap to 95% requires deliberate session hygiene.[3][18]See also: Autonomous Build Loop (session structure that maximizes cache prefix stability)
A silent infrastructure regression beginning March 6, 2026 shifted Claude Code sessions from 1-hour cache TTL (intended behavior) back to 5-minute TTL, generating $949.08 in avoidable costs across 119,866 API calls from a single developer across two machines and two accounts over 4 months.[1][16]
| Phase | Dates | Observed Behavior | Evidence |
|---|---|---|---|
| 1 | Jan 11–31, 2026 | 5-minute TTL only | ephemeral_1h absent or zero |
| 2 | Feb 1–Mar 5, 2026 | 1-hour TTL only (intended) | ephemeral_5m = 0; 33 consecutive days, two machines |
| 3 | Mar 6–7, 2026 | Transition — 5m tokens reappear | First 5m tokens logged after 33-day absence |
| 4 | Mar 8–Apr 11, 2026 | 5-minute TTL dominant | 5m tier surges to 83%–93% of all cache tokens |
| Date | 5m Tokens | 1h Tokens | 5m Share |
|---|---|---|---|
| 2026-03-05 | 0.00M | 6.55M | 0% — last clean 1h day |
| 2026-03-06 | 0.29M | 0.22M | 57% — first 5m reappearance |
| 2026-03-07 | 4.56M | 0.50M | 90% |
| 2026-03-08 | 16.86M | 3.44M | 83% |
| 2026-03-21 | 21.37M | 1.70M | 93% |
| Month | API Calls | Actual Cost | Cost at 1h TTL | Overpaid | Waste % |
|---|---|---|---|---|---|
| Jan 2026 | 2,639 | $78.99 | $37.54 | $41.45 | 52.5% |
| Feb 2026 | 27,220 | $1,120.43 | $1,108.11 | $12.32 | 1.1% ← 1h working |
| Mar 2026 | 68,264 | $2,776.11 | $2,057.01 | $719.09 | 25.9% |
| Apr 2026 | 21,743 | $1,193.01 | $1,016.78 | $176.23 | 14.8% |
| Total | 119,866 | $5,561.17 | $4,612.09 | $949.08 | 17.1% |
| Month | Actual Cost | Cost at 1h TTL | Overpaid | Waste % |
|---|---|---|---|---|
| Jan 2026 | $131.65 | $62.57 | $69.08 | 52.5% |
| Feb 2026 | $1,867.38 | $1,846.85 | $20.53 | 1.1% |
| Mar 2026 | $4,626.84 | $3,428.36 | $1,198.49 | 25.9% |
| Apr 2026 | $1,988.35 | $1,694.64 | $293.71 | 14.8% |
| Total | $9,268.97 | $7,687.17 | $1,581.80 | 17.1% |
The 17.1% waste ratio is identical for Sonnet and Opus because it is driven entirely by the 5m/1h token split ratio — the same proportion of tokens fell on the wrong TTL tier regardless of per-token price.[1] Model switching cannot mitigate a TTL regression; only session hygiene (shorter idle gaps, fewer invalidating operations) helps.
Key finding: The February 2026 baseline — 1.1% waste when 1h TTL was functioning — demonstrates the theoretical optimum. The March 2026 regression to 25.9% waste was not user error; it was an infrastructure change that cost individual developers hundreds of dollars with no notification. Monitoringcache_creation_input_tokensvscache_read_input_tokensin API responses is the only way to detect this class of regression.[1][16]
As of April 2026, three production models cover the cost-capability tradeoff: Opus 4.7 ($5/$25 input/output), Sonnet 4.6 ($3/$15), and Haiku 4.5 ($1/$5).[6] Opus 4.7 output pricing at $25/MTok is the dominant cost driver in agentic sessions with high turn counts.
| Feature | Opus 4.7 | Sonnet 4.6 | Haiku 4.5 |
|---|---|---|---|
| Input price | $5/MTok | $3/MTok | $1/MTok |
| Output price | $25/MTok | $15/MTok | $5/MTok |
| Input vs Haiku ratio | 5x more | 3x more | baseline |
| Extended Thinking | No | Yes | Yes |
| Adaptive Thinking | Yes | Yes | No |
| Context Window | 1M tokens | 1M tokens | 200k tokens |
| Max Output | 128k tokens | 64k tokens | 64k tokens |
| Knowledge Cutoff | Jan 2026 | Aug 2025 | Feb 2025 |
Haiku 4.5 narrows the quality gap significantly against Sonnet 4.5:[7]
| Benchmark | Haiku 4.5 | Sonnet 4.5 | Gap |
|---|---|---|---|
| SWE-bench Verified | 73.3% | 77.2% | 3.9 pp |
| OSWorld Computer Use | 50.7% | — | Highest Haiku score ever |
Haiku 4.5 delivers "capabilities comparable to Sonnet 4" — the model Anthropic considered cutting-edge at launch in August 2025.[7]
| Task | Scale | Haiku 4.5 | Sonnet 4.5 | Savings |
|---|---|---|---|---|
| Chatbot | 100K sessions/month | $2,250/month | $6,750/month | $4,500 (3x) |
| Agent w/ Extended Thinking | 10K tasks/month | $700/month | $2,100/month | $1,400 (3x) |
| Batch Processing w/ Caching | 100 requests | $1.56 | $4.67 | $3.11 (3x) |
Multi-agent routing insight: Routing 60% of agent calls to Haiku cuts total cost by approximately 40%.[7]
| Plan | Default Model |
|---|---|
| Max and Team Premium | Opus 4.7 |
| Pro, Team Standard, Enterprise, Anthropic API | Sonnet 4.6 |
| Bedrock, Vertex, Foundry | Sonnet 4.5 |
Note: "Claude Code may automatically fall back to Sonnet if you hit a usage threshold with Opus."[15] This fallback is silent and unannounced in the UI.
| Effort Level | Relative Thinking Tokens | Recommended Use |
|---|---|---|
low |
Lowest | Scoped, latency-sensitive, non-intelligence-sensitive tasks |
medium |
Reduced | Cost-sensitive work |
high |
Balanced | Intelligence-sensitive minimum |
xhigh |
High | Best results for coding/agentic tasks (Opus 4.7 default as of v2.1.117) |
max |
Highest | Demanding tasks; session-only |
Note: Exact token reduction percentages per effort level are not disclosed by Anthropic. Labels reflect relative thinking token consumption; actual savings vary by task complexity and model.[15]
Default as of v2.1.117: xhigh on Opus 4.7, high on Opus 4.6 and Sonnet 4.6.[15]
The opusplan model preset routes plan mode to Opus 4.7 (superior reasoning for architecture) while automatically switching to Sonnet 4.6 for execution mode (cheaper code generation) — using a standard 200K context window for the Opus plan phase, not the full 1M.[15] This delivers Opus planning intelligence at Sonnet execution cost.
| Variable | Effect | Estimated Savings |
|---|---|---|
CLAUDE_CODE_SUBAGENT_MODEL=haiku |
Routes all subagents to Haiku 4.5 | 83% vs Opus for subagent compute[15] |
CLAUDE_CODE_EFFORT_LEVEL=medium |
System-wide effort level reduction | Significant thinking token reduction[15] |
MAX_THINKING_TOKENS=8000 |
Caps thinking token budget | Prevents runaway extended thinking costs[11] |
Thinking tokens are billed as output tokens, and the default budget can reach tens of thousands of tokens per request.[11] At Opus 4.7 output pricing ($25/MTok), a 40,000-token thinking budget = $1.00 per request in thinking alone — before any response tokens are counted.
| Model | Input | Output | Status |
|---|---|---|---|
| Opus 4.7 | $5/MTok | $25/MTok | Current |
| Opus 4.6 | $5/MTok | $25/MTok | Deprecated June 15, 2026 |
| Opus 4.1 | $15/MTok | $75/MTok | Deprecated — 3x more expensive than current |
| Sonnet 4 | — | — | Deprecated June 15, 2026 |
Key finding: The 20% savings from routing 80% of tasks to Sonnet (from the optimization breakdown) and the 40% savings from routing 60% of agent calls to Haiku confirm that model routing — not prompt engineering — is the highest-leverage single intervention available. A developer on Max (defaulting to Opus) running all tasks through Opus is spending 3–5x more than necessary for the majority of their workload.[18][7]
MCP tool definitions are injected at every conversation turn — not only at session start — making always-on MCP configurations a persistent per-message cost multiplier. A heavy always-on configuration can consume 66,000 tokens before the first user interaction.[8]
| MCP Server | Tools | Schema Overhead (tokens) | Per-Tool Average |
|---|---|---|---|
| Jira | varies | ~17,000 | — |
| mcp-omnisearch | 20 | ~14,100 | 705 |
| GitHub MCP | varies | 8,000–12,000 | — |
| Gmail | 7 | ~2,640 | 377 |
| Playwright | 22 | ~3,442 | 156 |
| Codex | 2 | 610 | 305 |
| SQLite | 6 | 385 | 64 |
Notable outlier: a single gmail_create_draft tool definition consumed 820 tokens due to verbose schema alone.[8]
| Scenario | Tools Active | Turns | Total Schema Tokens | Monthly Cost (1K conversations/day @ Sonnet) |
|---|---|---|---|---|
| Light | 30 | 15 | 54,450 | — |
| Moderate | 80 | 20 | 193,240 | — |
| Heavy | 120 | 25 | 362,350 | — |
| Enterprise | 200 | 25 | 358,425 | ~$21,000/month in schema overhead alone |
| Active Tools | Turns | Native MCP Tokens | mcp2cli Tokens | Savings |
|---|---|---|---|---|
| 30 | 15 | 54,525 | 2,309 | 96% |
| 80 | 20 | 193,240 | 3,871 | 98% |
| 120 | 25 | 362,350 | 5,181 | 99% |
| Tools Connected | Input Token Reduction | Task Pass Rate |
|---|---|---|
| 96 tools | 58% | 100% |
| 251 tools | 84% | 100% |
| 508 tools | 92% | 100% |
Reference case: Google Drive-to-Salesforce workflow reduced from 150,000 tokens → 2,000 tokens (98.7% reduction) via on-demand loading.[9]
MCP servers inject full tool schemas at every single message turn, not just at session start. At 100 tools × 20 turns = 2,000 schema injections per session — each including name, description, and full parameter schema. Complex enterprise tools (Jira, Salesforce) have schemas of 1,000–5,000 tokens each.[9]
"MCP tool definitions are deferred by default, so only tool names enter context until Claude uses a specific tool." — Official Claude Code documentation.[11]
Official prioritized strategies:[11]
gh, aws, gcloud) over MCP servers — no per-tool listing overhead/mcp/context to see what is consuming spaceKey finding: The "always-on" MCP configuration is not a configuration — it is an unintentional billing structure. At scale (200 tools, 1,000 daily conversations), MCP schema overhead alone reaches $21,000/month before a single user prompt is processed. Deferral (on-demand loading) eliminates 96–99% of this overhead with no task pass rate reduction, making it the highest-ROI zero-quality-tradeoff optimization available.[8][9]See also: New Tooling 2026 (which MCP servers to install); Heavy-User Configs (tuning recommendations)
Subagent dispatch introduces a 7x token multiplier (plan mode teams) and a nonlinear cost curve as agent count grows — but positive ROI is well-documented for specific task types when prompt cache reads absorb the overhead.
| Active Agents | Cost Multiplier | Estimated Hourly Cost |
|---|---|---|
| 1 | 1x | $3–$8 |
| 3 | 3–4x | $15–$40 |
| 10 | 8–12x | $50–$150 |
| 25+ | 15–25x | $200–$500 |
| 49 | 30–50x | $3,000–$6,000 |
A documented real-world incident of runaway subagent spawning:[10]
| Use Case | ROI Range | Why Positive |
|---|---|---|
| TypeScript verification | 150–370% | Replaces senior dev time |
| API documentation | 300–650% | High parallelization, low quality risk |
| Database migration | 213–400% | Deterministic task with clear success criteria |
"Over 90% of tokens in a typical heavy session are prompt cache reads at $0.50/MTok for Opus, which dramatically softens the apparent cost of subagent expansion."[10] List-price calculations of subagent cost are typically overstated by an order of magnitude when cache hit rates are high.
At 10 agents × 18,000 tokens/minute:[10][15]
| Model | Rate (tokens/min) | Per-Minute Cost | Hourly Cost | vs Opus |
|---|---|---|---|---|
| Opus 4.7 ($5/MTok) | 180,000 | $5.40/min | $324/hour | baseline |
| Haiku 4.5 ($1/MTok) | 180,000 | $1.08/min | $64.80/hour | 83% savings |
From official Claude Code documentation:[11]
A non-obvious use of subagents: preventing verbose output (10k–50k tokens) from entering the main context. This protects the main session's cache prefix, keeping downstream turns cheaper by maintaining a stable cache anchor.[18][11]
Key finding: Subagent cost is not linear — it is superlinear. At 10 agents the multiplier is 8–12x; at 49 agents it reaches 30–50x. The cache mitigation factor (90%+ cache reads) provides a real but bounded offset. The practical guideline from the data: 3 focused agents with CLAUDE_CODE_SUBAGENT_MODEL=haiku achieves the same throughput as 10 Opus agents at a fraction of the cost — Haiku subagents deliver 83% cost savings vs Opus for the same token volume.[10][15]
See also: Autonomous Build Loop (multi-CLI coordination architecture); Multi-CLI Coordination (parallel agent patterns)
Claude Max quota accounting is non-deterministic and undisclosed. A documented 1,500x variance in tokens-per-percent-quota across sessions from the same account exposes that Anthropic applies a dynamic weighting formula that is not published.[14]
| Plan | Monthly Price | 5-Hour Window | Weekly Cap |
|---|---|---|---|
| Free | $0 | ~40 messages/day | No Code access |
| Pro | $20 | ~45 prompts | 40–80 Sonnet hours |
| Max 5x | $100 | 50–800 prompts | Up to 480 Sonnet hours |
| Max 20x | $200 | 50–800 prompts | 40 Opus hours |
Anthropic implemented intentional peak-hour metering (announced March 26, 2026), causing faster 5-hour session limit consumption during high-demand periods.[5]
| Parameter | Value |
|---|---|
| Peak hours | Weekdays 5am–11am PT / 1pm–7pm GMT[5] |
| Users affected | ~7% of user base[5] |
| Weekly limits | Unchanged — only 5-hour window consumption rate changes[5] |
| Max 5x drain rate | Limits exhausted in ~90 minutes on normal tasks[5] |
| Max 20x single-prompt jump | 21% → 100% on a single prompt[5] |
| Pro drain rate | 5-hour sessions draining in 1–2 hours[5] |
A documented 5-hour session consumed 65% of Max $100/month quota with these token counts:[14]
The math does not reconcile with published pricing. From Issue #22435 (5,396 mitmproxy samples, same account, same day):[14]
| Measurement | Tokens per 1% Quota |
|---|---|
| Minimum | 2,517 |
| Maximum | 18,531,900 |
| Ratio | 1,500x variance |
For API rate limits: "only input_tokens + cache_creation_input_tokens count toward your ITPM limit" — cache reads do not count toward API rate limits.[14] Max subscription quota accounting may differ and is not disclosed by Anthropic.
| Team Size | Recommended TPM per User | Recommended RPM per User |
|---|---|---|
| 1–5 users | 200k–300k | 5–7 |
| 5–20 users | 100k–150k | 2.5–3.5 |
| 20–50 users | 50k–75k | 1.25–1.75 |
| 50–100 users | 25k–35k | 0.62–0.87 |
| 100–500 users | 15k–20k | 0.37–0.47 |
| 500+ users | 10k–15k | 0.25–0.35 |
Key finding: Anthropic does not publish its Max subscription quota formula. The 1,500x token-per-percent-quota variance demonstrates that quota consumption is not proportional to tokens processed — it is subject to opaque model-specific weighting and billing tier classification. Monitoring the API response usage fields (not the dashboard) is the only reliable signal for cost tracking; the dashboard reflects quota depletion under an unknown formula.[14]
Fast mode delivers up to 2.5x higher output tokens per second for Claude Opus 4.6 only, at a 6x price premium ($30/$150 input/output vs $5/$25 standard).[12] It is currently beta (research preview) and requires waitlist access.
| Pricing Component | Standard Opus 4.6 | Fast Mode Opus 4.6 | Multiplier |
|---|---|---|---|
| Input | $5/MTok | $30/MTok | 6x |
| Output | $25/MTok | $150/MTok | 6x |
Caching multipliers and data residency multipliers stack on top of fast mode pricing.[12]
/fast command or Ctrl+F — enabling the same 2.5x output TPS acceleration on the Opus 4.6 model for interactive sessions.[15]"Falling back from fast to standard speed will result in a prompt cache miss. Requests at different speeds do not share cached prefixes."[12]
Practical consequence: if a fast-mode session hits fast mode rate limits and falls back to standard, the entire cached context is invalidated — forcing full-price input token re-upload for the complete conversation history at standard rates. This is a compounding penalty: 6x cost for fast mode + forced cache recreation at write rate (1.25x) instead of read rate (0.1x).
| Use Case | Fast Mode Justified? | Reason |
|---|---|---|
| Interactive debugging (developer watching) | Yes | Developer wait time has direct opportunity cost |
| Rapid iteration in flow state | Yes | Flow state disruption has real productivity cost |
| Tight deadline, billable hours | Yes | Speed reduces human wait time with monetary value |
| Overnight autonomous agent runs | No | Nobody is waiting; speed adds zero value |
| Batch processing pipelines | No | Throughput, not latency, is the constraint |
| Unattended code review pipelines | No | No human wait time to reduce |
Key finding: Fast mode's 6x cost premium is justified only when developer wait time has monetary value and the developer is the bottleneck, not Claude. The cache incompatibility constraint (speed switches destroy cached prefixes) makes fast mode a committed choice for a session — starting fast and falling back to standard is the worst-case scenario: 6x cost for fast mode turns plus forced cache rebuild at write rates for all subsequent standard turns.[12]
Three simultaneous infrastructure issues compounded in March–April 2026, creating a cost crisis for power users: a cache TTL regression, a thinking deletion bug, and peak-hour throttling. Anthropic acknowledged all three in an official engineering postmortem on April 23, 2026 and reset usage limits for all subscribers.[13]
| Issue | Period | Root Cause | Quality Impact | Cost Impact |
|---|---|---|---|---|
| Reasoning Effort Downgrade | Mar 4 – Apr 7, 2026 | Default reasoning switched from high to medium (addressing UI freezing reports) |
Users reported Claude felt "less intelligent" | Lower: fewer thinking tokens generated |
| Thinking Deletion Bug | Mar 26 – Apr 10, 2026 | Implementation error caused reasoning history to be repeatedly cleared during sessions; sessions idle >1 hour triggered continuous deletion | Forgetfulness, repetition, odd tool choices | Higher: forced fresh cache writes on every effective session restart |
| System Prompt Verbosity Constraint | Apr 16–20, 2026 | Added "keep text between tool calls to ≤25 words" constraint | 3% quality drop confirmed by ablation testing on Opus 4.6 and 4.7 | Neutral — reduced output verbosity |
From March 2026, all three infrastructure problems coexisted and compounded:[13][1][5]
This explains why power users hit quota limits for the first time in March 2026, even without any change to their usage patterns: every session restart (from bug #2) forced a cache write (at 12.5x the read rate, worsened by bug #1) during peak hours (billed at accelerated quota rate by issue #3).[13][1][5]
Key finding: The March–April 2026 compound event demonstrates that infrastructure regressions — not user behavior — drove the first wave of power-user quota exhaustion. Monitoringcache_creation_input_tokensvscache_read_input_tokensin API responses would have flagged the TTL regression (Issue 1) within hours of onset. The thinking deletion bug (Issue 2) is detectable by tracking whether session restart costs increase mid-session. Neither regression was communicated to users proactively; the postmortem was published only after widespread public complaints.[13]
From official Claude Code documentation, prioritized by estimated impact:[11][15]
| Priority | Strategy | Mechanism | Estimated Impact |
|---|---|---|---|
| 1 | opusplan model preset |
Opus reasoning for planning, Sonnet cost for execution | 33–60% vs all-Opus[15] |
| 2 | CLAUDE_CODE_SUBAGENT_MODEL=haiku |
Routes all subagents to Haiku 4.5 | 83% subagent cost reduction vs Opus[15] |
| 3 | Route 60–80% of tasks to Sonnet/Haiku | Model routing by task complexity | 20–40% of total cost[18][7] |
| 4 | Keep MCP tools deferred (default behavior) | Tools load on-demand, not at every turn | 96–99% of schema overhead[8] |
| 5 | CLAUDE.md under 200 lines (official) / 500 lines (community) | Move specialized workflows to on-demand skills | Reduces always-on system prompt tokens[11] |
| 6 | Timely /compact//clear between tasks |
Prevents stale context accumulating | 25% of optimization savings[18] |
| 7 | Preprocessing hooks for verbose output | Grep errors before Claude reads logs (10k lines → hundreds) | Prevents massive context inflation[11] |
| 8 | Subagents for verbose output isolation | Verbose output stays in subagent context, not main cache prefix | Protects main session cache anchor[18] |
| 9 | Lower extended thinking budget | MAX_THINKING_TOKENS=8000 for simple tasks |
Prevents runaway thinking at $25/MTok output[11] |
| 10 | Prefer CLI tools over MCP servers | No per-tool listing overhead for gh, aws, gcloud |
Eliminates server schema injection entirely[11] |
| 11 | Plan mode before implementation | Forces architecture review before token-intensive implementation; prevents expensive re-work sessions | Avoids full session cost of incorrect implementations[11] |
| Signal | API Field | What It Reveals |
|---|---|---|
| TTL regression | cache_creation_input_tokens spike |
Sessions re-uploading context instead of reading cache — indicates 5m TTL regression or frequent invalidation |
| Cache hit rate | cache_read_input_tokens / (cache_read + cache_creation + input_tokens) |
Should be ≥80% for optimized sessions; <60% indicates structural invalidation problem |
| Thinking cost | output_tokens disproportionately high |
Extended thinking running uncapped — check MAX_THINKING_TOKENS |
| Subagent sprawl | Session-level token totals vs expected | Cascade spawning — 887k tokens/min is detectable if tracked per minute[10] |
Starting from unoptimized Opus-only usage ($300.90 per 1,289 requests baseline)[2], applying interventions in order:
| Intervention Applied | Cumulative Cost Reduction | Source |
|---|---|---|
| Default caching (84% hit rate) | 76% → $73.56 | [2] |
| + Route 80% tasks to Sonnet | Additional 20% of remaining | [18] |
| + Precise prompts + session hygiene | Additional 35–25% of remaining | [18] |
| + Optimize to 95% hit rate | Total: ~$176/month from $1,320/month | [3] |
| + Haiku subagents | 83% subagent cost reduction | [15] |
| + MCP deferral | 96–99% of schema overhead eliminated | [8] |
Key finding: The optimization stack is compounding, not additive. Default caching alone achieves 76% cost reduction. Adding model routing achieves the next 20%. Adding session hygiene achieves the next 25%. The fully optimized state ($176/month from a $1,320/month baseline) represents an 87% total reduction — achievable without any quality degradation, using only configuration changes and workflow adjustments documented in official Anthropic documentation.[3][18][11]