Context Continuity & Instant Resume

Executive Summary

Critical finding: Sub-3-second agent startup is achievable only below 50K stable tokens with a warm cache — cold cache at 500K tokens takes 35 seconds to first token, a 10× latency penalty that no session flag or memory MCP can overcome.^[3]^[11]

The 5-minute prompt cache TTL is the binding constraint in every sub-3s startup stack. With a warm cache, a 50K-token stable context achieves ~1.7s TTFT; crossing 500K on a warm cache still costs ~3.5s; cold at 500K hits 35 seconds. A 1M-token cold session extrapolates to 60–90 seconds — making 1M context antithetical to fast boot. The implication is structural: cache hit rate matters more than raw context size. A 50K-token context with a 95% cache hit rate outperforms — in both cost and latency — a 10K-token context that's cold on every request.^[11] Anthropic's prompt cache requires a minimum of 1,024 tokens to activate; below that threshold, cache writes actually regress TTFT by 10–18%.^[11]

CLAUDE.md is the single highest-leverage cost variable at session start because every token it contains loads before any user message, on every API call. Published guidance splits into two tiers: official docs cap it at <200 lines per file; practitioner guidance tightens to <500 tokens, with a practical template of "3–5 key rules + 3 file pointers" totaling ~200 tokens.^[2]^[13] A lean startup context budget — system prompt (~2,700 tokens), tool definitions (~16,800 tokens), 150-line CLAUDE.md (~1,500 tokens), MEMORY.md index (~400 tokens), and skill descriptions (~800 tokens for 10 skills) — totals ~3,000–5,000 tokens before the first user message.^[10]^[1] Path-scoped rules (`.claude/rules/` with `paths:` frontmatter) are the largest available saving: they load only when Claude reads a matching file, not at launch.

Skills have a fundamentally different cost model from CLAUDE.md. Only skill descriptions load at startup — each capped at 1,536 characters (~384 tokens); full content loads on invocation only.^[4]^[16] After auto-compaction, Claude re-attaches the most recently invoked skills up to a combined ceiling of 25,000 tokens, prioritizing recency and dropping older skills if budget is exceeded. The high-ROI skills for practitioners are `/simplify` (spawns 3 parallel review agents), `/batch` (parallelizes across 5–30 isolated worktree agents for large migrations), and proactive `/compact` before phase transitions.^[15] Dead-weight patterns: embedding background facts directly in SKILL.md bodies, mixing process steps with reference docs, and leaving admin or destructive skills with model invocation enabled when they should carry `disable-model-invocation: true` — which removes the description from context entirely.^[4]

Session resume performance is directly tied to transcript size, not to feature flags. The JSONL append-only architecture chain-patches UUID anchors across every prior message on every resume call: sessions under 50K tokens resume in under 3 seconds; sessions over 100K tokens introduce meaningful latency from deserialization. A structured handoff note under 5K tokens loads near-instantly; a raw 100K+ transcript does not.^[12]^[5] Three open bugs in `--resume` (`#3138`, `#43696`, `#46445`) cause recurring regressions in context restoration fidelity as of 2026, making automated resume unreliable in CI/automation contexts — `--resume` with explicit session names is the safer flag for pipelines.^[5]

The compaction reserve buffer was reduced from ~45,000 tokens to ~33,000 tokens in early 2026, yielding ~12K additional usable tokens and pushing the auto-compaction trigger from ~77% to 83.5% context fill.^[1]^[14] The compaction mechanism runs five sequential stages before full summarization: budget reduction → snip → microcompact → context collapse → auto-compact.^[12] The critical quality problem: "the model is at its least intelligent point when compacting" because context rot accumulates as fill approaches the limit. Proactive `/compact` at 70–75% fill — before starting a new work phase, not reactively — produces materially better summaries.^[8] The `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE` env var shifts the trigger threshold; StatusLine hooks that receive `remaining_percentage` provide the only live context monitor available for implementing this proactively.^[14]

Native auto-memory uses LLM-based file header scanning — not vector similarity — to select relevant topic files. The MEMORY.md index loads at every session start (hard cap: first 200 lines or 25KB); individual topic files load on demand when Claude judges them relevant.^[10] This design is intentional: inspectability and reliability over semantic richness. The tradeoff is speed — LLM retrieval takes seconds vs. milliseconds for vector MCPs. `mcp-memory-service` (doobidoo) with SQLite-vec achieves 5ms retrieval latency and scores DevBench Recall@5 of 91.1%,^[6] with a SessionEnd auto-harvest hook and hybrid retrieval combining vector similarity, BM25, and a typed knowledge graph. `claude-mem` (thedotmack) adds a 3-layer progressive disclosure pattern — index queries at ~50–100 tokens, full detail at ~500–1,000 tokens — claiming ~3–10× token savings over loading full session history.^[7] Neither vector MCP addresses multi-machine state in a zero-config way; only `mcp-memory-service` supports optional hybrid local+cloud sync via Milvus or Zilliz.

Structured handoff protocols solve the dual failure modes of agent resume: information overload (full history triggers "Lost in the Middle" degradation) and aggressive compression (stripping evidence removes verification ability). XTrace's structured briefing model — decisions/constraints always present, artifacts and preferences queryable, timeline as a chronological log — reduced average 15-minute handoff delays to seconds and shrinks context reconstruction from 5–10 turns to 1–2 turns.^[9] The `/clear` + distilled brief pattern is the manual equivalent: more effort per session, but it eliminates context rot entirely and gives the practitioner control over what survives the boundary.

The 1M context window addresses capacity, not startup speed, and carries three concrete costs. Accuracy degrades measurably: Opus 4.6 drops from 93% to 76% retrieval accuracy between 256K and 1M tokens; Sonnet 4.5 scores 18.5% MRCR at 1M — effectively unreliable and not recommended for long-context work.^[3] Crossing the 200K token threshold triggers a pricing step function — ALL tokens price at 2× input and 1.5× output (Sonnet 4.6: $3.00/M → $6.00/M input), and cache read/write costs also double.^[3] A 400K-token session costs roughly 5× per turn vs. a post-compaction 80K session, making intentional session management frequently cheaper than relying on window capacity.

Practitioners who need sub-3s startup must stack five mechanisms simultaneously: keep the stable prefix (CLAUDE.md + memory index + tool definitions) under 50K tokens; inject dynamic session recovery data after the stable prefix to preserve cache hits; use semantic retrieval (5ms vector MCP) rather than full history injection; keep structured handoffs under 5K tokens; and manage session discipline with proactive `/compact` before phase transitions and `/clear` on task pivots. None of these mechanisms alone is sufficient — cold cache at 500K tokens is a 35-second wall that no session flag overrides, and the 5-minute TTL means sessions with natural pauses will periodically pay full cold-start cost regardless of context size. The unsolved problem in 2026 is cache warming: there is no mechanism to pre-warm the cache before a session starts, and the recurring `--resume` regressions make raw transcript resume unreliable for production agent loops.

Section 1: CLAUDE.md Size, Cost Tradeoffs & Loading Hierarchy

Every token in a CLAUDE.md file is loaded before any user message is processed, making it the single highest-leverage cost variable in session startup. A 5,000-token CLAUDE.md consumes 5,000 tokens on every API call with zero user interaction.^[2]^[13] Published targets converge on strict size limits with different granularity: the official Claude Code docs specify <200 lines per file,^[10] while practitioner guidance tightens this to <500 tokens with a practical template of "3–5 key rules + 3 file pointers" (~200 tokens total).^[2]^[13]

Size Targets: Official vs. Practitioner

Loading Hierarchy (Official)

Source	Size Target	Rationale
Official Claude Code Docs^[10]	<200 lines per file	Longer files reduce model adherence, not just context budget
buildtolaunch.substack.com^[2]^[13]	<500 tokens	Explicit token budget; 200 tokens with "5 rules + 3 pointers" template
Practical minimum^[13]	~200 tokens	"3–5 key rules + 3 file pointers" — approximately 200 tokens total

CLAUDE.md files load by walking UP the directory tree from the working directory. The five-tier cascade, in order of loading:^[10]

Subdirectory CLAUDE.md files load on demand when Claude reads files in those subdirectories — NOT at launch. This is the primary token-saving behavior in multi-project repositories.^[10]

Delivery Mechanism and Context Drift

CLAUDE.md is delivered as a user message after the system prompt, not as the system prompt itself.^[10] This has a critical implication: in long sessions, the model deprioritizes earlier CLAUDE.md instructions in favor of recent conversation history — a phenomenon documented as "context drift."^[16] Skills (loaded only when invoked) are more robust to context drift than CLAUDE.md baseline injection.^[16]

Token-Saving Techniques

Typical Startup Context Budget

Technique	Mechanism	Token Impact	Source
HTML comment stripping	`<!-- notes -->` stripped before injection	Zero cost for maintainer notes	^[10]
@Import syntax	References external files that expand at load	Organizes but does NOT reduce context; max 5-hop depth	^[10]
Path-scoped rules	`.claude/rules/` with `paths:` frontmatter	Load only when matching file is read; largest available saving	^[10]
Skills for procedures	Move playbooks out of CLAUDE.md into `.claude/skills/`	Pay only on invocation, not every turn	^[4]

Component	Typical Size	Notes
System prompt^[1]	2,700 tokens (1.3% of 200K)	Fixed Anthropic baseline
System tools^[1]	16,800 tokens (8.4%)	Tool definitions; largest non-user component
CLAUDE.md (150 lines)^[10]	~1,500 tokens	Scales linearly with file size
MEMORY.md index (100 entries)^[10]	~400 tokens	200-line / 25KB cap enforced
Skill descriptions (10 skills)^[4]	~800 tokens	Descriptions only; full content loads on invocation
User skills (raw session data)^[1]	1,000 tokens (0.5%)	Observed baseline
Total startup overhead (lean)	~3,000–5,000 tokens	Before first user message

Path-scoped rules architecture is the official recommendation for sub-10K startup overhead: keep core always-on rules in CLAUDE.md (<200 lines), language/area-specific guidance in .claude/rules/, and procedures in skills.^[10]

Section 2: Context Buffer Mechanics and Auto-Compaction

The compaction reserve buffer was reduced from ~45,000 tokens (22.5% of 200K) to ~33,000 tokens (16.5%) in early 2026 — an undocumented change that delivers ~12K additional usable tokens before compaction triggers.^[1]^[14] Compaction now fires at ~83.5% context usage (previously 77–78%), yielding approximately 167K usable tokens in a 200K window before automatic summarization begins.^[14]

Five Graduated Compaction Layers

Claude Code does NOT jump directly to full summarization. VILA-Lab's systematic analysis documents five sequential compression stages:^[12]

Buffer Controls Available

Context Window Construction Order

Layer	Name	Action
1	Budget Reduction	Reduces tool output verbosity
2	Snip	Removes oldest non-critical messages
3	Microcompact	Compresses recent messages
4	Context Collapse	Read-time, non-destructive
5	Auto-Compact	Last resort — full summarization

Control	Mechanism	Effect	Source
`CLAUDE_AUTOCOMPACT_PCT_OVERRIDE`	Env var (1–100)	Shifts when compaction fires relative to total window	^[1]^[14]
`sonnet[1m]`	Model selection	1M context window; dramatically pushes back compaction threshold	^[14]
Manual `/compact`	User command	Proactive summarization while context quality is high; accepts steering	^[14]
StatusLine hook	Hook receives `remaining_percentage`	Only live context monitor available; enables proactive compaction triggers	^[14]
`CLAUDE_CODE_MAX_OUTPUT_TOKENS`	Env var	Controls output length — does NOT change compaction buffer (common misconception)	^[14]

Nine ordered sources load into the context window on session start (VILA-Lab systematic analysis):^[12]

Session Architecture: Append-Only JSONL

Claude Code uses append-only JSONL transcripts across three channels: global prompt history, subagent sidechains, and transcript logs.^[12] Chain-patching at read time uses UUID anchors (headUuid/anchorUuid/tailUuid) — nothing is destructively edited on disk.^[12] Permissions are never restored on resume; trust is re-established per session as a deliberate security design choice.^[12]

Section 3: Skills System — ROI Analysis: High Value vs. Dead Weight

Skills have a fundamentally different cost model from CLAUDE.md: "Unlike CLAUDE.md content, a skill's body loads only when it's used, so long reference material costs almost nothing until you need it."^[4] This makes skills the primary mechanism for keeping session startup cheap while preserving access to complex procedures.

Skills Cost Model

Skills Architecture: Location Options

Invocation Control Matrix

Scenario	What Loads	Token Cost	Source
Regular session (not invoked)	Skill descriptions only	Up to ~384 tokens per description (1,536-character cap^[4]^[16]); total skill-description budget capped at 1% of context window (~2,000 tokens at 8,000-character fallback)	^[4]^[16]
Regular session (invoked)	Full skill content	Full file size (up to 500 lines)	^[4]
Post-compaction	Most recently invoked skills, first 5K tokens each	Max 25,000 tokens total across all re-attached skills	^[4]
Subagent with preloaded skills	Full skill content at startup	Higher than regular session; different cost model	^[4]

Scope	Path	Applies to
Enterprise	Managed settings	All org users
Personal	`~/.claude/skills/<skill-name>/SKILL.md`	All your projects
Project	`.claude/skills/<skill-name>/SKILL.md`	This project only
Plugin	`<plugin>/skills/<skill-name>/SKILL.md`	Where plugin is enabled

Frontmatter	User can invoke	Claude can invoke	Description in context
(default)	Yes	Yes	Always in context
`disable-model-invocation: true`	Yes	No	NOT in context (description also removed)
`user-invocable: false`	No	Yes	Always in context

Token-saving insight: disable-model-invocation: true removes the skill description from context entirely — useful for rarely-used deployment or admin skills to reduce baseline overhead.^[4]

Size Limits and Budget Constraints

High-ROI Skills (Practitioner Assessment)

Low-Value / Dead Weight Patterns

Dynamic Context Injection (Power Feature)

Limit	Value	Impact if Exceeded	Source
Max SKILL.md length	500 lines	Degraded performance	^[4]
Description truncation	1,536 characters per entry	Keywords stripped; Claude misses invocation triggers	^[4]^[16]
Total description budget	1% of context window OR 8,000 characters (fallback)	Descriptions shortened across all skills to fit	^[4]^[16]
Budget override	`SLASH_COMMAND_TOOL_CHAR_BUDGET` env var	Raises limit; use when many skills needed	^[4]

Skill	ROI Assessment	Optimal Trigger	Source
`/simplify`	Highest ROI — spawns 3 parallel review agents; catches unused imports, redundant variables, shared logic	After refactoring or AI-generated code	^[15]
`/batch`	Maximum impact for large-scale changes — parallelizes across 5–30 independent units using isolated worktree agents	Large migrations	^[15]
`/review`	Complementary correctness validation — bugs, logic errors, edge cases; pairs with /simplify	Before commit/PR	^[15]
`/compact`	Essential for session sustainability — proactive compression before new work phases	Before phase transition, not at context limit	^[15]

Pattern	Problem	Fix	Source
Background facts / style guides in SKILL.md body	Bloats file; context cost every invocation	Move to supporting files in skill directory	^[4]
Skills that should be CLAUDE.md rules	Wrong abstraction; pay invocation overhead for always-on content	Static always-on context belongs in CLAUDE.md	^[4]
Large reference docs embedded directly	5,000-token skill body loaded on every invocation	Reference from SKILL.md; load only when explicitly needed	^[4]
Skills with process steps + background context mixed	"Packing skills with background context bloats the file and degrades performance"	Skills contain process steps only	^[16]
`/loop` skill	Limited real usage; useful only for specific monitoring scenarios	Situational — use only when actually polling	^[15]
`/debug` skill	Situational troubleshooting only — “niche skill, but invaluable when you need it”; not for routine use	Keep but invoke only for specific debugging scenarios; not a session-start default	^[15]
Admin/destructive skills with model invocation enabled	Risk of unintended invocation; always in context budget	Set `disable-model-invocation: true`	^[4]

The !`command` syntax in skill files executes shell commands at invocation time — output replaces the placeholder before Claude sees the skill content.^[4] This enables sub-second injection of live data (git diffs, test results, PR comments) at skill invocation rather than baking stale data into the session start:

This pattern decouples "what's always in context" from "what's injected when needed" — a key technique for sub-3s startup with full situational awareness.^[4]

Skills vs. CLAUDE.md Decision Rule

Content Type	Mechanism	Cost Profile	Source
Facts, rules, patterns (always relevant)	CLAUDE.md	Pay every turn	^[4]
Procedures, playbooks, workflows	Skills	Pay only when invoked	^[4]
Large reference docs	Supporting files in skill directory	Pay only when explicitly loaded via skill	^[4]

Skills are also hot-reloaded — edits take effect within the current session without restart. Creating a new top-level skills/ directory requires restart.^[4]

Section 4: Session Continuity Flags: --continue, --resume, /compact, /clear

Claude Code provides four mechanisms for session continuity with distinct tradeoffs: --continue (auto-resume latest), --resume (targeted resume), /compact (summarize-in-place), and /clear (fresh start with manual brief). Each addresses a different failure mode in the context lifecycle.^[5]^[8]

--continue (-c) vs. --resume (-r)

Flag	Behavior	Ideal Use Case	Source
`--continue` (`-c`)	Auto-loads most recent conversation in current directory without user interaction	Single active task, single project	^[5]
`--resume` (`-r`)	Interactive picker or explicit name/ID selection across sessions	Multiple parallel projects; targeted resume	^[5]

What gets restored with either flag: complete message history, all tool results (files read, commands executed, code modifications), full conversation context.^[5]

--resume Interactive Session Picker Controls

Session Storage Format

Key	Action
Up/Down arrows	Navigate sessions
Enter	Resume selected session
P	Preview content
R	Rename session
/	Search/filter
B	Filter by Git branch

Sessions are grouped by Git repository, including worktrees. No automatic cleanup or expiration — indefinite retention by default.^[5]

/compact vs. /clear: Official Framework

When to Use /compact

When to Use /clear

Five Official Decision Points After Each Task

Sub-3-Second Resume Performance Reality

Approach	Method	Tradeoff	Source
`/compact`	Model summarizes automatically; accepts steering via inline flags	Lossy but thorough; Claude decides relevance; can be guided with explicit focus	^[8]
`/clear`	User writes distilled brief manually; zero-rot fresh session	More effort per session but you control relevance; eliminates model deprioritization of old content	^[8]

#	Action	When
1	Continue (next message)	Same task, same context, no rot
2	/rewind (Esc Esc)	Jump back to previous message and retry
3	/clear + distilled brief	Task pivot, poisoned context, fresh start needed
4	/compact	Summarize session and keep working in same stream
5	Subagents	Delegate work to agent with clean context; preserve main session for decision-making

The session system was NOT designed for sub-3s resumption of large sessions. Performance characteristics by session size:^[5]^[12]

Known Issues and Regressions (As of 2026)

Section 5: Auto-Memory System (MEMORY.md + ~/.claude/memory/)

Session Size	Resume Latency	Bottleneck
<50K tokens	<3 seconds	None — fast
50K–100K tokens	Variable; can be significant	Serialization/deserialization of large tool outputs in history
>100K tokens	Meaningful latency	JSONL chain patching + UUID anchor resolution across full transcript
Structured handoff notes (<5K tokens)	Near-instant	JSONL deserialization of small payload is sub-second

Issue	Symptom	Status
GitHub #3138^[5]	`--resume` fails to maintain context after hitting usage/context limits	Known bug
GitHub #43696^[5]	`--continue` and `--resume` do not restore prior conversation context (regression in 2.1.x)	Patched but re-appears
GitHub #46445^[5]	`/continue` and `/resume` show all sessions across projects instead of current project only (regression in 2.1.101)	Regression
Non-interactive mode^[5]	`--continue` may create new sessions in CI/automation	Use `--resume` with explicit session names for reliable automation

Claude Code provides two official memory mechanisms with different persistence models and token costs. CLAUDE.md files hold instructions and rules (loaded in full every session); auto-memory holds Claude's learned patterns and decisions (indexed summary loaded at startup, details loaded on demand).^[10]

Two Official Memory Mechanisms

Directory Structure

Loading Behavior: Budget-Gated Index Pattern

Retrieval Mechanism: LLM File Header Scan vs. Vector MCP

Design Philosophy: Concise Index Pattern

Enable/Disable Controls

What Survives /compact

Section 6: Vector-Store Memory MCPs

Mechanism	Who Writes	What It Contains	Loaded Into
CLAUDE.md files^[10]	Developer	Instructions, rules, project context	Every session (full content)
Auto memory^[10]	Claude (agent-written)	Learnings, patterns, decisions, gotchas	Every session (first 200 lines or 25KB of MEMORY.md)

File	When Loaded	Token Budget
MEMORY.md^[10]	Every session start	First 200 lines OR first 25KB — hard cap
Topic files (e.g., debugging.md)^[10]	On demand — Claude reads when relevant	Full file size; only when Claude judges it necessary

Property	Native Auto-Memory	Vector-Store MCP
Retrieval method^[12]	LLM scans file headers; selects up to 5 relevant files	Embedding similarity + BM25
Retrieval speed	Seconds (LLM call required)	~5ms (local SQLite-vec)
Inspectability	Full (plain Markdown files, readable by human)	Partial (requires query tool)
External dependencies	None	Requires MCP server running
Auto-capture	No — Claude decides what's worth saving	Yes — SessionEnd hooks + lifecycle hooks
Token efficiency	200-line MEMORY.md cap; topic files on demand	Semantic filtering; only relevant context injected

Content	Survives /compact?	Mechanism
Project-root CLAUDE.md^[10]	Yes	Re-read from disk after compaction
Nested CLAUDE.md in subdirectories^[10]	Deferred	Re-loaded next time Claude reads a file in that subdirectory
Instructions given only in conversation^[10]	No	Must be added to CLAUDE.md to persist
MEMORY.md^[10]	Yes (disk-persisted)	Re-loaded at next session start

Two open-source tools provide vector-backed persistent memory for Claude Code: mcp-memory-service (doobidoo) targeting production AI agent pipelines, and claude-mem (thedotmack) targeting session capture and progressive disclosure within Claude Code workflows.^[6]^[7]

mcp-memory-service (doobidoo)

Architecture: Three Access Modes

Vector Storage Backends

Embedding Model

Mode	Interface	Use Case
REST API	15 endpoints	Framework-agnostic HTTP clients (LangGraph, CrewAI, AutoGen)
MCP Server	Native tool integration	Claude Desktop, Claude Code native integration
Remote MCP via HTTPS/OAuth	Browser-accessible	Browser-based claude.ai integration

Backend	Use Case	Notes
SQLite-vec (default)	Local, no external dependencies	5ms retrieval; recommended for Claude Code
Milvus (Lite / self-hosted / Zilliz Cloud)	Production scale	Managed or self-hosted options
Cloudflare Vectorize	Cloud deployment	Edge deployment
Hybrid local + cloud sync	Resilience / multi-machine	Addresses machine-local limitation of native auto-memory

All-MiniLM-L6-v2, 384-dimensional embeddings, running locally via ONNX — no external API dependencies, sub-millisecond embedding inference.^[6]

Hybrid Retrieval: Four Signals Combined

Performance Benchmarks

Claude Code Integration Path

Benchmark	Metric	Score
LongMemEval (500 questions)^[6]	Recall@5	80.4%
LongMemEval^[6]	NDCG@10	82.2%
LongMemEval^[6]	MRR	89.1%
DevBench (practical workflows)^[6]	Recall@5	91.1%
DevBench^[6]	Overall MRR	0.861
SQLite-vec retrieval speed^[6]	Retrieval latency	5ms — adds ~5–15ms total to startup including query formulation

Reliability fix in v10.40.3: Eliminated socket hang-ups via keepAlive: false and explicit Connection: close headers. Critical for production agent loops that restart frequently.^[6]

claude-mem (thedotmack)

Session Capture: Five Lifecycle Hooks

3-Layer Progressive Disclosure Pattern

Hook	When It Fires	What It Does
`SessionStart`^[7]	Session begins	Initialize session context
`UserPromptSubmit`^[7]	User sends message	Capture user intent
`PostToolUse`^[7]	After each tool call	Record tool usage and observations
`Stop`^[7]	Session paused	Checkpoint state
`SessionEnd`^[7]	Session ends	Final AI-powered compression and storage

The key architectural innovation for token efficiency — access detail at the granularity you need:^[7]

Layer	Tool	Tokens per Result	Use Case
1 — Index	`search`	~50–100 tokens	Find relevant observations; get IDs
2 — Timeline	`timeline`	~200–500 tokens (estimated; exact token cost not documented in source^[7] — estimate based on intermediate granularity between index and full-detail layers)	Chronological context around results
3 — Detail	`get_observations`	~500–1,000 tokens	Full details for filtered IDs only

Token efficiency claim: ~10x savings by filtering before fetching vs. loading full session history.^[7] Conservative re-estimate based on corpus analysis: ~3x reduction (50 observations × 500 tokens = 25K full load; vs. 3,750 token index + ~7,500 for 3–5 relevant observations).^[7]

Infrastructure

Full Comparison: Native vs. Vector MCP Options

Section 7: Baton/Handoff Protocols for Agent Resume

Property	Native Auto-Memory	mcp-memory-service	claude-mem
Retrieval method^[10]^[6]^[7]	LLM file header scan	Hybrid vector + BM25 + knowledge graph	Vector + FTS5
Retrieval speed^[12]^[6]^[7]	~Seconds (LLM call)	5ms	~Milliseconds
Token efficiency	200-line MEMORY.md cap; topic files on demand	Semantic filtering; decay + compression pipeline	3-layer progressive disclosure; ~3–10x claimed reduction
External dependencies	None	SQLite-vec or Milvus	Node.js 18+, Bun, Chroma
Inspectability	Full (plain Markdown)	Partial (query tool required)	Partial (web viewer available)
Auto-capture	No — Claude decides	SessionEnd auto-harvest hook	5 lifecycle hooks (every tool call)
Multi-machine support	No (machine-local)	Yes (Hybrid local+cloud or Zilliz)	No (local SQLite)
Production benchmarks	N/A	DevBench Recall@5: 91.1%	Not published

Agent context handoffs fail through two symmetric failure modes: information overload (passing complete message histories triggers "Lost in the Middle" effect) and aggressive compression (stripping away supporting evidence removes the ability to verify or extend prior analysis).^[9] Both modes degrade quality; neither achieves sub-3s productive startup.

Root Cause: Why Context Dumps Fail

The Structured Briefing Model: Four Information Layers

Practical Baton Format (Widely Adopted)

Failure Mode	Mechanism	Result	Source
Information Overload	Full message history passed; model buries critical signals	"Lost in the Middle" — performance degrades with large context dumps	^[9]
Aggressive Compression	Summarization strips evidence and reasoning chains	Receiving agent cannot verify or extend prior analysis	^[9]

Layer	Content	Availability
1	Decisions and non-negotiable constraints	Always present — inject at every session start
2	Actual artifacts (not summaries) for reference	Available on demand — query by ID
3	Accumulated preferences and patterns	Retrievable via query
4	Timeline context (what happened and when)	Queryable chronological log

Synthesized from XTrace research^[9] and Anthropic's /clear + brief guidance:^[8]

Performance Benchmarks

/clear + Manual Brief = Manual Baton Pass

Metric	Without Structured Handoff	With Structured Handoff	Source
Handoff delay	Average 15 minutes	Seconds	^[9]
Context reconstruction turns	5–10 turns to re-derive prior state	1–2 turns from structured baton	^[9]
Resume latency (JSONL)	Significant for >100K session transcripts	Near-instant for <5K structured notes	^[12]

The /clear + distilled brief pattern is the manual version of a structured baton pass:^[8]

Queryable Memory as Infrastructure

"When a new agent joins a workflow, it doesn't receive a dump. It queries the memory for a briefing."^[9] This architecture allows agents to access accumulated knowledge immediately, avoid cold starts, and avoid degradation through successive summarization rounds — the core problem with repeated /compact cycles.^[9]

Subagent Strategy for Context Isolation

Condition	Strategy
Only the conclusion needed (not intermediate output)	Use subagent — intermediate output stays in subagent context; only conclusion returns to parent
Output will be referenced again in main session	Keep in main session — subagent return loses the working context

Subagents keep main context clean by using isolated Claude instances that return only distilled results.^[16]

Section 8: Prompt Caching & Sub-3-Second Startup Architecture

The 5-minute prompt cache TTL is the fundamental constraint for sub-3s startup. With a warm cache (<5 min since last request), TTFT at 500K tokens is ~3.5 seconds. Cold cache at the same size takes 35–90 seconds — a 10–26x latency penalty.^[3]^[11]

TTFT Benchmarks by Context Size and Cache State

Context Size	Cache State	TTFT	Source
50K tokens	Warm (<5 min)	~1.7s	^[11]
50K tokens (GPT-4o baseline)	Warm, cached	1,699ms (down from 4,290ms)	^[11]
500K tokens	Warm (<5 min)	~3.5s	^[3]
500K tokens	Cold (>5 min)	~35s	^[3]
1M tokens (extrapolated)	Cold	60–90s	^[3]

Model note: Row 1 (~1.7s at 50K warm cache) is a Claude-inferred estimate extrapolated from cross-provider caching improvement data in [11]. Row 2 (1,699ms at 50K warm cache) is a direct GPT-4o measurement from arxiv 2601.06007^[11]; it is included as a cross-provider reference point. Rows 3–5 (500K and 1M) are Claude measurements from claudecodecamp.com^[3].

Cache Hit Rate Impact on Cost

Cache Minimum Threshold

The Cache Paradox

Model	Cost Reduction from Caching	Source
GPT-5.2^[11]	79–81%	^[11]
Claude Sonnet 4.5^[11]	78–79%	^[11]
Average across providers^[11]	41–80%	^[11]
TTFT improvement from caching^[11]	13–31% across providers	^[11]

Provider	Minimum Tokens for Cache	Below Minimum
Anthropic^[11]	1,024 tokens	TTFT regression of 10–18% (worse than no caching)
OpenAI^[11]	4,096 tokens	Cache does not activate

"Full context caching can paradoxically increase latency" when dynamic content like tool results gets cached without producing subsequent hits. Cache writes consume compute; if the cached content isn't reused, you pay write cost with no benefit.^[11]

Strategic insight: "System prompt only caching provides the most consistent benefits."^[11] Operational rule:

Optimal Sub-3s Startup Architecture

Content Type	Cache Strategy	Reason
CLAUDE.md + skills + static tools^[11]	Maximize cache stability — never change between calls	Stable prefix → consistent cache hit rate
Dynamic content (tool results, git status)^[11]	Do NOT cache; generate fresh each turn	Cached dynamic content rarely gets a hit; pays write cost for nothing
Session recovery data^[11]	Inject AFTER stable prefix	Dynamic injection after stable prefix does not break cache on prefix

Synthesized from six sources — five independent approaches that combine into a coherent stack:^[11]^[6]^[3]^[12]^[8]

Section 9: 1M Context Window — When to Use It

Step	Action	Target	Why
1^[11]	Keep stable context (CLAUDE.md + memory index + static tools) under 50K tokens	TTFT ~1.7s warm	Below 50K: consistent sub-2s warm TTFT
2^[11]	Inject dynamic session recovery data AFTER stable prefix	Cache hit on prefix preserved	Dynamic suffix doesn't break cache on the stable portion
3^[6]	Use semantic retrieval (5ms) to select what dynamic content to inject	~5–15ms memory retrieval overhead	Only relevant prior context injected, not full history
4^[3]	Stay under 200K tokens	No long-context price cliff	200K crossing doubles all per-token costs
5^[12]	Keep structured handoff (baton) under 5K tokens	Near-instant JSONL deserialization	Chain-patching small JSONL is sub-second; large transcripts are not
6^[6]	Store session summaries + key decisions in vector store at session end; semantic query on startup	Relevant prior context in 5ms	Avoids cold-start reconstruction; externalized memory survives /clear

The 1M context window addresses a different problem than sub-3s startup. It eliminates automatic summarization for most coding sessions but introduces a 200K token pricing cliff (2× per-token cost), a 5-minute cache TTL cold-restart penalty, and measurable accuracy degradation above 256K tokens.^[3]

Long-Context Pricing Cliff (200K Threshold)

When crossing 200K tokens, ALL tokens get premium pricing — not just those above the threshold. Step function, not linear scale:^[3]

Model	Standard Input	Long Context Input (2×)	Standard Output	Long Context Output (1.5×)
Opus 4.6	$5.00/M	$10.00/M	$25.00/M	$37.50/M
Sonnet 4.6	$3.00/M	$6.00/M	$15.00/M	$22.50/M

Infrastructure cost justification: long-context sessions require hundreds of gigabytes of GPU memory reserved per user and consume 25× more compute on cold starts.^[3]

Cache pricing at the 200K threshold (Opus 4.6): Cache reads double from $0.50/M (standard) to $1.00/M; cache writes from $6.25/M to $12.50/M — a 2× multiplier on both. This cache cost step-up is additive to the standard input/output price increase: sub-3s startup stacks relying on prompt caching face double cache overhead when crossing 200K tokens.^[3]

Accuracy Degradation at Long Context

Model	256K Context Accuracy	1M Context Accuracy	Metric
Opus 4.6^[3]	93%	76%	General retrieval accuracy; "lost in the middle" effect
Sonnet 4.5^[3]	N/A	18.5% MRCR score	Effectively unreliable at 1M; avoid entirely

Critical information should be at the beginning or end of context for best retrieval at long context lengths.^[3]

When 1M Context Is Justified

1M Window vs. Sub-3s Startup

Section 10: Integrated Decision Framework for Sub-3s Resume

Sub-3s productive startup requires stacking five independent mechanisms. No single tool achieves it alone; the architecture requires: stable prefix caching, lean startup context, structured handoff format, semantic memory retrieval for dynamic injection, and session lifecycle management aligned with the 5-minute cache TTL.

The Sub-3s Startup Stack

Context Management Decision Tree

Practitioners Who Have Achieved Sub-3s Startup

Use Case	1M Appropriate?	Reason
Single-shot analysis of large codebases/documents^[3]	Yes	Full corpus in context; no compaction interruption
Deep debugging requiring full context preservation^[3]	Yes	Compaction during debugging loses critical trace data
Multi-agent shared state^[3]	Yes	Eliminates need for aggressive handoff compression
Compliance work requiring exact citations^[3]	Yes	Source material must remain uncompacted
Routine coding sessions^[3]	No	Most sessions peak at 80–120K before compaction anyway
Sessions with frequent pauses^[3]	No	5-minute cache TTL → painful cold restarts (35–90s)
Sonnet 4.5 at 1M^[3]	No	18.5% MRCR — unreliable; avoid entirely

Layer	Mechanism	Target	Failure Mode Without It
Stable context^[11]^[10]	CLAUDE.md <500 tokens + path-scoped rules + lean MEMORY.md	<50K tokens stable prefix	Large stable prefix → TTFT >3.5s even warm
Cache hit rate^[11]	Keep stable prefix unchanged between requests; stay under 200K tokens	>90% cache hit on stable prefix	Cold cache at 500K = 35s startup
Session format^[12]	Structured baton <5K tokens instead of full transcript resume	Near-instant JSONL deserialization	100K+ JSONL transcript deserialization adds meaningful latency
Dynamic injection^[6]	Semantic retrieval (5ms) to select relevant prior context	5–15ms retrieval overhead	Full history injection bloats context; destroys cache hit rate
Session discipline^[8]^[15]	Proactive /compact before phase transitions; /clear on task pivots	Session stays under 100K tokens	Context rot + auto-compaction at worst intelligence point

Situation	Recommended Action	Source
Same task, clean context	Continue (send next message)	^[8]
Wrong direction, same task	/rewind (Esc Esc) to prior message	^[8]
Same task, starting new phase	Proactive /compact with steering before phase transition	^[15]^[8]
Task pivot or poisoned context	/clear + manual distilled brief (baton)	^[8]
Subtask where only conclusion needed	Subagent delegation; intermediate output stays in subagent context	^[8]^[16]
Resume after >5 min break	Expect cold cache; use structured baton (<5K tokens) for fastest productive start	^[3]^[12]
New session, need prior context	Vector MCP retrieval (5ms) OR MEMORY.md index + demand-load topic files	^[6]^[10]

Practitioner / Source	Stack Described	Key Metric
doobidoo/mcp-memory-service^[6]	SQLite-vec local + SessionEnd auto-harvest + semantic retrieval on startup	5ms retrieval; adds 5–15ms total startup overhead for memory layer
XTrace structured handoff^[9]	Layered briefing format + queryable memory infrastructure	15-minute handoff delays reduced to seconds; 5–10 turns reconstructions → 1–2 turns
claudefa.st (tool vendor)^[1]^[14]	CLAUDE_AUTOCOMPACT_PCT_OVERRIDE + StatusLine hook real-time monitoring	Compaction quality metric, not startup latency: proactive compaction before buffer limit preserves context quality; no before/after startup time reported
arxiv 2601.06007 (academic paper — GPT-4o measurement)^[11]	Stable prefix caching + dynamic suffix injection (cross-provider caching study, not a Claude Code deployment)	GPT-4o direct measurement: 50K warm cache TTFT = 1,699ms (down from 4,290ms); 13–31% TTFT improvement from caching. Included as cross-provider reference; not a Claude practitioner benchmark.

Note on corpus coverage: The entries above document tools and approaches from corpus sources (vendor documentation, academic papers). No independent practitioner case studies — developers who measured their own Claude Code cold-start latency, applied a config stack, and published before/after numbers — were found in raw_1.md through raw_16.md. To add such entries, search: "claude code startup latency optimization site:reddit.com OR site:github.com OR site:dev.to" and "claude code resume performance blog 2025 2026".

Gaps and Unsolved Problems

Areas where corpus data identifies open problems without current solutions:^[5]^[3]^[11]

Section 11: Plan Mode & Saved Plans — Pre-Positioned Context

Research gap — content removed pending source verification. This section originally cited sources [17]–[20], none of which appear in the consolidated corpus source index (raw_1.md–raw_16.md). Specific claims about “Ultraplan,” a Ctrl+G editor integration, plan approval-flow options, and plan persistence paths could not be verified against the corpus. The “Ultraplan” product name has no corpus anchor. All content relying on sources [17]–[20] has been removed per corpus verification policy.

To restore this section: fetch the official Claude Code plan mode documentation (the URL pattern from confirmed sources suggests code.claude.com/docs/en/plan-mode or the Anthropic docs equivalent), add the real URLs to the consolidated source index, and re-synthesize using only verified claims with proper citation numbers.

The EnterPlanMode and ExitPlanMode tools are confirmed real Claude Code API tools (they appear in the Claude Code tool declaration list). No other plan mode mechanics are documented in the consolidated corpus (raw_1.md–raw_16.md).

Executive Summary

Table of Contents

Section 1: CLAUDE.md Size, Cost Tradeoffs & Loading Hierarchy

Size Targets: Official vs. Practitioner

Loading Hierarchy (Official)

Delivery Mechanism and Context Drift

Token-Saving Techniques

Typical Startup Context Budget

Section 2: Context Buffer Mechanics and Auto-Compaction

Five Graduated Compaction Layers

Buffer Controls Available

Context Window Construction Order

Session Architecture: Append-Only JSONL

Section 3: Skills System — ROI Analysis: High Value vs. Dead Weight

Skills Cost Model

Skills Architecture: Location Options

Invocation Control Matrix

Size Limits and Budget Constraints

High-ROI Skills (Practitioner Assessment)

Low-Value / Dead Weight Patterns

Dynamic Context Injection (Power Feature)

Skills vs. CLAUDE.md Decision Rule

Section 4: Session Continuity Flags: --continue, --resume, /compact, /clear

--continue (-c) vs. --resume (-r)

--resume Interactive Session Picker Controls

Session Storage Format

/compact vs. /clear: Official Framework

When to Use /compact

When to Use /clear

Five Official Decision Points After Each Task

Sub-3-Second Resume Performance Reality

Known Issues and Regressions (As of 2026)

Section 5: Auto-Memory System (MEMORY.md + ~/.claude/memory/)

Two Official Memory Mechanisms

Directory Structure

Loading Behavior: Budget-Gated Index Pattern

Retrieval Mechanism: LLM File Header Scan vs. Vector MCP

Design Philosophy: Concise Index Pattern

Enable/Disable Controls

What Survives /compact

Section 6: Vector-Store Memory MCPs

mcp-memory-service (doobidoo)

Architecture: Three Access Modes

Vector Storage Backends

Embedding Model

Hybrid Retrieval: Four Signals Combined

Performance Benchmarks

Claude Code Integration Path

claude-mem (thedotmack)

Session Capture: Five Lifecycle Hooks

3-Layer Progressive Disclosure Pattern

Infrastructure

Full Comparison: Native vs. Vector MCP Options

Section 7: Baton/Handoff Protocols for Agent Resume

Root Cause: Why Context Dumps Fail

The Structured Briefing Model: Four Information Layers

Practical Baton Format (Widely Adopted)

Performance Benchmarks

/clear + Manual Brief = Manual Baton Pass

Queryable Memory as Infrastructure

Subagent Strategy for Context Isolation

Section 8: Prompt Caching & Sub-3-Second Startup Architecture

TTFT Benchmarks by Context Size and Cache State

Cache Hit Rate Impact on Cost

Cache Minimum Threshold

The Cache Paradox

Optimal Sub-3s Startup Architecture

Section 9: 1M Context Window — When to Use It

Long-Context Pricing Cliff (200K Threshold)

Accuracy Degradation at Long Context

When 1M Context Is Justified

1M Window vs. Sub-3s Startup

Section 10: Integrated Decision Framework for Sub-3s Resume

The Sub-3s Startup Stack

Context Management Decision Tree

Practitioners Who Have Achieved Sub-3s Startup

Gaps and Unsolved Problems

Section 11: Plan Mode & Saved Plans — Pre-Positioned Context

Sources