Home

Context Continuity & Instant Resume

Pillar: context-continuity | Date: April 2026
Scope: Head-to-head comparison of 2026 SOTA for sub-3s agent boot at full context: CLAUDE.md size/cost tradeoffs, Skills ROI analysis (which earn rent vs. dead weight), plan mode and saved plans, custom baton/handoff protocols, auto-memory (~/.claude/memory/, MEMORY.md indexing), --resume/--continue flags, vector-store memory MCPs (mcp-memory-service, etc.), embedding-based session retrieval, /compact vs /clear vs --resume tradeoffs. Practitioners who have actually solved sub-3s startup and their full stack with benchmarks.
Sources: 16 gathered, consolidated, synthesized.

Executive Summary

Critical finding: Sub-3-second agent startup is achievable only below 50K stable tokens with a warm cache — cold cache at 500K tokens takes 35 seconds to first token, a 10× latency penalty that no session flag or memory MCP can overcome.[3][11]

The 5-minute prompt cache TTL is the binding constraint in every sub-3s startup stack. With a warm cache, a 50K-token stable context achieves ~1.7s TTFT; crossing 500K on a warm cache still costs ~3.5s; cold at 500K hits 35 seconds. A 1M-token cold session extrapolates to 60–90 seconds — making 1M context antithetical to fast boot. The implication is structural: cache hit rate matters more than raw context size. A 50K-token context with a 95% cache hit rate outperforms — in both cost and latency — a 10K-token context that's cold on every request.[11] Anthropic's prompt cache requires a minimum of 1,024 tokens to activate; below that threshold, cache writes actually regress TTFT by 10–18%.[11]

CLAUDE.md is the single highest-leverage cost variable at session start because every token it contains loads before any user message, on every API call. Published guidance splits into two tiers: official docs cap it at <200 lines per file; practitioner guidance tightens to <500 tokens, with a practical template of "3–5 key rules + 3 file pointers" totaling ~200 tokens.[2][13] A lean startup context budget — system prompt (~2,700 tokens), tool definitions (~16,800 tokens), 150-line CLAUDE.md (~1,500 tokens), MEMORY.md index (~400 tokens), and skill descriptions (~800 tokens for 10 skills) — totals ~3,000–5,000 tokens before the first user message.[10][1] Path-scoped rules (`.claude/rules/` with `paths:` frontmatter) are the largest available saving: they load only when Claude reads a matching file, not at launch.

Skills have a fundamentally different cost model from CLAUDE.md. Only skill descriptions load at startup — each capped at 1,536 characters (~384 tokens); full content loads on invocation only.[4][16] After auto-compaction, Claude re-attaches the most recently invoked skills up to a combined ceiling of 25,000 tokens, prioritizing recency and dropping older skills if budget is exceeded. The high-ROI skills for practitioners are `/simplify` (spawns 3 parallel review agents), `/batch` (parallelizes across 5–30 isolated worktree agents for large migrations), and proactive `/compact` before phase transitions.[15] Dead-weight patterns: embedding background facts directly in SKILL.md bodies, mixing process steps with reference docs, and leaving admin or destructive skills with model invocation enabled when they should carry `disable-model-invocation: true` — which removes the description from context entirely.[4]

Session resume performance is directly tied to transcript size, not to feature flags. The JSONL append-only architecture chain-patches UUID anchors across every prior message on every resume call: sessions under 50K tokens resume in under 3 seconds; sessions over 100K tokens introduce meaningful latency from deserialization. A structured handoff note under 5K tokens loads near-instantly; a raw 100K+ transcript does not.[12][5] Three open bugs in `--resume` (`#3138`, `#43696`, `#46445`) cause recurring regressions in context restoration fidelity as of 2026, making automated resume unreliable in CI/automation contexts — `--resume` with explicit session names is the safer flag for pipelines.[5]

The compaction reserve buffer was reduced from ~45,000 tokens to ~33,000 tokens in early 2026, yielding ~12K additional usable tokens and pushing the auto-compaction trigger from ~77% to 83.5% context fill.[1][14] The compaction mechanism runs five sequential stages before full summarization: budget reduction → snip → microcompact → context collapse → auto-compact.[12] The critical quality problem: "the model is at its least intelligent point when compacting" because context rot accumulates as fill approaches the limit. Proactive `/compact` at 70–75% fill — before starting a new work phase, not reactively — produces materially better summaries.[8] The `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE` env var shifts the trigger threshold; StatusLine hooks that receive `remaining_percentage` provide the only live context monitor available for implementing this proactively.[14]

Native auto-memory uses LLM-based file header scanning — not vector similarity — to select relevant topic files. The MEMORY.md index loads at every session start (hard cap: first 200 lines or 25KB); individual topic files load on demand when Claude judges them relevant.[10] This design is intentional: inspectability and reliability over semantic richness. The tradeoff is speed — LLM retrieval takes seconds vs. milliseconds for vector MCPs. `mcp-memory-service` (doobidoo) with SQLite-vec achieves 5ms retrieval latency and scores DevBench Recall@5 of 91.1%,[6] with a SessionEnd auto-harvest hook and hybrid retrieval combining vector similarity, BM25, and a typed knowledge graph. `claude-mem` (thedotmack) adds a 3-layer progressive disclosure pattern — index queries at ~50–100 tokens, full detail at ~500–1,000 tokens — claiming ~3–10× token savings over loading full session history.[7] Neither vector MCP addresses multi-machine state in a zero-config way; only `mcp-memory-service` supports optional hybrid local+cloud sync via Milvus or Zilliz.

Structured handoff protocols solve the dual failure modes of agent resume: information overload (full history triggers "Lost in the Middle" degradation) and aggressive compression (stripping evidence removes verification ability). XTrace's structured briefing model — decisions/constraints always present, artifacts and preferences queryable, timeline as a chronological log — reduced average 15-minute handoff delays to seconds and shrinks context reconstruction from 5–10 turns to 1–2 turns.[9] The `/clear` + distilled brief pattern is the manual equivalent: more effort per session, but it eliminates context rot entirely and gives the practitioner control over what survives the boundary.

The 1M context window addresses capacity, not startup speed, and carries three concrete costs. Accuracy degrades measurably: Opus 4.6 drops from 93% to 76% retrieval accuracy between 256K and 1M tokens; Sonnet 4.5 scores 18.5% MRCR at 1M — effectively unreliable and not recommended for long-context work.[3] Crossing the 200K token threshold triggers a pricing step function — ALL tokens price at 2× input and 1.5× output (Sonnet 4.6: $3.00/M → $6.00/M input), and cache read/write costs also double.[3] A 400K-token session costs roughly 5× per turn vs. a post-compaction 80K session, making intentional session management frequently cheaper than relying on window capacity.

Practitioners who need sub-3s startup must stack five mechanisms simultaneously: keep the stable prefix (CLAUDE.md + memory index + tool definitions) under 50K tokens; inject dynamic session recovery data after the stable prefix to preserve cache hits; use semantic retrieval (5ms vector MCP) rather than full history injection; keep structured handoffs under 5K tokens; and manage session discipline with proactive `/compact` before phase transitions and `/clear` on task pivots. None of these mechanisms alone is sufficient — cold cache at 500K tokens is a 35-second wall that no session flag overrides, and the 5-minute TTL means sessions with natural pauses will periodically pay full cold-start cost regardless of context size. The unsolved problem in 2026 is cache warming: there is no mechanism to pre-warm the cache before a session starts, and the recurring `--resume` regressions make raw transcript resume unreliable for production agent loops.



Table of Contents

  1. CLAUDE.md Size, Cost Tradeoffs & Loading Hierarchy
  2. Context Buffer Mechanics and Auto-Compaction
  3. Skills System — ROI Analysis: High Value vs. Dead Weight
  4. Session Continuity Flags: --continue, --resume, /compact, /clear
  5. Auto-Memory System (MEMORY.md + ~/.claude/memory/)
  6. Vector-Store Memory MCPs
  7. Baton/Handoff Protocols for Agent Resume
  8. Prompt Caching & Sub-3-Second Startup Architecture
  9. 1M Context Window: When to Use It
  10. Integrated Decision Framework for Sub-3s Resume
  11. Plan Mode & Saved Plans — Pre-Positioned Context

Section 1: CLAUDE.md Size, Cost Tradeoffs & Loading Hierarchy

Every token in a CLAUDE.md file is loaded before any user message is processed, making it the single highest-leverage cost variable in session startup. A 5,000-token CLAUDE.md consumes 5,000 tokens on every API call with zero user interaction.[2][13] Published targets converge on strict size limits with different granularity: the official Claude Code docs specify <200 lines per file,[10] while practitioner guidance tightens this to <500 tokens with a practical template of "3–5 key rules + 3 file pointers" (~200 tokens total).[2][13]

Key finding: "Five rules and three file pointers is the right size" for a CLAUDE.md. The 200-line official figure is an outer bound; the 500-token practitioner target is the optimum for startup token budget.[13]

Size Targets: Official vs. Practitioner

SourceSize TargetRationale
Official Claude Code Docs[10] <200 lines per file Longer files reduce model adherence, not just context budget
buildtolaunch.substack.com[2][13] <500 tokens Explicit token budget; 200 tokens with "5 rules + 3 pointers" template
Practical minimum[13] ~200 tokens "3–5 key rules + 3 file pointers" — approximately 200 tokens total

Loading Hierarchy (Official)

CLAUDE.md files load by walking UP the directory tree from the working directory. The five-tier cascade, in order of loading:[10]

  1. ./CLAUDE.md or ./.claude/CLAUDE.md — project root
  2. ./CLAUDE.local.md — personal overrides, gitignored
  3. Parent directory CLAUDE.md files — cascades up the tree
  4. ~/.claude/CLAUDE.md — user-global
  5. /etc/claude-code/CLAUDE.md or managed policy path — org-wide, cannot be excluded

Subdirectory CLAUDE.md files load on demand when Claude reads files in those subdirectories — NOT at launch. This is the primary token-saving behavior in multi-project repositories.[10]

Delivery Mechanism and Context Drift

CLAUDE.md is delivered as a user message after the system prompt, not as the system prompt itself.[10] This has a critical implication: in long sessions, the model deprioritizes earlier CLAUDE.md instructions in favor of recent conversation history — a phenomenon documented as "context drift."[16] Skills (loaded only when invoked) are more robust to context drift than CLAUDE.md baseline injection.[16]

Token-Saving Techniques

TechniqueMechanismToken ImpactSource
HTML comment stripping <!-- notes --> stripped before injection Zero cost for maintainer notes [10]
@Import syntax References external files that expand at load Organizes but does NOT reduce context; max 5-hop depth [10]
Path-scoped rules .claude/rules/ with paths: frontmatter Load only when matching file is read; largest available saving [10]
Skills for procedures Move playbooks out of CLAUDE.md into .claude/skills/ Pay only on invocation, not every turn [4]

Typical Startup Context Budget

At session start, the context window contains:[10][1][14]

ComponentTypical SizeNotes
System prompt[1] 2,700 tokens (1.3% of 200K) Fixed Anthropic baseline
System tools[1] 16,800 tokens (8.4%) Tool definitions; largest non-user component
CLAUDE.md (150 lines)[10] ~1,500 tokens Scales linearly with file size
MEMORY.md index (100 entries)[10] ~400 tokens 200-line / 25KB cap enforced
Skill descriptions (10 skills)[4] ~800 tokens Descriptions only; full content loads on invocation
User skills (raw session data)[1] 1,000 tokens (0.5%) Observed baseline
Total startup overhead (lean) ~3,000–5,000 tokens Before first user message

Path-scoped rules architecture is the official recommendation for sub-10K startup overhead: keep core always-on rules in CLAUDE.md (<200 lines), language/area-specific guidance in .claude/rules/, and procedures in skills.[10]

See also: Cost Optimization (token cost per-turn analysis)

Section 2: Context Buffer Mechanics and Auto-Compaction

The compaction reserve buffer was reduced from ~45,000 tokens (22.5% of 200K) to ~33,000 tokens (16.5%) in early 2026 — an undocumented change that delivers ~12K additional usable tokens before compaction triggers.[1][14] Compaction now fires at ~83.5% context usage (previously 77–78%), yielding approximately 167K usable tokens in a 200K window before automatic summarization begins.[14]

Key finding: "Due to context rot, the model is at its least intelligent point when compacting." Proactive /compact before the limit produces materially better summaries than waiting for automatic trigger.[8]

Five Graduated Compaction Layers

Claude Code does NOT jump directly to full summarization. VILA-Lab's systematic analysis documents five sequential compression stages:[12]

LayerNameAction
1Budget ReductionReduces tool output verbosity
2SnipRemoves oldest non-critical messages
3MicrocompactCompresses recent messages
4Context CollapseRead-time, non-destructive
5Auto-CompactLast resort — full summarization

Buffer Controls Available

ControlMechanismEffectSource
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE Env var (1–100) Shifts when compaction fires relative to total window [1][14]
sonnet[1m] Model selection 1M context window; dramatically pushes back compaction threshold [14]
Manual /compact User command Proactive summarization while context quality is high; accepts steering [14]
StatusLine hook Hook receives remaining_percentage Only live context monitor available; enables proactive compaction triggers [14]
CLAUDE_CODE_MAX_OUTPUT_TOKENS Env var Controls output length — does NOT change compaction buffer (common misconception) [14]

Context Window Construction Order

Nine ordered sources load into the context window on session start (VILA-Lab systematic analysis):[12]

  1. System prompt
  2. Managed CLAUDE.md (/etc/ — org-level)
  3. User CLAUDE.md (~/.claude/)
  4. Project CLAUDE.md
  5. Local CLAUDE.md (CLAUDE.local.md)
  6. Auto-memory MEMORY.md (first 200 lines / 25KB)
  7. Skills (descriptions only; full content on invocation)
  8. Session history
  9. Current turn

Session Architecture: Append-Only JSONL

Claude Code uses append-only JSONL transcripts across three channels: global prompt history, subagent sidechains, and transcript logs.[12] Chain-patching at read time uses UUID anchors (headUuid/anchorUuid/tailUuid) — nothing is destructively edited on disk.[12] Permissions are never restored on resume; trust is re-established per session as a deliberate security design choice.[12]

See also: Cost Optimization (pricing per compaction event)

Section 3: Skills System — ROI Analysis: High Value vs. Dead Weight

Skills have a fundamentally different cost model from CLAUDE.md: "Unlike CLAUDE.md content, a skill's body loads only when it's used, so long reference material costs almost nothing until you need it."[4] This makes skills the primary mechanism for keeping session startup cheap while preserving access to complex procedures.

Key finding: After auto-compaction, Claude Code re-attaches the most recent invocation of each skill (first 5,000 tokens each) up to a combined budget of 25,000 tokens — filling from most recently invoked. Older skills can be dropped entirely if the budget is exceeded.[4]

Skills Cost Model

ScenarioWhat LoadsToken CostSource
Regular session (not invoked) Skill descriptions only Up to ~384 tokens per description (1,536-character cap[4][16]); total skill-description budget capped at 1% of context window (~2,000 tokens at 8,000-character fallback) [4][16]
Regular session (invoked) Full skill content Full file size (up to 500 lines) [4]
Post-compaction Most recently invoked skills, first 5K tokens each Max 25,000 tokens total across all re-attached skills [4]
Subagent with preloaded skills Full skill content at startup Higher than regular session; different cost model [4]

Skills Architecture: Location Options

ScopePathApplies to
Enterprise Managed settings All org users
Personal ~/.claude/skills/<skill-name>/SKILL.md All your projects
Project .claude/skills/<skill-name>/SKILL.md This project only
Plugin <plugin>/skills/<skill-name>/SKILL.md Where plugin is enabled

Source: Official Claude Code Docs.[4]

Invocation Control Matrix

FrontmatterUser can invokeClaude can invokeDescription in context
(default) Yes Yes Always in context
disable-model-invocation: true Yes No NOT in context (description also removed)
user-invocable: false No Yes Always in context

Token-saving insight: disable-model-invocation: true removes the skill description from context entirely — useful for rarely-used deployment or admin skills to reduce baseline overhead.[4]

Size Limits and Budget Constraints

LimitValueImpact if ExceededSource
Max SKILL.md length 500 lines Degraded performance [4]
Description truncation 1,536 characters per entry Keywords stripped; Claude misses invocation triggers [4][16]
Total description budget 1% of context window OR 8,000 characters (fallback) Descriptions shortened across all skills to fit [4][16]
Budget override SLASH_COMMAND_TOOL_CHAR_BUDGET env var Raises limit; use when many skills needed [4]

High-ROI Skills (Practitioner Assessment)

SkillROI AssessmentOptimal TriggerSource
/simplify Highest ROI — spawns 3 parallel review agents; catches unused imports, redundant variables, shared logic After refactoring or AI-generated code [15]
/batch Maximum impact for large-scale changes — parallelizes across 5–30 independent units using isolated worktree agents Large migrations [15]
/review Complementary correctness validation — bugs, logic errors, edge cases; pairs with /simplify Before commit/PR [15]
/compact Essential for session sustainability — proactive compression before new work phases Before phase transition, not at context limit [15]

Recommended workflow sequence: review → fix → simplify.[15]

Low-Value / Dead Weight Patterns

PatternProblemFixSource
Background facts / style guides in SKILL.md body Bloats file; context cost every invocation Move to supporting files in skill directory [4]
Skills that should be CLAUDE.md rules Wrong abstraction; pay invocation overhead for always-on content Static always-on context belongs in CLAUDE.md [4]
Large reference docs embedded directly 5,000-token skill body loaded on every invocation Reference from SKILL.md; load only when explicitly needed [4]
Skills with process steps + background context mixed "Packing skills with background context bloats the file and degrades performance" Skills contain process steps only [16]
/loop skill Limited real usage; useful only for specific monitoring scenarios Situational — use only when actually polling [15]
/debug skill Situational troubleshooting only — “niche skill, but invaluable when you need it”; not for routine use Keep but invoke only for specific debugging scenarios; not a session-start default [15]
Admin/destructive skills with model invocation enabled Risk of unintended invocation; always in context budget Set disable-model-invocation: true [4]

Dynamic Context Injection (Power Feature)

The !`command` syntax in skill files executes shell commands at invocation time — output replaces the placeholder before Claude sees the skill content.[4] This enables sub-second injection of live data (git diffs, test results, PR comments) at skill invocation rather than baking stale data into the session start:

---
name: pr-summary
context: fork
agent: Explore
---
## Pull request context
- PR diff: !`gh pr diff`
- PR comments: !`gh pr view --comments`

This pattern decouples "what's always in context" from "what's injected when needed" — a key technique for sub-3s startup with full situational awareness.[4]

Skills vs. CLAUDE.md Decision Rule

Content TypeMechanismCost ProfileSource
Facts, rules, patterns (always relevant) CLAUDE.md Pay every turn [4]
Procedures, playbooks, workflows Skills Pay only when invoked [4]
Large reference docs Supporting files in skill directory Pay only when explicitly loaded via skill [4]

Skills are also hot-reloaded — edits take effect within the current session without restart. Creating a new top-level skills/ directory requires restart.[4]


Section 4: Session Continuity Flags: --continue, --resume, /compact, /clear

Claude Code provides four mechanisms for session continuity with distinct tradeoffs: --continue (auto-resume latest), --resume (targeted resume), /compact (summarize-in-place), and /clear (fresh start with manual brief). Each addresses a different failure mode in the context lifecycle.[5][8]

--continue (-c) vs. --resume (-r)

FlagBehaviorIdeal Use CaseSource
--continue (-c) Auto-loads most recent conversation in current directory without user interaction Single active task, single project [5]
--resume (-r) Interactive picker or explicit name/ID selection across sessions Multiple parallel projects; targeted resume [5]

What gets restored with either flag: complete message history, all tool results (files read, commands executed, code modifications), full conversation context.[5]

--resume Interactive Session Picker Controls

KeyAction
Up/Down arrowsNavigate sessions
EnterResume selected session
PPreview content
RRename session
/Search/filter
BFilter by Git branch

Source: Pillitteri, 2026.[5]

Session Storage Format

Sessions persist in two locations:[5]

  1. ~/.claude/history.jsonl — global index with timestamps, paths, session IDs
  2. ~/.claude/projects/ — per-project directories containing:

Sessions are grouped by Git repository, including worktrees. No automatic cleanup or expiration — indefinite retention by default.[5]

/compact vs. /clear: Official Framework

ApproachMethodTradeoffSource
/compact Model summarizes automatically; accepts steering via inline flags Lossy but thorough; Claude decides relevance; can be guided with explicit focus [8]
/clear User writes distilled brief manually; zero-rot fresh session More effort per session but you control relevance; eliminates model deprioritization of old content [8]

When to Use /compact

When to Use /clear

Five Official Decision Points After Each Task

Anthropic's official framework for choosing the right continuity mechanism:[8]

#ActionWhen
1Continue (next message)Same task, same context, no rot
2/rewind (Esc Esc)Jump back to previous message and retry
3/clear + distilled briefTask pivot, poisoned context, fresh start needed
4/compactSummarize session and keep working in same stream
5SubagentsDelegate work to agent with clean context; preserve main session for decision-making

Sub-3-Second Resume Performance Reality

The session system was NOT designed for sub-3s resumption of large sessions. Performance characteristics by session size:[5][12]

Session SizeResume LatencyBottleneck
<50K tokens <3 seconds None — fast
50K–100K tokens Variable; can be significant Serialization/deserialization of large tool outputs in history
>100K tokens Meaningful latency JSONL chain patching + UUID anchor resolution across full transcript
Structured handoff notes (<5K tokens) Near-instant JSONL deserialization of small payload is sub-second
Key finding: Sub-3s startup with --resume requires keeping session history short. The JSONL append-only design means each resume call deserializes and chain-patches ALL prior messages. A 5K-token structured baton loads near-instantly; a 100K+ transcript does not.[12]

Known Issues and Regressions (As of 2026)

IssueSymptomStatus
GitHub #3138[5] --resume fails to maintain context after hitting usage/context limits Known bug
GitHub #43696[5] --continue and --resume do not restore prior conversation context (regression in 2.1.x) Patched but re-appears
GitHub #46445[5] /continue and /resume show all sessions across projects instead of current project only (regression in 2.1.101) Regression
Non-interactive mode[5] --continue may create new sessions in CI/automation Use --resume with explicit session names for reliable automation
See also: Multi-CLI Coordination (parallel session state management)

Section 5: Auto-Memory System (MEMORY.md + ~/.claude/memory/)

Claude Code provides two official memory mechanisms with different persistence models and token costs. CLAUDE.md files hold instructions and rules (loaded in full every session); auto-memory holds Claude's learned patterns and decisions (indexed summary loaded at startup, details loaded on demand).[10]

Two Official Memory Mechanisms

MechanismWho WritesWhat It ContainsLoaded Into
CLAUDE.md files[10] Developer Instructions, rules, project context Every session (full content)
Auto memory[10] Claude (agent-written) Learnings, patterns, decisions, gotchas Every session (first 200 lines or 25KB of MEMORY.md)

Directory Structure

~/.claude/projects/<project-hash>/memory/
├── MEMORY.md          # Index — loaded into EVERY session (first 200 lines / 25KB)
├── debugging.md       # Detailed notes (loaded on demand)
├── api-conventions.md # Architecture decisions (loaded on demand)
└── ...

Source: Official Claude Code Docs.[10]

Properties:[10]

Loading Behavior: Budget-Gated Index Pattern

FileWhen LoadedToken Budget
MEMORY.md[10] Every session start First 200 lines OR first 25KB — hard cap
Topic files (e.g., debugging.md)[10] On demand — Claude reads when relevant Full file size; only when Claude judges it necessary
Key finding: Claude Code's auto-memory uses LLM-based scanning of file headers (titles/descriptions) to select relevant files — NOT vector similarity search. "Names are the retrieval mechanism" — descriptive filename and description, not embedding distance.[10][12] This is a deliberate design choice for inspectability and reliability over semantic richness.

Retrieval Mechanism: LLM File Header Scan vs. Vector MCP

PropertyNative Auto-MemoryVector-Store MCP
Retrieval method[12] LLM scans file headers; selects up to 5 relevant files Embedding similarity + BM25
Retrieval speed Seconds (LLM call required) ~5ms (local SQLite-vec)
Inspectability Full (plain Markdown files, readable by human) Partial (requires query tool)
External dependencies None Requires MCP server running
Auto-capture No — Claude decides what's worth saving Yes — SessionEnd hooks + lifecycle hooks
Token efficiency 200-line MEMORY.md cap; topic files on demand Semantic filtering; only relevant context injected

Design Philosophy: Concise Index Pattern

The official auto-memory design philosophy (from Claude Code Docs):[10]

  1. MEMORY.md is a concise index — one line per memory, descriptive filenames
  2. Details live in topic files, loaded only when needed
  3. Descriptive filename + description = retrieval mechanism (not embeddings)
  4. Claude reads MEMORY.md index, picks by topic, reads the specific file if needed

Enable/Disable Controls

CLAUDE_CODE_DISABLE_AUTO_MEMORY=1   # env var
{"autoMemoryEnabled": false}         # settings.json

Requires Claude Code v2.1.59+.[10]

What Survives /compact

ContentSurvives /compact?Mechanism
Project-root CLAUDE.md[10] Yes Re-read from disk after compaction
Nested CLAUDE.md in subdirectories[10] Deferred Re-loaded next time Claude reads a file in that subdirectory
Instructions given only in conversation[10] No Must be added to CLAUDE.md to persist
MEMORY.md[10] Yes (disk-persisted) Re-loaded at next session start
See also: New Tooling 2026 (new memory MCP releases)

Section 6: Vector-Store Memory MCPs

Two open-source tools provide vector-backed persistent memory for Claude Code: mcp-memory-service (doobidoo) targeting production AI agent pipelines, and claude-mem (thedotmack) targeting session capture and progressive disclosure within Claude Code workflows.[6][7]

mcp-memory-service (doobidoo)

Architecture: Three Access Modes

ModeInterfaceUse Case
REST API 15 endpoints Framework-agnostic HTTP clients (LangGraph, CrewAI, AutoGen)
MCP Server Native tool integration Claude Desktop, Claude Code native integration
Remote MCP via HTTPS/OAuth Browser-accessible Browser-based claude.ai integration

Source: doobidoo/mcp-memory-service.[6]

Vector Storage Backends

BackendUse CaseNotes
SQLite-vec (default) Local, no external dependencies 5ms retrieval; recommended for Claude Code
Milvus (Lite / self-hosted / Zilliz Cloud) Production scale Managed or self-hosted options
Cloudflare Vectorize Cloud deployment Edge deployment
Hybrid local + cloud sync Resilience / multi-machine Addresses machine-local limitation of native auto-memory

Source: doobidoo/mcp-memory-service.[6]

Embedding Model

All-MiniLM-L6-v2, 384-dimensional embeddings, running locally via ONNX — no external API dependencies, sub-millisecond embedding inference.[6]

Hybrid Retrieval: Four Signals Combined

  1. Vector-based semantic similarity
  2. BM25 full-text ranking
  3. Knowledge graph relationships (typed edges: causes, fixes, supports, follows, related, contradicts)
  4. Tag-based filtering with agent-scoped retrieval via X-Agent-ID headers

Source: doobidoo/mcp-memory-service.[6]

Performance Benchmarks

BenchmarkMetricScore
LongMemEval (500 questions)[6] Recall@5 80.4%
LongMemEval[6] NDCG@10 82.2%
LongMemEval[6] MRR 89.1%
DevBench (practical workflows)[6] Recall@5 91.1%
DevBench[6] Overall MRR 0.861
SQLite-vec retrieval speed[6] Retrieval latency 5ms — adds ~5–15ms total to startup including query formulation

Claude Code Integration Path

Two integration modes available:[6]

Reliability fix in v10.40.3: Eliminated socket hang-ups via keepAlive: false and explicit Connection: close headers. Critical for production agent loops that restart frequently.[6]

claude-mem (thedotmack)

Session Capture: Five Lifecycle Hooks

HookWhen It FiresWhat It Does
SessionStart[7] Session begins Initialize session context
UserPromptSubmit[7] User sends message Capture user intent
PostToolUse[7] After each tool call Record tool usage and observations
Stop[7] Session paused Checkpoint state
SessionEnd[7] Session ends Final AI-powered compression and storage

3-Layer Progressive Disclosure Pattern

The key architectural innovation for token efficiency — access detail at the granularity you need:[7]

LayerToolTokens per ResultUse Case
1 — Index search ~50–100 tokens Find relevant observations; get IDs
2 — Timeline timeline ~200–500 tokens (estimated; exact token cost not documented in source[7] — estimate based on intermediate granularity between index and full-detail layers) Chronological context around results
3 — Detail get_observations ~500–1,000 tokens Full details for filtered IDs only

Token efficiency claim: ~10x savings by filtering before fetching vs. loading full session history.[7] Conservative re-estimate based on corpus analysis: ~3x reduction (50 observations × 500 tokens = 25K full load; vs. 3,750 token index + ~7,500 for 3–5 relevant observations).[7]

Infrastructure

Full Comparison: Native vs. Vector MCP Options

PropertyNative Auto-Memorymcp-memory-serviceclaude-mem
Retrieval method[10][6][7] LLM file header scan Hybrid vector + BM25 + knowledge graph Vector + FTS5
Retrieval speed[12][6][7] ~Seconds (LLM call) 5ms ~Milliseconds
Token efficiency 200-line MEMORY.md cap; topic files on demand Semantic filtering; decay + compression pipeline 3-layer progressive disclosure; ~3–10x claimed reduction
External dependencies None SQLite-vec or Milvus Node.js 18+, Bun, Chroma
Inspectability Full (plain Markdown) Partial (query tool required) Partial (web viewer available)
Auto-capture No — Claude decides SessionEnd auto-harvest hook 5 lifecycle hooks (every tool call)
Multi-machine support No (machine-local) Yes (Hybrid local+cloud or Zilliz) No (local SQLite)
Production benchmarks N/A DevBench Recall@5: 91.1% Not published

Sources: raw_10.md, raw_12.md, raw_6.md, raw_7.md.[10][12][6][7]

See also: New Tooling 2026 (additional memory MCP releases in 2026)

Section 7: Baton/Handoff Protocols for Agent Resume

Agent context handoffs fail through two symmetric failure modes: information overload (passing complete message histories triggers "Lost in the Middle" effect) and aggressive compression (stripping away supporting evidence removes the ability to verify or extend prior analysis).[9] Both modes degrade quality; neither achieves sub-3s productive startup.

Key finding: "Handoffs treat context as a one-time data transfer, a blob of text thrown from one agent to another." The fix is replacing text dumps with layered, queryable briefing structures. Structured handoffs reduce average 15-minute handoff delays to seconds.[9]

Root Cause: Why Context Dumps Fail

Failure ModeMechanismResultSource
Information Overload Full message history passed; model buries critical signals "Lost in the Middle" — performance degrades with large context dumps [9]
Aggressive Compression Summarization strips evidence and reasoning chains Receiving agent cannot verify or extend prior analysis [9]

The Structured Briefing Model: Four Information Layers

Replace text dumps with layered, queryable information (XTrace, 2026):[9]

LayerContentAvailability
1 Decisions and non-negotiable constraints Always present — inject at every session start
2 Actual artifacts (not summaries) for reference Available on demand — query by ID
3 Accumulated preferences and patterns Retrievable via query
4 Timeline context (what happened and when) Queryable chronological log

Practical Baton Format (Widely Adopted)

Synthesized from XTrace research[9] and Anthropic's /clear + brief guidance:[8]

## Current State
- What exists and what it does
- Decisions already made (with why)

## Next Steps (ordered)
- Exact files to change + what to change
- Command to verify each step

## Constraints
- Non-negotiable decisions
- Things that were tried and failed

## Key Files
- File + what to look for in each

Performance Benchmarks

MetricWithout Structured HandoffWith Structured HandoffSource
Handoff delay Average 15 minutes Seconds [9]
Context reconstruction turns 5–10 turns to re-derive prior state 1–2 turns from structured baton [9]
Resume latency (JSONL) Significant for >100K session transcripts Near-instant for <5K structured notes [12]

/clear + Manual Brief = Manual Baton Pass

The /clear + distilled brief pattern is the manual version of a structured baton pass:[8]

Queryable Memory as Infrastructure

"When a new agent joins a workflow, it doesn't receive a dump. It queries the memory for a briefing."[9] This architecture allows agents to access accumulated knowledge immediately, avoid cold starts, and avoid degradation through successive summarization rounds — the core problem with repeated /compact cycles.[9]

Subagent Strategy for Context Isolation

Decision framework for subagent delegation (Anthropic):[8]

ConditionStrategy
Only the conclusion needed (not intermediate output) Use subagent — intermediate output stays in subagent context; only conclusion returns to parent
Output will be referenced again in main session Keep in main session — subagent return loses the working context

Subagents keep main context clean by using isolated Claude instances that return only distilled results.[16]

See also: Multi-CLI Coordination (baton protocols across parallel agents)

Section 8: Prompt Caching & Sub-3-Second Startup Architecture

The 5-minute prompt cache TTL is the fundamental constraint for sub-3s startup. With a warm cache (<5 min since last request), TTFT at 500K tokens is ~3.5 seconds. Cold cache at the same size takes 35–90 seconds — a 10–26x latency penalty.[3][11]

Key finding: "A 50K-token context with a 95% cache hit rate is more economical and faster than a 10K-token context that's cold on every request. The goal should be maximizing cache hit rate, not minimizing context size alone."[11]

TTFT Benchmarks by Context Size and Cache State

Context SizeCache StateTTFTSource
50K tokens Warm (<5 min) ~1.7s [11]
50K tokens (GPT-4o baseline) Warm, cached 1,699ms (down from 4,290ms) [11]
500K tokens Warm (<5 min) ~3.5s [3]
500K tokens Cold (>5 min) ~35s [3]
1M tokens (extrapolated) Cold 60–90s [3]

Model note: Row 1 (~1.7s at 50K warm cache) is a Claude-inferred estimate extrapolated from cross-provider caching improvement data in [11]. Row 2 (1,699ms at 50K warm cache) is a direct GPT-4o measurement from arxiv 2601.06007[11]; it is included as a cross-provider reference point. Rows 3–5 (500K and 1M) are Claude measurements from claudecodecamp.com[3].

Cache Hit Rate Impact on Cost

ModelCost Reduction from CachingSource
GPT-5.2[11] 79–81% [11]
Claude Sonnet 4.5[11] 78–79% [11]
Average across providers[11] 41–80% [11]
TTFT improvement from caching[11] 13–31% across providers [11]

Cache Minimum Threshold

Cache activation requires a minimum prompt length:[11]

ProviderMinimum Tokens for CacheBelow Minimum
Anthropic[11] 1,024 tokens TTFT regression of 10–18% (worse than no caching)
OpenAI[11] 4,096 tokens Cache does not activate

The Cache Paradox

"Full context caching can paradoxically increase latency" when dynamic content like tool results gets cached without producing subsequent hits. Cache writes consume compute; if the cached content isn't reused, you pay write cost with no benefit.[11]

Strategic insight: "System prompt only caching provides the most consistent benefits."[11] Operational rule:

Content TypeCache StrategyReason
CLAUDE.md + skills + static tools[11] Maximize cache stability — never change between calls Stable prefix → consistent cache hit rate
Dynamic content (tool results, git status)[11] Do NOT cache; generate fresh each turn Cached dynamic content rarely gets a hit; pays write cost for nothing
Session recovery data[11] Inject AFTER stable prefix Dynamic injection after stable prefix does not break cache on prefix

Optimal Sub-3s Startup Architecture

Synthesized from six sources — five independent approaches that combine into a coherent stack:[11][6][3][12][8]

StepActionTargetWhy
1[11] Keep stable context (CLAUDE.md + memory index + static tools) under 50K tokens TTFT ~1.7s warm Below 50K: consistent sub-2s warm TTFT
2[11] Inject dynamic session recovery data AFTER stable prefix Cache hit on prefix preserved Dynamic suffix doesn't break cache on the stable portion
3[6] Use semantic retrieval (5ms) to select what dynamic content to inject ~5–15ms memory retrieval overhead Only relevant prior context injected, not full history
4[3] Stay under 200K tokens No long-context price cliff 200K crossing doubles all per-token costs
5[12] Keep structured handoff (baton) under 5K tokens Near-instant JSONL deserialization Chain-patching small JSONL is sub-second; large transcripts are not
6[6] Store session summaries + key decisions in vector store at session end; semantic query on startup Relevant prior context in 5ms Avoids cold-start reconstruction; externalized memory survives /clear
See also: Cost Optimization (cache pricing details per model)

Section 9: 1M Context Window — When to Use It

The 1M context window addresses a different problem than sub-3s startup. It eliminates automatic summarization for most coding sessions but introduces a 200K token pricing cliff (2× per-token cost), a 5-minute cache TTL cold-restart penalty, and measurable accuracy degradation above 256K tokens.[3]

Long-Context Pricing Cliff (200K Threshold)

When crossing 200K tokens, ALL tokens get premium pricing — not just those above the threshold. Step function, not linear scale:[3]

ModelStandard InputLong Context Input (2×)Standard OutputLong Context Output (1.5×)
Opus 4.6 $5.00/M $10.00/M $25.00/M $37.50/M
Sonnet 4.6 $3.00/M $6.00/M $15.00/M $22.50/M

Infrastructure cost justification: long-context sessions require hundreds of gigabytes of GPU memory reserved per user and consume 25× more compute on cold starts.[3]

Cache pricing at the 200K threshold (Opus 4.6): Cache reads double from $0.50/M (standard) to $1.00/M; cache writes from $6.25/M to $12.50/M — a 2× multiplier on both. This cache cost step-up is additive to the standard input/output price increase: sub-3s startup stacks relying on prompt caching face double cache overhead when crossing 200K tokens.[3]

Accuracy Degradation at Long Context

Model256K Context Accuracy1M Context AccuracyMetric
Opus 4.6[3] 93% 76% General retrieval accuracy; "lost in the middle" effect
Sonnet 4.5[3] N/A 18.5% MRCR score Effectively unreliable at 1M; avoid entirely

Critical information should be at the beginning or end of context for best retrieval at long context lengths.[3]

When 1M Context Is Justified

Use Case1M Appropriate?Reason
Single-shot analysis of large codebases/documents[3] Yes Full corpus in context; no compaction interruption
Deep debugging requiring full context preservation[3] Yes Compaction during debugging loses critical trace data
Multi-agent shared state[3] Yes Eliminates need for aggressive handoff compression
Compliance work requiring exact citations[3] Yes Source material must remain uncompacted
Routine coding sessions[3] No Most sessions peak at 80–120K before compaction anyway
Sessions with frequent pauses[3] No 5-minute cache TTL → painful cold restarts (35–90s)
Sonnet 4.5 at 1M[3] No 18.5% MRCR — unreliable; avoid entirely
Key finding: A 400K-token session costs roughly 5× per turn compared to a post-compaction 80K session. Better session management through intentional clearing often outperforms relying on larger context windows for most coding tasks.[3] 1M context does NOT help sub-3s boot — cold cache at 500K tokens takes 35 seconds to first token.[8]

1M Window vs. Sub-3s Startup

The distinction between capacity and startup speed:[8][3]

See also: Cost Optimization (full pricing model across models and context tiers)

Section 10: Integrated Decision Framework for Sub-3s Resume

Sub-3s productive startup requires stacking five independent mechanisms. No single tool achieves it alone; the architecture requires: stable prefix caching, lean startup context, structured handoff format, semantic memory retrieval for dynamic injection, and session lifecycle management aligned with the 5-minute cache TTL.

The Sub-3s Startup Stack

LayerMechanismTargetFailure Mode Without It
Stable context[11][10] CLAUDE.md <500 tokens + path-scoped rules + lean MEMORY.md <50K tokens stable prefix Large stable prefix → TTFT >3.5s even warm
Cache hit rate[11] Keep stable prefix unchanged between requests; stay under 200K tokens >90% cache hit on stable prefix Cold cache at 500K = 35s startup
Session format[12] Structured baton <5K tokens instead of full transcript resume Near-instant JSONL deserialization 100K+ JSONL transcript deserialization adds meaningful latency
Dynamic injection[6] Semantic retrieval (5ms) to select relevant prior context 5–15ms retrieval overhead Full history injection bloats context; destroys cache hit rate
Session discipline[8][15] Proactive /compact before phase transitions; /clear on task pivots Session stays under 100K tokens Context rot + auto-compaction at worst intelligence point

Context Management Decision Tree

SituationRecommended ActionSource
Same task, clean context Continue (send next message) [8]
Wrong direction, same task /rewind (Esc Esc) to prior message [8]
Same task, starting new phase Proactive /compact with steering before phase transition [15][8]
Task pivot or poisoned context /clear + manual distilled brief (baton) [8]
Subtask where only conclusion needed Subagent delegation; intermediate output stays in subagent context [8][16]
Resume after >5 min break Expect cold cache; use structured baton (<5K tokens) for fastest productive start [3][12]
New session, need prior context Vector MCP retrieval (5ms) OR MEMORY.md index + demand-load topic files [6][10]

Practitioners Who Have Achieved Sub-3s Startup

Documentation of working stacks from corpus sources:

Practitioner / SourceStack DescribedKey Metric
doobidoo/mcp-memory-service[6] SQLite-vec local + SessionEnd auto-harvest + semantic retrieval on startup 5ms retrieval; adds 5–15ms total startup overhead for memory layer
XTrace structured handoff[9] Layered briefing format + queryable memory infrastructure 15-minute handoff delays reduced to seconds; 5–10 turns reconstructions → 1–2 turns
claudefa.st (tool vendor)[1][14] CLAUDE_AUTOCOMPACT_PCT_OVERRIDE + StatusLine hook real-time monitoring Compaction quality metric, not startup latency: proactive compaction before buffer limit preserves context quality; no before/after startup time reported
arxiv 2601.06007 (academic paper — GPT-4o measurement)[11] Stable prefix caching + dynamic suffix injection (cross-provider caching study, not a Claude Code deployment) GPT-4o direct measurement: 50K warm cache TTFT = 1,699ms (down from 4,290ms); 13–31% TTFT improvement from caching. Included as cross-provider reference; not a Claude practitioner benchmark.

Note on corpus coverage: The entries above document tools and approaches from corpus sources (vendor documentation, academic papers). No independent practitioner case studies — developers who measured their own Claude Code cold-start latency, applied a config stack, and published before/after numbers — were found in raw_1.md through raw_16.md. To add such entries, search: "claude code startup latency optimization site:reddit.com OR site:github.com OR site:dev.to" and "claude code resume performance blog 2025 2026".

Gaps and Unsolved Problems

Areas where corpus data identifies open problems without current solutions:[5][3][11]


Section 11: Plan Mode & Saved Plans — Pre-Positioned Context

Research gap — content removed pending source verification. This section originally cited sources [17]–[20], none of which appear in the consolidated corpus source index (raw_1.md–raw_16.md). Specific claims about “Ultraplan,” a Ctrl+G editor integration, plan approval-flow options, and plan persistence paths could not be verified against the corpus. The “Ultraplan” product name has no corpus anchor. All content relying on sources [17]–[20] has been removed per corpus verification policy.

To restore this section: fetch the official Claude Code plan mode documentation (the URL pattern from confirmed sources suggests code.claude.com/docs/en/plan-mode or the Anthropic docs equivalent), add the real URLs to the consolidated source index, and re-synthesize using only verified claims with proper citation numbers.

The EnterPlanMode and ExitPlanMode tools are confirmed real Claude Code API tools (they appear in the Claude Code tool declaration list). No other plan mode mechanics are documented in the consolidated corpus (raw_1.md–raw_16.md).


Sources

  1. Claude Code Context Buffer: The 33K-45K Token Problem (retrieved 2026-04-27)
  2. Claude Code Token Optimization: Full System Guide (2026) (retrieved 2026-04-27)
  3. Claude Code 1M Context Window: Cost, Limits, and When to Use It (retrieved 2026-04-27)
  4. Extend Claude with skills — Official Claude Code Docs (retrieved 2026-04-27)
  5. Claude Code --continue and --resume: Never Lose Your Context Again (retrieved 2026-04-27)
  6. mcp-memory-service — Open-source persistent memory for AI agent pipelines (GitHub) (retrieved 2026-04-27)
  7. claude-mem — AI-powered session capture and context injection plugin for Claude Code (retrieved 2026-04-27)
  8. Using Claude Code: Session Management and 1M Context — Official Anthropic Blog (retrieved 2026-04-27)
  9. AI Agent Handoff: Why Context Breaks & How to Fix It — XTrace (retrieved 2026-04-27)
  10. How Claude remembers your project — Official Claude Code Docs (retrieved 2026-04-27)
  11. Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks (retrieved 2026-04-27)
  12. Dive into Claude Code: A Systematic Analysis of Claude Code for AI Agent Systems (retrieved 2026-04-27)
  13. Claude Code Token Optimization: Full System Guide (2026) (retrieved 2026-04-27)
  14. Claude Code Context Buffer: The 33K-45K Token Problem (retrieved 2026-04-27)
  15. Essential Claude Code Skills and Commands | (think) (retrieved 2026-04-27)
  16. Claude Code Customization: CLAUDE.md, Slash Commands, Skills, and Subagents (retrieved 2026-04-27)

Home