Pillar: context-continuity | Date: April 2026
Scope: Head-to-head comparison of 2026 SOTA for sub-3s agent boot at full context: CLAUDE.md size/cost tradeoffs, Skills ROI analysis (which earn rent vs. dead weight), plan mode and saved plans, custom baton/handoff protocols, auto-memory (~/.claude/memory/, MEMORY.md indexing), --resume/--continue flags, vector-store memory MCPs (mcp-memory-service, etc.), embedding-based session retrieval, /compact vs /clear vs --resume tradeoffs. Practitioners who have actually solved sub-3s startup and their full stack with benchmarks.
Sources: 16 gathered, consolidated, synthesized.
Critical finding: Sub-3-second agent startup is achievable only below 50K stable tokens with a warm cache — cold cache at 500K tokens takes 35 seconds to first token, a 10× latency penalty that no session flag or memory MCP can overcome.[3][11]
The 5-minute prompt cache TTL is the binding constraint in every sub-3s startup stack. With a warm cache, a 50K-token stable context achieves ~1.7s TTFT; crossing 500K on a warm cache still costs ~3.5s; cold at 500K hits 35 seconds. A 1M-token cold session extrapolates to 60–90 seconds — making 1M context antithetical to fast boot. The implication is structural: cache hit rate matters more than raw context size. A 50K-token context with a 95% cache hit rate outperforms — in both cost and latency — a 10K-token context that's cold on every request.[11] Anthropic's prompt cache requires a minimum of 1,024 tokens to activate; below that threshold, cache writes actually regress TTFT by 10–18%.[11]
CLAUDE.md is the single highest-leverage cost variable at session start because every token it contains loads before any user message, on every API call. Published guidance splits into two tiers: official docs cap it at <200 lines per file; practitioner guidance tightens to <500 tokens, with a practical template of "3–5 key rules + 3 file pointers" totaling ~200 tokens.[2][13] A lean startup context budget — system prompt (~2,700 tokens), tool definitions (~16,800 tokens), 150-line CLAUDE.md (~1,500 tokens), MEMORY.md index (~400 tokens), and skill descriptions (~800 tokens for 10 skills) — totals ~3,000–5,000 tokens before the first user message.[10][1] Path-scoped rules (`.claude/rules/` with `paths:` frontmatter) are the largest available saving: they load only when Claude reads a matching file, not at launch.
Skills have a fundamentally different cost model from CLAUDE.md. Only skill descriptions load at startup — each capped at 1,536 characters (~384 tokens); full content loads on invocation only.[4][16] After auto-compaction, Claude re-attaches the most recently invoked skills up to a combined ceiling of 25,000 tokens, prioritizing recency and dropping older skills if budget is exceeded. The high-ROI skills for practitioners are `/simplify` (spawns 3 parallel review agents), `/batch` (parallelizes across 5–30 isolated worktree agents for large migrations), and proactive `/compact` before phase transitions.[15] Dead-weight patterns: embedding background facts directly in SKILL.md bodies, mixing process steps with reference docs, and leaving admin or destructive skills with model invocation enabled when they should carry `disable-model-invocation: true` — which removes the description from context entirely.[4]
Session resume performance is directly tied to transcript size, not to feature flags. The JSONL append-only architecture chain-patches UUID anchors across every prior message on every resume call: sessions under 50K tokens resume in under 3 seconds; sessions over 100K tokens introduce meaningful latency from deserialization. A structured handoff note under 5K tokens loads near-instantly; a raw 100K+ transcript does not.[12][5] Three open bugs in `--resume` (`#3138`, `#43696`, `#46445`) cause recurring regressions in context restoration fidelity as of 2026, making automated resume unreliable in CI/automation contexts — `--resume` with explicit session names is the safer flag for pipelines.[5]
The compaction reserve buffer was reduced from ~45,000 tokens to ~33,000 tokens in early 2026, yielding ~12K additional usable tokens and pushing the auto-compaction trigger from ~77% to 83.5% context fill.[1][14] The compaction mechanism runs five sequential stages before full summarization: budget reduction → snip → microcompact → context collapse → auto-compact.[12] The critical quality problem: "the model is at its least intelligent point when compacting" because context rot accumulates as fill approaches the limit. Proactive `/compact` at 70–75% fill — before starting a new work phase, not reactively — produces materially better summaries.[8] The `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE` env var shifts the trigger threshold; StatusLine hooks that receive `remaining_percentage` provide the only live context monitor available for implementing this proactively.[14]
Native auto-memory uses LLM-based file header scanning — not vector similarity — to select relevant topic files. The MEMORY.md index loads at every session start (hard cap: first 200 lines or 25KB); individual topic files load on demand when Claude judges them relevant.[10] This design is intentional: inspectability and reliability over semantic richness. The tradeoff is speed — LLM retrieval takes seconds vs. milliseconds for vector MCPs. `mcp-memory-service` (doobidoo) with SQLite-vec achieves 5ms retrieval latency and scores DevBench Recall@5 of 91.1%,[6] with a SessionEnd auto-harvest hook and hybrid retrieval combining vector similarity, BM25, and a typed knowledge graph. `claude-mem` (thedotmack) adds a 3-layer progressive disclosure pattern — index queries at ~50–100 tokens, full detail at ~500–1,000 tokens — claiming ~3–10× token savings over loading full session history.[7] Neither vector MCP addresses multi-machine state in a zero-config way; only `mcp-memory-service` supports optional hybrid local+cloud sync via Milvus or Zilliz.
Structured handoff protocols solve the dual failure modes of agent resume: information overload (full history triggers "Lost in the Middle" degradation) and aggressive compression (stripping evidence removes verification ability). XTrace's structured briefing model — decisions/constraints always present, artifacts and preferences queryable, timeline as a chronological log — reduced average 15-minute handoff delays to seconds and shrinks context reconstruction from 5–10 turns to 1–2 turns.[9] The `/clear` + distilled brief pattern is the manual equivalent: more effort per session, but it eliminates context rot entirely and gives the practitioner control over what survives the boundary.
The 1M context window addresses capacity, not startup speed, and carries three concrete costs. Accuracy degrades measurably: Opus 4.6 drops from 93% to 76% retrieval accuracy between 256K and 1M tokens; Sonnet 4.5 scores 18.5% MRCR at 1M — effectively unreliable and not recommended for long-context work.[3] Crossing the 200K token threshold triggers a pricing step function — ALL tokens price at 2× input and 1.5× output (Sonnet 4.6: $3.00/M → $6.00/M input), and cache read/write costs also double.[3] A 400K-token session costs roughly 5× per turn vs. a post-compaction 80K session, making intentional session management frequently cheaper than relying on window capacity.
Practitioners who need sub-3s startup must stack five mechanisms simultaneously: keep the stable prefix (CLAUDE.md + memory index + tool definitions) under 50K tokens; inject dynamic session recovery data after the stable prefix to preserve cache hits; use semantic retrieval (5ms vector MCP) rather than full history injection; keep structured handoffs under 5K tokens; and manage session discipline with proactive `/compact` before phase transitions and `/clear` on task pivots. None of these mechanisms alone is sufficient — cold cache at 500K tokens is a 35-second wall that no session flag overrides, and the 5-minute TTL means sessions with natural pauses will periodically pay full cold-start cost regardless of context size. The unsolved problem in 2026 is cache warming: there is no mechanism to pre-warm the cache before a session starts, and the recurring `--resume` regressions make raw transcript resume unreliable for production agent loops.
Every token in a CLAUDE.md file is loaded before any user message is processed, making it the single highest-leverage cost variable in session startup. A 5,000-token CLAUDE.md consumes 5,000 tokens on every API call with zero user interaction.[2][13] Published targets converge on strict size limits with different granularity: the official Claude Code docs specify <200 lines per file,[10] while practitioner guidance tightens this to <500 tokens with a practical template of "3–5 key rules + 3 file pointers" (~200 tokens total).[2][13]
Key finding: "Five rules and three file pointers is the right size" for a CLAUDE.md. The 200-line official figure is an outer bound; the 500-token practitioner target is the optimum for startup token budget.[13]
| Source | Size Target | Rationale |
|---|---|---|
| Official Claude Code Docs[10] | <200 lines per file | Longer files reduce model adherence, not just context budget |
| buildtolaunch.substack.com[2][13] | <500 tokens | Explicit token budget; 200 tokens with "5 rules + 3 pointers" template |
| Practical minimum[13] | ~200 tokens | "3–5 key rules + 3 file pointers" — approximately 200 tokens total |
CLAUDE.md files load by walking UP the directory tree from the working directory. The five-tier cascade, in order of loading:[10]
./CLAUDE.md or ./.claude/CLAUDE.md — project root./CLAUDE.local.md — personal overrides, gitignored~/.claude/CLAUDE.md — user-global/etc/claude-code/CLAUDE.md or managed policy path — org-wide, cannot be excludedSubdirectory CLAUDE.md files load on demand when Claude reads files in those subdirectories — NOT at launch. This is the primary token-saving behavior in multi-project repositories.[10]
CLAUDE.md is delivered as a user message after the system prompt, not as the system prompt itself.[10] This has a critical implication: in long sessions, the model deprioritizes earlier CLAUDE.md instructions in favor of recent conversation history — a phenomenon documented as "context drift."[16] Skills (loaded only when invoked) are more robust to context drift than CLAUDE.md baseline injection.[16]
| Technique | Mechanism | Token Impact | Source |
|---|---|---|---|
| HTML comment stripping | <!-- notes --> stripped before injection |
Zero cost for maintainer notes | [10] |
| @Import syntax | References external files that expand at load | Organizes but does NOT reduce context; max 5-hop depth | [10] |
| Path-scoped rules | .claude/rules/ with paths: frontmatter |
Load only when matching file is read; largest available saving | [10] |
| Skills for procedures | Move playbooks out of CLAUDE.md into .claude/skills/ |
Pay only on invocation, not every turn | [4] |
At session start, the context window contains:[10][1][14]
| Component | Typical Size | Notes |
|---|---|---|
| System prompt[1] | 2,700 tokens (1.3% of 200K) | Fixed Anthropic baseline |
| System tools[1] | 16,800 tokens (8.4%) | Tool definitions; largest non-user component |
| CLAUDE.md (150 lines)[10] | ~1,500 tokens | Scales linearly with file size |
| MEMORY.md index (100 entries)[10] | ~400 tokens | 200-line / 25KB cap enforced |
| Skill descriptions (10 skills)[4] | ~800 tokens | Descriptions only; full content loads on invocation |
| User skills (raw session data)[1] | 1,000 tokens (0.5%) | Observed baseline |
| Total startup overhead (lean) | ~3,000–5,000 tokens | Before first user message |
Path-scoped rules architecture is the official recommendation for sub-10K startup overhead: keep core always-on rules in CLAUDE.md (<200 lines), language/area-specific guidance in .claude/rules/, and procedures in skills.[10]
The compaction reserve buffer was reduced from ~45,000 tokens (22.5% of 200K) to ~33,000 tokens (16.5%) in early 2026 — an undocumented change that delivers ~12K additional usable tokens before compaction triggers.[1][14] Compaction now fires at ~83.5% context usage (previously 77–78%), yielding approximately 167K usable tokens in a 200K window before automatic summarization begins.[14]
Key finding: "Due to context rot, the model is at its least intelligent point when compacting." Proactive /compact before the limit produces materially better summaries than waiting for automatic trigger.[8]
Claude Code does NOT jump directly to full summarization. VILA-Lab's systematic analysis documents five sequential compression stages:[12]
| Layer | Name | Action |
|---|---|---|
| 1 | Budget Reduction | Reduces tool output verbosity |
| 2 | Snip | Removes oldest non-critical messages |
| 3 | Microcompact | Compresses recent messages |
| 4 | Context Collapse | Read-time, non-destructive |
| 5 | Auto-Compact | Last resort — full summarization |
| Control | Mechanism | Effect | Source |
|---|---|---|---|
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE |
Env var (1–100) | Shifts when compaction fires relative to total window | [1][14] |
sonnet[1m] |
Model selection | 1M context window; dramatically pushes back compaction threshold | [14] |
Manual /compact |
User command | Proactive summarization while context quality is high; accepts steering | [14] |
| StatusLine hook | Hook receives remaining_percentage |
Only live context monitor available; enables proactive compaction triggers | [14] |
CLAUDE_CODE_MAX_OUTPUT_TOKENS |
Env var | Controls output length — does NOT change compaction buffer (common misconception) | [14] |
Nine ordered sources load into the context window on session start (VILA-Lab systematic analysis):[12]
/etc/ — org-level)~/.claude/)CLAUDE.local.md)Claude Code uses append-only JSONL transcripts across three channels: global prompt history, subagent sidechains, and transcript logs.[12] Chain-patching at read time uses UUID anchors (headUuid/anchorUuid/tailUuid) — nothing is destructively edited on disk.[12] Permissions are never restored on resume; trust is re-established per session as a deliberate security design choice.[12]
Skills have a fundamentally different cost model from CLAUDE.md: "Unlike CLAUDE.md content, a skill's body loads only when it's used, so long reference material costs almost nothing until you need it."[4] This makes skills the primary mechanism for keeping session startup cheap while preserving access to complex procedures.
Key finding: After auto-compaction, Claude Code re-attaches the most recent invocation of each skill (first 5,000 tokens each) up to a combined budget of 25,000 tokens — filling from most recently invoked. Older skills can be dropped entirely if the budget is exceeded.[4]
| Scenario | What Loads | Token Cost | Source |
|---|---|---|---|
| Regular session (not invoked) | Skill descriptions only | Up to ~384 tokens per description (1,536-character cap[4][16]); total skill-description budget capped at 1% of context window (~2,000 tokens at 8,000-character fallback) | [4][16] |
| Regular session (invoked) | Full skill content | Full file size (up to 500 lines) | [4] |
| Post-compaction | Most recently invoked skills, first 5K tokens each | Max 25,000 tokens total across all re-attached skills | [4] |
| Subagent with preloaded skills | Full skill content at startup | Higher than regular session; different cost model | [4] |
| Scope | Path | Applies to |
|---|---|---|
| Enterprise | Managed settings | All org users |
| Personal | ~/.claude/skills/<skill-name>/SKILL.md |
All your projects |
| Project | .claude/skills/<skill-name>/SKILL.md |
This project only |
| Plugin | <plugin>/skills/<skill-name>/SKILL.md |
Where plugin is enabled |
Source: Official Claude Code Docs.[4]
| Frontmatter | User can invoke | Claude can invoke | Description in context |
|---|---|---|---|
| (default) | Yes | Yes | Always in context |
disable-model-invocation: true |
Yes | No | NOT in context (description also removed) |
user-invocable: false |
No | Yes | Always in context |
Token-saving insight: disable-model-invocation: true removes the skill description from context entirely — useful for rarely-used deployment or admin skills to reduce baseline overhead.[4]
| Limit | Value | Impact if Exceeded | Source |
|---|---|---|---|
| Max SKILL.md length | 500 lines | Degraded performance | [4] |
| Description truncation | 1,536 characters per entry | Keywords stripped; Claude misses invocation triggers | [4][16] |
| Total description budget | 1% of context window OR 8,000 characters (fallback) | Descriptions shortened across all skills to fit | [4][16] |
| Budget override | SLASH_COMMAND_TOOL_CHAR_BUDGET env var |
Raises limit; use when many skills needed | [4] |
| Skill | ROI Assessment | Optimal Trigger | Source |
|---|---|---|---|
/simplify |
Highest ROI — spawns 3 parallel review agents; catches unused imports, redundant variables, shared logic | After refactoring or AI-generated code | [15] |
/batch |
Maximum impact for large-scale changes — parallelizes across 5–30 independent units using isolated worktree agents | Large migrations | [15] |
/review |
Complementary correctness validation — bugs, logic errors, edge cases; pairs with /simplify | Before commit/PR | [15] |
/compact |
Essential for session sustainability — proactive compression before new work phases | Before phase transition, not at context limit | [15] |
Recommended workflow sequence: review → fix → simplify.[15]
| Pattern | Problem | Fix | Source |
|---|---|---|---|
| Background facts / style guides in SKILL.md body | Bloats file; context cost every invocation | Move to supporting files in skill directory | [4] |
| Skills that should be CLAUDE.md rules | Wrong abstraction; pay invocation overhead for always-on content | Static always-on context belongs in CLAUDE.md | [4] |
| Large reference docs embedded directly | 5,000-token skill body loaded on every invocation | Reference from SKILL.md; load only when explicitly needed | [4] |
| Skills with process steps + background context mixed | "Packing skills with background context bloats the file and degrades performance" | Skills contain process steps only | [16] |
/loop skill |
Limited real usage; useful only for specific monitoring scenarios | Situational — use only when actually polling | [15] |
/debug skill |
Situational troubleshooting only — “niche skill, but invaluable when you need it”; not for routine use | Keep but invoke only for specific debugging scenarios; not a session-start default | [15] |
| Admin/destructive skills with model invocation enabled | Risk of unintended invocation; always in context budget | Set disable-model-invocation: true |
[4] |
The !`command` syntax in skill files executes shell commands at invocation time — output replaces the placeholder before Claude sees the skill content.[4] This enables sub-second injection of live data (git diffs, test results, PR comments) at skill invocation rather than baking stale data into the session start:
---
name: pr-summary
context: fork
agent: Explore
---
## Pull request context
- PR diff: !`gh pr diff`
- PR comments: !`gh pr view --comments`
This pattern decouples "what's always in context" from "what's injected when needed" — a key technique for sub-3s startup with full situational awareness.[4]
| Content Type | Mechanism | Cost Profile | Source |
|---|---|---|---|
| Facts, rules, patterns (always relevant) | CLAUDE.md | Pay every turn | [4] |
| Procedures, playbooks, workflows | Skills | Pay only when invoked | [4] |
| Large reference docs | Supporting files in skill directory | Pay only when explicitly loaded via skill | [4] |
Skills are also hot-reloaded — edits take effect within the current session without restart. Creating a new top-level skills/ directory requires restart.[4]
Claude Code provides four mechanisms for session continuity with distinct tradeoffs: --continue (auto-resume latest), --resume (targeted resume), /compact (summarize-in-place), and /clear (fresh start with manual brief). Each addresses a different failure mode in the context lifecycle.[5][8]
| Flag | Behavior | Ideal Use Case | Source |
|---|---|---|---|
--continue (-c) |
Auto-loads most recent conversation in current directory without user interaction | Single active task, single project | [5] |
--resume (-r) |
Interactive picker or explicit name/ID selection across sessions | Multiple parallel projects; targeted resume | [5] |
What gets restored with either flag: complete message history, all tool results (files read, commands executed, code modifications), full conversation context.[5]
| Key | Action |
|---|---|
| Up/Down arrows | Navigate sessions |
| Enter | Resume selected session |
| P | Preview content |
| R | Rename session |
| / | Search/filter |
| B | Filter by Git branch |
Source: Pillitteri, 2026.[5]
Sessions persist in two locations:[5]
~/.claude/history.jsonl — global index with timestamps, paths, session IDs~/.claude/projects/ — per-project directories containing:
.jsonl files (complete transcripts)sessions-index.json (summaries, metadata)memory/ subdirectory for persistent notesSessions are grouped by Git repository, including worktrees. No automatic cleanup or expiration — indefinite retention by default.[5]
| Approach | Method | Tradeoff | Source |
|---|---|---|---|
/compact |
Model summarizes automatically; accepts steering via inline flags | Lossy but thorough; Claude decides relevance; can be guided with explicit focus | [8] |
/clear |
User writes distilled brief manually; zero-rot fresh session | More effort per session but you control relevance; eliminates model deprioritization of old content | [8] |
/compact focus on the auth refactor, drop the test debugging[8]Anthropic's official framework for choosing the right continuity mechanism:[8]
| # | Action | When |
|---|---|---|
| 1 | Continue (next message) | Same task, same context, no rot |
| 2 | /rewind (Esc Esc) | Jump back to previous message and retry |
| 3 | /clear + distilled brief | Task pivot, poisoned context, fresh start needed |
| 4 | /compact | Summarize session and keep working in same stream |
| 5 | Subagents | Delegate work to agent with clean context; preserve main session for decision-making |
The session system was NOT designed for sub-3s resumption of large sessions. Performance characteristics by session size:[5][12]
| Session Size | Resume Latency | Bottleneck |
|---|---|---|
| <50K tokens | <3 seconds | None — fast |
| 50K–100K tokens | Variable; can be significant | Serialization/deserialization of large tool outputs in history |
| >100K tokens | Meaningful latency | JSONL chain patching + UUID anchor resolution across full transcript |
| Structured handoff notes (<5K tokens) | Near-instant | JSONL deserialization of small payload is sub-second |
Key finding: Sub-3s startup with --resume requires keeping session history short. The JSONL append-only design means each resume call deserializes and chain-patches ALL prior messages. A 5K-token structured baton loads near-instantly; a 100K+ transcript does not.[12]
| Issue | Symptom | Status |
|---|---|---|
| GitHub #3138[5] | --resume fails to maintain context after hitting usage/context limits |
Known bug |
| GitHub #43696[5] | --continue and --resume do not restore prior conversation context (regression in 2.1.x) |
Patched but re-appears |
| GitHub #46445[5] | /continue and /resume show all sessions across projects instead of current project only (regression in 2.1.101) |
Regression |
| Non-interactive mode[5] | --continue may create new sessions in CI/automation |
Use --resume with explicit session names for reliable automation |
Claude Code provides two official memory mechanisms with different persistence models and token costs. CLAUDE.md files hold instructions and rules (loaded in full every session); auto-memory holds Claude's learned patterns and decisions (indexed summary loaded at startup, details loaded on demand).[10]
| Mechanism | Who Writes | What It Contains | Loaded Into |
|---|---|---|---|
| CLAUDE.md files[10] | Developer | Instructions, rules, project context | Every session (full content) |
| Auto memory[10] | Claude (agent-written) | Learnings, patterns, decisions, gotchas | Every session (first 200 lines or 25KB of MEMORY.md) |
~/.claude/projects/<project-hash>/memory/
├── MEMORY.md # Index — loaded into EVERY session (first 200 lines / 25KB)
├── debugging.md # Detailed notes (loaded on demand)
├── api-conventions.md # Architecture decisions (loaded on demand)
└── ...
Source: Official Claude Code Docs.[10]
Properties:[10]
autoMemoryDirectory setting allows custom location| File | When Loaded | Token Budget |
|---|---|---|
| MEMORY.md[10] | Every session start | First 200 lines OR first 25KB — hard cap |
| Topic files (e.g., debugging.md)[10] | On demand — Claude reads when relevant | Full file size; only when Claude judges it necessary |
Key finding: Claude Code's auto-memory uses LLM-based scanning of file headers (titles/descriptions) to select relevant files — NOT vector similarity search. "Names are the retrieval mechanism" — descriptive filename and description, not embedding distance.[10][12] This is a deliberate design choice for inspectability and reliability over semantic richness.
| Property | Native Auto-Memory | Vector-Store MCP |
|---|---|---|
| Retrieval method[12] | LLM scans file headers; selects up to 5 relevant files | Embedding similarity + BM25 |
| Retrieval speed | Seconds (LLM call required) | ~5ms (local SQLite-vec) |
| Inspectability | Full (plain Markdown files, readable by human) | Partial (requires query tool) |
| External dependencies | None | Requires MCP server running |
| Auto-capture | No — Claude decides what's worth saving | Yes — SessionEnd hooks + lifecycle hooks |
| Token efficiency | 200-line MEMORY.md cap; topic files on demand | Semantic filtering; only relevant context injected |
The official auto-memory design philosophy (from Claude Code Docs):[10]
CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 # env var
{"autoMemoryEnabled": false} # settings.json
Requires Claude Code v2.1.59+.[10]
| Content | Survives /compact? | Mechanism |
|---|---|---|
| Project-root CLAUDE.md[10] | Yes | Re-read from disk after compaction |
| Nested CLAUDE.md in subdirectories[10] | Deferred | Re-loaded next time Claude reads a file in that subdirectory |
| Instructions given only in conversation[10] | No | Must be added to CLAUDE.md to persist |
| MEMORY.md[10] | Yes (disk-persisted) | Re-loaded at next session start |
Two open-source tools provide vector-backed persistent memory for Claude Code: mcp-memory-service (doobidoo) targeting production AI agent pipelines, and claude-mem (thedotmack) targeting session capture and progressive disclosure within Claude Code workflows.[6][7]
| Mode | Interface | Use Case |
|---|---|---|
| REST API | 15 endpoints | Framework-agnostic HTTP clients (LangGraph, CrewAI, AutoGen) |
| MCP Server | Native tool integration | Claude Desktop, Claude Code native integration |
| Remote MCP via HTTPS/OAuth | Browser-accessible | Browser-based claude.ai integration |
Source: doobidoo/mcp-memory-service.[6]
| Backend | Use Case | Notes |
|---|---|---|
| SQLite-vec (default) | Local, no external dependencies | 5ms retrieval; recommended for Claude Code |
| Milvus (Lite / self-hosted / Zilliz Cloud) | Production scale | Managed or self-hosted options |
| Cloudflare Vectorize | Cloud deployment | Edge deployment |
| Hybrid local + cloud sync | Resilience / multi-machine | Addresses machine-local limitation of native auto-memory |
Source: doobidoo/mcp-memory-service.[6]
All-MiniLM-L6-v2, 384-dimensional embeddings, running locally via ONNX — no external API dependencies, sub-millisecond embedding inference.[6]
X-Agent-ID headersSource: doobidoo/mcp-memory-service.[6]
| Benchmark | Metric | Score |
|---|---|---|
| LongMemEval (500 questions)[6] | Recall@5 | 80.4% |
| LongMemEval[6] | NDCG@10 | 82.2% |
| LongMemEval[6] | MRR | 89.1% |
| DevBench (practical workflows)[6] | Recall@5 | 91.1% |
| DevBench[6] | Overall MRR | 0.861 |
| SQLite-vec retrieval speed[6] | Retrieval latency | 5ms — adds ~5–15ms total to startup including query formulation |
Two integration modes available:[6]
claude /plugin marketplace add doobidoo/mcp-memory-serviceSessionEnd auto-harvest hook — safe-by-default, zero npm dependencies, 5-second timeoutReliability fix in v10.40.3: Eliminated socket hang-ups via keepAlive: false and explicit Connection: close headers. Critical for production agent loops that restart frequently.[6]
| Hook | When It Fires | What It Does |
|---|---|---|
SessionStart[7] |
Session begins | Initialize session context |
UserPromptSubmit[7] |
User sends message | Capture user intent |
PostToolUse[7] |
After each tool call | Record tool usage and observations |
Stop[7] |
Session paused | Checkpoint state |
SessionEnd[7] |
Session ends | Final AI-powered compression and storage |
The key architectural innovation for token efficiency — access detail at the granularity you need:[7]
| Layer | Tool | Tokens per Result | Use Case |
|---|---|---|---|
| 1 — Index | search |
~50–100 tokens | Find relevant observations; get IDs |
| 2 — Timeline | timeline |
~200–500 tokens (estimated; exact token cost not documented in source[7] — estimate based on intermediate granularity between index and full-detail layers) | Chronological context around results |
| 3 — Detail | get_observations |
~500–1,000 tokens | Full details for filtered IDs only |
Token efficiency claim: ~10x savings by filtering before fetching vs. loading full session history.[7] Conservative re-estimate based on corpus analysis: ~3x reduction (50 observations × 500 tokens = 25K full load; vs. 3,750 token index + ~7,500 for 3–5 relevant observations).[7]
npx claude-mem install (requires Node.js 18+)[7]| Property | Native Auto-Memory | mcp-memory-service | claude-mem |
|---|---|---|---|
| Retrieval method[10][6][7] | LLM file header scan | Hybrid vector + BM25 + knowledge graph | Vector + FTS5 |
| Retrieval speed[12][6][7] | ~Seconds (LLM call) | 5ms | ~Milliseconds |
| Token efficiency | 200-line MEMORY.md cap; topic files on demand | Semantic filtering; decay + compression pipeline | 3-layer progressive disclosure; ~3–10x claimed reduction |
| External dependencies | None | SQLite-vec or Milvus | Node.js 18+, Bun, Chroma |
| Inspectability | Full (plain Markdown) | Partial (query tool required) | Partial (web viewer available) |
| Auto-capture | No — Claude decides | SessionEnd auto-harvest hook | 5 lifecycle hooks (every tool call) |
| Multi-machine support | No (machine-local) | Yes (Hybrid local+cloud or Zilliz) | No (local SQLite) |
| Production benchmarks | N/A | DevBench Recall@5: 91.1% | Not published |
Sources: raw_10.md, raw_12.md, raw_6.md, raw_7.md.[10][12][6][7]
See also: New Tooling 2026 (additional memory MCP releases in 2026)Agent context handoffs fail through two symmetric failure modes: information overload (passing complete message histories triggers "Lost in the Middle" effect) and aggressive compression (stripping away supporting evidence removes the ability to verify or extend prior analysis).[9] Both modes degrade quality; neither achieves sub-3s productive startup.
Key finding: "Handoffs treat context as a one-time data transfer, a blob of text thrown from one agent to another." The fix is replacing text dumps with layered, queryable briefing structures. Structured handoffs reduce average 15-minute handoff delays to seconds.[9]
| Failure Mode | Mechanism | Result | Source |
|---|---|---|---|
| Information Overload | Full message history passed; model buries critical signals | "Lost in the Middle" — performance degrades with large context dumps | [9] |
| Aggressive Compression | Summarization strips evidence and reasoning chains | Receiving agent cannot verify or extend prior analysis | [9] |
Replace text dumps with layered, queryable information (XTrace, 2026):[9]
| Layer | Content | Availability |
|---|---|---|
| 1 | Decisions and non-negotiable constraints | Always present — inject at every session start |
| 2 | Actual artifacts (not summaries) for reference | Available on demand — query by ID |
| 3 | Accumulated preferences and patterns | Retrievable via query |
| 4 | Timeline context (what happened and when) | Queryable chronological log |
Synthesized from XTrace research[9] and Anthropic's /clear + brief guidance:[8]
## Current State
- What exists and what it does
- Decisions already made (with why)
## Next Steps (ordered)
- Exact files to change + what to change
- Command to verify each step
## Constraints
- Non-negotiable decisions
- Things that were tried and failed
## Key Files
- File + what to look for in each
| Metric | Without Structured Handoff | With Structured Handoff | Source |
|---|---|---|---|
| Handoff delay | Average 15 minutes | Seconds | [9] |
| Context reconstruction turns | 5–10 turns to re-derive prior state | 1–2 turns from structured baton | [9] |
| Resume latency (JSONL) | Significant for >100K session transcripts | Near-instant for <5K structured notes | [12] |
The /clear + distilled brief pattern is the manual version of a structured baton pass:[8]
"When a new agent joins a workflow, it doesn't receive a dump. It queries the memory for a briefing."[9] This architecture allows agents to access accumulated knowledge immediately, avoid cold starts, and avoid degradation through successive summarization rounds — the core problem with repeated /compact cycles.[9]
Decision framework for subagent delegation (Anthropic):[8]
| Condition | Strategy |
|---|---|
| Only the conclusion needed (not intermediate output) | Use subagent — intermediate output stays in subagent context; only conclusion returns to parent |
| Output will be referenced again in main session | Keep in main session — subagent return loses the working context |
Subagents keep main context clean by using isolated Claude instances that return only distilled results.[16]
See also: Multi-CLI Coordination (baton protocols across parallel agents)The 5-minute prompt cache TTL is the fundamental constraint for sub-3s startup. With a warm cache (<5 min since last request), TTFT at 500K tokens is ~3.5 seconds. Cold cache at the same size takes 35–90 seconds — a 10–26x latency penalty.[3][11]
Key finding: "A 50K-token context with a 95% cache hit rate is more economical and faster than a 10K-token context that's cold on every request. The goal should be maximizing cache hit rate, not minimizing context size alone."[11]
| Context Size | Cache State | TTFT | Source |
|---|---|---|---|
| 50K tokens | Warm (<5 min) | ~1.7s | [11] |
| 50K tokens (GPT-4o baseline) | Warm, cached | 1,699ms (down from 4,290ms) | [11] |
| 500K tokens | Warm (<5 min) | ~3.5s | [3] |
| 500K tokens | Cold (>5 min) | ~35s | [3] |
| 1M tokens (extrapolated) | Cold | 60–90s | [3] |
Model note: Row 1 (~1.7s at 50K warm cache) is a Claude-inferred estimate extrapolated from cross-provider caching improvement data in [11]. Row 2 (1,699ms at 50K warm cache) is a direct GPT-4o measurement from arxiv 2601.06007[11]; it is included as a cross-provider reference point. Rows 3–5 (500K and 1M) are Claude measurements from claudecodecamp.com[3].
| Model | Cost Reduction from Caching | Source |
|---|---|---|
| GPT-5.2[11] | 79–81% | [11] |
| Claude Sonnet 4.5[11] | 78–79% | [11] |
| Average across providers[11] | 41–80% | [11] |
| TTFT improvement from caching[11] | 13–31% across providers | [11] |
Cache activation requires a minimum prompt length:[11]
| Provider | Minimum Tokens for Cache | Below Minimum |
|---|---|---|
| Anthropic[11] | 1,024 tokens | TTFT regression of 10–18% (worse than no caching) |
| OpenAI[11] | 4,096 tokens | Cache does not activate |
"Full context caching can paradoxically increase latency" when dynamic content like tool results gets cached without producing subsequent hits. Cache writes consume compute; if the cached content isn't reused, you pay write cost with no benefit.[11]
Strategic insight: "System prompt only caching provides the most consistent benefits."[11] Operational rule:
| Content Type | Cache Strategy | Reason |
|---|---|---|
| CLAUDE.md + skills + static tools[11] | Maximize cache stability — never change between calls | Stable prefix → consistent cache hit rate |
| Dynamic content (tool results, git status)[11] | Do NOT cache; generate fresh each turn | Cached dynamic content rarely gets a hit; pays write cost for nothing |
| Session recovery data[11] | Inject AFTER stable prefix | Dynamic injection after stable prefix does not break cache on prefix |
Synthesized from six sources — five independent approaches that combine into a coherent stack:[11][6][3][12][8]
| Step | Action | Target | Why |
|---|---|---|---|
| 1[11] | Keep stable context (CLAUDE.md + memory index + static tools) under 50K tokens | TTFT ~1.7s warm | Below 50K: consistent sub-2s warm TTFT |
| 2[11] | Inject dynamic session recovery data AFTER stable prefix | Cache hit on prefix preserved | Dynamic suffix doesn't break cache on the stable portion |
| 3[6] | Use semantic retrieval (5ms) to select what dynamic content to inject | ~5–15ms memory retrieval overhead | Only relevant prior context injected, not full history |
| 4[3] | Stay under 200K tokens | No long-context price cliff | 200K crossing doubles all per-token costs |
| 5[12] | Keep structured handoff (baton) under 5K tokens | Near-instant JSONL deserialization | Chain-patching small JSONL is sub-second; large transcripts are not |
| 6[6] | Store session summaries + key decisions in vector store at session end; semantic query on startup | Relevant prior context in 5ms | Avoids cold-start reconstruction; externalized memory survives /clear |
The 1M context window addresses a different problem than sub-3s startup. It eliminates automatic summarization for most coding sessions but introduces a 200K token pricing cliff (2× per-token cost), a 5-minute cache TTL cold-restart penalty, and measurable accuracy degradation above 256K tokens.[3]
When crossing 200K tokens, ALL tokens get premium pricing — not just those above the threshold. Step function, not linear scale:[3]
| Model | Standard Input | Long Context Input (2×) | Standard Output | Long Context Output (1.5×) |
|---|---|---|---|---|
| Opus 4.6 | $5.00/M | $10.00/M | $25.00/M | $37.50/M |
| Sonnet 4.6 | $3.00/M | $6.00/M | $15.00/M | $22.50/M |
Infrastructure cost justification: long-context sessions require hundreds of gigabytes of GPU memory reserved per user and consume 25× more compute on cold starts.[3]
Cache pricing at the 200K threshold (Opus 4.6): Cache reads double from $0.50/M (standard) to $1.00/M; cache writes from $6.25/M to $12.50/M — a 2× multiplier on both. This cache cost step-up is additive to the standard input/output price increase: sub-3s startup stacks relying on prompt caching face double cache overhead when crossing 200K tokens.[3]
| Model | 256K Context Accuracy | 1M Context Accuracy | Metric |
|---|---|---|---|
| Opus 4.6[3] | 93% | 76% | General retrieval accuracy; "lost in the middle" effect |
| Sonnet 4.5[3] | N/A | 18.5% MRCR score | Effectively unreliable at 1M; avoid entirely |
Critical information should be at the beginning or end of context for best retrieval at long context lengths.[3]
| Use Case | 1M Appropriate? | Reason |
|---|---|---|
| Single-shot analysis of large codebases/documents[3] | Yes | Full corpus in context; no compaction interruption |
| Deep debugging requiring full context preservation[3] | Yes | Compaction during debugging loses critical trace data |
| Multi-agent shared state[3] | Yes | Eliminates need for aggressive handoff compression |
| Compliance work requiring exact citations[3] | Yes | Source material must remain uncompacted |
| Routine coding sessions[3] | No | Most sessions peak at 80–120K before compaction anyway |
| Sessions with frequent pauses[3] | No | 5-minute cache TTL → painful cold restarts (35–90s) |
| Sonnet 4.5 at 1M[3] | No | 18.5% MRCR — unreliable; avoid entirely |
Key finding: A 400K-token session costs roughly 5× per turn compared to a post-compaction 80K session. Better session management through intentional clearing often outperforms relying on larger context windows for most coding tasks.[3] 1M context does NOT help sub-3s boot — cold cache at 500K tokens takes 35 seconds to first token.[8]
The distinction between capacity and startup speed:[8][3]
Sub-3s productive startup requires stacking five independent mechanisms. No single tool achieves it alone; the architecture requires: stable prefix caching, lean startup context, structured handoff format, semantic memory retrieval for dynamic injection, and session lifecycle management aligned with the 5-minute cache TTL.
| Layer | Mechanism | Target | Failure Mode Without It |
|---|---|---|---|
| Stable context[11][10] | CLAUDE.md <500 tokens + path-scoped rules + lean MEMORY.md | <50K tokens stable prefix | Large stable prefix → TTFT >3.5s even warm |
| Cache hit rate[11] | Keep stable prefix unchanged between requests; stay under 200K tokens | >90% cache hit on stable prefix | Cold cache at 500K = 35s startup |
| Session format[12] | Structured baton <5K tokens instead of full transcript resume | Near-instant JSONL deserialization | 100K+ JSONL transcript deserialization adds meaningful latency |
| Dynamic injection[6] | Semantic retrieval (5ms) to select relevant prior context | 5–15ms retrieval overhead | Full history injection bloats context; destroys cache hit rate |
| Session discipline[8][15] | Proactive /compact before phase transitions; /clear on task pivots | Session stays under 100K tokens | Context rot + auto-compaction at worst intelligence point |
| Situation | Recommended Action | Source |
|---|---|---|
| Same task, clean context | Continue (send next message) | [8] |
| Wrong direction, same task | /rewind (Esc Esc) to prior message | [8] |
| Same task, starting new phase | Proactive /compact with steering before phase transition | [15][8] |
| Task pivot or poisoned context | /clear + manual distilled brief (baton) | [8] |
| Subtask where only conclusion needed | Subagent delegation; intermediate output stays in subagent context | [8][16] |
| Resume after >5 min break | Expect cold cache; use structured baton (<5K tokens) for fastest productive start | [3][12] |
| New session, need prior context | Vector MCP retrieval (5ms) OR MEMORY.md index + demand-load topic files | [6][10] |
Documentation of working stacks from corpus sources:
| Practitioner / Source | Stack Described | Key Metric |
|---|---|---|
| doobidoo/mcp-memory-service[6] | SQLite-vec local + SessionEnd auto-harvest + semantic retrieval on startup | 5ms retrieval; adds 5–15ms total startup overhead for memory layer |
| XTrace structured handoff[9] | Layered briefing format + queryable memory infrastructure | 15-minute handoff delays reduced to seconds; 5–10 turns reconstructions → 1–2 turns |
| claudefa.st (tool vendor)[1][14] | CLAUDE_AUTOCOMPACT_PCT_OVERRIDE + StatusLine hook real-time monitoring | Compaction quality metric, not startup latency: proactive compaction before buffer limit preserves context quality; no before/after startup time reported |
| arxiv 2601.06007 (academic paper — GPT-4o measurement)[11] | Stable prefix caching + dynamic suffix injection (cross-provider caching study, not a Claude Code deployment) | GPT-4o direct measurement: 50K warm cache TTFT = 1,699ms (down from 4,290ms); 13–31% TTFT improvement from caching. Included as cross-provider reference; not a Claude practitioner benchmark. |
Note on corpus coverage: The entries above document tools and approaches from corpus sources (vendor documentation, academic papers). No independent practitioner case studies — developers who measured their own Claude Code cold-start latency, applied a config stack, and published before/after numbers — were found in raw_1.md through raw_16.md. To add such entries, search: "claude code startup latency optimization site:reddit.com OR site:github.com OR site:dev.to" and "claude code resume performance blog 2025 2026".
Areas where corpus data identifies open problems without current solutions:[5][3][11]
Research gap — content removed pending source verification. This section originally cited sources [17]–[20], none of which appear in the consolidated corpus source index (raw_1.md–raw_16.md). Specific claims about “Ultraplan,” a Ctrl+G editor integration, plan approval-flow options, and plan persistence paths could not be verified against the corpus. The “Ultraplan” product name has no corpus anchor. All content relying on sources [17]–[20] has been removed per corpus verification policy.
To restore this section: fetch the official Claude Code plan mode documentation (the URL pattern from confirmed sources suggests code.claude.com/docs/en/plan-mode or the Anthropic docs equivalent), add the real URLs to the consolidated source index, and re-synthesize using only verified claims with proper citation numbers.
The EnterPlanMode and ExitPlanMode tools are confirmed real Claude Code API tools (they appear in the Claude Code tool declaration list). No other plan mode mechanics are documented in the consolidated corpus (raw_1.md–raw_16.md).