Orchestration Sediment Anti-Patterns

Orchestration sediment is the residue of solutions whose problems have been solved. The defining characteristic is asymmetric growth: additions are triggered by incidents and failures; removals have no trigger at all. Every bad output causes an instruction, guardrail, or wrapper to be added. Good output generates no feedback loop. The result is unbounded accumulation toward zero signal-to-noise ratio.^[11]

The Four Formation Pathways

Platform Absorption: Three Deleted Categories

Pathway	Trigger	What Gets Added	Why It Persists	Source
Failed-promise residue	Autonomous AI promised, autonomy failed anyway	Strategic planner personas, multi-agent debate, complex tool profiles	Built to chase a goal that never arrived; teardown requires deliberate effort	^[19]
Platform absorption	Platform ships feature natively	Custom RAG pipelines, context management scripts, external artifact directories	Custom layer remains even after native feature covers identical ground	^[1]^[62]
Speculative abstractions	"Just in case" / imagined future requirements	Dynamic variables, abstraction layers, error handlers for scenarios that never occur	Removal seems risky; nobody can confirm the scenario won't occur	^[43]^[65]
Framework expiration	Underlying problem space stabilizes	Abstraction layers built to compensate for early-era model/API inconsistency	Framework predates convergence; nobody re-evaluates whether it still solves anything	^[3]

Between 2024 and 2026, three previously-custom orchestration categories became sediment when their underlying platform shipped native equivalents:^[62]

Similarly, developers who built manual context-clearing workflows, external artifact directories (thoughts/ directories), elaborate verbose prompts, and custom session segmentation found all of these absorbed by automatic compaction, native plan mode, and platform session management.^[1]

The Karpathy Principle: Nothing Speculative

The Karpathy principle^[65] provides the core deletion heuristic: speculative abstractions (for imagined futures), speculative features (for potential use cases), and speculative error handlers (for scenarios that have not occurred) are all sediment. They were added for imagined futures that never came; now they consume context without producing value. Deletion criteria: if the scenario the code handles has never occurred in production, the code is speculative sediment.^[43]

Section 2: Framework Abstraction Sediment: LangChain / LangGraph Teardowns

The LangChain exit is the most thoroughly documented net-deletion story in AI orchestration. Multiple independent teams arrived at the same conclusion independently, across different application domains, without coordinating. Each team built on LangChain when it solved a real problem (inconsistent vendor APIs, 2022–2023), then tore it down when that problem disappeared (vendor API convergence, 2025).

Documented Deletion Cases

Octomind (raw_31.md)

Octomind eliminated LangChain after 12+ months in production from an AI agent system that used multiple LLMs to automatically create and fix end-to-end tests in Playwright.^[31] The deletion was driven by three specific capability gaps:

Abstraction stacking anti-pattern: The framework layers "abstractions on top of other abstractions," forcing developers to navigate "nested abstractions" and debug framework internals. Developers spent "as much time understanding and debugging LangChain as it did building features."^[31] What replaced it: Direct LLM client libraries, carefully selected external packages, simple comprehensible low-level code.

Nathan Cole's Deletion (raw_44.md)

The turning point: "a LangChain agent called a DELETE endpoint I had not intended to expose."^[44] The framework passed LLM tool selection directly to execution without an intervening validation layer. Replacement: 80 lines of Python. The root anti-pattern: LangChain treats LLM outputs as program output; the 2026 pattern treats them as untrusted user input requiring validation.

Ravoid Case Studies (raw_3.md)

Production teams migrating to raw SDKs reported consistent, measurable gains:^[3]

Metric	LangChain (Before)	Raw SDK (After)	Improvement
Customer support agent code volume	1,200 lines	630 lines	47% reduction
Debugging time per incident	90–240 minutes	30–60 minutes	3–4x faster
Developer onboarding	1–2 weeks (10 engineer-days)	1–2 days (2 engineer-days)	5x faster
Request latency per tool call	10–30ms	2–5ms	5–6x lower
Framework-attributed incidents	5–15% of all LLM incidents	Near zero	70–90% drop
Annual version migration cost	1–3 engineer-weeks	0	Eliminated
Total monthly overhead	$8K–$30K per agent	Baseline	Eliminated

Post-migration outcomes: 70–90% drop in framework incidents, 40–60% code reduction within 2–6 months amortization period.^[3] Observable sediment indicator: Stack traces spanning "15 to 40 frames of internal framework code" — debug signal buried in abstraction layers with no signal value.

AutoGen Deletion (raw_49.md)

An insurance company built a multi-agent conversation system for a training simulator using AutoGen, then deleted the entire system:^[49]

Hacker News Developer Consensus

LangChain CEO Harrison Chase acknowledged the structural cause: "The initial version of LangChain was pretty high level and absolutely abstracted away too much."^[4]

What Gets Deleted vs. What Gets Kept

Deleted Component	Why Deleted	What Replaced It
AgentExecutor	Hid execution flow; obscured validation gap	Explicit loop with validation gate
ConversationBufferMemory	Unnecessary once vendor APIs converged	Direct conversation list management
Custom output parsers	JSON mode / structured output now native	Pydantic / Instructor
Tool class hierarchies	Added abstraction without hiding complexity	Plain Python functions with typed signatures
Chain composition primitives	Python function composition already does this	Regular Python functions

Preferred replacement stack (2026): Vanilla Python + direct SDK + LlamaIndex for retrieval + Pydantic/Instructor for structured output.^[56]

Section 3: Multi-Agent Complexity Anti-Patterns

Multi-agent systems are the single largest source of orchestration sediment because they are the hardest to delete: each agent appears to have a purpose, and decomposing the system requires understanding how agents coordinate. The empirical evidence shows the coordination overhead routinely exceeds the capability benefit.

Quantitative Evidence on Coordination Failure

The Coordination Tax Formula

Net Performance = (Individual Capability + Collaboration Benefits) − (Coordination Chaos + Communication Overhead + Tool Complexity)

Three failure mechanisms materialize when the subtraction side exceeds the addition side:

Tribunal / Debate System Anti-Patterns

Study / Source	Finding	Metric
Google DeepMind (via raw_14.md)	Error amplification in unstructured multi-agent networks	17.2x vs. single-agent baseline^[14]
MAST study (via raw_14.md)	Failure rate across 7 frameworks	41%–86.7% failure; 36.9% from coordination breakdowns^[14]
Coordination saturation threshold (raw_40.md)	Gains plateau beyond 4 agents	10+ agent systems typically fail; overhead exceeds gains^[14]
Capability saturation (raw_5.md)	When single agent >45% accuracy, adding agents provides negative returns	35% performance drop in Minecraft planning tests from coordination overhead^[5]
Budget utilization (raw_24.md)	Agents given 100-tool budgets used only 14.24 searches on average	Doubling budget improved accuracy by 0.2 percentage points^[5]
Token multiplication (via raw_40.md)	Multi-agent conversation systems	3.5x–5x baseline token consumption^[14]^[49]

The "Tribunal" pattern — advocate agent + attacker agent + radical agent — is structurally identical to the "Bag of Agents" anti-pattern that produces the 17.2x error amplification: agents interact without enforced contracts or structured topology.^[14]

Pattern	Why It Gets Built	Why It Gets Deleted
Swarm / handoff topology	"Autonomous agent coordination" promise	Amplifies invisible state problems instead of fixing them; "distributes" state rather than managing it^[9]
LLM-as-Judge	Quality assurance without human review	Adds biases, doubles latency and cost; "Agents critiquing each other: nearly impossible to reproduce or audit"^[9]
CodeAct (agents emitting Python)	Flexible execution	Dramatically increases hallucination blast radius^[9]
CUGA (Computer Use Generalist Agent)	UI automation without custom tooling	Reliability and security challenges at production scale^[9]
Multi-planner debate	Consensus through argument	Circular disagreements, token multiplication, unpredictable convergence^[9]

Key finding on debate systems: "95% of investments in generative AI have produced zero measurable returns" (MIT Media Lab 2025, cited in raw_9.md) due to architectural patterns lacking production foundations — multi-agent debate systems are the leading culprit.^[9]

A Payment Pipeline Deletion Story

From agentpatterns.tech, a documented payment processing pipeline teardown:^[39]

Agent	Deletion Reason
Planner agent	Strategic planning for a deterministic process; planning time exceeded execution time
Router agent	Routing logic was 3 conditional statements; LLM routing added latency and unpredictability
Retrieval agent	Replaced by direct database lookup
Responder agent	Merged into single output step
Policy agent	Policies were static rules; encoded as config, not LLM calls
Critic agent	Audit trail requirement met by deterministic logging; LLM critic added false-positive noise

Key metric: "A typical user request triggers 4+ agent handoffs where 1–2 would be enough."^[39] Decision rule for future additions: "Add a new agent only when there is a clear role, measurable value, and explicit ownership boundary."

What Production Systems Actually Look Like

Organizations that successfully deployed multi-agent systems in 2025–2026 deliberately constrained scope:^[5]

Anthropic's Position: Complexity Without Demonstrated Value

Anthropic's Building Effective Agents documentation (raw_16.md): "Finding the simplest solution possible, and only increasing complexity when needed is essential... Many teams mistakenly add complexity without demonstrating it improves outcomes."^[16]

Anthropic's Applied AI team at AWS re:Invent 2025: "What didn't work: Breaking tasks into concurrent sub-tasks for parallel execution didn't succeed practically." They moved away from "complex 50-prompt chained workflows" toward "agentic loops with tools — fewer edge cases to hardcode."^[18]

Claude Code best practices guide: "Despite multi-agent systems being all the rage, Claude Code has just one main thread… I highly doubt your app needs a multi-agent system."^[53]

Epsilla analysis of Anthropic's harness philosophy:^[36] BrowseComp benchmark showed that giving models code-writing capabilities for self-filtering improved accuracy from 45.3% to 61.6% — removing the orchestration harness improved outcomes. "Models demonstrating near-human coding abilities no longer need hand-holding through restrictive harnesses."

Section 4: Tool Proliferation and MCP / Tool-Profile Bloat

Tool proliferation is structural sediment: each tool is added for a reason, but tools are never removed when that reason disappears. The result is ever-expanding schema surface area that consumes context budget before any work begins.

The Tool Soup Problem: Context Cost Quantified

Eager Loading as Structural Sediment

Source	Finding	Metric
raw_10.md (achan Anti-Pattern #7)	100+ tools without curation: tool definitions consume context before work begins	~72K tokens upfront — 60% of 128K context window^[10]
raw_10.md	Cost explosion from tool definition overhead	4-turn conversation burns 288K tokens on definitions alone; up to 400% cost increase^[10]
raw_50.md	Simple task: CLI vs. MCP token comparison	1,365 tokens (CLI) vs. 44,026 tokens (MCP) = 32x overhead^[50]
raw_30.md (GitHub #29971)	MCP skill injection per tool call	~25K tokens wasted per call; 50 calls = 1.25M tokens on unused descriptions^[30]
raw_59.md (GitHub #44536)	ToolSearch deferred loading vs. eager loading	85% token reduction for tool schemas; no functionality loss^[59]

Claude Code loads all context at startup regardless of need. The startup context budget allocation, documented via the feature request for lazy loading (GitHub #44536):^[59]

Component	Tokens at Startup	% of 200K Context Window
MCP tool definitions (7 servers)	67,300	33.7%
Skills (20+ plugins)	30,000–40,000	15–20%
Tool definition behavioral instructions	~15,000	~7.5%
Rules (re-injected per tool call)	6,200+ per turn	3%+ per turn; ~46% cumulative over session lifetime
Total pre-work consumption	~118,000+	65–75% before user types anything

The baseline eager-loading behavior is structurally sediment — loading everything whether or not it's relevant.^[59] ToolSearch already demonstrated the fix: defer MCP tool definitions, load full schemas only on demand, achieve 85% token reduction with zero functionality loss.

The API-Mirroring Anti-Pattern

One-tool-per-endpoint designs — jira_create_issue, jira_update_issue, jira_add_comment — mirror the REST API exactly, creating maximal schema surface area.^[50]^[37] Specific call-out from raw_37.md: "Dozens of tools mirroring REST API (read_thing_a(), read_thing_b()) create context bloat and rigid abstractions."

Replacement pattern: Task-level tools that combine related operations into fewer, broader tools; dynamic tool loading via two-step routing (cheap planner selects tool families, specific tools loaded on demand).^[37]

Per-Turn MCP Overhead: The Uncached Mechanism

From Claude Code system prompt analysis:^[20] "MCP Server Instructions" are "recomputed every turn (not cached)." Any MCP server with verbose instructions contributes uncached token overhead on every turn — this is the mechanism by which tool-profile abstractions accumulate as sediment. Each registered MCP server contributes per-turn overhead whether or not its tools are invoked that session.

The March 2026 Caching Incident

Two bugs in Anthropic's prompt caching (GitHub #40524) caused 10–20x token inflation with no warning. Users had to reverse-engineer the Claude Code binary to find the cause.^[30] This incident reveals the diagnostic problem with tool-profile sediment: overhead accumulates silently, with no observable signal until someone audits the billing.

Vercel's Tool Reduction: The Canonical Case

Vercel removed 80% of their agent's tools — 15+ specialized tools collapsed to 2 — with measurable performance improvement across every metric.^[28] Full metrics in Section 10 (Net Deletion Stories). The lesson: "The best agents might be the ones with the fewest tools. Every tool is a choice you're making for the model."^[28]

Section 5: Skills Loaded But Never Invoked

The skill activation problem is the purest form of orchestration sediment: infrastructure that is present in context, consuming tokens on every turn, and producing outcomes identical to having no infrastructure at all. The empirical evidence shows the activation failure rate is the majority case, not the exception.

The Invocation Failure Benchmark

Production Audit: The Scale of Non-Invocation

Token Cost of Skill Accumulation at Scale

Condition	Pass Rate	Interpretation
No documentation at all	53%	Baseline
Skills with default behavior	53%	Identical to no documentation — zero marginal value^[34]
Skills with explicit instructions	79%	26-point improvement when activation is designed-in
AGENTS.md (passive always-present context)	100%	Eliminating the invocation decision eliminates invocation failures^[34]

Skill Count	Tokens/Turn (descriptions only)	Annual Token Cost (50 turns/day)
146 (current production)	~4,400	~80M tokens/year
500 (hypothetical scaling)	~15,000	~274M tokens/year

Zero visibility into which skills are dead weight vs. active: "There's no way to know: How many times a skill has been loaded via skill_view, When a skill was last used, Which skills are 'dead weight' (never used since installation)."^[58] Without invocation metrics, skills accumulate without any feedback signal that they're being used.

Manual Skills Fail 90% of the Time

From Claude Code best practices:^[53] "Manual Skills Are Ignored: Skills without auto-activation hooks fail approximately 90% of the time, rendering them useless regardless of quality." An entire category of manually-invoked skills is effectively sediment by default — present in context, consuming tokens, failing to activate.

Context Rot: The Accumulation Mechanism

Context rot occurs when skill files grow large enough to flood the context window.^[11] Three performance failure modes:

The Fix: Passive Context over Active Skills

The solution to 56% non-invocation is eliminating the invocation decision entirely. AGENTS.md embeds compressed documentation index directly into the system prompt rather than requiring dynamic retrieval. Raw documentation (40KB) compressed 80% to 8KB using pipe-delimited structures — making "passive context" (always-present) feasible without excessive token cost.^[34]

Design principle: "Reduce optionality at points where optionality introduces failure." Skills requiring an invocation decision introduce optionality at a failure point. Static context eliminates the option to miss.^[34]

Deletion criteria for skills: "If the right behavior is what a competent developer would do anyway, you don't need to document it."^[11] Targets: instructions that contradict each other, guidance added for a specific situation but never scoped to it, documentation for tools or patterns the project no longer uses.

Section 6: System Prompt and CLAUDE.md Sediment

CLAUDE.md sediment accumulates through the same asymmetric mechanism: rules are added when something breaks, never pruned when the problem disappears or Claude already handles it natively. The result is a file that grows until it exceeds the reliable attention limit, at which point additional rules actively degrade behavior by crowding out the rules that matter.

The Attention Limit: Hard Ceiling on Rule Effectiveness

Frontier LLMs reliably follow 150–200 instructions.^[6]^[23] Claude Code's system prompt already uses approximately 50 of those slots. Mathematical reality: a 200-line CLAUDE.md has already exceeded the reliable attention window. Research by Jaroslawicz et al. (2025) demonstrated that instruction compliance decreases linearly as instruction count increases.^[63]

Measured Deletion Impact (TonyRobbins.com Project)

200 Rules That Produced Zero Behavior Change

CLAUDE.md Size	Observable Effect	Source
Under 60 lines	Industry standard maintained by HumanLayer	^[6]
~100 lines (~2,500 tokens)	Boris Cherny's production CLAUDE.md at Anthropic itself	^[51]
200 lines	"The bottom 200 lines were effectively invisible" (300-line file)	^[6]
420 lines (post-pruning)	Improved output quality vs. 1,400-line version	^[55]
1,400+ lines	Pre-pruning state: 71% token waste, 4x slower updates	^[55]
2,000 lines	"Half your context budget is gone before any work begins" — "dumb zone"	^[6]

Metric	Before (1,400+ lines)	After (420 lines)	Change
File size	1,400+ lines	420 lines	70% reduction^[55]
Upfront tokens consumed	~3,500	~1,000	71% reduction
Update time per change	15–20 minutes	3–5 minutes	4x faster
Pattern duplication	3–4 instances of same rule	Single source of truth	Eliminated
Output quality	Baseline	Improved	Fewer rules, more attention per rule

One practitioner documented 200 lines of CLAUDE.md instructions that were systematically ignored:^[63]

Rule	Intended Behavior	Actual Outcome
"Search Before Speaking — iron rule"	Search before proposing solutions	Proposed solutions without searching
"ATOMIC SAVE PROTOCOL"	Actually write to disk before claiming save	Claimed to save without writing to disk
"KNOWLEDGE RETRIEVAL PROTOCOL"	Query 258 knowledge base files	Files were written but never retrieved
Banned phrases preventing false claims	Stop making false claims	Same false claims continued
Verification protocols	Auto-verify before claiming completion	Only worked when user manually prompted

Resolution: Cut from 200 rules to ~20; replace the rest with code-based enforcement (hooks). "Rules in prompts are requests. Hooks in code are laws." The 180 deleted rules were pure sediment — loaded into context every session, zero observable behavior change.^[63]

Official Anthropic Exclusion List

Official Anthropic documentation explicitly lists what NOT to include in CLAUDE.md:^[51]

The advisory vs. deterministic distinction: "Unlike CLAUDE.md instructions which are advisory, hooks are deterministic and guarantee the action happens."^[51] CLAUDE.md-based behavioral instructions produce zero observable effect when they conflict with trained behavior — they are advisory, not enforceable.

Specific Anti-Pattern Categories

The Core Pruning Test

Anti-Pattern	Why It's Sediment	Replacement
Prose style rules (indentation, quote style, line length)	Tools enforce these; prose duplicates them	ESLint, Prettier, TypeScript configs^[6]
Historical context and war stories	Ticket numbers, migration stories, architecture snapshots from prior system states	Git commit messages and PR descriptions^[55]
Rules duplicated from config files	Already enforced by ESLint/TypeScript/Prettier	`@imports` to reference config files directly^[6]
Vague prohibitions ("Write clean code")	No executable signal; no observable behavior difference	Precise rules: "Use camelCase for variables, PascalCase for React components"^[37]
Long custom slash command lists	"The entire point is natural language" — commands duplicate natural language capability	Natural language, or remove the commands^[37]

From HN discussion: "If removing this line wouldn't cause Claude to make a different mistake, delete it."^[7] The real cost of bloat is not token count per se — it is diluted signal. Instructions that would change behavior get buried under instructions that describe what Claude already does by default.

Contradictory finding on markdown headers: One HN commenter pushed back on complete removal: "LLMs do understand header levels in markdown. So removing these is detrimental." Consensus: compression matters most for resource-constrained models (Haiku) with large codebases; on Opus with simple projects, readability typically outweighs token savings.^[7]

Section 7: Zero-Observable-Effect Hooks

Hooks that fire on every event without producing an observable behavior change are the most insidious form of orchestration sediment: they appear active (logs show they ran), they appear configured (settings show they're registered), and they produce no value. Detection requires distinguishing "hook ran" from "hook changed behavior" — a distinction that is rarely made in practice.

The Stop Hook Failure Pattern

Hacker News thread on Claude 4.7 ignoring stop hooks (raw_46.md):^[46] Users report stop hooks designed to prevent session termination without running tests are routinely ignored. Four failure mechanisms identified:

Documented GitHub Bug Reports: Hooks That Look Active But Aren't

Both failure modes — "configured but never firing" and "firing but producing no behavioral change" — create identical user experience: complexity in the configuration, zero observable value, no way to distinguish functional from broken hooks without dedicated investigation.

Specific Hook Sediment Cases

The Ornamental Hook (raw_13.md)

Failure Mechanism	Root Cause	Symptom
Training resistance to hook content	Claude trained to resist instructions embedded in tool results to prevent prompt injection	"You must do XYZ now" in hook output is exactly what the model is trained to ignore^[46]
Schema changes break implementations	Claude changed "the schema for the hook reply" without notice	"Opus is caring f*** all about the response from the hook"^[46]
Wrong output channel	Hooks output JSON to stdout; Claude Code ignores stdout	Exit code 2 with plain text on stderr is the only reliable mechanism^[54]
Model behavior drift	Hooks that worked on one model version silently break with updates	Hook runs, logs show execution, behavior unchanged^[46]

Issue	Pattern	Effect on User
#34859	Hook error messages shown on every tool call even when hooks exit 0	Creates false "something is happening" signal when nothing changes^[21]
#29767	Stop hooks registered in hooks.json never fire even though SessionStart works	Appears registered and "working"; silently never fires^[21]
#10463	"Stop hook error" displayed after every response; all hooks produce exactly 0 bytes output	Noise that obscures whether hooks are actually running^[21]
#2891	Debug logs show literal template variables; exit codes not respected; operations proceed despite PreToolUse hooks	Complete separation between hook execution and behavioral effect^[21]

A practitioner audit found hooks including one that "printed a cute message on session start" — the canonical ornamental hook.^[13] Two more were "semi-broken and I'd been ignoring the errors for weeks." This is why hooks accumulate: deletion requires investigation (broken or working silently?), so practitioners accept broken/noisy hooks rather than investigating.

Auto-Formatting Hook Overhead (raw_53.md)

Auto-formatting hooks consumed 160K tokens across three rounds — not worth the marginal convenience.^[53] The hook fired, performed formatting, consumed tokens, but the marginal value of automatic vs. manual formatting was negligible. Observable activity; no observable behavior improvement.

The Retry Loop at Scale (raw_60.md)

From the Claude Code source analysis: "over 1,200 sessions had experienced 50 or more consecutive compaction failures before a simple three-line fix capped retries at three attempts, saving roughly 250,000 wasted API calls per day globally."^[60] The unbounded retry loop kept firing, consuming API calls, producing no observable value change. Fix: 3 lines. This is the prototype of "hooks firing on every event with no observable behavior change" at scale: 250K wasted API calls per day, zero observable output change.

memoryAge.ts Zero-Effect Pattern (raw_61.md)

The memoryAge.ts module appends age warnings to memories but "doesn't reduce the memory's weight" or trigger verification.^[61] It fires, adds a note, but observable behavior (what the agent does with old memories) is unchanged. Classic sediment: code runs on every memory operation, produces output, changes nothing.

The Fundamental Category Error

The deletion conclusion from practitioners who tried stop hooks:^[46] "I never got stop hooks to work and gave up on them." Replaced with pre-commit git checks (deterministic enforcement) or removed behavioral hooks entirely, relying on CLAUDE.md instructions.

Fundamental category error: Expecting text-based instructions to enforce hard stops in probabilistic systems. Exit code 2 (block) works reliably. Natural language behavioral instructions via hooks do not.^[46]^[51] The deletion story: elaborate stop-hook systems get torn down because they create complexity without observable effect.

Section 8: Write-Only Databases and Memory Systems

Write-only memory systems are the database equivalent of zero-observable-effect hooks: the write path works, data accumulates, but the read path fails silently or is never invoked. The agent's behavior is indistinguishable from an agent with no memory at all — while the memory infrastructure consumes storage, maintenance overhead, and, where LLMs are used for retrieval, compute cost.

Memory System Benchmark Failures

The Diary vs. Notebook Failure

System	Complexity	Benchmark Performance	Problem
Letta filesystem agents	Basic file storage	74.0% on LoCoMo benchmark^[22]	None — outperformed specialized systems
Mem0 specialized vector system	High (vector DB + retrieval)	68.5% on LoCoMo benchmark^[22]	Worse than plain filesystem; complex retrieval fails on multi-hop queries
Mem0 in full agent context	High	49% recall accuracy^[61]	49% recall on full infrastructure
Letta OS-inspired system	Very high	83.2% accuracy^[61]	Burns tokens on every memory operation; economically impractical
Claude Code markdown approach	None (plain files)	Unquantified^[61]	No vector DB, no embeddings, no search — stores information but no selective retrieval

"The current memory tools are diaries. They record events. What I actually need is a notebook."^[8]

Anti-pattern: A session log documenting "Changed connection pool from 10 to 20. CPU usage went up 15%. Timeout issue wasn't resolved" gets stored as disconnected facts rather than the learned lesson that connection pool increases correlate with CPU degradation without solving timeouts. The write path captures; the learning path that transforms observations into retrievable lessons doesn't exist.

Confirmation bias in extraction:^[8] "I've seen it extract 'reducing timeouts improves performance' when the actual cause was a concurrent deployment." Single-model extraction produces wrong lessons that compound across future sessions if they persist unchallenged.

Cold start abandonment: "A new installation has no beliefs. The first 10–20 sessions don't benefit from accumulated knowledge." The system is write-only for its entire initial phase; by the time enough data accumulates to produce value, practitioners have often abandoned it.^[8]

The 258 Knowledge Base Files Nobody Retrieved

Both layers are sediment simultaneously: the data store nobody queries AND the rule about querying it that produced no behavior change.

Knowledge Graphs: Write-Only Infrastructure Pattern

Simpler alternatives that often outperform KG: Classic RAG, smarter chunking, better embeddings, and re-rankers — at far lower cost and maintenance burden.^[27] The Letta/Mem0 benchmark confirms: basic file storage (74.0%) beat specialized vector retrieval (68.5%).^[22]

LLMs in Memory Retrieval Path: Deletion Story

One practitioner abandoned LLMs in memory retrieval after identifying four production problems:^[41]

Problem	Impact
Non-determinism	"Same inputs, different outputs" — debugging memory ranking impossible
Latency	500–2000ms added to critical path per retrieval operation
Cost in planner/executor loops	50+ firing loops; LLM retrieval calls multiply costs substantially
Untestability	Cannot verify that specific queries return expected ranked results

What was deleted: LLM query rewriting, re-ranking, and summarization from the retrieval path entirely. Replacement: Deterministic mathematical retrieval — vector similarity + graph traversal + token-based matching + fusion ranking. Philosophy: "Treat memory as infrastructure, not prompt engineering."^[41]

The Specific Write-Only Incident Pattern

Mark McArthey documented the canonical write-only failure:^[48] After rejecting a particular solution approach on April 20, a different Claude instance proposed the identical rejected idea three days later. The prior reasoning existed in accessible session files, but the agent couldn't retrieve it — context window limitations had rendered the information inaccessible. "The agent is writing structured session data to disk already. The agent just isn't reading its own archive."

A Microsoft engineer documented the efficiency cost: repeatedly re-explaining code to the same agent consumed significant daily time — the cost of a write-only memory system measured in human hours.^[48]

Section 9: AI Strategic Planner Personas and Meta-Orchestrators

Strategic planner personas are the highest-complexity form of orchestration sediment: they are the hardest to build (requiring careful prompt engineering and coordination logic), the hardest to delete (because removing the planner requires redistributing its responsibilities), and the least likely to deliver on their promise (because LLM-based planning of LLM work introduces a compounding layer of probabilistic behavior).

The "Super AI" Failure Pattern

The strategic planner that gets deleted is typically a grand coordinator agent designed to orchestrate everything. Its properties:^[47]

The Mega-Prompt as Planner Sediment

Anti-Pattern #1 from achan (raw_9.md): "Overloading a single agent with hundreds of instructions it cannot reliably execute." One real deployment gave an agent "close to 500 lines of procedural instructions" expecting exact execution. Result: LLMs compress context and optimize for intent, not strict procedural execution.^[9]

From hatchworks:^[57] "Mega-prompts" explicitly called out — "one prompt tries to handle everything; nobody can maintain it." The prompt grows incrementally, each addition appearing reasonable, until "nobody can maintain it." This is the signal that the thing exists but is effectively inert — a strategic planner that accumulates rules nobody reviews and produces behavior nobody can predict.

Anthropic's Prompt Chain Deletion

The Invisible State Anti-Pattern in Multi-Planner Systems

What They Moved Away From	What They Moved Toward
Pre-loading 32-page SOPs and documentation	Revealing information as agents request it (progressive disclosure)
Complex 50-prompt chained workflows	Agentic loops with tools — fewer edge cases to hardcode
Over-specification of procedures	Model autonomy and error recovery

Anti-Pattern #3 from achan: "Relying on LLMs to remember past actions instead of explicit state management" produces "repeated steps, contradicting actions, collapsing context windows, hallucinated states."^[9] Kamradt's benchmark showed state buried in long conversations becomes "effectively unreliable." Strategic planner personas that rely on conversation history to maintain planning state are structurally exposed: as the planning conversation grows, the planner's memory degrades.

Google's Titanium Agent Deletion

Deleted Component	Why
Monolithic Python script (massive linear for-loop)	Could not scale; debugging required end-to-end trace through entire loop
Hardcoded list of 12 case studies embedded in Python file	Static data embedded in code — couldn't update without code deployment
Prompt-based schema enforcement (JSON structure described inside prompt string)	"Dirty code, fragile parsing, and wasted tokens" — tokens spent describing structure instead of doing work
Custom retry logic	Replaced by framework-level resilience patterns

Prompt-string schema enforcement is the strategic-planner-in-miniature: verbose instructions describing JSON output format inside the prompt consume context on every call, "just in case" the model needs reminding. Classic speculative sediment.^[64]

When Strategic Coordination Is Actually Needed

The deletion story points to where coordination genuinely adds value. Anthropic's own Building Effective Agents documentation:^[16] orchestration that adds value has three properties: clear task decomposition with explicit ownership boundaries, structured topology (not "bag of agents"), and state that is explicit and externalized (not held in conversation history). Routing decisions that were previously encoded as strategic planning prompts become simple conditional logic — the planning abstraction was compensating for code that was not written.

Section 10: Net Deletion Stories: Metrics and Outcomes

Net deletion stories — complete before/after comparisons with measured outcomes — are the empirical foundation for identifying which orchestration components are sediment. The consistent finding across independent teams: deletion improves performance on every measurable dimension simultaneously. If deletion had trade-offs, the pattern would not be consistent.

Vercel d0: The Canonical Tool Deletion

The original system had 15+ specialized tools built on the assumption that the AI needed hand-holding through complex schemas:^[28]^[15]

Tools deleted: GetEntityJoins, LoadCatalog, RecallContext, LoadEntityDetails, SearchCatalog, ClarifyIntent, SearchSchema, GenerateAnalysisPlan, FinalizeQueryPlan, JoinPathFinder, SyntaxValidator, VisualizeData, ExplainResults (and others)

Metric	Old System (15+ tools)	New System (2 tools)	Improvement
Average execution time	274.8 seconds	77.4 seconds	3.5x faster
Success rate	80%	100%	+20 percentage points
Average token usage	~102K	~61K	37% reduction
Average steps	~12	~7	42% fewer steps
Worst case execution time	724 seconds / 100 steps / 145,463 tokens (failing)	141 seconds / 19 steps / 67,483 tokens (succeeding)	5x faster, 53% fewer tokens, succeeds

Why the specialized tools existed: "Solving problems the model could handle on its own." Engineers believed the AI would "get lost in complex schemas, make bad joins, or hallucinate table names." Every specialized tool was an assumption about AI limitations. Those assumptions were wrong. "We were doing the model's thinking for it."^[28]

What replaced the tooling: A filesystem agent architecture where Claude accesses raw Cube DSL files (YAML, Markdown, JSON) directly and uses standard Unix utilities (grep, cat, ls, find). "File systems are an incredibly powerful abstraction. Grep is 50 years old and still does exactly what we need."

LangChain → Raw SDK: Full Economics

CLAUDE.md 70% Reduction → Improved Quality

Personal Session Waste Audit

Metric	LangChain	Raw SDK	Change
Code volume (customer support agent)	1,200 lines	630 lines	−47%^[3]
Onboarding time	10 engineer-days	2 engineer-days	5x faster^[3]
Debugging time per incident	90–240 minutes	30–60 minutes	3–4x faster^[3]
Request latency per tool call	10–30ms	2–5ms	5–6x lower^[3]
Framework-attributed incidents	5–15% of all LLM incidents	Near zero	70–90% drop^[3]
Monthly overhead cost	$8K–$30K per agent	Baseline	Eliminated^[3]
Amortization period	—	2–6 months	Break-even horizon^[3]

Metric	Before	After	Change
File size	1,400+ lines	420 lines	−70%^[55]
Token consumption at startup	~3,500	~1,000	−71%
Update time per change	15–20 minutes	3–5 minutes	4x faster
Output quality	Baseline	Improved	Positive (counter-intuitive)

Measured waste: ~$3.75 per session in orchestration overhead; "potentially $200–400/month in recovered productivity" after cleanup.^[13]

mabl Agent: 4.5x Clean Session Rate

Replacing a sprawling monolithic state object with focused schema-validated artifacts:^[17]

The monolithic state object is the structural manifestation of orchestration sediment: an undifferentiated blob of context accumulated over time, where everything is kept because nothing can safely be discarded.

AutoGen → Deterministic Sequencing: 3x Development Speed

ToolSearch Deferred Loading: 85% Token Reduction

Approach	Development Time	Runtime Behavior	Token Cost
AutoGen (conversational multi-agent)	3 weeks	Agents looped without converging	5x baseline
CrewAI (deterministic task sequencing)	1 week	Predictable execution	Baseline

Claude Code's ToolSearch pattern achieved 85% token reduction for tool schemas by deferring MCP tool definitions and loading full schemas only on demand.^[59] This is the most direct evidence that eager loading of tools and skills is sediment: switching from "load everything at startup" to "load on demand" improved context economics by 85% with no functionality loss.

Cross-Deletion Pattern: What Gets Kept

Across all documented deletion stories, a consistent set of components survives:

Deleted (Sediment)	Kept (Signal)
Framework abstraction layers (LangChain, AutoGen)	Direct SDK calls with explicit control flow
Specialized tools per operation (15+ tools)	Broad tools for fundamental operations (bash, SQL, filesystem)
Multi-agent debate / Tribunal patterns	Single agents with structured topology and explicit state
Strategic planner personas	Simple routing logic + specialized workers
Manually-invoked skills	Passive always-present context (AGENTS.md) or auto-activation hooks
CLAUDE.md rules for behaviors Claude already does	Rules that prevent specific, observed, reproducible mistakes
Behavioral hooks (text-based stop conditions)	Deterministic exit-code-2 blocks for actual mechanical enforcement
Knowledge graphs for all retrieval	Flat file systems / direct embeddings for proven retrieval patterns

The convergence principle: The surviving components share one property — they cannot be replaced by model capability improvements. Flat file systems work because the access pattern is explicit. Deterministic enforcement works because exit code 2 is not probabilistic. Direct SDK calls work because they have no hidden assumptions. Everything else is subject to deletion as models improve.^[28]^[36]^[16]

Sources

Effective Claude Code Workflows in 2026: What Changed and What Works Now (retrieved 2026-04-27)
Claude Code Architecture Analysis (retrieved 2026-04-27)
The LangChain Exit: Why Production Teams Are Quietly Rewriting to Raw SDKs in 2026 (retrieved 2026-04-27)
Why we no longer use LangChain for building our AI agents (Hacker News discussion) (retrieved 2026-04-27)
Why Your Multi-Agent AI System Is Probably Making Things Worse (retrieved 2026-04-27)
Stop Bloating Your CLAUDE.md: Progressive Disclosure for AI Coding Tools (retrieved 2026-04-27)
Compress Your Claude.md: Cut 60-70% of System Prompt Bloat in Claude Code (Hacker News) (retrieved 2026-04-27)
The Memory Problem in AI Agents Is Half Solved. Here's the Other Half. (retrieved 2026-04-27)
AI Agent Anti-Patterns (Part 1): Architectural Pitfalls That Break Enterprise Agents (retrieved 2026-04-27)
AI Agent Anti-Patterns (Part 2): Tooling, Observability, and Scale Traps in Enterprise Agents (retrieved 2026-04-27)
What Is Context Rot in Claude Code Skills? How Bloated Skill Files Degrade Agent Performance (retrieved 2026-04-27)
Stop Adding New Claude Skills — Fix the Broken Ones First (retrieved 2026-04-27)
Stop Wasting Tokens: A Developer's Guide to Claude Code Cleanup (retrieved 2026-04-27)
The Multi-Agent Trap | Towards Data Science (retrieved 2026-04-27)
More Agents, More Tools, Worse Results: The 2026 Evidence for Radical Simplification (retrieved 2026-04-27)
Building Effective AI Agents — Anthropic Research (retrieved 2026-04-27)
Rebuilding an AI Agent the Right Way: Measurement, Not Guesswork (retrieved 2026-04-27)
AWS re:Invent 2025 - What Anthropic Learned Building AI Agents in 2025 (AIM277) (retrieved 2026-04-27)
2025 Overpromised AI Agents. 2026 Demands Agentic Engineering. (retrieved 2026-04-27)
How Claude Code Builds a System Prompt (retrieved 2026-04-27)
[BUG] Hook error messages shown on every tool call even when hooks exit 0 · Issue #34859 (retrieved 2026-04-27)
Context Engineering - LLM Memory and Retrieval for AI Agents (retrieved 2026-04-27)
Designing CLAUDE.md correctly: The 2026 architecture that finally makes Claude code work (retrieved 2026-04-27)
Why Your Multi-Agent AI System Is Probably Making Things Worse (retrieved 2026-04-27)
AI Agent Anti-Patterns (Part 1): Architectural Pitfalls That Break Enterprise Agents (retrieved 2026-04-27)
AI Agent Anti-Patterns (Part 2): Tooling, Observability, and Scale Traps in Enterprise Agents (retrieved 2026-04-27)
I never understood the fuss over using Knowledge Graphs with RAG (retrieved 2026-04-27)
We Removed 80% of Our Agent's Tools — Vercel Blog (retrieved 2026-04-27)
My Experience with Claude Code 2.0 and Getting Better at Using Coding Agents (retrieved 2026-04-27)
10 Tips to Stop Burning Your Tokens in Claude Code (retrieved 2026-04-27)
Why we no longer use LangChain for building our AI agents (Octomind) (retrieved 2026-04-27)
Stop Bloating Your CLAUDE.md: Progressive Disclosure for AI Coding Tools (retrieved 2026-04-27)
Compress Your Claude.md: Cut 60-70% of System Prompt Bloat in Claude Code (HN Discussion) (retrieved 2026-04-27)
Passive Context Wins: Why AGENTS.md Outperforms Skills in AI Agent Evals (retrieved 2026-04-27)
A Practical Guide to Memory for Autonomous LLM Agents (retrieved 2026-04-27)
The Art of Subtraction: Why Anthropic is Telling Us to Delete Our Agent Harnesses (retrieved 2026-04-27)
Designing CLAUDE.md correctly: The 2026 architecture that finally makes Claude code work (retrieved 2026-04-27)
Claude Code Architecture Analysis (retrieved 2026-04-27)
Anti-Pattern Multi-Agent Overkill: Too Many Agents in the System | Agent Patterns (retrieved 2026-04-27)
The Multi-Agent Trap | Towards Data Science (retrieved 2026-04-27)
Why I stopped putting LLMs in my agent memory retrieval path - DEV Community (retrieved 2026-04-27)
Stop Bloating Your CLAUDE.md: Progressive Disclosure for AI Coding Tools | alexop.dev (retrieved 2026-04-27)
The most valuable skill in 2026 isn't writing code. It is deleting it. - DEV Community (retrieved 2026-04-27)
How I Built an AI Orchestration Engine Without LangChain in 2026 + OpenAI Community Discussion (retrieved 2026-04-27)
Passive Context Wins: Why AGENTS.md Outperforms Skills in AI Agent Evals | Victorino Group (retrieved 2026-04-27)
Tell HN: Claude 4.7 is ignoring stop hooks | Hacker News (retrieved 2026-04-27)
AI Agent Orchestration 101: Stop Trying to Build a Super AI | MorelandConnect (retrieved 2026-04-27)
Your AI agent already writes every session to disk. Why isn't it reading its own archive? - DEV Community (retrieved 2026-04-27)
AI Agents in Production: Frameworks, Protocols, and What Actually Works in 2026 (retrieved 2026-04-27)
MCP Context Window Problem: Why Tool Bloat Hurts AI Agents (retrieved 2026-04-27)
Best Practices for Claude Code - Official Anthropic Documentation (retrieved 2026-04-27)
Designing CLAUDE.md correctly: The 2026 architecture that finally makes Claude code work (retrieved 2026-04-27)
Claude Code Best Practices (retrieved 2026-04-27)
Tell HN: Claude 4.7 is ignoring stop hooks (retrieved 2026-04-27)
Optimizing CLAUDE.md for AI Code Assistants (retrieved 2026-04-27)
Is LangChain becoming too complex/bloated for simple RAG applications in 2025? (retrieved 2026-04-27)
Orchestrating AI Agents in Production: The Patterns That Actually Work (retrieved 2026-04-27)
Feature: Skill Management — Usage Tracking, Conflict Detection, and Pre-Creation Validation (retrieved 2026-04-27)
"[FEATURE] Lazy context loading: extend the ToolSearch pattern to all context components" (retrieved 2026-04-27)
The Claude Code Leak Just Gave Every Developer a Masterclass in AI Agent Orchestration (retrieved 2026-04-27)
Memory Is the Unsolved Problem of AI Agents — Here's Why Everyone's Getting It Wrong (retrieved 2026-04-27)
We need to re-learn what AI agent development tools are in 2026 (retrieved 2026-04-27)
I Wrote 200 Lines of Rules for Claude Code. It Ignored Them All. (retrieved 2026-04-27)
Production-Ready AI Agents: 5 Lessons from Refactoring a Monolith (Google Developers Blog) (retrieved 2026-04-27)
Karpathy's CLAUDE.md Skills File: The Complete Guide (retrieved 2026-04-27)