Home

Orchestration Sediment Anti-Patterns

Pillar: sediment-patterns | Date: April 2026
Scope: Custom orchestration sediment: what heavy orchestration builders deleted and why. Specific patterns: AI strategic planner personas, multi-agent debate systems (Tribunal-style), tool-profile abstractions, knowledge graphs nobody queries, skills loaded into context but never invoked, commands mirroring native functionality, hooks firing on every event with no observable behavior change, write-only databases. Net deletion stories — what practitioners tore down, what they kept, and why.
Sources: 65 gathered, consolidated, synthesized.

Table of Contents

  1. The Sediment Accumulation Mechanism
  2. Framework Abstraction Sediment: LangChain / LangGraph Teardowns
  3. Multi-Agent Complexity Anti-Patterns
  4. Tool Proliferation and MCP / Tool-Profile Bloat
  5. Skills Loaded But Never Invoked
  6. System Prompt and CLAUDE.md Sediment
  7. Zero-Observable-Effect Hooks
  8. Write-Only Databases and Memory Systems
  9. AI Strategic Planner Personas and Meta-Orchestrators
  10. Net Deletion Stories: Metrics and Outcomes

Section 1: The Sediment Accumulation Mechanism

Orchestration sediment is the residue of solutions whose problems have been solved. The defining characteristic is asymmetric growth: additions are triggered by incidents and failures; removals have no trigger at all. Every bad output causes an instruction, guardrail, or wrapper to be added. Good output generates no feedback loop. The result is unbounded accumulation toward zero signal-to-noise ratio.[11]

Key finding: "If your skill files only ever grow — if you add new guidance whenever something goes wrong but never audit what's already there — context rot is accumulating by default."[11] The asymmetry is structural: the addition trigger is automatic; the removal trigger does not exist.

The Four Formation Pathways

Pathway Trigger What Gets Added Why It Persists Source
Failed-promise residue Autonomous AI promised, autonomy failed anyway Strategic planner personas, multi-agent debate, complex tool profiles Built to chase a goal that never arrived; teardown requires deliberate effort [19]
Platform absorption Platform ships feature natively Custom RAG pipelines, context management scripts, external artifact directories Custom layer remains even after native feature covers identical ground [1][62]
Speculative abstractions "Just in case" / imagined future requirements Dynamic variables, abstraction layers, error handlers for scenarios that never occur Removal seems risky; nobody can confirm the scenario won't occur [43][65]
Framework expiration Underlying problem space stabilizes Abstraction layers built to compensate for early-era model/API inconsistency Framework predates convergence; nobody re-evaluates whether it still solves anything [3]

Platform Absorption: Three Deleted Categories

Between 2024 and 2026, three previously-custom orchestration categories became sediment when their underlying platform shipped native equivalents:[62]

  1. Custom RAG pipelines — absorbed by native document context in Claude Projects and ChatGPT Projects
  2. Web search orchestration — absorbed by native search in most LLM services
  3. Custom MCP orchestration layers — described as having "a meteoric rise and then fizzled out"

Similarly, developers who built manual context-clearing workflows, external artifact directories (thoughts/ directories), elaborate verbose prompts, and custom session segmentation found all of these absorbed by automatic compaction, native plan mode, and platform session management.[1]

The Karpathy Principle: Nothing Speculative

The Karpathy principle[65] provides the core deletion heuristic: speculative abstractions (for imagined futures), speculative features (for potential use cases), and speculative error handlers (for scenarios that have not occurred) are all sediment. They were added for imagined futures that never came; now they consume context without producing value. Deletion criteria: if the scenario the code handles has never occurred in production, the code is speculative sediment.[43]

See also: Cost Optimization (token costs of loaded sediment), New Tooling 2026 (which new tools survived this cycle)

Section 2: Framework Abstraction Sediment: LangChain / LangGraph Teardowns

The LangChain exit is the most thoroughly documented net-deletion story in AI orchestration. Multiple independent teams arrived at the same conclusion independently, across different application domains, without coordinating. Each team built on LangChain when it solved a real problem (inconsistent vendor APIs, 2022–2023), then tore it down when that problem disappeared (vendor API convergence, 2025).

Key finding: "The abstraction LangChain provided in 2022 was solving a temporary problem. By 2025, the model providers solved most of it themselves."[3] Every framework has an expiration date tied to the stability of its underlying problem space.

Documented Deletion Cases

Octomind (raw_31.md)

Octomind eliminated LangChain after 12+ months in production from an AI agent system that used multiple LLMs to automatically create and fix end-to-end tests in Playwright.[31] The deletion was driven by three specific capability gaps:

Abstraction stacking anti-pattern: The framework layers "abstractions on top of other abstractions," forcing developers to navigate "nested abstractions" and debug framework internals. Developers spent "as much time understanding and debugging LangChain as it did building features."[31] What replaced it: Direct LLM client libraries, carefully selected external packages, simple comprehensible low-level code.

Nathan Cole's Deletion (raw_44.md)

The turning point: "a LangChain agent called a DELETE endpoint I had not intended to expose."[44] The framework passed LLM tool selection directly to execution without an intervening validation layer. Replacement: 80 lines of Python. The root anti-pattern: LangChain treats LLM outputs as program output; the 2026 pattern treats them as untrusted user input requiring validation.

Ravoid Case Studies (raw_3.md)

Production teams migrating to raw SDKs reported consistent, measurable gains:[3]

Metric LangChain (Before) Raw SDK (After) Improvement
Customer support agent code volume 1,200 lines 630 lines 47% reduction
Debugging time per incident 90–240 minutes 30–60 minutes 3–4x faster
Developer onboarding 1–2 weeks (10 engineer-days) 1–2 days (2 engineer-days) 5x faster
Request latency per tool call 10–30ms 2–5ms 5–6x lower
Framework-attributed incidents 5–15% of all LLM incidents Near zero 70–90% drop
Annual version migration cost 1–3 engineer-weeks 0 Eliminated
Total monthly overhead $8K–$30K per agent Baseline Eliminated

Post-migration outcomes: 70–90% drop in framework incidents, 40–60% code reduction within 2–6 months amortization period.[3] Observable sediment indicator: Stack traces spanning "15 to 40 frames of internal framework code" — debug signal buried in abstraction layers with no signal value.

AutoGen Deletion (raw_49.md)

An insurance company built a multi-agent conversation system for a training simulator using AutoGen, then deleted the entire system:[49]

Hacker News Developer Consensus

Community feedback on LangChain identified a consistent anti-pattern:[4]

LangChain CEO Harrison Chase acknowledged the structural cause: "The initial version of LangChain was pretty high level and absolutely abstracted away too much."[4]

What Gets Deleted vs. What Gets Kept

Deleted Component Why Deleted What Replaced It
AgentExecutor Hid execution flow; obscured validation gap Explicit loop with validation gate
ConversationBufferMemory Unnecessary once vendor APIs converged Direct conversation list management
Custom output parsers JSON mode / structured output now native Pydantic / Instructor
Tool class hierarchies Added abstraction without hiding complexity Plain Python functions with typed signatures
Chain composition primitives Python function composition already does this Regular Python functions

Preferred replacement stack (2026): Vanilla Python + direct SDK + LlamaIndex for retrieval + Pydantic/Instructor for structured output.[56]

See also: Failure Containment (active failure modes when abstraction layers fail)

Section 3: Multi-Agent Complexity Anti-Patterns

Multi-agent systems are the single largest source of orchestration sediment because they are the hardest to delete: each agent appears to have a purpose, and decomposing the system requires understanding how agents coordinate. The empirical evidence shows the coordination overhead routinely exceeds the capability benefit.

Key finding: A Google DeepMind study found unstructured multi-agent networks amplify errors "up to 17.2 times compared to single-agent baselines."[14] A system with 95% per-step success drops to 59.9% reliability across 10 sequential steps and 35.8% across 20 steps.

Quantitative Evidence on Coordination Failure

Study / Source Finding Metric
Google DeepMind (via raw_14.md) Error amplification in unstructured multi-agent networks 17.2x vs. single-agent baseline[14]
MAST study (via raw_14.md) Failure rate across 7 frameworks 41%–86.7% failure; 36.9% from coordination breakdowns[14]
Coordination saturation threshold (raw_40.md) Gains plateau beyond 4 agents 10+ agent systems typically fail; overhead exceeds gains[14]
Capability saturation (raw_5.md) When single agent >45% accuracy, adding agents provides negative returns 35% performance drop in Minecraft planning tests from coordination overhead[5]
Budget utilization (raw_24.md) Agents given 100-tool budgets used only 14.24 searches on average Doubling budget improved accuracy by 0.2 percentage points[5]
Token multiplication (via raw_40.md) Multi-agent conversation systems 3.5x–5x baseline token consumption[14][49]

The Coordination Tax Formula

The net performance calculation for multi-agent systems:[5]

Net Performance = (Individual Capability + Collaboration Benefits) − (Coordination Chaos + Communication Overhead + Tool Complexity)

Three failure mechanisms materialize when the subtraction side exceeds the addition side:

  1. Coordination Tax: Agents spend cognitive resources on tool management rather than problem-solving — measured as 35% performance drop in controlled tests[5]
  2. Capability Saturation: When a single agent can already solve a problem with >45% accuracy, adding more agents delivers diminishing or negative returns[5]
  3. Error Amplification: The 17.2x factor — errors propagate and compound across agent handoffs rather than self-correcting[14]

Tribunal / Debate System Anti-Patterns

The "Tribunal" pattern — advocate agent + attacker agent + radical agent — is structurally identical to the "Bag of Agents" anti-pattern that produces the 17.2x error amplification: agents interact without enforced contracts or structured topology.[14]

Specific exotic topologies documented as sediment-generating:[9]

Pattern Why It Gets Built Why It Gets Deleted
Swarm / handoff topology "Autonomous agent coordination" promise Amplifies invisible state problems instead of fixing them; "distributes" state rather than managing it[9]
LLM-as-Judge Quality assurance without human review Adds biases, doubles latency and cost; "Agents critiquing each other: nearly impossible to reproduce or audit"[9]
CodeAct (agents emitting Python) Flexible execution Dramatically increases hallucination blast radius[9]
CUGA (Computer Use Generalist Agent) UI automation without custom tooling Reliability and security challenges at production scale[9]
Multi-planner debate Consensus through argument Circular disagreements, token multiplication, unpredictable convergence[9]

Key finding on debate systems: "95% of investments in generative AI have produced zero measurable returns" (MIT Media Lab 2025, cited in raw_9.md) due to architectural patterns lacking production foundations — multi-agent debate systems are the leading culprit.[9]

A Payment Pipeline Deletion Story

From agentpatterns.tech, a documented payment processing pipeline teardown:[39]

Agent Deletion Reason
Planner agent Strategic planning for a deterministic process; planning time exceeded execution time
Router agent Routing logic was 3 conditional statements; LLM routing added latency and unpredictability
Retrieval agent Replaced by direct database lookup
Responder agent Merged into single output step
Policy agent Policies were static rules; encoded as config, not LLM calls
Critic agent Audit trail requirement met by deterministic logging; LLM critic added false-positive noise

Key metric: "A typical user request triggers 4+ agent handoffs where 1–2 would be enough."[39] Decision rule for future additions: "Add a new agent only when there is a clear role, measurable value, and explicit ownership boundary."

What Production Systems Actually Look Like

Organizations that successfully deployed multi-agent systems in 2025–2026 deliberately constrained scope:[5]

Anthropic's Position: Complexity Without Demonstrated Value

Anthropic's Building Effective Agents documentation (raw_16.md): "Finding the simplest solution possible, and only increasing complexity when needed is essential... Many teams mistakenly add complexity without demonstrating it improves outcomes."[16]

Anthropic's Applied AI team at AWS re:Invent 2025: "What didn't work: Breaking tasks into concurrent sub-tasks for parallel execution didn't succeed practically." They moved away from "complex 50-prompt chained workflows" toward "agentic loops with tools — fewer edge cases to hardcode."[18]

Claude Code best practices guide: "Despite multi-agent systems being all the rage, Claude Code has just one main thread… I highly doubt your app needs a multi-agent system."[53]

Epsilla analysis of Anthropic's harness philosophy:[36] BrowseComp benchmark showed that giving models code-writing capabilities for self-filtering improved accuracy from 45.3% to 61.6% — removing the orchestration harness improved outcomes. "Models demonstrating near-human coding abilities no longer need hand-holding through restrictive harnesses."

See also: Failure Containment (active failure modes during multi-agent execution)

Section 4: Tool Proliferation and MCP / Tool-Profile Bloat

Tool proliferation is structural sediment: each tool is added for a reason, but tools are never removed when that reason disappears. The result is ever-expanding schema surface area that consumes context budget before any work begins.

Key finding: MCP costs 4x to 32x more tokens than CLI for identical operations. Simple task comparison: 1,365 tokens via CLI vs. 44,026 tokens via MCP. The difference is eager schema injection vs. progressive disclosure via --help.[50]

The Tool Soup Problem: Context Cost Quantified

Source Finding Metric
raw_10.md (achan Anti-Pattern #7) 100+ tools without curation: tool definitions consume context before work begins ~72K tokens upfront — 60% of 128K context window[10]
raw_10.md Cost explosion from tool definition overhead 4-turn conversation burns 288K tokens on definitions alone; up to 400% cost increase[10]
raw_50.md Simple task: CLI vs. MCP token comparison 1,365 tokens (CLI) vs. 44,026 tokens (MCP) = 32x overhead[50]
raw_30.md (GitHub #29971) MCP skill injection per tool call ~25K tokens wasted per call; 50 calls = 1.25M tokens on unused descriptions[30]
raw_59.md (GitHub #44536) ToolSearch deferred loading vs. eager loading 85% token reduction for tool schemas; no functionality loss[59]

Eager Loading as Structural Sediment

Claude Code loads all context at startup regardless of need. The startup context budget allocation, documented via the feature request for lazy loading (GitHub #44536):[59]

Component Tokens at Startup % of 200K Context Window
MCP tool definitions (7 servers) 67,300 33.7%
Skills (20+ plugins) 30,000–40,000 15–20%
Tool definition behavioral instructions ~15,000 ~7.5%
Rules (re-injected per tool call) 6,200+ per turn 3%+ per turn; ~46% cumulative over session lifetime
Total pre-work consumption ~118,000+ 65–75% before user types anything

The baseline eager-loading behavior is structurally sediment — loading everything whether or not it's relevant.[59] ToolSearch already demonstrated the fix: defer MCP tool definitions, load full schemas only on demand, achieve 85% token reduction with zero functionality loss.

The API-Mirroring Anti-Pattern

One-tool-per-endpoint designs — jira_create_issue, jira_update_issue, jira_add_comment — mirror the REST API exactly, creating maximal schema surface area.[50][37] Specific call-out from raw_37.md: "Dozens of tools mirroring REST API (read_thing_a(), read_thing_b()) create context bloat and rigid abstractions."

Replacement pattern: Task-level tools that combine related operations into fewer, broader tools; dynamic tool loading via two-step routing (cheap planner selects tool families, specific tools loaded on demand).[37]

Per-Turn MCP Overhead: The Uncached Mechanism

From Claude Code system prompt analysis:[20] "MCP Server Instructions" are "recomputed every turn (not cached)." Any MCP server with verbose instructions contributes uncached token overhead on every turn — this is the mechanism by which tool-profile abstractions accumulate as sediment. Each registered MCP server contributes per-turn overhead whether or not its tools are invoked that session.

The March 2026 Caching Incident

Two bugs in Anthropic's prompt caching (GitHub #40524) caused 10–20x token inflation with no warning. Users had to reverse-engineer the Claude Code binary to find the cause.[30] This incident reveals the diagnostic problem with tool-profile sediment: overhead accumulates silently, with no observable signal until someone audits the billing.

Vercel's Tool Reduction: The Canonical Case

Vercel removed 80% of their agent's tools — 15+ specialized tools collapsed to 2 — with measurable performance improvement across every metric.[28] Full metrics in Section 10 (Net Deletion Stories). The lesson: "The best agents might be the ones with the fewest tools. Every tool is a choice you're making for the model."[28]

See also: Cost Optimization (token cost accounting for tool definitions)

Section 5: Skills Loaded But Never Invoked

The skill activation problem is the purest form of orchestration sediment: infrastructure that is present in context, consuming tokens on every turn, and producing outcomes identical to having no infrastructure at all. The empirical evidence shows the activation failure rate is the majority case, not the exception.

Key finding: Vercel evaluation study found that in 56% of evaluation cases, the agent never invoked the skill it needed — despite the skill mechanism functioning correctly and documentation existing.[34] Skills delivered zero improvement over having no documentation at all.

The Invocation Failure Benchmark

Condition Pass Rate Interpretation
No documentation at all 53% Baseline
Skills with default behavior 53% Identical to no documentation — zero marginal value[34]
Skills with explicit instructions 79% 26-point improvement when activation is designed-in
AGENTS.md (passive always-present context) 100% Eliminating the invocation decision eliminates invocation failures[34]

Three activation failure mechanisms prevent skills from being invoked:[45]

  1. Recognition gap: Agent must recognize it needs information before requesting it — fails when the agent doesn't know what it doesn't know
  2. Sequencing failures: Agents unsure whether to explore first or read documentation first; often chooses exploration and never returns to documentation
  3. Access variability: Skills only available when agent decides to retrieve them — the decision itself is the failure point

Production Audit: The Scale of Non-Invocation

A real audit of a 192-file skill setup found:[12]

Silent failure modes documented:[12]

Token Cost of Skill Accumulation at Scale

From a production Hermes-agent setup managing 146+ skills:[58]

Skill Count Tokens/Turn (descriptions only) Annual Token Cost (50 turns/day)
146 (current production) ~4,400 ~80M tokens/year
500 (hypothetical scaling) ~15,000 ~274M tokens/year

Zero visibility into which skills are dead weight vs. active: "There's no way to know: How many times a skill has been loaded via skill_view, When a skill was last used, Which skills are 'dead weight' (never used since installation)."[58] Without invocation metrics, skills accumulate without any feedback signal that they're being used.

Manual Skills Fail 90% of the Time

From Claude Code best practices:[53] "Manual Skills Are Ignored: Skills without auto-activation hooks fail approximately 90% of the time, rendering them useless regardless of quality." An entire category of manually-invoked skills is effectively sediment by default — present in context, consuming tokens, failing to activate.

Context Rot: The Accumulation Mechanism

Context rot occurs when skill files grow large enough to flood the context window.[11] Three performance failure modes:

  1. Instruction conflicts: Contradictions accumulate as skill files grow — agent resolves conflicts unpredictably, appearing as inconsistent behavior
  2. Token budget compression: "If your skill files consume 20,000 tokens out of a 200,000-token context, that's 10% of total capacity gone before a single line of code is read"[11]
  3. Relevance mismatch: "A file that began as documentation for three API methods eventually becomes a catch-all for every decision, preference, and edge case"[11]

The Fix: Passive Context over Active Skills

The solution to 56% non-invocation is eliminating the invocation decision entirely. AGENTS.md embeds compressed documentation index directly into the system prompt rather than requiring dynamic retrieval. Raw documentation (40KB) compressed 80% to 8KB using pipe-delimited structures — making "passive context" (always-present) feasible without excessive token cost.[34]

Design principle: "Reduce optionality at points where optionality introduces failure." Skills requiring an invocation decision introduce optionality at a failure point. Static context eliminates the option to miss.[34]

Deletion criteria for skills: "If the right behavior is what a competent developer would do anyway, you don't need to document it."[11] Targets: instructions that contradict each other, guidance added for a specific situation but never scoped to it, documentation for tools or patterns the project no longer uses.


Section 6: System Prompt and CLAUDE.md Sediment

CLAUDE.md sediment accumulates through the same asymmetric mechanism: rules are added when something breaks, never pruned when the problem disappears or Claude already handles it natively. The result is a file that grows until it exceeds the reliable attention limit, at which point additional rules actively degrade behavior by crowding out the rules that matter.

Key finding: CLAUDE.md pruned from 1,400 lines to 420 lines (70% reduction) produced simultaneous improvement in Claude's output quality. The reduction itself improved performance — fewer rules means more attention per rule.[55]

The Attention Limit: Hard Ceiling on Rule Effectiveness

Frontier LLMs reliably follow 150–200 instructions.[6][23] Claude Code's system prompt already uses approximately 50 of those slots. Mathematical reality: a 200-line CLAUDE.md has already exceeded the reliable attention window. Research by Jaroslawicz et al. (2025) demonstrated that instruction compliance decreases linearly as instruction count increases.[63]

CLAUDE.md Size Observable Effect Source
Under 60 lines Industry standard maintained by HumanLayer [6]
~100 lines (~2,500 tokens) Boris Cherny's production CLAUDE.md at Anthropic itself [51]
200 lines "The bottom 200 lines were effectively invisible" (300-line file) [6]
420 lines (post-pruning) Improved output quality vs. 1,400-line version [55]
1,400+ lines Pre-pruning state: 71% token waste, 4x slower updates [55]
2,000 lines "Half your context budget is gone before any work begins" — "dumb zone" [6]

Measured Deletion Impact (TonyRobbins.com Project)

Metric Before (1,400+ lines) After (420 lines) Change
File size 1,400+ lines 420 lines 70% reduction[55]
Upfront tokens consumed ~3,500 ~1,000 71% reduction
Update time per change 15–20 minutes 3–5 minutes 4x faster
Pattern duplication 3–4 instances of same rule Single source of truth Eliminated
Output quality Baseline Improved Fewer rules, more attention per rule

200 Rules That Produced Zero Behavior Change

One practitioner documented 200 lines of CLAUDE.md instructions that were systematically ignored:[63]

Rule Intended Behavior Actual Outcome
"Search Before Speaking — iron rule" Search before proposing solutions Proposed solutions without searching
"ATOMIC SAVE PROTOCOL" Actually write to disk before claiming save Claimed to save without writing to disk
"KNOWLEDGE RETRIEVAL PROTOCOL" Query 258 knowledge base files Files were written but never retrieved
Banned phrases preventing false claims Stop making false claims Same false claims continued
Verification protocols Auto-verify before claiming completion Only worked when user manually prompted

Resolution: Cut from 200 rules to ~20; replace the rest with code-based enforcement (hooks). "Rules in prompts are requests. Hooks in code are laws." The 180 deleted rules were pure sediment — loaded into context every session, zero observable behavior change.[63]

Official Anthropic Exclusion List

Official Anthropic documentation explicitly lists what NOT to include in CLAUDE.md:[51]

The advisory vs. deterministic distinction: "Unlike CLAUDE.md instructions which are advisory, hooks are deterministic and guarantee the action happens."[51] CLAUDE.md-based behavioral instructions produce zero observable effect when they conflict with trained behavior — they are advisory, not enforceable.

Specific Anti-Pattern Categories

Anti-Pattern Why It's Sediment Replacement
Prose style rules (indentation, quote style, line length) Tools enforce these; prose duplicates them ESLint, Prettier, TypeScript configs[6]
Historical context and war stories Ticket numbers, migration stories, architecture snapshots from prior system states Git commit messages and PR descriptions[55]
Rules duplicated from config files Already enforced by ESLint/TypeScript/Prettier @imports to reference config files directly[6]
Vague prohibitions ("Write clean code") No executable signal; no observable behavior difference Precise rules: "Use camelCase for variables, PascalCase for React components"[37]
Long custom slash command lists "The entire point is natural language" — commands duplicate natural language capability Natural language, or remove the commands[37]

The Core Pruning Test

From HN discussion: "If removing this line wouldn't cause Claude to make a different mistake, delete it."[7] The real cost of bloat is not token count per se — it is diluted signal. Instructions that would change behavior get buried under instructions that describe what Claude already does by default.

Contradictory finding on markdown headers: One HN commenter pushed back on complete removal: "LLMs do understand header levels in markdown. So removing these is detrimental." Consensus: compression matters most for resource-constrained models (Haiku) with large codebases; on Opus with simple projects, readability typically outweighs token savings.[7]


Section 7: Zero-Observable-Effect Hooks

Hooks that fire on every event without producing an observable behavior change are the most insidious form of orchestration sediment: they appear active (logs show they ran), they appear configured (settings show they're registered), and they produce no value. Detection requires distinguishing "hook ran" from "hook changed behavior" — a distinction that is rarely made in practice.

Key finding: The zero-observable-effect hook pattern has a precise five-step sequence: (1) hook is configured and runs, (2) output is delivered to the model, (3) model acknowledges the hook content, (4) model proceeds with the original action anyway, (5) net effect: hook ran, nothing changed.[46]

The Stop Hook Failure Pattern

Hacker News thread on Claude 4.7 ignoring stop hooks (raw_46.md):[46] Users report stop hooks designed to prevent session termination without running tests are routinely ignored. Four failure mechanisms identified:

Failure Mechanism Root Cause Symptom
Training resistance to hook content Claude trained to resist instructions embedded in tool results to prevent prompt injection "You must do XYZ now" in hook output is exactly what the model is trained to ignore[46]
Schema changes break implementations Claude changed "the schema for the hook reply" without notice "Opus is caring f*** all about the response from the hook"[46]
Wrong output channel Hooks output JSON to stdout; Claude Code ignores stdout Exit code 2 with plain text on stderr is the only reliable mechanism[54]
Model behavior drift Hooks that worked on one model version silently break with updates Hook runs, logs show execution, behavior unchanged[46]

Documented GitHub Bug Reports: Hooks That Look Active But Aren't

Issue Pattern Effect on User
#34859 Hook error messages shown on every tool call even when hooks exit 0 Creates false "something is happening" signal when nothing changes[21]
#29767 Stop hooks registered in hooks.json never fire even though SessionStart works Appears registered and "working"; silently never fires[21]
#10463 "Stop hook error" displayed after every response; all hooks produce exactly 0 bytes output Noise that obscures whether hooks are actually running[21]
#2891 Debug logs show literal template variables; exit codes not respected; operations proceed despite PreToolUse hooks Complete separation between hook execution and behavioral effect[21]

Both failure modes — "configured but never firing" and "firing but producing no behavioral change" — create identical user experience: complexity in the configuration, zero observable value, no way to distinguish functional from broken hooks without dedicated investigation.

Specific Hook Sediment Cases

The Ornamental Hook (raw_13.md)

A practitioner audit found hooks including one that "printed a cute message on session start" — the canonical ornamental hook.[13] Two more were "semi-broken and I'd been ignoring the errors for weeks." This is why hooks accumulate: deletion requires investigation (broken or working silently?), so practitioners accept broken/noisy hooks rather than investigating.

Auto-Formatting Hook Overhead (raw_53.md)

Auto-formatting hooks consumed 160K tokens across three rounds — not worth the marginal convenience.[53] The hook fired, performed formatting, consumed tokens, but the marginal value of automatic vs. manual formatting was negligible. Observable activity; no observable behavior improvement.

The Retry Loop at Scale (raw_60.md)

From the Claude Code source analysis: "over 1,200 sessions had experienced 50 or more consecutive compaction failures before a simple three-line fix capped retries at three attempts, saving roughly 250,000 wasted API calls per day globally."[60] The unbounded retry loop kept firing, consuming API calls, producing no observable value change. Fix: 3 lines. This is the prototype of "hooks firing on every event with no observable behavior change" at scale: 250K wasted API calls per day, zero observable output change.

memoryAge.ts Zero-Effect Pattern (raw_61.md)

The memoryAge.ts module appends age warnings to memories but "doesn't reduce the memory's weight" or trigger verification.[61] It fires, adds a note, but observable behavior (what the agent does with old memories) is unchanged. Classic sediment: code runs on every memory operation, produces output, changes nothing.

The Fundamental Category Error

The deletion conclusion from practitioners who tried stop hooks:[46] "I never got stop hooks to work and gave up on them." Replaced with pre-commit git checks (deterministic enforcement) or removed behavioral hooks entirely, relying on CLAUDE.md instructions.

Fundamental category error: Expecting text-based instructions to enforce hard stops in probabilistic systems. Exit code 2 (block) works reliably. Natural language behavioral instructions via hooks do not.[46][51] The deletion story: elaborate stop-hook systems get torn down because they create complexity without observable effect.

See also: Failure Containment (active failure modes when hooks fail during execution)

Section 8: Write-Only Databases and Memory Systems

Write-only memory systems are the database equivalent of zero-observable-effect hooks: the write path works, data accumulates, but the read path fails silently or is never invoked. The agent's behavior is indistinguishable from an agent with no memory at all — while the memory infrastructure consumes storage, maintenance overhead, and, where LLMs are used for retrieval, compute cost.

Key finding: "Write-manage-read must function as an integrated system, not isolated components. A memory system where writes far outnumber reads is a write-only database — infrastructure theater rather than functional cognition."[35]

Memory System Benchmark Failures

System Complexity Benchmark Performance Problem
Letta filesystem agents Basic file storage 74.0% on LoCoMo benchmark[22] None — outperformed specialized systems
Mem0 specialized vector system High (vector DB + retrieval) 68.5% on LoCoMo benchmark[22] Worse than plain filesystem; complex retrieval fails on multi-hop queries
Mem0 in full agent context High 49% recall accuracy[61] 49% recall on full infrastructure
Letta OS-inspired system Very high 83.2% accuracy[61] Burns tokens on every memory operation; economically impractical
Claude Code markdown approach None (plain files) Unquantified[61] No vector DB, no embeddings, no search — stores information but no selective retrieval

The Diary vs. Notebook Failure

"The current memory tools are diaries. They record events. What I actually need is a notebook."[8]

Anti-pattern: A session log documenting "Changed connection pool from 10 to 20. CPU usage went up 15%. Timeout issue wasn't resolved" gets stored as disconnected facts rather than the learned lesson that connection pool increases correlate with CPU degradation without solving timeouts. The write path captures; the learning path that transforms observations into retrievable lessons doesn't exist.

Confirmation bias in extraction:[8] "I've seen it extract 'reducing timeouts improves performance' when the actual cause was a concurrent deployment." Single-model extraction produces wrong lessons that compound across future sessions if they persist unchallenged.

Cold start abandonment: "A new installation has no beliefs. The first 10–20 sessions don't benefit from accumulated knowledge." The system is write-only for its entire initial phase; by the time enough data accumulates to produce value, practitioners have often abandoned it.[8]

The 258 Knowledge Base Files Nobody Retrieved

Double sediment from a single practitioner's setup:[63]

  1. 258 knowledge base files were written to but never retrieved — write-only database
  2. "KNOWLEDGE RETRIEVAL PROTOCOL" — a 200-line CLAUDE.md rule about how to query the knowledge base — had zero observable effect

Both layers are sediment simultaneously: the data store nobody queries AND the rule about querying it that produced no behavior change.

Knowledge Graphs: Write-Only Infrastructure Pattern

The knowledge graph anti-pattern in AI agents:[27][22]

Simpler alternatives that often outperform KG: Classic RAG, smarter chunking, better embeddings, and re-rankers — at far lower cost and maintenance burden.[27] The Letta/Mem0 benchmark confirms: basic file storage (74.0%) beat specialized vector retrieval (68.5%).[22]

LLMs in Memory Retrieval Path: Deletion Story

One practitioner abandoned LLMs in memory retrieval after identifying four production problems:[41]

Problem Impact
Non-determinism "Same inputs, different outputs" — debugging memory ranking impossible
Latency 500–2000ms added to critical path per retrieval operation
Cost in planner/executor loops 50+ firing loops; LLM retrieval calls multiply costs substantially
Untestability Cannot verify that specific queries return expected ranked results

What was deleted: LLM query rewriting, re-ranking, and summarization from the retrieval path entirely. Replacement: Deterministic mathematical retrieval — vector similarity + graph traversal + token-based matching + fusion ranking. Philosophy: "Treat memory as infrastructure, not prompt engineering."[41]

The Specific Write-Only Incident Pattern

Mark McArthey documented the canonical write-only failure:[48] After rejecting a particular solution approach on April 20, a different Claude instance proposed the identical rejected idea three days later. The prior reasoning existed in accessible session files, but the agent couldn't retrieve it — context window limitations had rendered the information inaccessible. "The agent is writing structured session data to disk already. The agent just isn't reading its own archive."

A Microsoft engineer documented the efficiency cost: repeatedly re-explaining code to the same agent consumed significant daily time — the cost of a write-only memory system measured in human hours.[48]


Section 9: AI Strategic Planner Personas and Meta-Orchestrators

Strategic planner personas are the highest-complexity form of orchestration sediment: they are the hardest to build (requiring careful prompt engineering and coordination logic), the hardest to delete (because removing the planner requires redistributing its responsibilities), and the least likely to deliver on their promise (because LLM-based planning of LLM work introduces a compounding layer of probabilistic behavior).

Key finding: Anthropic's Claude Code Coordinator — their own orchestrator — deliberately removes capabilities: "The Coordinator cannot execute Bash. It cannot read files."[2] The deliberate deletion of capabilities from the orchestrator is the design pattern. An orchestrator that can do everything will fail in catastrophic ways.

The "Super AI" Failure Pattern

The strategic planner that gets deleted is typically a grand coordinator agent designed to orchestrate everything. Its properties:[47]

Four failure modes of monolithic coordinators:[47]

  1. Brittleness: "A minor bug in one small, seemingly unrelated module can create cascading failures that bring down the entire system"
  2. Inefficient scaling: Upgrading one function forces scaling the whole system
  3. Innovation lock-in: Prevents swapping in better specialized models
  4. Black-box risk: "Context blindness" when optimizing narrowly

The Mega-Prompt as Planner Sediment

Anti-Pattern #1 from achan (raw_9.md): "Overloading a single agent with hundreds of instructions it cannot reliably execute." One real deployment gave an agent "close to 500 lines of procedural instructions" expecting exact execution. Result: LLMs compress context and optimize for intent, not strict procedural execution.[9]

From hatchworks:[57] "Mega-prompts" explicitly called out — "one prompt tries to handle everything; nobody can maintain it." The prompt grows incrementally, each addition appearing reasonable, until "nobody can maintain it." This is the signal that the thing exists but is effectively inert — a strategic planner that accumulates rules nobody reviews and produces behavior nobody can predict.

Anthropic's Prompt Chain Deletion

Anthropic's Applied AI team at AWS re:Invent 2025:[18]

What They Moved Away From What They Moved Toward
Pre-loading 32-page SOPs and documentation Revealing information as agents request it (progressive disclosure)
Complex 50-prompt chained workflows Agentic loops with tools — fewer edge cases to hardcode
Over-specification of procedures Model autonomy and error recovery

The Invisible State Anti-Pattern in Multi-Planner Systems

Anti-Pattern #3 from achan: "Relying on LLMs to remember past actions instead of explicit state management" produces "repeated steps, contradicting actions, collapsing context windows, hallucinated states."[9] Kamradt's benchmark showed state buried in long conversations becomes "effectively unreliable." Strategic planner personas that rely on conversation history to maintain planning state are structurally exposed: as the planning conversation grows, the planner's memory degrades.

Google's Titanium Agent Deletion

From Google's production AI agent refactor (raw_64.md):[64]

Deleted Component Why
Monolithic Python script (massive linear for-loop) Could not scale; debugging required end-to-end trace through entire loop
Hardcoded list of 12 case studies embedded in Python file Static data embedded in code — couldn't update without code deployment
Prompt-based schema enforcement (JSON structure described inside prompt string) "Dirty code, fragile parsing, and wasted tokens" — tokens spent describing structure instead of doing work
Custom retry logic Replaced by framework-level resilience patterns

Prompt-string schema enforcement is the strategic-planner-in-miniature: verbose instructions describing JSON output format inside the prompt consume context on every call, "just in case" the model needs reminding. Classic speculative sediment.[64]

When Strategic Coordination Is Actually Needed

The deletion story points to where coordination genuinely adds value. Anthropic's own Building Effective Agents documentation:[16] orchestration that adds value has three properties: clear task decomposition with explicit ownership boundaries, structured topology (not "bag of agents"), and state that is explicit and externalized (not held in conversation history). Routing decisions that were previously encoded as strategic planning prompts become simple conditional logic — the planning abstraction was compensating for code that was not written.

See also: Multi-Agent Complexity Anti-Patterns (quantitative evidence on coordination failure)

Section 10: Net Deletion Stories: Metrics and Outcomes

Net deletion stories — complete before/after comparisons with measured outcomes — are the empirical foundation for identifying which orchestration components are sediment. The consistent finding across independent teams: deletion improves performance on every measurable dimension simultaneously. If deletion had trade-offs, the pattern would not be consistent.

Key finding: Vercel removed 80% of their agent's tools, replacing 15+ specialized tools with 2, and improved every metric: 3.5x faster execution, 20% higher success rate, 37% fewer tokens, 42% fewer steps.[28] "The model makes better choices when we stop making choices for it."

Vercel d0: The Canonical Tool Deletion

The original system had 15+ specialized tools built on the assumption that the AI needed hand-holding through complex schemas:[28][15]

Tools deleted: GetEntityJoins, LoadCatalog, RecallContext, LoadEntityDetails, SearchCatalog, ClarifyIntent, SearchSchema, GenerateAnalysisPlan, FinalizeQueryPlan, JoinPathFinder, SyntaxValidator, VisualizeData, ExplainResults (and others)

Tools retained: 2 — bash command execution and SQL execution

Metric Old System (15+ tools) New System (2 tools) Improvement
Average execution time 274.8 seconds 77.4 seconds 3.5x faster
Success rate 80% 100% +20 percentage points
Average token usage ~102K ~61K 37% reduction
Average steps ~12 ~7 42% fewer steps
Worst case execution time 724 seconds / 100 steps / 145,463 tokens (failing) 141 seconds / 19 steps / 67,483 tokens (succeeding) 5x faster, 53% fewer tokens, succeeds

Why the specialized tools existed: "Solving problems the model could handle on its own." Engineers believed the AI would "get lost in complex schemas, make bad joins, or hallucinate table names." Every specialized tool was an assumption about AI limitations. Those assumptions were wrong. "We were doing the model's thinking for it."[28]

What replaced the tooling: A filesystem agent architecture where Claude accesses raw Cube DSL files (YAML, Markdown, JSON) directly and uses standard Unix utilities (grep, cat, ls, find). "File systems are an incredibly powerful abstraction. Grep is 50 years old and still does exactly what we need."

LangChain → Raw SDK: Full Economics

Metric LangChain Raw SDK Change
Code volume (customer support agent) 1,200 lines 630 lines −47%[3]
Onboarding time 10 engineer-days 2 engineer-days 5x faster[3]
Debugging time per incident 90–240 minutes 30–60 minutes 3–4x faster[3]
Request latency per tool call 10–30ms 2–5ms 5–6x lower[3]
Framework-attributed incidents 5–15% of all LLM incidents Near zero 70–90% drop[3]
Monthly overhead cost $8K–$30K per agent Baseline Eliminated[3]
Amortization period 2–6 months Break-even horizon[3]

CLAUDE.md 70% Reduction → Improved Quality

Metric Before After Change
File size 1,400+ lines 420 lines −70%[55]
Token consumption at startup ~3,500 ~1,000 −71%
Update time per change 15–20 minutes 3–5 minutes 4x faster
Output quality Baseline Improved Positive (counter-intuitive)

Personal Session Waste Audit

One practitioner's cleanup quantified in dollar terms:[13]

Measured waste: ~$3.75 per session in orchestration overhead; "potentially $200–400/month in recovered productivity" after cleanup.[13]

mabl Agent: 4.5x Clean Session Rate

Replacing a sprawling monolithic state object with focused schema-validated artifacts:[17]

The monolithic state object is the structural manifestation of orchestration sediment: an undifferentiated blob of context accumulated over time, where everything is kept because nothing can safely be discarded.

AutoGen → Deterministic Sequencing: 3x Development Speed

Same task (table reservation workflow):[49]

Approach Development Time Runtime Behavior Token Cost
AutoGen (conversational multi-agent) 3 weeks Agents looped without converging 5x baseline
CrewAI (deterministic task sequencing) 1 week Predictable execution Baseline

ToolSearch Deferred Loading: 85% Token Reduction

Claude Code's ToolSearch pattern achieved 85% token reduction for tool schemas by deferring MCP tool definitions and loading full schemas only on demand.[59] This is the most direct evidence that eager loading of tools and skills is sediment: switching from "load everything at startup" to "load on demand" improved context economics by 85% with no functionality loss.

Cross-Deletion Pattern: What Gets Kept

Across all documented deletion stories, a consistent set of components survives:

Deleted (Sediment) Kept (Signal)
Framework abstraction layers (LangChain, AutoGen) Direct SDK calls with explicit control flow
Specialized tools per operation (15+ tools) Broad tools for fundamental operations (bash, SQL, filesystem)
Multi-agent debate / Tribunal patterns Single agents with structured topology and explicit state
Strategic planner personas Simple routing logic + specialized workers
Manually-invoked skills Passive always-present context (AGENTS.md) or auto-activation hooks
CLAUDE.md rules for behaviors Claude already does Rules that prevent specific, observed, reproducible mistakes
Behavioral hooks (text-based stop conditions) Deterministic exit-code-2 blocks for actual mechanical enforcement
Knowledge graphs for all retrieval Flat file systems / direct embeddings for proven retrieval patterns

The convergence principle: The surviving components share one property — they cannot be replaced by model capability improvements. Flat file systems work because the access pattern is explicit. Deterministic enforcement works because exit code 2 is not probabilistic. Direct SDK calls work because they have no hidden assumptions. Everything else is subject to deletion as models improve.[28][36][16]

See also: Cost Optimization (token cost accounting), New Tooling 2026 (what survived the deletion wave)

Sources

  1. Effective Claude Code Workflows in 2026: What Changed and What Works Now (retrieved 2026-04-27)
  2. Claude Code Architecture Analysis (retrieved 2026-04-27)
  3. The LangChain Exit: Why Production Teams Are Quietly Rewriting to Raw SDKs in 2026 (retrieved 2026-04-27)
  4. Why we no longer use LangChain for building our AI agents (Hacker News discussion) (retrieved 2026-04-27)
  5. Why Your Multi-Agent AI System Is Probably Making Things Worse (retrieved 2026-04-27)
  6. Stop Bloating Your CLAUDE.md: Progressive Disclosure for AI Coding Tools (retrieved 2026-04-27)
  7. Compress Your Claude.md: Cut 60-70% of System Prompt Bloat in Claude Code (Hacker News) (retrieved 2026-04-27)
  8. The Memory Problem in AI Agents Is Half Solved. Here's the Other Half. (retrieved 2026-04-27)
  9. AI Agent Anti-Patterns (Part 1): Architectural Pitfalls That Break Enterprise Agents (retrieved 2026-04-27)
  10. AI Agent Anti-Patterns (Part 2): Tooling, Observability, and Scale Traps in Enterprise Agents (retrieved 2026-04-27)
  11. What Is Context Rot in Claude Code Skills? How Bloated Skill Files Degrade Agent Performance (retrieved 2026-04-27)
  12. Stop Adding New Claude Skills — Fix the Broken Ones First (retrieved 2026-04-27)
  13. Stop Wasting Tokens: A Developer's Guide to Claude Code Cleanup (retrieved 2026-04-27)
  14. The Multi-Agent Trap | Towards Data Science (retrieved 2026-04-27)
  15. More Agents, More Tools, Worse Results: The 2026 Evidence for Radical Simplification (retrieved 2026-04-27)
  16. Building Effective AI Agents — Anthropic Research (retrieved 2026-04-27)
  17. Rebuilding an AI Agent the Right Way: Measurement, Not Guesswork (retrieved 2026-04-27)
  18. AWS re:Invent 2025 - What Anthropic Learned Building AI Agents in 2025 (AIM277) (retrieved 2026-04-27)
  19. 2025 Overpromised AI Agents. 2026 Demands Agentic Engineering. (retrieved 2026-04-27)
  20. How Claude Code Builds a System Prompt (retrieved 2026-04-27)
  21. [BUG] Hook error messages shown on every tool call even when hooks exit 0 · Issue #34859 (retrieved 2026-04-27)
  22. Context Engineering - LLM Memory and Retrieval for AI Agents (retrieved 2026-04-27)
  23. Designing CLAUDE.md correctly: The 2026 architecture that finally makes Claude code work (retrieved 2026-04-27)
  24. Why Your Multi-Agent AI System Is Probably Making Things Worse (retrieved 2026-04-27)
  25. AI Agent Anti-Patterns (Part 1): Architectural Pitfalls That Break Enterprise Agents (retrieved 2026-04-27)
  26. AI Agent Anti-Patterns (Part 2): Tooling, Observability, and Scale Traps in Enterprise Agents (retrieved 2026-04-27)
  27. I never understood the fuss over using Knowledge Graphs with RAG (retrieved 2026-04-27)
  28. We Removed 80% of Our Agent's Tools — Vercel Blog (retrieved 2026-04-27)
  29. My Experience with Claude Code 2.0 and Getting Better at Using Coding Agents (retrieved 2026-04-27)
  30. 10 Tips to Stop Burning Your Tokens in Claude Code (retrieved 2026-04-27)
  31. Why we no longer use LangChain for building our AI agents (Octomind) (retrieved 2026-04-27)
  32. Stop Bloating Your CLAUDE.md: Progressive Disclosure for AI Coding Tools (retrieved 2026-04-27)
  33. Compress Your Claude.md: Cut 60-70% of System Prompt Bloat in Claude Code (HN Discussion) (retrieved 2026-04-27)
  34. Passive Context Wins: Why AGENTS.md Outperforms Skills in AI Agent Evals (retrieved 2026-04-27)
  35. A Practical Guide to Memory for Autonomous LLM Agents (retrieved 2026-04-27)
  36. The Art of Subtraction: Why Anthropic is Telling Us to Delete Our Agent Harnesses (retrieved 2026-04-27)
  37. Designing CLAUDE.md correctly: The 2026 architecture that finally makes Claude code work (retrieved 2026-04-27)
  38. Claude Code Architecture Analysis (retrieved 2026-04-27)
  39. Anti-Pattern Multi-Agent Overkill: Too Many Agents in the System | Agent Patterns (retrieved 2026-04-27)
  40. The Multi-Agent Trap | Towards Data Science (retrieved 2026-04-27)
  41. Why I stopped putting LLMs in my agent memory retrieval path - DEV Community (retrieved 2026-04-27)
  42. Stop Bloating Your CLAUDE.md: Progressive Disclosure for AI Coding Tools | alexop.dev (retrieved 2026-04-27)
  43. The most valuable skill in 2026 isn't writing code. It is deleting it. - DEV Community (retrieved 2026-04-27)
  44. How I Built an AI Orchestration Engine Without LangChain in 2026 + OpenAI Community Discussion (retrieved 2026-04-27)
  45. Passive Context Wins: Why AGENTS.md Outperforms Skills in AI Agent Evals | Victorino Group (retrieved 2026-04-27)
  46. Tell HN: Claude 4.7 is ignoring stop hooks | Hacker News (retrieved 2026-04-27)
  47. AI Agent Orchestration 101: Stop Trying to Build a Super AI | MorelandConnect (retrieved 2026-04-27)
  48. Your AI agent already writes every session to disk. Why isn't it reading its own archive? - DEV Community (retrieved 2026-04-27)
  49. AI Agents in Production: Frameworks, Protocols, and What Actually Works in 2026 (retrieved 2026-04-27)
  50. MCP Context Window Problem: Why Tool Bloat Hurts AI Agents (retrieved 2026-04-27)
  51. Best Practices for Claude Code - Official Anthropic Documentation (retrieved 2026-04-27)
  52. Designing CLAUDE.md correctly: The 2026 architecture that finally makes Claude code work (retrieved 2026-04-27)
  53. Claude Code Best Practices (retrieved 2026-04-27)
  54. Tell HN: Claude 4.7 is ignoring stop hooks (retrieved 2026-04-27)
  55. Optimizing CLAUDE.md for AI Code Assistants (retrieved 2026-04-27)
  56. Is LangChain becoming too complex/bloated for simple RAG applications in 2025? (retrieved 2026-04-27)
  57. Orchestrating AI Agents in Production: The Patterns That Actually Work (retrieved 2026-04-27)
  58. Feature: Skill Management — Usage Tracking, Conflict Detection, and Pre-Creation Validation (retrieved 2026-04-27)
  59. "[FEATURE] Lazy context loading: extend the ToolSearch pattern to all context components" (retrieved 2026-04-27)
  60. The Claude Code Leak Just Gave Every Developer a Masterclass in AI Agent Orchestration (retrieved 2026-04-27)
  61. Memory Is the Unsolved Problem of AI Agents — Here's Why Everyone's Getting It Wrong (retrieved 2026-04-27)
  62. We need to re-learn what AI agent development tools are in 2026 (retrieved 2026-04-27)
  63. I Wrote 200 Lines of Rules for Claude Code. It Ignored Them All. (retrieved 2026-04-27)
  64. Production-Ready AI Agents: 5 Lessons from Refactoring a Monolith (Google Developers Blog) (retrieved 2026-04-27)
  65. Karpathy's CLAUDE.md Skills File: The Complete Guide (retrieved 2026-04-27)

Home