Pillar: sediment-patterns | Date: April 2026
Scope: Custom orchestration sediment: what heavy orchestration builders deleted and why. Specific patterns: AI strategic planner personas, multi-agent debate systems (Tribunal-style), tool-profile abstractions, knowledge graphs nobody queries, skills loaded into context but never invoked, commands mirroring native functionality, hooks firing on every event with no observable behavior change, write-only databases. Net deletion stories — what practitioners tore down, what they kept, and why.
Sources: 65 gathered, consolidated, synthesized.
Orchestration sediment is the residue of solutions whose problems have been solved. The defining characteristic is asymmetric growth: additions are triggered by incidents and failures; removals have no trigger at all. Every bad output causes an instruction, guardrail, or wrapper to be added. Good output generates no feedback loop. The result is unbounded accumulation toward zero signal-to-noise ratio.[11]
Key finding: "If your skill files only ever grow — if you add new guidance whenever something goes wrong but never audit what's already there — context rot is accumulating by default."[11] The asymmetry is structural: the addition trigger is automatic; the removal trigger does not exist.
| Pathway | Trigger | What Gets Added | Why It Persists | Source |
|---|---|---|---|---|
| Failed-promise residue | Autonomous AI promised, autonomy failed anyway | Strategic planner personas, multi-agent debate, complex tool profiles | Built to chase a goal that never arrived; teardown requires deliberate effort | [19] |
| Platform absorption | Platform ships feature natively | Custom RAG pipelines, context management scripts, external artifact directories | Custom layer remains even after native feature covers identical ground | [1][62] |
| Speculative abstractions | "Just in case" / imagined future requirements | Dynamic variables, abstraction layers, error handlers for scenarios that never occur | Removal seems risky; nobody can confirm the scenario won't occur | [43][65] |
| Framework expiration | Underlying problem space stabilizes | Abstraction layers built to compensate for early-era model/API inconsistency | Framework predates convergence; nobody re-evaluates whether it still solves anything | [3] |
Between 2024 and 2026, three previously-custom orchestration categories became sediment when their underlying platform shipped native equivalents:[62]
Similarly, developers who built manual context-clearing workflows, external artifact directories (thoughts/ directories), elaborate verbose prompts, and custom session segmentation found all of these absorbed by automatic compaction, native plan mode, and platform session management.[1]
The Karpathy principle[65] provides the core deletion heuristic: speculative abstractions (for imagined futures), speculative features (for potential use cases), and speculative error handlers (for scenarios that have not occurred) are all sediment. They were added for imagined futures that never came; now they consume context without producing value. Deletion criteria: if the scenario the code handles has never occurred in production, the code is speculative sediment.[43]
See also: Cost Optimization (token costs of loaded sediment), New Tooling 2026 (which new tools survived this cycle)The LangChain exit is the most thoroughly documented net-deletion story in AI orchestration. Multiple independent teams arrived at the same conclusion independently, across different application domains, without coordinating. Each team built on LangChain when it solved a real problem (inconsistent vendor APIs, 2022–2023), then tore it down when that problem disappeared (vendor API convergence, 2025).
Key finding: "The abstraction LangChain provided in 2022 was solving a temporary problem. By 2025, the model providers solved most of it themselves."[3] Every framework has an expiration date tied to the stability of its underlying problem space.
Octomind eliminated LangChain after 12+ months in production from an AI agent system that used multiple LLMs to automatically create and fix end-to-end tests in Playwright.[31] The deletion was driven by three specific capability gaps:
Abstraction stacking anti-pattern: The framework layers "abstractions on top of other abstractions," forcing developers to navigate "nested abstractions" and debug framework internals. Developers spent "as much time understanding and debugging LangChain as it did building features."[31] What replaced it: Direct LLM client libraries, carefully selected external packages, simple comprehensible low-level code.
The turning point: "a LangChain agent called a DELETE endpoint I had not intended to expose."[44] The framework passed LLM tool selection directly to execution without an intervening validation layer. Replacement: 80 lines of Python. The root anti-pattern: LangChain treats LLM outputs as program output; the 2026 pattern treats them as untrusted user input requiring validation.
Production teams migrating to raw SDKs reported consistent, measurable gains:[3]
| Metric | LangChain (Before) | Raw SDK (After) | Improvement |
|---|---|---|---|
| Customer support agent code volume | 1,200 lines | 630 lines | 47% reduction |
| Debugging time per incident | 90–240 minutes | 30–60 minutes | 3–4x faster |
| Developer onboarding | 1–2 weeks (10 engineer-days) | 1–2 days (2 engineer-days) | 5x faster |
| Request latency per tool call | 10–30ms | 2–5ms | 5–6x lower |
| Framework-attributed incidents | 5–15% of all LLM incidents | Near zero | 70–90% drop |
| Annual version migration cost | 1–3 engineer-weeks | 0 | Eliminated |
| Total monthly overhead | $8K–$30K per agent | Baseline | Eliminated |
Post-migration outcomes: 70–90% drop in framework incidents, 40–60% code reduction within 2–6 months amortization period.[3] Observable sediment indicator: Stack traces spanning "15 to 40 frames of internal framework code" — debug signal buried in abstraction layers with no signal value.
An insurance company built a multi-agent conversation system for a training simulator using AutoGen, then deleted the entire system:[49]
Community feedback on LangChain identified a consistent anti-pattern:[4]
LangChain CEO Harrison Chase acknowledged the structural cause: "The initial version of LangChain was pretty high level and absolutely abstracted away too much."[4]
| Deleted Component | Why Deleted | What Replaced It |
|---|---|---|
| AgentExecutor | Hid execution flow; obscured validation gap | Explicit loop with validation gate |
| ConversationBufferMemory | Unnecessary once vendor APIs converged | Direct conversation list management |
| Custom output parsers | JSON mode / structured output now native | Pydantic / Instructor |
| Tool class hierarchies | Added abstraction without hiding complexity | Plain Python functions with typed signatures |
| Chain composition primitives | Python function composition already does this | Regular Python functions |
Preferred replacement stack (2026): Vanilla Python + direct SDK + LlamaIndex for retrieval + Pydantic/Instructor for structured output.[56]
See also: Failure Containment (active failure modes when abstraction layers fail)Multi-agent systems are the single largest source of orchestration sediment because they are the hardest to delete: each agent appears to have a purpose, and decomposing the system requires understanding how agents coordinate. The empirical evidence shows the coordination overhead routinely exceeds the capability benefit.
Key finding: A Google DeepMind study found unstructured multi-agent networks amplify errors "up to 17.2 times compared to single-agent baselines."[14] A system with 95% per-step success drops to 59.9% reliability across 10 sequential steps and 35.8% across 20 steps.
| Study / Source | Finding | Metric |
|---|---|---|
| Google DeepMind (via raw_14.md) | Error amplification in unstructured multi-agent networks | 17.2x vs. single-agent baseline[14] |
| MAST study (via raw_14.md) | Failure rate across 7 frameworks | 41%–86.7% failure; 36.9% from coordination breakdowns[14] |
| Coordination saturation threshold (raw_40.md) | Gains plateau beyond 4 agents | 10+ agent systems typically fail; overhead exceeds gains[14] |
| Capability saturation (raw_5.md) | When single agent >45% accuracy, adding agents provides negative returns | 35% performance drop in Minecraft planning tests from coordination overhead[5] |
| Budget utilization (raw_24.md) | Agents given 100-tool budgets used only 14.24 searches on average | Doubling budget improved accuracy by 0.2 percentage points[5] |
| Token multiplication (via raw_40.md) | Multi-agent conversation systems | 3.5x–5x baseline token consumption[14][49] |
The net performance calculation for multi-agent systems:[5]
Net Performance = (Individual Capability + Collaboration Benefits) − (Coordination Chaos + Communication Overhead + Tool Complexity)
Three failure mechanisms materialize when the subtraction side exceeds the addition side:
The "Tribunal" pattern — advocate agent + attacker agent + radical agent — is structurally identical to the "Bag of Agents" anti-pattern that produces the 17.2x error amplification: agents interact without enforced contracts or structured topology.[14]
Specific exotic topologies documented as sediment-generating:[9]
| Pattern | Why It Gets Built | Why It Gets Deleted |
|---|---|---|
| Swarm / handoff topology | "Autonomous agent coordination" promise | Amplifies invisible state problems instead of fixing them; "distributes" state rather than managing it[9] |
| LLM-as-Judge | Quality assurance without human review | Adds biases, doubles latency and cost; "Agents critiquing each other: nearly impossible to reproduce or audit"[9] |
| CodeAct (agents emitting Python) | Flexible execution | Dramatically increases hallucination blast radius[9] |
| CUGA (Computer Use Generalist Agent) | UI automation without custom tooling | Reliability and security challenges at production scale[9] |
| Multi-planner debate | Consensus through argument | Circular disagreements, token multiplication, unpredictable convergence[9] |
Key finding on debate systems: "95% of investments in generative AI have produced zero measurable returns" (MIT Media Lab 2025, cited in raw_9.md) due to architectural patterns lacking production foundations — multi-agent debate systems are the leading culprit.[9]
From agentpatterns.tech, a documented payment processing pipeline teardown:[39]
| Agent | Deletion Reason |
|---|---|
| Planner agent | Strategic planning for a deterministic process; planning time exceeded execution time |
| Router agent | Routing logic was 3 conditional statements; LLM routing added latency and unpredictability |
| Retrieval agent | Replaced by direct database lookup |
| Responder agent | Merged into single output step |
| Policy agent | Policies were static rules; encoded as config, not LLM calls |
| Critic agent | Audit trail requirement met by deterministic logging; LLM critic added false-positive noise |
Key metric: "A typical user request triggers 4+ agent handoffs where 1–2 would be enough."[39] Decision rule for future additions: "Add a new agent only when there is a clear role, measurable value, and explicit ownership boundary."
Organizations that successfully deployed multi-agent systems in 2025–2026 deliberately constrained scope:[5]
Anthropic's Building Effective Agents documentation (raw_16.md): "Finding the simplest solution possible, and only increasing complexity when needed is essential... Many teams mistakenly add complexity without demonstrating it improves outcomes."[16]
Anthropic's Applied AI team at AWS re:Invent 2025: "What didn't work: Breaking tasks into concurrent sub-tasks for parallel execution didn't succeed practically." They moved away from "complex 50-prompt chained workflows" toward "agentic loops with tools — fewer edge cases to hardcode."[18]
Claude Code best practices guide: "Despite multi-agent systems being all the rage, Claude Code has just one main thread… I highly doubt your app needs a multi-agent system."[53]
Epsilla analysis of Anthropic's harness philosophy:[36] BrowseComp benchmark showed that giving models code-writing capabilities for self-filtering improved accuracy from 45.3% to 61.6% — removing the orchestration harness improved outcomes. "Models demonstrating near-human coding abilities no longer need hand-holding through restrictive harnesses."
See also: Failure Containment (active failure modes during multi-agent execution)Tool proliferation is structural sediment: each tool is added for a reason, but tools are never removed when that reason disappears. The result is ever-expanding schema surface area that consumes context budget before any work begins.
Key finding: MCP costs 4x to 32x more tokens than CLI for identical operations. Simple task comparison: 1,365 tokens via CLI vs. 44,026 tokens via MCP. The difference is eager schema injection vs. progressive disclosure via --help.[50]
| Source | Finding | Metric |
|---|---|---|
| raw_10.md (achan Anti-Pattern #7) | 100+ tools without curation: tool definitions consume context before work begins | ~72K tokens upfront — 60% of 128K context window[10] |
| raw_10.md | Cost explosion from tool definition overhead | 4-turn conversation burns 288K tokens on definitions alone; up to 400% cost increase[10] |
| raw_50.md | Simple task: CLI vs. MCP token comparison | 1,365 tokens (CLI) vs. 44,026 tokens (MCP) = 32x overhead[50] |
| raw_30.md (GitHub #29971) | MCP skill injection per tool call | ~25K tokens wasted per call; 50 calls = 1.25M tokens on unused descriptions[30] |
| raw_59.md (GitHub #44536) | ToolSearch deferred loading vs. eager loading | 85% token reduction for tool schemas; no functionality loss[59] |
Claude Code loads all context at startup regardless of need. The startup context budget allocation, documented via the feature request for lazy loading (GitHub #44536):[59]
| Component | Tokens at Startup | % of 200K Context Window |
|---|---|---|
| MCP tool definitions (7 servers) | 67,300 | 33.7% |
| Skills (20+ plugins) | 30,000–40,000 | 15–20% |
| Tool definition behavioral instructions | ~15,000 | ~7.5% |
| Rules (re-injected per tool call) | 6,200+ per turn | 3%+ per turn; ~46% cumulative over session lifetime |
| Total pre-work consumption | ~118,000+ | 65–75% before user types anything |
The baseline eager-loading behavior is structurally sediment — loading everything whether or not it's relevant.[59] ToolSearch already demonstrated the fix: defer MCP tool definitions, load full schemas only on demand, achieve 85% token reduction with zero functionality loss.
One-tool-per-endpoint designs — jira_create_issue, jira_update_issue, jira_add_comment — mirror the REST API exactly, creating maximal schema surface area.[50][37] Specific call-out from raw_37.md: "Dozens of tools mirroring REST API (read_thing_a(), read_thing_b()) create context bloat and rigid abstractions."
Replacement pattern: Task-level tools that combine related operations into fewer, broader tools; dynamic tool loading via two-step routing (cheap planner selects tool families, specific tools loaded on demand).[37]
From Claude Code system prompt analysis:[20] "MCP Server Instructions" are "recomputed every turn (not cached)." Any MCP server with verbose instructions contributes uncached token overhead on every turn — this is the mechanism by which tool-profile abstractions accumulate as sediment. Each registered MCP server contributes per-turn overhead whether or not its tools are invoked that session.
Two bugs in Anthropic's prompt caching (GitHub #40524) caused 10–20x token inflation with no warning. Users had to reverse-engineer the Claude Code binary to find the cause.[30] This incident reveals the diagnostic problem with tool-profile sediment: overhead accumulates silently, with no observable signal until someone audits the billing.
Vercel removed 80% of their agent's tools — 15+ specialized tools collapsed to 2 — with measurable performance improvement across every metric.[28] Full metrics in Section 10 (Net Deletion Stories). The lesson: "The best agents might be the ones with the fewest tools. Every tool is a choice you're making for the model."[28]
See also: Cost Optimization (token cost accounting for tool definitions)The skill activation problem is the purest form of orchestration sediment: infrastructure that is present in context, consuming tokens on every turn, and producing outcomes identical to having no infrastructure at all. The empirical evidence shows the activation failure rate is the majority case, not the exception.
Key finding: Vercel evaluation study found that in 56% of evaluation cases, the agent never invoked the skill it needed — despite the skill mechanism functioning correctly and documentation existing.[34] Skills delivered zero improvement over having no documentation at all.
| Condition | Pass Rate | Interpretation |
|---|---|---|
| No documentation at all | 53% | Baseline |
| Skills with default behavior | 53% | Identical to no documentation — zero marginal value[34] |
| Skills with explicit instructions | 79% | 26-point improvement when activation is designed-in |
| AGENTS.md (passive always-present context) | 100% | Eliminating the invocation decision eliminates invocation failures[34] |
Three activation failure mechanisms prevent skills from being invoked:[45]
A real audit of a 192-file skill setup found:[12]
Silent failure modes documented:[12]
From a production Hermes-agent setup managing 146+ skills:[58]
| Skill Count | Tokens/Turn (descriptions only) | Annual Token Cost (50 turns/day) |
|---|---|---|
| 146 (current production) | ~4,400 | ~80M tokens/year |
| 500 (hypothetical scaling) | ~15,000 | ~274M tokens/year |
Zero visibility into which skills are dead weight vs. active: "There's no way to know: How many times a skill has been loaded via skill_view, When a skill was last used, Which skills are 'dead weight' (never used since installation)."[58] Without invocation metrics, skills accumulate without any feedback signal that they're being used.
From Claude Code best practices:[53] "Manual Skills Are Ignored: Skills without auto-activation hooks fail approximately 90% of the time, rendering them useless regardless of quality." An entire category of manually-invoked skills is effectively sediment by default — present in context, consuming tokens, failing to activate.
Context rot occurs when skill files grow large enough to flood the context window.[11] Three performance failure modes:
The solution to 56% non-invocation is eliminating the invocation decision entirely. AGENTS.md embeds compressed documentation index directly into the system prompt rather than requiring dynamic retrieval. Raw documentation (40KB) compressed 80% to 8KB using pipe-delimited structures — making "passive context" (always-present) feasible without excessive token cost.[34]
Design principle: "Reduce optionality at points where optionality introduces failure." Skills requiring an invocation decision introduce optionality at a failure point. Static context eliminates the option to miss.[34]
Deletion criteria for skills: "If the right behavior is what a competent developer would do anyway, you don't need to document it."[11] Targets: instructions that contradict each other, guidance added for a specific situation but never scoped to it, documentation for tools or patterns the project no longer uses.
CLAUDE.md sediment accumulates through the same asymmetric mechanism: rules are added when something breaks, never pruned when the problem disappears or Claude already handles it natively. The result is a file that grows until it exceeds the reliable attention limit, at which point additional rules actively degrade behavior by crowding out the rules that matter.
Key finding: CLAUDE.md pruned from 1,400 lines to 420 lines (70% reduction) produced simultaneous improvement in Claude's output quality. The reduction itself improved performance — fewer rules means more attention per rule.[55]
Frontier LLMs reliably follow 150–200 instructions.[6][23] Claude Code's system prompt already uses approximately 50 of those slots. Mathematical reality: a 200-line CLAUDE.md has already exceeded the reliable attention window. Research by Jaroslawicz et al. (2025) demonstrated that instruction compliance decreases linearly as instruction count increases.[63]
| CLAUDE.md Size | Observable Effect | Source |
|---|---|---|
| Under 60 lines | Industry standard maintained by HumanLayer | [6] |
| ~100 lines (~2,500 tokens) | Boris Cherny's production CLAUDE.md at Anthropic itself | [51] |
| 200 lines | "The bottom 200 lines were effectively invisible" (300-line file) | [6] |
| 420 lines (post-pruning) | Improved output quality vs. 1,400-line version | [55] |
| 1,400+ lines | Pre-pruning state: 71% token waste, 4x slower updates | [55] |
| 2,000 lines | "Half your context budget is gone before any work begins" — "dumb zone" | [6] |
| Metric | Before (1,400+ lines) | After (420 lines) | Change |
|---|---|---|---|
| File size | 1,400+ lines | 420 lines | 70% reduction[55] |
| Upfront tokens consumed | ~3,500 | ~1,000 | 71% reduction |
| Update time per change | 15–20 minutes | 3–5 minutes | 4x faster |
| Pattern duplication | 3–4 instances of same rule | Single source of truth | Eliminated |
| Output quality | Baseline | Improved | Fewer rules, more attention per rule |
One practitioner documented 200 lines of CLAUDE.md instructions that were systematically ignored:[63]
| Rule | Intended Behavior | Actual Outcome |
|---|---|---|
| "Search Before Speaking — iron rule" | Search before proposing solutions | Proposed solutions without searching |
| "ATOMIC SAVE PROTOCOL" | Actually write to disk before claiming save | Claimed to save without writing to disk |
| "KNOWLEDGE RETRIEVAL PROTOCOL" | Query 258 knowledge base files | Files were written but never retrieved |
| Banned phrases preventing false claims | Stop making false claims | Same false claims continued |
| Verification protocols | Auto-verify before claiming completion | Only worked when user manually prompted |
Resolution: Cut from 200 rules to ~20; replace the rest with code-based enforcement (hooks). "Rules in prompts are requests. Hooks in code are laws." The 180 deleted rules were pure sediment — loaded into context every session, zero observable behavior change.[63]
Official Anthropic documentation explicitly lists what NOT to include in CLAUDE.md:[51]
The advisory vs. deterministic distinction: "Unlike CLAUDE.md instructions which are advisory, hooks are deterministic and guarantee the action happens."[51] CLAUDE.md-based behavioral instructions produce zero observable effect when they conflict with trained behavior — they are advisory, not enforceable.
| Anti-Pattern | Why It's Sediment | Replacement |
|---|---|---|
| Prose style rules (indentation, quote style, line length) | Tools enforce these; prose duplicates them | ESLint, Prettier, TypeScript configs[6] |
| Historical context and war stories | Ticket numbers, migration stories, architecture snapshots from prior system states | Git commit messages and PR descriptions[55] |
| Rules duplicated from config files | Already enforced by ESLint/TypeScript/Prettier | @imports to reference config files directly[6] |
| Vague prohibitions ("Write clean code") | No executable signal; no observable behavior difference | Precise rules: "Use camelCase for variables, PascalCase for React components"[37] |
| Long custom slash command lists | "The entire point is natural language" — commands duplicate natural language capability | Natural language, or remove the commands[37] |
From HN discussion: "If removing this line wouldn't cause Claude to make a different mistake, delete it."[7] The real cost of bloat is not token count per se — it is diluted signal. Instructions that would change behavior get buried under instructions that describe what Claude already does by default.
Contradictory finding on markdown headers: One HN commenter pushed back on complete removal: "LLMs do understand header levels in markdown. So removing these is detrimental." Consensus: compression matters most for resource-constrained models (Haiku) with large codebases; on Opus with simple projects, readability typically outweighs token savings.[7]
Hooks that fire on every event without producing an observable behavior change are the most insidious form of orchestration sediment: they appear active (logs show they ran), they appear configured (settings show they're registered), and they produce no value. Detection requires distinguishing "hook ran" from "hook changed behavior" — a distinction that is rarely made in practice.
Key finding: The zero-observable-effect hook pattern has a precise five-step sequence: (1) hook is configured and runs, (2) output is delivered to the model, (3) model acknowledges the hook content, (4) model proceeds with the original action anyway, (5) net effect: hook ran, nothing changed.[46]
Hacker News thread on Claude 4.7 ignoring stop hooks (raw_46.md):[46] Users report stop hooks designed to prevent session termination without running tests are routinely ignored. Four failure mechanisms identified:
| Failure Mechanism | Root Cause | Symptom |
|---|---|---|
| Training resistance to hook content | Claude trained to resist instructions embedded in tool results to prevent prompt injection | "You must do XYZ now" in hook output is exactly what the model is trained to ignore[46] |
| Schema changes break implementations | Claude changed "the schema for the hook reply" without notice | "Opus is caring f*** all about the response from the hook"[46] |
| Wrong output channel | Hooks output JSON to stdout; Claude Code ignores stdout | Exit code 2 with plain text on stderr is the only reliable mechanism[54] |
| Model behavior drift | Hooks that worked on one model version silently break with updates | Hook runs, logs show execution, behavior unchanged[46] |
| Issue | Pattern | Effect on User |
|---|---|---|
| #34859 | Hook error messages shown on every tool call even when hooks exit 0 | Creates false "something is happening" signal when nothing changes[21] |
| #29767 | Stop hooks registered in hooks.json never fire even though SessionStart works | Appears registered and "working"; silently never fires[21] |
| #10463 | "Stop hook error" displayed after every response; all hooks produce exactly 0 bytes output | Noise that obscures whether hooks are actually running[21] |
| #2891 | Debug logs show literal template variables; exit codes not respected; operations proceed despite PreToolUse hooks | Complete separation between hook execution and behavioral effect[21] |
Both failure modes — "configured but never firing" and "firing but producing no behavioral change" — create identical user experience: complexity in the configuration, zero observable value, no way to distinguish functional from broken hooks without dedicated investigation.
A practitioner audit found hooks including one that "printed a cute message on session start" — the canonical ornamental hook.[13] Two more were "semi-broken and I'd been ignoring the errors for weeks." This is why hooks accumulate: deletion requires investigation (broken or working silently?), so practitioners accept broken/noisy hooks rather than investigating.
Auto-formatting hooks consumed 160K tokens across three rounds — not worth the marginal convenience.[53] The hook fired, performed formatting, consumed tokens, but the marginal value of automatic vs. manual formatting was negligible. Observable activity; no observable behavior improvement.
From the Claude Code source analysis: "over 1,200 sessions had experienced 50 or more consecutive compaction failures before a simple three-line fix capped retries at three attempts, saving roughly 250,000 wasted API calls per day globally."[60] The unbounded retry loop kept firing, consuming API calls, producing no observable value change. Fix: 3 lines. This is the prototype of "hooks firing on every event with no observable behavior change" at scale: 250K wasted API calls per day, zero observable output change.
The memoryAge.ts module appends age warnings to memories but "doesn't reduce the memory's weight" or trigger verification.[61] It fires, adds a note, but observable behavior (what the agent does with old memories) is unchanged. Classic sediment: code runs on every memory operation, produces output, changes nothing.
The deletion conclusion from practitioners who tried stop hooks:[46] "I never got stop hooks to work and gave up on them." Replaced with pre-commit git checks (deterministic enforcement) or removed behavioral hooks entirely, relying on CLAUDE.md instructions.
Fundamental category error: Expecting text-based instructions to enforce hard stops in probabilistic systems. Exit code 2 (block) works reliably. Natural language behavioral instructions via hooks do not.[46][51] The deletion story: elaborate stop-hook systems get torn down because they create complexity without observable effect.
See also: Failure Containment (active failure modes when hooks fail during execution)Write-only memory systems are the database equivalent of zero-observable-effect hooks: the write path works, data accumulates, but the read path fails silently or is never invoked. The agent's behavior is indistinguishable from an agent with no memory at all — while the memory infrastructure consumes storage, maintenance overhead, and, where LLMs are used for retrieval, compute cost.
Key finding: "Write-manage-read must function as an integrated system, not isolated components. A memory system where writes far outnumber reads is a write-only database — infrastructure theater rather than functional cognition."[35]
| System | Complexity | Benchmark Performance | Problem |
|---|---|---|---|
| Letta filesystem agents | Basic file storage | 74.0% on LoCoMo benchmark[22] | None — outperformed specialized systems |
| Mem0 specialized vector system | High (vector DB + retrieval) | 68.5% on LoCoMo benchmark[22] | Worse than plain filesystem; complex retrieval fails on multi-hop queries |
| Mem0 in full agent context | High | 49% recall accuracy[61] | 49% recall on full infrastructure |
| Letta OS-inspired system | Very high | 83.2% accuracy[61] | Burns tokens on every memory operation; economically impractical |
| Claude Code markdown approach | None (plain files) | Unquantified[61] | No vector DB, no embeddings, no search — stores information but no selective retrieval |
"The current memory tools are diaries. They record events. What I actually need is a notebook."[8]
Anti-pattern: A session log documenting "Changed connection pool from 10 to 20. CPU usage went up 15%. Timeout issue wasn't resolved" gets stored as disconnected facts rather than the learned lesson that connection pool increases correlate with CPU degradation without solving timeouts. The write path captures; the learning path that transforms observations into retrievable lessons doesn't exist.
Confirmation bias in extraction:[8] "I've seen it extract 'reducing timeouts improves performance' when the actual cause was a concurrent deployment." Single-model extraction produces wrong lessons that compound across future sessions if they persist unchallenged.
Cold start abandonment: "A new installation has no beliefs. The first 10–20 sessions don't benefit from accumulated knowledge." The system is write-only for its entire initial phase; by the time enough data accumulates to produce value, practitioners have often abandoned it.[8]
Double sediment from a single practitioner's setup:[63]
Both layers are sediment simultaneously: the data store nobody queries AND the rule about querying it that produced no behavior change.
The knowledge graph anti-pattern in AI agents:[27][22]
Simpler alternatives that often outperform KG: Classic RAG, smarter chunking, better embeddings, and re-rankers — at far lower cost and maintenance burden.[27] The Letta/Mem0 benchmark confirms: basic file storage (74.0%) beat specialized vector retrieval (68.5%).[22]
One practitioner abandoned LLMs in memory retrieval after identifying four production problems:[41]
| Problem | Impact |
|---|---|
| Non-determinism | "Same inputs, different outputs" — debugging memory ranking impossible |
| Latency | 500–2000ms added to critical path per retrieval operation |
| Cost in planner/executor loops | 50+ firing loops; LLM retrieval calls multiply costs substantially |
| Untestability | Cannot verify that specific queries return expected ranked results |
What was deleted: LLM query rewriting, re-ranking, and summarization from the retrieval path entirely. Replacement: Deterministic mathematical retrieval — vector similarity + graph traversal + token-based matching + fusion ranking. Philosophy: "Treat memory as infrastructure, not prompt engineering."[41]
Mark McArthey documented the canonical write-only failure:[48] After rejecting a particular solution approach on April 20, a different Claude instance proposed the identical rejected idea three days later. The prior reasoning existed in accessible session files, but the agent couldn't retrieve it — context window limitations had rendered the information inaccessible. "The agent is writing structured session data to disk already. The agent just isn't reading its own archive."
A Microsoft engineer documented the efficiency cost: repeatedly re-explaining code to the same agent consumed significant daily time — the cost of a write-only memory system measured in human hours.[48]
Strategic planner personas are the highest-complexity form of orchestration sediment: they are the hardest to build (requiring careful prompt engineering and coordination logic), the hardest to delete (because removing the planner requires redistributing its responsibilities), and the least likely to deliver on their promise (because LLM-based planning of LLM work introduces a compounding layer of probabilistic behavior).
Key finding: Anthropic's Claude Code Coordinator — their own orchestrator — deliberately removes capabilities: "The Coordinator cannot execute Bash. It cannot read files."[2] The deliberate deletion of capabilities from the orchestrator is the design pattern. An orchestrator that can do everything will fail in catastrophic ways.
The strategic planner that gets deleted is typically a grand coordinator agent designed to orchestrate everything. Its properties:[47]
Four failure modes of monolithic coordinators:[47]
Anti-Pattern #1 from achan (raw_9.md): "Overloading a single agent with hundreds of instructions it cannot reliably execute." One real deployment gave an agent "close to 500 lines of procedural instructions" expecting exact execution. Result: LLMs compress context and optimize for intent, not strict procedural execution.[9]
From hatchworks:[57] "Mega-prompts" explicitly called out — "one prompt tries to handle everything; nobody can maintain it." The prompt grows incrementally, each addition appearing reasonable, until "nobody can maintain it." This is the signal that the thing exists but is effectively inert — a strategic planner that accumulates rules nobody reviews and produces behavior nobody can predict.
Anthropic's Applied AI team at AWS re:Invent 2025:[18]
| What They Moved Away From | What They Moved Toward |
|---|---|
| Pre-loading 32-page SOPs and documentation | Revealing information as agents request it (progressive disclosure) |
| Complex 50-prompt chained workflows | Agentic loops with tools — fewer edge cases to hardcode |
| Over-specification of procedures | Model autonomy and error recovery |
Anti-Pattern #3 from achan: "Relying on LLMs to remember past actions instead of explicit state management" produces "repeated steps, contradicting actions, collapsing context windows, hallucinated states."[9] Kamradt's benchmark showed state buried in long conversations becomes "effectively unreliable." Strategic planner personas that rely on conversation history to maintain planning state are structurally exposed: as the planning conversation grows, the planner's memory degrades.
From Google's production AI agent refactor (raw_64.md):[64]
| Deleted Component | Why |
|---|---|
| Monolithic Python script (massive linear for-loop) | Could not scale; debugging required end-to-end trace through entire loop |
| Hardcoded list of 12 case studies embedded in Python file | Static data embedded in code — couldn't update without code deployment |
| Prompt-based schema enforcement (JSON structure described inside prompt string) | "Dirty code, fragile parsing, and wasted tokens" — tokens spent describing structure instead of doing work |
| Custom retry logic | Replaced by framework-level resilience patterns |
Prompt-string schema enforcement is the strategic-planner-in-miniature: verbose instructions describing JSON output format inside the prompt consume context on every call, "just in case" the model needs reminding. Classic speculative sediment.[64]
The deletion story points to where coordination genuinely adds value. Anthropic's own Building Effective Agents documentation:[16] orchestration that adds value has three properties: clear task decomposition with explicit ownership boundaries, structured topology (not "bag of agents"), and state that is explicit and externalized (not held in conversation history). Routing decisions that were previously encoded as strategic planning prompts become simple conditional logic — the planning abstraction was compensating for code that was not written.
See also: Multi-Agent Complexity Anti-Patterns (quantitative evidence on coordination failure)Net deletion stories — complete before/after comparisons with measured outcomes — are the empirical foundation for identifying which orchestration components are sediment. The consistent finding across independent teams: deletion improves performance on every measurable dimension simultaneously. If deletion had trade-offs, the pattern would not be consistent.
Key finding: Vercel removed 80% of their agent's tools, replacing 15+ specialized tools with 2, and improved every metric: 3.5x faster execution, 20% higher success rate, 37% fewer tokens, 42% fewer steps.[28] "The model makes better choices when we stop making choices for it."
The original system had 15+ specialized tools built on the assumption that the AI needed hand-holding through complex schemas:[28][15]
Tools deleted: GetEntityJoins, LoadCatalog, RecallContext, LoadEntityDetails, SearchCatalog, ClarifyIntent, SearchSchema, GenerateAnalysisPlan, FinalizeQueryPlan, JoinPathFinder, SyntaxValidator, VisualizeData, ExplainResults (and others)
Tools retained: 2 — bash command execution and SQL execution
| Metric | Old System (15+ tools) | New System (2 tools) | Improvement |
|---|---|---|---|
| Average execution time | 274.8 seconds | 77.4 seconds | 3.5x faster |
| Success rate | 80% | 100% | +20 percentage points |
| Average token usage | ~102K | ~61K | 37% reduction |
| Average steps | ~12 | ~7 | 42% fewer steps |
| Worst case execution time | 724 seconds / 100 steps / 145,463 tokens (failing) | 141 seconds / 19 steps / 67,483 tokens (succeeding) | 5x faster, 53% fewer tokens, succeeds |
Why the specialized tools existed: "Solving problems the model could handle on its own." Engineers believed the AI would "get lost in complex schemas, make bad joins, or hallucinate table names." Every specialized tool was an assumption about AI limitations. Those assumptions were wrong. "We were doing the model's thinking for it."[28]
What replaced the tooling: A filesystem agent architecture where Claude accesses raw Cube DSL files (YAML, Markdown, JSON) directly and uses standard Unix utilities (grep, cat, ls, find). "File systems are an incredibly powerful abstraction. Grep is 50 years old and still does exactly what we need."
| Metric | LangChain | Raw SDK | Change |
|---|---|---|---|
| Code volume (customer support agent) | 1,200 lines | 630 lines | −47%[3] |
| Onboarding time | 10 engineer-days | 2 engineer-days | 5x faster[3] |
| Debugging time per incident | 90–240 minutes | 30–60 minutes | 3–4x faster[3] |
| Request latency per tool call | 10–30ms | 2–5ms | 5–6x lower[3] |
| Framework-attributed incidents | 5–15% of all LLM incidents | Near zero | 70–90% drop[3] |
| Monthly overhead cost | $8K–$30K per agent | Baseline | Eliminated[3] |
| Amortization period | — | 2–6 months | Break-even horizon[3] |
| Metric | Before | After | Change |
|---|---|---|---|
| File size | 1,400+ lines | 420 lines | −70%[55] |
| Token consumption at startup | ~3,500 | ~1,000 | −71% |
| Update time per change | 15–20 minutes | 3–5 minutes | 4x faster |
| Output quality | Baseline | Improved | Positive (counter-intuitive) |
One practitioner's cleanup quantified in dollar terms:[13]
Measured waste: ~$3.75 per session in orchestration overhead; "potentially $200–400/month in recovered productivity" after cleanup.[13]
Replacing a sprawling monolithic state object with focused schema-validated artifacts:[17]
The monolithic state object is the structural manifestation of orchestration sediment: an undifferentiated blob of context accumulated over time, where everything is kept because nothing can safely be discarded.
Same task (table reservation workflow):[49]
| Approach | Development Time | Runtime Behavior | Token Cost |
|---|---|---|---|
| AutoGen (conversational multi-agent) | 3 weeks | Agents looped without converging | 5x baseline |
| CrewAI (deterministic task sequencing) | 1 week | Predictable execution | Baseline |
Claude Code's ToolSearch pattern achieved 85% token reduction for tool schemas by deferring MCP tool definitions and loading full schemas only on demand.[59] This is the most direct evidence that eager loading of tools and skills is sediment: switching from "load everything at startup" to "load on demand" improved context economics by 85% with no functionality loss.
Across all documented deletion stories, a consistent set of components survives:
| Deleted (Sediment) | Kept (Signal) |
|---|---|
| Framework abstraction layers (LangChain, AutoGen) | Direct SDK calls with explicit control flow |
| Specialized tools per operation (15+ tools) | Broad tools for fundamental operations (bash, SQL, filesystem) |
| Multi-agent debate / Tribunal patterns | Single agents with structured topology and explicit state |
| Strategic planner personas | Simple routing logic + specialized workers |
| Manually-invoked skills | Passive always-present context (AGENTS.md) or auto-activation hooks |
| CLAUDE.md rules for behaviors Claude already does | Rules that prevent specific, observed, reproducible mistakes |
| Behavioral hooks (text-based stop conditions) | Deterministic exit-code-2 blocks for actual mechanical enforcement |
| Knowledge graphs for all retrieval | Flat file systems / direct embeddings for proven retrieval patterns |
The convergence principle: The surviving components share one property — they cannot be replaced by model capability improvements. Flat file systems work because the access pattern is explicit. Deterministic enforcement works because exit code 2 is not probabilistic. Direct SDK calls work because they have no hidden assumptions. Everything else is subject to deletion as models improve.[28][36][16]
See also: Cost Optimization (token cost accounting), New Tooling 2026 (what survived the deletion wave)