Autonomous Build-Test-Ship Loop

Executive Summary

Critical finding: AI-generated code produces 15–25% higher mutation survival rates than human-written code at equivalent coverage levels — meaning a fully autonomous agent can achieve 100% test coverage while its tests detect zero behavioral regressions. Standard coverage metrics are a necessary but insufficient quality gate for any build loop that writes its own tests.^[6]

The most consequential design decision in any CC 2026 autonomous loop is hook placement: PostToolUse fires after every tool call, Stop fires once per response. Routing a full test suite through PostToolUse adds that suite's runtime to every single file write — a 40-second TypeScript suite across 12 monorepo packages runs dozens of times per session instead of once. The correct split is format and lint on PostToolUse (near-zero cost, immediate feedback), test suites and mutation scoring on Stop (once per response, worth the latency). Teams that implement this split correctly recover measurable developer hours weekly.^[11] Stop hooks that exit code 2 trigger another Claude response, creating a re-entrancy risk: without a stop_hook_active guard, a failing test suite produces an infinite token-burning remediation loop. Every production Stop hook must check this field before executing.^[11]

CC 2026 provides three scheduling primitives with distinct trade-off profiles. Cloud Routines (Anthropic-hosted) require no local machine and survive reboots, but have a 1-hour minimum interval and no local file access — correct for scheduled reporting tasks, wrong for commit-loop automation. Desktop Scheduled Tasks (machine-local cron) have a 1-minute minimum interval with full file access but require the machine to be on. The in-session /loop command can be self-paced: when no interval is specified, Claude picks delay dynamically — shorter while a build is finishing, longer when nothing is pending — and the 7-day automatic expiry bounds forgotten runaway loops. Maximum 50 scheduled tasks per session via CronCreate.^[2] All three mechanisms support 5-field cron expressions; extended syntax (L, W, name aliases) is not supported. Recurring tasks fire up to 10% late, capped at 15 minutes of jitter.^[2]

Among autonomous agent tools, Claude Code is the only system combining native hook infrastructure with three loop primitives (/loop, CronCreate, self-pacing). Aider provides no built-in loop mechanism — external shell wrappers required — and has no hook system at all. Goose (Apache 2.0, 29,400+ GitHub stars, 25+ LLM providers via MCP) is the only option that runs fully locally with Ollama (48GB+ RAM for 70B models), making it the correct choice for GDPR/DSGVO environments with data residency requirements.^[5]^[24] Cursor Composer 2 has no headless mode at all — it requires interactive sessions, though it deploys improved model checkpoints as often as every 5 hours via real-time RL training on production interaction data and is used by 50%+ of Fortune 500 companies as of 2026.^[29] Devin (Cognition AI) costs $500/month — break-even requires saving 12–20 developer hours monthly at $75/hour — and improved its PR merge rate from 34% to 67% and security vulnerability resolution time from 30 minutes to 1.5 minutes (a 20× gain) over 18 months of production operation. However, success rates collapse on vague or ambiguous tasks: 82% on test writing, 35% on vague bug fixes, 25% on ambiguous features, with the documented "last 30% problem" (core logic delivered, edge cases and integration missing).^[4]^[22]

Mutation testing is the only quality signal that catches the coverage-without-correctness failure mode characteristic of AI-generated tests. The standard recommendation for autonomous loops: 80% minimum mutation score for newly created AI code, 90% for critical domains (authentication, data integrity), and no decrease allowed when AI modifies existing code. Stryker's parallel execution cut CI time from 45 to 18 minutes in a generative AI pipeline, and the Stryker .NET benchmark showed 8× speedup on a 32-core cluster. A critical architectural gap: Stryker does not support Vitest browser mode because its instrumentation assumes Node.js execution — manual agent-driven mutation testing is the fallback for browser-mode stacks.^[15]^[25] The emerging production pattern is a secondary agent loop: surviving mutants trigger a "Test Generator" agent that writes tests to kill them, before human review.^[6]

Anthropic's published research (NeurIPS 2025, arXiv 2510.09907) on autonomous property-based testing across 100 popular Python packages produced 984 bug reports with a 56% overall validity rate and a $5.56 cost per bug report (~$9.93 per valid bug). Top-ranked reports reached 86% validity. Confirmed bugs merged into production included a catastrophic cancellation error in numpy.random.wald (500M+ monthly downloads), a silent iterator failure in AWS Lambda Powertools slice_dictionary(), and a hash collision in CloudFormation CLI's item_hash().^[16] The self-reflection loop (step 4 of 5 in the agent workflow) is mandatory — without it, domain-specific invariants generate false positives that block legitimate PRs. Model generation matters: Opus 4.1 and Sonnet 4.5 showed significantly better self-reflection than Sonnet 4, directly reducing false positive rates.^[16] Hypothesis CI profiles should run 200 examples for moderate CI budgets and 500 for critical paths; the dev profile at 25 examples is too thin for catching edge cases on AI-generated code.^[7]

Security scanning in autonomous loops has a structural gap: Semgrep's cross-file analysis does not run on diff-aware PR scans — it only runs on full repository scans. An agent loop running only PR-triggered Semgrep scans will miss cross-file SQL injection and XSS chains entirely. The correct architecture: Semgrep for fast per-PR scanning (10–30 seconds) plus a nightly CronCreate task running a full cross-file scan. Semgrep Multimodal (announced March 2026) claims 8× more true positives with 50% fewer false positives versus rule-only Semgrep.^[8] Snyk's free tier covers 200 tests/month; commercial plans start at $50/dev/month versus Checkmarx at $15,000+/year for enterprise compliance requirements.^[27] Production-grade teams combine both: Snyk for SCA and container scanning (handles cross-file analysis server-side), Semgrep for fast CI SAST on every PR. Common vulnerabilities in AI-generated code that these scanners catch: exposed API keys, missing input validation, insecure default configurations, overly permissive CORS headers — Devin specifically has documented security blindness for SQL injection and XSS risks.^[4]^[8]

The integrated loop architecture layers quality signals by cost and frequency. Format and per-file lint run on PostToolUse (near-zero cost, no latency impact). Security scans via MCP tool hooks run on PostToolUse against Write/Edit. Unit tests run on Stop with the mandatory re-entrancy guard. Mutation testing and property-based tests run on Stop or CI PR trigger. Full Semgrep cross-file analysis runs on a nightly CronCreate. The --bare flag in claude -p headless mode skips all auto-discovery (hooks, skills, MCP, CLAUDE.md) for reproducible CI behavior — Anthropic has signaled it will become the default for -p in a future release — and requires explicit ANTHROPIC_API_KEY rather than OAuth or keychain.^[9]^[12]

Practitioners should wire quality layers before optimizing loop speed: a loop that ships fast but skips mutation testing on AI-generated code accumulates semantic debt that coverage metrics won't surface. The concrete implementation path is: (1) PostToolUse hook for format/lint/per-file security scan, (2) Stop hook with re-entrancy guard for test suite, (3) CI PR gate for mutation score threshold (80% floor, 90% for auth/data paths), (4) property-based tests on Stop or CI for critical invariants, (5) nightly CronCreate for full Semgrep cross-file scan. This ordering installs the quality signals at the correct granularity without burning tokens on per-file test runs or missing cross-file vulnerabilities in PR-only scan configurations. Tool selection matters less than the discipline to run these checks: "success depends on workflow discipline — writing clear specifications, maintaining review gates, and implementing verification systems — rather than tool selection alone."^[29]

Section 1: CC 2026 Hook System — PostToolUse, PreToolUse, Stop, SessionStart

Claude Code hooks are shell commands (or HTTP calls, MCP tool calls, prompt evaluations, or subagent invocations) triggered at specific session lifecycle points, enforcing code standards deterministically without relying on model memory.^[11] "Hooks shift responsibility from the model to the runtime."^[11] Multiple hooks matching the same event all execute — PostToolUse hooks run in parallel; Stop hooks run sequentially.^[10] Hooks run synchronously in the agent loop, meaning slow hooks add latency to every matched tool call.^[10]^[19]

Hook Lifecycle Events

Five Hook Types

Exit Code Semantics

Event	When It Fires	Can Block? (exit 2)	Autonomous Loop Use
`SessionStart`	Session begins or resumes	No	Load context, restore state, print lane status
`UserPromptSubmit`	User submits prompt, before Claude processes	Yes	Inject project context, validate prompt safety
`PreToolUse`	Before a tool call executes	Yes	Block dangerous commands (rm -rf /, force push), rewrite tool input
`PostToolUse`	After a tool call succeeds	No (provides feedback)	Per-file lint/format, security scan on write, inject lint errors as context
`PostToolUseFailure`	After a tool call fails	No	Log failures, trigger fallback paths
`PostToolBatch`	After a full batch of parallel tool calls resolves	—	Aggregate batch results before next turn
`Stop`	When Claude finishes responding	Yes (exit 2)	Full test suite, mutation scoring, commit staging, baton write
`SessionEnd`	Session terminates	No	Flush logs, release lane claims
`Notification`	Claude sends a notification	—	Forward to Slack/PagerDuty, route alerts

Type	Config Key	Default Timeout	Best For
Command	`"type": "command"`	600s	Shell scripts, linters, test runners — receives JSON on stdin
HTTP	`"type": "http"`	30s	Webhooks, external monitoring APIs
MCP Tool	`"type": "mcp_tool"`	30s	Security scanners, Snyk, custom analysis tools via MCP
Prompt	`"type": "prompt"`	30s	Yes/no policy evaluation by Claude (e.g., "is this change safe?")
Agent	`"type": "agent"`	60s	Spawn subagents for complex verification tasks

Exit Code	Effect	Notes
0	Success; JSON stdout parsed	Normal path
2	Blocking error (for events that support blocking)	stderr shown to Claude; Claude asked to fix before continuing
Other	Non-blocking error	First line of stderr shown in transcript; execution continues

^[1] PostToolUse decision: "block" does NOT prevent tool execution (tool already ran); it instructs Claude to fix the flagged issue before its next action.^[10]

Hook Configuration Locations (Precedence Order)

File	Scope	Git-tracked?
`~/.claude/settings.json`	User-wide (all projects)	No
`.claude/settings.json`	Team repo	Yes (committable)
`.claude/settings.local.json`	Personal overrides	No (gitignored)
Plugin `hooks/hooks.json`	While plugin is enabled	Plugin-managed
Skill/agent frontmatter	While component is active	Skill-managed

^[1]^[11] Hooks merge additively across all files — a user-level stop hook fires alongside a project-level stop hook.

Matcher Pattern Rules

PostToolUse JSON Input Schema

Key Hook Fields

Critical Anti-Pattern: The Infinite Stop Loop

Pattern	Evaluated As	Example
`"*"`, `""`, or omitted	Match all tools	`"matcher": "*"`
Letters/digits/underscores/pipe only	Exact string or pipe-separated list	`"Edit\|Write\|MultiEdit"`
Contains other characters	JavaScript regex	`"mcp__memory__.*"`

Field	Description
`timeout`	Seconds before canceling. Command default: 600s
`statusMessage`	Custom spinner text shown while hook runs
`once`	If `true`, runs once per session (skill frontmatter only)
`async`	Background execution without blocking agent loop (command hooks only)
`asyncRewake`	Background execution; wake session if exits with code 2 (command hooks only)
`if`	Permission rule: `"Bash(git )"`, `"Edit(.ts)"`

Stop hooks that exit code 2 trigger a new Claude response. Without a guard, failing tests → remediation pass → Stop fires again → infinite token consumption. The mandatory guard:^[11]

The stop_hook_active field is injected automatically when a Stop hook is the cause of the current response, giving hooks a reliable re-entrancy check.^[11]

Common Pitfalls

Pitfall	Consequence	Fix
Missing `stop_hook_active` guard	Runaway token consumption	Add re-entrancy check (see above)
Full test suite in PostToolUse	Latency on every file write	Move tests to Stop hook
exit 2 for style warnings	Wasted turns on minor issues	Reserve exit 2 for blocking failures only
No timeout set	Hung hook stalls agent indefinitely	Always set explicit `timeout`
Committing `settings.local.json`	Personal hooks leak to team	Keep team hooks in `settings.json`
Extra text before JSON output	JSON validation failure; hook ignored	Print only valid JSON to stdout
Hard-coded absolute paths	Portability failure across machines	Use `$CLAUDE_PROJECT_DIR`
Non-login shell assumptions	nvm/pyenv not loaded → wrong runtime	Explicitly source version managers in hook script

Real-world impact: A team with a 40-second TypeScript test suite across 12 monorepo packages eliminated manual cleanup passes using Stop hooks, recovering several hours of developer time weekly.^[11]

Section 2: Scheduled Execution — CronCreate, /loop, Self-Pacing

Claude Code v2.1.72+ supports three scheduling mechanisms: Cloud Routines (Anthropic-hosted, no local machine required), Desktop Scheduled Tasks (machine-local cron), and the in-session /loop command.^[2]^[20] All three use 5-field cron expressions internally but differ on persistence, local file access, and minimum interval.

Scheduling Options Comparison

/loop Command: Four Invocation Modes

Dimension	Cloud Routines	Desktop Scheduled Tasks	/loop
Runs on	Anthropic cloud	Local machine	Local machine
Machine must be on	No	Yes	Yes
Open session required	No	No	Yes
Persistent across restarts	Yes	Yes	Restored on `--resume` if unexpired
Local file access	No (fresh clone)	Yes	Yes
MCP servers	Connectors configured per task	Config files + connectors	Inherits from session
Permission prompts	No (autonomous)	Configurable per task	Inherits from session
Minimum interval	1 hour	1 minute	1 minute
Auto-expiry	Configurable	Configurable	7 days

Syntax	Behavior
`/loop 5m check the deploy`	Fixed cron schedule, specified interval
`/loop check the deploy`	Claude picks interval dynamically after each iteration
`/loop`	Runs built-in maintenance prompt
`/loop 20m /review-pr 1234`	Re-runs a packaged workflow slash command every 20 minutes

^[2] Supported interval units: s, m, h, d. Non-standard intervals (e.g., 7m, 90m) round to the nearest valid step; Claude reports the chosen interval.^[2]

Dynamic Self-Pacing

When no interval is specified, Claude selects delay dynamically after each iteration — shorter while a build is finishing or a PR has activity, longer when nothing is pending. It may use the Monitor tool for event-driven triggering instead of polling.^[2] On Bedrock, Vertex AI, and Microsoft Foundry, no-interval loops fall back to a fixed 10-minute schedule.^[2]

Built-in Maintenance Prompt (Bare /loop)

Running /loop with no arguments activates a built-in protocol that executes in order:^[2]^[20]

Irreversible actions (pushing, deleting) proceed only when continuing already-authorized work. Irreversible new initiatives require human initiation.^[2]

loop.md Customization

Path	Scope
`.claude/loop.md`	Project-level (takes precedence)
`~/.claude/loop.md`	User-level (any project without own loop.md)

Edits take effect on the next iteration. Maximum size: 25,000 bytes. Disable scheduling entirely: CLAUDE_CODE_DISABLE_CRON=1.^[2]^[20]

Underlying Tools: CronCreate / CronList / CronDelete

Execution Model and Jitter

CronCreate Expression Reference

Tool	Purpose
`CronCreate`	Schedule a new task. Accepts 5-field cron expression, prompt, recur flag
`CronList`	List all tasks with IDs, schedules, prompts
`CronDelete`	Cancel a task by ID

Expression	Meaning
`/5 * * *`	Every 5 minutes
`0 * * * *`	Every hour on the hour
`0 9 * * *`	Every day at 9am local
`0 9 * * 1-5`	Weekdays at 9am local
`30 14 15 3 *`	March 15 at 2:30pm local

Extended syntax (L, W, ?, name aliases like MON/JAN) not supported.^[2]

Section 3: Headless Mode — --print / -p / --bare

The -p / --print flag runs Claude non-interactively, outputs to stdout, and exits. Previously called "headless mode"; the umbrella term is now "Agent SDK CLI."^[9]^[12] The --bare flag skips auto-discovery of hooks, skills, plugins, MCP servers, auto memory, and CLAUDE.md — ensuring reproducible behavior in CI. Anthropic has signaled --bare will become the default for -p in a future release.^[9]

Output Formats

Permission Modes

--bare Context Loading Flags

Format Flag	Output	Best For
`--output-format text`	Plain text (default)	Human-readable logs
`--output-format json`	JSON with result + session_id + metadata	Pipeline chaining, CI artifacts
`--output-format json` + `--json-schema`	Schema-enforced structured JSON	Typed extraction tasks
`--output-format stream-json`	Real-time streaming JSON events	Live monitoring, retry event handling

Mode	Flag	What It Permits
Default	(none)	Interactive prompts for most writes and shell commands
dontAsk	`--permission-mode dontAsk`	Denies anything not in `permissions.allow` or read-only set
acceptEdits	`--permission-mode acceptEdits`	File writes without prompts; auto-approves mkdir/touch/mv/cp; shell/network still need explicit allow
bypassPermissions	`--dangerously-skip-permissions`	All interactive confirmations bypassed — isolated environments only

To Load	Flag
System prompt additions	`--append-system-prompt`, `--append-system-prompt-file`
Settings	`--settings <file-or-json>`
MCP servers	`--mcp-config <file-or-json>`
Custom agents	`--agents <json>`
Plugin directory	`--plugin-dir <path>`

^[9] Authentication in --bare mode must come from ANTHROPIC_API_KEY (OAuth and keychain are skipped).^[9]

Multi-Turn Session Continuation

Retry Event Handling (stream-json)

When an API request fails with a retryable error, a system/api_retry event is emitted containing:^[9]^[12]

Multi-Agent Output Chaining Pattern

Cron Integration Requirements

Section 4: Worktree Isolation — Task Tool, EnterWorktree, Parallel Agents

Field	Description
`attempt`	Current attempt number, starting at 1
`max_retries`	Total retries permitted
`retry_delay_ms`	Milliseconds until next attempt
`error_status`	HTTP status code
`error`	Category: rate_limit, server_error, authentication_failed, billing_error, etc.

Without isolation, concurrent agents writing to the same repository produce uncommitted change collisions — Agent A's edits to config.py become mixed with Agent B's edits, making clean rollbacks impossible.^[28] Git worktrees solve this by giving each agent its own working directory with an independent branch, while sharing the same object store.

Task Tool with isolation: "worktree"

Each subagent gets its own context window, its own tool set, and its own git worktree with no shared file state. The worktree is automatically cleaned up if the agent makes no changes; otherwise the path and branch are returned to the caller.^[28]

Worktree Architecture (MindStudio Pattern, 2026)

Plane	Directory	Purpose
Control Plane	`.tasks/`	Task metadata (status, worktree binding)
Execution Plane	`.worktrees/`	Isolated directories with task-specific git branches
Event Log	`.worktrees/events.jsonl`	Lifecycle events for audit and recovery
Index	`.worktrees/index.json`	Active worktree registry (crash-durable)

^[28] File-based state in .tasks/ and .worktrees/index.json provides durability across crashes. Conversation memory remains volatile — worktree state does not.

Worktree Lifecycle Operations

Parallel Agent Pattern (Production 2026)

Operation	Effect	State Transition
Creation	Task persisted to `.tasks/`	→ `pending`
Binding	Worktree created, connected to task	→ `in_progress`
Isolation	Commands execute within worktree cwd	(active)
Cleanup (keep)	`worktree_keep()` — preserves branch	→ `complete`
Cleanup (remove)	`worktree_remove(complete_task=True)`	→ `complete`

"One agent refactoring authentication while another builds the payment API, both producing separate, mergeable PRs by morning."^[28]

Internal Agentic Behavior Loop

Section 5: Autonomous Agent Tool Comparison

5.1 Aider: Scripting and Autonomous Loop

Aider supports non-interactive operation via --message / --yes flags. Git integration is on by default: auto-commits each edit set with Conventional Commits format using a weak model, prepending "(aider)" to author/committer names. Pre-existing uncommitted changes are committed first (user/AI separation).^[3]^[13]

Key Automation Flags

Production Autonomous Loop Patterns (2026)

Flag	Default	Description
`--message` / `-m`	—	Single instruction; process and exit
`--message-file` / `-f`	—	Load instructions from file
`--yes` / `--yes-always`	False	Auto-confirm all prompts (headless)
`--auto-commits`	True	Auto-commit LLM changes
`--dirty-commits`	True	Commit dirty files before edits
`--no-git`	False	Disable git integration
`--dry-run`	False	Execute without modifying files
`--commit`	—	Commit all pending changes, then exit

Aider limitations: No built-in loop/cron mechanism (must wrap in external shell/cron). No native retry logic on LLM failures. Python API may break between releases.^[21]

5.2 Devin (Cognition AI): Fully Autonomous Cloud Agent

Devin operates in a sandboxed cloud environment with a code editor, terminal, web browser, and planning tools. Users describe tasks; Devin executes without constant oversight and submits PRs when complete. Cost: $500/month — break-even requires saving 12-20 developer hours monthly at $75/hour.^[4]

Performance Data: 5 Real-World Codebases

18-Month Production Improvements (Cognition Review, 2025)

Documented Weaknesses

Task Type	Success Rate	Quality Notes
Test writing	82%	Good
Clear bug fixes	78%	Good
Well-defined features	65%	Good
SWE-bench Verified	~44%	Benchmark
Vague bug fixes	35%	Mixed quality
Ambiguous features	25%	Poor quality

Metric	Change
Problem-solving speed	4× faster
Resource consumption	2× more efficient
PR merge rate	34% → 67%
Security vulnerability resolution	20× efficiency gain (30 min → 1.5 min)
Java migration speed	14× faster than humans
Test coverage uplift	50-60% → 80-90%

Data conflict: raw_4.md states a real-world fail rate of approximately 85% on tasks outside SWE-bench conditions; raw_22.md shows a 67% PR merge rate (implying 33% fail rate). Discrepancy likely reflects different task difficulty distributions and measurement methodologies.^[4]^[22]

5.3 Goose (Block/AAIF): Open-Source Local Agent

Goose is a Rust-based autonomous agent under the Agentic AI Foundation (AAIF) at the Linux Foundation (formerly block/goose). Apache 2.0 license. 29,400+ GitHub stars, 368 contributors, 2,600+ forks.^[5]^[14]^[24]

Goose Architecture (Three Layers)

"At its core, goose is very much about pushing autonomy, so we let the agent loop run as far as it can, so if it stumbles, if it hits obstacles, it'll back up and it'll try another approach."^[14]

Recipes: Portable YAML Workflow Units

Recipes support composable sub-recipes for multi-stage pipelines. Real-world reported: 45-minute status prep reduced to under 5 minutes using a recipe combining GitHub, Linear, and Notion extensions.^[5]^[24]

Local/Privacy-First Operation

5.4 OpenAI Codex CLI: Agent Loop Architecture

Configuration	Detail
Local LLM backend	Ollama at localhost:11434 (llama3:70b or similar)
Data transmission	Zero — no prompts, code, or outputs transmitted externally
Compliance	GDPR/DSGVO compliant (confirmed no external data transfer)
Hardware requirement	48GB+ RAM for 70B models

Codex CLI sends HTTP requests to OpenAI's Responses API in a continuous cycle: user input → API request → model generates output or requests tool calls → Codex executes tools locally → resubmits expanded context → loops until model stops emitting tool calls.^[17]

Prompt caching: Cache hits occur when the new prompt contains an exact prefix match with the previous inference call, enabling linear-time sampling despite growing context sizes. Critical for long-horizon coding tasks.^[17]

Approval Modes

Mode	Behavior
`suggest`	Proposes changes; needs human approval
`auto-edit`	Edits files without prompting
`full-auto`	Runs commands and edits without any prompts

Model timeline: GPT-5.2-Codex (December 2025) described as "the moment that people began to believe that using autonomous coding agents could be reliable." GPT-5.3-Codex and GPT-5.4 added in 2026.^[17]

5.5 Cursor Composer 2: IDE-Native Autonomous Agent

Cursor Composer 2 executes multi-file coding tasks autonomously across entire codebases. Shifts from isolated function generation to complete feature implementation. As of 2026, 50%+ of Fortune 500 companies use Cursor; SOC 2 Type II, GDPR, HIPAA compliant.^[29]

Three-Phase Execution Model

Phase	What Happens
Explore	Examines relevant codebase portions
Plan	Formulates plan; optional human approval gate
Execute	Runs autonomously with real-time feedback stream; self-corrects on failures

^[29] Real-time RL training: Cursor deploys improved model checkpoints as often as every 5 hours using real user interaction data.^[29]

Five Production Patterns (Cursor Composer 2)

5.6 GitHub Agentic Workflows + Sweep: Issue-to-PR Automation

GitHub Agentic Workflows (Technical Preview, 2026): Markdown files combining YAML config with natural language task descriptions, built on GitHub Actions infrastructure. Safety-first design: isolated sandboxes, read-only by default, write actions require review approval.^[23]

Sweep AI (open-source): Triggered by adding a sweep label to any GitHub issue or prefixing the title with "Sweep:". Automated flow: reads codebase → posts comment outlining plan → developer replies to adjust → creates branch → makes incremental commits (test-first) → submits PR.^[23]

Sweep: Effective vs. Failure Use Cases

Tool Comparison Matrix

Section 6: Mutation Testing — Mutmut, Stryker, PIT

Effective	Failure Modes
Typo fixes, config updates	Ambiguous requirements → loops without useful code
Simple API endpoints	Complex business logic → surface-level implementation
Documentation generation	Regulatory implications → missed without domain context
Well-described bug fixes	Migrations requiring deep domain context → breakdown

Dimension	Claude Code	Aider	Devin	Goose	Codex CLI	Cursor Composer 2
Auto-commit	Requires hook or explicit instruction	Default on after each change	Automatic (PR on completion)	Via recipe steps	Requires `full-auto` mode	Explicit review before commit
File targeting	Autonomous discovery	Requires explicit file list	Autonomous	Autonomous	Autonomous (sandboxed paths)	Autonomous (full codebase)
Loop primitives	`/loop` + CronCreate + self-pacing	No built-in loop; wrap in shell	Built-in cloud agent loop	Built-in indefinite agent loop	Continuous until done	Single session; manual re-trigger
Hook system	PostToolUse, Stop, PreToolUse, 5 types	None	None (cloud-managed)	None	None (externally via CI)	None (verification script pattern)
Cost	$20–$200/month	Free (API costs only)	$500/month	Free (Apache 2.0; API costs only)	Free CLI (API costs only)	$20–$40/month
Data privacy	Cloud API required	Cloud API required	Cloud (Cognition hosted)	Can run fully local (Ollama)	Cloud API required	Cloud; SOC 2 Type II
Non-interactive flag	`-p` / `--print`	`--yes`	Task submission UI/API	`--auto`	`exec` subcommand	No headless mode

Mutation testing introduces deliberate code modifications and verifies whether tests detect them. Score formula: killed mutants / total mutants. "Code coverage tells you what ran. Mutation testing tells you what your tests would actually catch if the code were wrong."^[6]

Why AI Code Specifically Needs Mutation Testing

Survival rates on AI-generated code are 15-25% higher than on human-written code at equivalent coverage levels.^[6] AI produces "structurally correct but semantically drifted" code lacking domain understanding. Documented incident: 92% coverage + passing tests failed to catch a deduplication bug using reference equality instead of business key comparison.^[6]

Tool Landscape by Language

Stryker (JavaScript/TypeScript) Performance

Language	Tool	Maturity	Key Limitation
Java	PIT (pitest.org)	Most mature	JVM-only
Python	mutmut, cosmic-ray	Good	Slower than compiled-language tools
JavaScript/TypeScript	Stryker	Most mature JS option	Does not support Vitest browser mode
.NET	Stryker.NET	Mature	—
Rust	cargo mutants	Early stage	Limited mutation operators
Go	go-mutesting	Early stage	Limited mutation operators

Per-test coverage analysis runs only relevant tests per mutant. With incremental mode and parallel execution: "Run mutation testing on every PR in 1-5 minutes for most codebases." In generative AI pipelines, parallel execution cut CI time from 45 to 18 minutes.^[25]

Critical Stryker limitation: Does NOT support Vitest's browser mode because "their instrumentation assumes Node.js execution, but browser mode runs tests in actual Chromium via Playwright." Agent-based manual mutation testing is the fallback for browser-mode stacks.^[15]

Stryker .NET Performance Benchmark (2025 IEEE Software)

PIT (Java) Configuration

Agent Skill Tool Detection Logic

Metric	Result
Mutant generation speed vs. Python tools	2× faster
Precision for computer vision apps	12% higher
Speedup on 32-core cluster (GitLab/Azure DevOps)	8× via parallelization

An autonomous agent can select the right mutation tool by inspecting project files:^[25]

File Present	Tool Invoked
`Cargo.toml`	`cargo mutants`
`package.json`	`npx stryker run`
`pyproject.toml` / `setup.py`	`mutmut run`

Quality gate: 100% kill rate required (all mutants must be killed) in the agent skill pattern — below 100% blocks the PR and lists survivors with specific test recommendations. Scope: changed files only (not entire codebase).^[25]

Recommended CI Thresholds

Autonomous Mutation-Testing Loop Vision

Context	Threshold
Newly created AI code	80% minimum
Critical domains (authentication, data integrity)	90%
AI modifying existing code	No decrease allowed

"A living mutant will spawn a secondary 'Test Generator' agent to create a test case to kill it, before a PR is even reviewed by a human."^[6]

AI Agent Manual Mutation Testing (When Tools Don't Work)

Claude Code can execute mutation testing manually when automated tools don't support the stack:^[15]^[25]

Real result: 38% mutation score (8 surviving mutants of 13 total) on a settings feature. Gaps found: volume boundary constraints, DOM class assignments, error handling paths.^[15]

Section 7: Property-Based Testing — Hypothesis, fast-check

Property-based testing specifies a general invariant and relies on the framework to generate thousands of inputs that might violate it, instead of verifying specific hand-picked examples. The framework shrinks failures to the smallest reproducing input.^[7]^[16]^[26]

Anthropic Research: Autonomous Property-Based Testing Agent

Published paper: "Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem" (NeurIPS 2025 Deep Learning for Code Workshop). arXiv reference: 2510.09907.^[7]^[16]

Agent Workflow (5 Steps)

Step	Action
1. Code analysis	Target code, documentation, codebase relationships
2. Property inference	Derived from type hints, docstrings, function names
3. Test generation	Hypothesis tests written for discovered properties
4. Execution + self-reflection	Distinguish real bug vs. false positive
5. Bug reporting	Formatted reports for confirmed issues

^[7]^[16]^[26] Agent uses a to-do list for long-range reasoning across multiple files. Self-reflection loop minimizes false positives.^[7]

Large-Scale Evaluation Results (100 Popular Python Packages)

Confirmed Bugs Merged into Production Codebases

Hypothesis CI Profile Configuration

Metric	Value
Total bug reports generated	984
Overall validity rate	56%
Valid AND reportable	32%
Top-ranked validity	86%
Top-ranked reportable	81%
Cost per bug report	$5.56
Cost per valid bug	~$9.93

Package	Bug Found	Impact
NumPy (500M+ monthly downloads)	`numpy.random.wald` returning negative values — catastrophic cancellation error	Fix achieved "nearly ten orders of magnitude lower relative error"
AWS Lambda Powertools	`slice_dictionary()` iterator not incrementing — returned duplicate chunks	Silent data integrity failure
Hugging Face Tokenizers	Missing parenthesis in CSS output	Rendering failure
CloudFormation CLI	`item_hash()` produced identical hashes — `.sort()` returning None	Hash collision in infrastructure provisioning
SciPy, Pandas	Additional bugs found	Confirmed and merged

Profile	`max_examples`	Context
dev	25	Fast feedback during development
ci (moderate)	200	Standard CI budget^[7]
ci (thorough)	500	Extended CI budget for critical paths^[16]

Hypothesis is safe to run in parallel with pytest-xdist; each worker gets an independent random seed.^[7]

fast-check (TypeScript/JavaScript)

CI integration with FC_NUM_RUNS=1000 increases iteration count for CI runs:^[16]^[26]

Hegel (2026): Cross-Language Property Testing

A new family of PBT libraries bringing Hypothesis-quality property testing to every language. Hegel for Rust released 2026, designed for autonomous CI integration.^[26]

Combining with Mutation Testing

Properties that survive mutations are vacuous — they don't actually detect real bugs. Running mutation testing after property-based test generation identifies which properties are checking real invariants vs. which are effectively no-ops.^[26]

Known Limitations

Section 8: Security Scanning — Semgrep, Snyk

Security scanning in autonomous agent loops operates at "Level 2 Continuous AI" — AI handles routine security work (detection, triage, PR generation) while developers focus on complex problems.^[8] Common vulnerabilities in AI-generated code include exposed API keys, missing input validation, insecure default configurations, overly permissive CORS headers, SQL injection, and XSS risks.^[8]^[4]

Snyk vs. Semgrep: Technical Comparison (2026)

Critical Semgrep Architectural Limitation

Dimension	Semgrep	Snyk	Checkmarx
Primary strength	SAST, custom rules, fast PR scanning	Best-in-class SCA, IDE integration, AI fixes	Enterprise compliance (SOC2/HIPAA/PCI-DSS), deep dataflow
PR scan speed	10-30 seconds	Cloud latency (slower)	Slower
Cross-file analysis in PR scan	No — diff-aware scans only; full scans required	Yes — server-side processing	Yes
Languages	30+	30+	30+
Cost	Free CLI (commercial use)	Free tier: 200 tests/month; $50+/dev/mo	$15,000+/year
Finding suppression	Inline `// nosemgrep`	`.snyk` policy files (web UI)	Policy-based
AI fix generation	Semgrep Multimodal (March 2026)	Snyk Mission Control (autonomous PR)	Limited
Memory limit	5GB fallback; 3-hour timeout	Cloud-managed	Cloud-managed

"Semgrep's cross-file analysis does NOT run on diff-aware PR scans. It only runs on full repository scans." Vulnerabilities requiring multi-file taint tracking won't be detected until a subsequent full scan — an autonomous loop that runs only PR-triggered scans will miss cross-file injection chains.^[27]

Snyk Mission Control: Autonomous Vulnerability Loop

Semgrep Multimodal (Announced March 2026)

Combines AI reasoning with deterministic rule-based analysis. Claimed performance improvements: 8× more true positives with 50% fewer false positives compared to rule-only Semgrep.^[8]

Semgrep Scaling Constraints

Semgrep Managed Scans

Constraint	Value
Memory fallback threshold	5GB
Maximum scan timeout	3 hours
Scaling limit	Issues beyond 1,000 files
Recommended RAM per core	4-8GB

Used by 40%+ of Semgrep customers — Semgrep clones and scans repos on their own infrastructure, with no CI runner required. Eliminates the overhead of maintaining CI scan jobs for each repository.^[27]

Pre-Commit Security Gate Policies

Section 9: Integrated Loop Architecture

A production autonomous build-test-ship loop combines CC hooks, scheduled execution, headless invocations, and quality signal layers into a single pipeline. The key design principle: each quality layer fires at the appropriate loop frequency to maximize signal without burning tokens.

Quality Signal Firing Frequency

Autonomous Commit Workflow Pattern

Signal	Trigger Point	Frequency	Rationale
Format (Prettier, Ruff)	PostToolUse on Edit/Write	Per file write	Near-zero cost; immediate feedback^[11]
Lint with auto-fix (ESLint, Ruff check)	PostToolUse on Edit/Write	Per file write	Catches style violations before accumulation^[11]
Security scan (Semgrep per file)	PostToolUse on Write/Edit via MCP	Per file write	Immediate catch of new vulnerabilities^[19]
Unit test suite	Stop hook (exit 2 on failure)	Per response	Cost too high for PostToolUse frequency; Stop fires once per response^[11]
Mutation testing (changed files)	Stop hook or CI PR trigger	Per response or per PR	Cost too high for per-file; acceptable at response level^[6]^[25]
Property-based tests	Stop hook or CI PR trigger	Per response or per PR	Generated once per implementation block^[7]^[16]
SCA (Snyk Open Source)	CI PR trigger	Per PR	Dependency changes are infrequent^[27]
Full Semgrep cross-file scan	Nightly CronCreate task	Daily	Cross-file taint analysis requires full repo scan^[27]

Stage files on edit (PostToolUse), commit at Stop — stages incrementally (reversible) and commits only post-validation. Pattern sequence:^[11]

Iterative Orchestration: Python Control Loop Pattern

Agentic Testing Architecture Trends (QualiZeal Research, 2025)

Section 10: Production Patterns and Real Configurations

PostToolUse: Per-File Lint and Format Hook

^[11] Use || true guards to prevent linter errors from blocking; --no-install fails fast if deps are missing; $CLAUDE_PROJECT_DIR ensures portability. Production note: Hooks run in non-login shells — runtimes like nvm (Node.js) and pyenv (Python) are not automatically loaded. Add explicit sourcing before the case block, e.g. [ -s "$HOME/.nvm/nvm.sh" ] && source "$HOME/.nvm/nvm.sh", otherwise tools resolve to system-default versions rather than the project's pinned runtime.^[11]