Pillar: autonomous-build-loop | Date: April 2026
Scope: CC 2026 automation primitives: PostToolUse hooks for format/lint/test, CronCreate for scheduled agents, /loop self-pacing, --print headless mode, Task with isolation:worktree, EnterWorktree, plan mode. Tool comparison for autonomous agents: Aider auto-commit, Cursor Composer, Devin headless, Sweep PR-from-issue, Goose loops, Codex CLI, Cline/RooCode. Continuous quality patterns in agent loops: mutation testing (Mutmut, Stryker), property-based testing (Hypothesis, fast-check), security scanning (Semgrep, Snyk), perf regression watchers. Real public configs.
Sources: 29 gathered, consolidated, synthesized.
Critical finding: AI-generated code produces 15–25% higher mutation survival rates than human-written code at equivalent coverage levels — meaning a fully autonomous agent can achieve 100% test coverage while its tests detect zero behavioral regressions. Standard coverage metrics are a necessary but insufficient quality gate for any build loop that writes its own tests.[6]
The most consequential design decision in any CC 2026 autonomous loop is hook placement: PostToolUse fires after every tool call, Stop fires once per response. Routing a full test suite through PostToolUse adds that suite's runtime to every single file write — a 40-second TypeScript suite across 12 monorepo packages runs dozens of times per session instead of once. The correct split is format and lint on PostToolUse (near-zero cost, immediate feedback), test suites and mutation scoring on Stop (once per response, worth the latency). Teams that implement this split correctly recover measurable developer hours weekly.[11] Stop hooks that exit code 2 trigger another Claude response, creating a re-entrancy risk: without a stop_hook_active guard, a failing test suite produces an infinite token-burning remediation loop. Every production Stop hook must check this field before executing.[11]
CC 2026 provides three scheduling primitives with distinct trade-off profiles. Cloud Routines (Anthropic-hosted) require no local machine and survive reboots, but have a 1-hour minimum interval and no local file access — correct for scheduled reporting tasks, wrong for commit-loop automation. Desktop Scheduled Tasks (machine-local cron) have a 1-minute minimum interval with full file access but require the machine to be on. The in-session /loop command can be self-paced: when no interval is specified, Claude picks delay dynamically — shorter while a build is finishing, longer when nothing is pending — and the 7-day automatic expiry bounds forgotten runaway loops. Maximum 50 scheduled tasks per session via CronCreate.[2] All three mechanisms support 5-field cron expressions; extended syntax (L, W, name aliases) is not supported. Recurring tasks fire up to 10% late, capped at 15 minutes of jitter.[2]
Among autonomous agent tools, Claude Code is the only system combining native hook infrastructure with three loop primitives (/loop, CronCreate, self-pacing). Aider provides no built-in loop mechanism — external shell wrappers required — and has no hook system at all. Goose (Apache 2.0, 29,400+ GitHub stars, 25+ LLM providers via MCP) is the only option that runs fully locally with Ollama (48GB+ RAM for 70B models), making it the correct choice for GDPR/DSGVO environments with data residency requirements.[5][24] Cursor Composer 2 has no headless mode at all — it requires interactive sessions, though it deploys improved model checkpoints as often as every 5 hours via real-time RL training on production interaction data and is used by 50%+ of Fortune 500 companies as of 2026.[29] Devin (Cognition AI) costs $500/month — break-even requires saving 12–20 developer hours monthly at $75/hour — and improved its PR merge rate from 34% to 67% and security vulnerability resolution time from 30 minutes to 1.5 minutes (a 20× gain) over 18 months of production operation. However, success rates collapse on vague or ambiguous tasks: 82% on test writing, 35% on vague bug fixes, 25% on ambiguous features, with the documented "last 30% problem" (core logic delivered, edge cases and integration missing).[4][22]
Mutation testing is the only quality signal that catches the coverage-without-correctness failure mode characteristic of AI-generated tests. The standard recommendation for autonomous loops: 80% minimum mutation score for newly created AI code, 90% for critical domains (authentication, data integrity), and no decrease allowed when AI modifies existing code. Stryker's parallel execution cut CI time from 45 to 18 minutes in a generative AI pipeline, and the Stryker .NET benchmark showed 8× speedup on a 32-core cluster. A critical architectural gap: Stryker does not support Vitest browser mode because its instrumentation assumes Node.js execution — manual agent-driven mutation testing is the fallback for browser-mode stacks.[15][25] The emerging production pattern is a secondary agent loop: surviving mutants trigger a "Test Generator" agent that writes tests to kill them, before human review.[6]
Anthropic's published research (NeurIPS 2025, arXiv 2510.09907) on autonomous property-based testing across 100 popular Python packages produced 984 bug reports with a 56% overall validity rate and a $5.56 cost per bug report (~$9.93 per valid bug). Top-ranked reports reached 86% validity. Confirmed bugs merged into production included a catastrophic cancellation error in numpy.random.wald (500M+ monthly downloads), a silent iterator failure in AWS Lambda Powertools slice_dictionary(), and a hash collision in CloudFormation CLI's item_hash().[16] The self-reflection loop (step 4 of 5 in the agent workflow) is mandatory — without it, domain-specific invariants generate false positives that block legitimate PRs. Model generation matters: Opus 4.1 and Sonnet 4.5 showed significantly better self-reflection than Sonnet 4, directly reducing false positive rates.[16] Hypothesis CI profiles should run 200 examples for moderate CI budgets and 500 for critical paths; the dev profile at 25 examples is too thin for catching edge cases on AI-generated code.[7]
Security scanning in autonomous loops has a structural gap: Semgrep's cross-file analysis does not run on diff-aware PR scans — it only runs on full repository scans. An agent loop running only PR-triggered Semgrep scans will miss cross-file SQL injection and XSS chains entirely. The correct architecture: Semgrep for fast per-PR scanning (10–30 seconds) plus a nightly CronCreate task running a full cross-file scan. Semgrep Multimodal (announced March 2026) claims 8× more true positives with 50% fewer false positives versus rule-only Semgrep.[8] Snyk's free tier covers 200 tests/month; commercial plans start at $50/dev/month versus Checkmarx at $15,000+/year for enterprise compliance requirements.[27] Production-grade teams combine both: Snyk for SCA and container scanning (handles cross-file analysis server-side), Semgrep for fast CI SAST on every PR. Common vulnerabilities in AI-generated code that these scanners catch: exposed API keys, missing input validation, insecure default configurations, overly permissive CORS headers — Devin specifically has documented security blindness for SQL injection and XSS risks.[4][8]
The integrated loop architecture layers quality signals by cost and frequency. Format and per-file lint run on PostToolUse (near-zero cost, no latency impact). Security scans via MCP tool hooks run on PostToolUse against Write/Edit. Unit tests run on Stop with the mandatory re-entrancy guard. Mutation testing and property-based tests run on Stop or CI PR trigger. Full Semgrep cross-file analysis runs on a nightly CronCreate. The --bare flag in claude -p headless mode skips all auto-discovery (hooks, skills, MCP, CLAUDE.md) for reproducible CI behavior — Anthropic has signaled it will become the default for -p in a future release — and requires explicit ANTHROPIC_API_KEY rather than OAuth or keychain.[9][12]
Practitioners should wire quality layers before optimizing loop speed: a loop that ships fast but skips mutation testing on AI-generated code accumulates semantic debt that coverage metrics won't surface. The concrete implementation path is: (1) PostToolUse hook for format/lint/per-file security scan, (2) Stop hook with re-entrancy guard for test suite, (3) CI PR gate for mutation score threshold (80% floor, 90% for auth/data paths), (4) property-based tests on Stop or CI for critical invariants, (5) nightly CronCreate for full Semgrep cross-file scan. This ordering installs the quality signals at the correct granularity without burning tokens on per-file test runs or missing cross-file vulnerabilities in PR-only scan configurations. Tool selection matters less than the discipline to run these checks: "success depends on workflow discipline — writing clear specifications, maintaining review gates, and implementing verification systems — rather than tool selection alone."[29]
Claude Code hooks are shell commands (or HTTP calls, MCP tool calls, prompt evaluations, or subagent invocations) triggered at specific session lifecycle points, enforcing code standards deterministically without relying on model memory.[11] "Hooks shift responsibility from the model to the runtime."[11] Multiple hooks matching the same event all execute — PostToolUse hooks run in parallel; Stop hooks run sequentially.[10] Hooks run synchronously in the agent loop, meaning slow hooks add latency to every matched tool call.[10][19]
Key finding: The distinction between PostToolUse and Stop hooks determines autonomous loop behavior. PostToolUse fires after every tool call — correct for per-file lint/format; too expensive for full test suites. Stop fires once per response — ideal for test suites, commit staging, and mutation scoring. Conflating the two is the most common autonomous-loop performance mistake.[11]
| Event | When It Fires | Can Block? (exit 2) | Autonomous Loop Use |
|---|---|---|---|
SessionStart |
Session begins or resumes | No | Load context, restore state, print lane status |
UserPromptSubmit |
User submits prompt, before Claude processes | Yes | Inject project context, validate prompt safety |
PreToolUse |
Before a tool call executes | Yes | Block dangerous commands (rm -rf /, force push), rewrite tool input |
PostToolUse |
After a tool call succeeds | No (provides feedback) | Per-file lint/format, security scan on write, inject lint errors as context |
PostToolUseFailure |
After a tool call fails | No | Log failures, trigger fallback paths |
PostToolBatch |
After a full batch of parallel tool calls resolves | — | Aggregate batch results before next turn |
Stop |
When Claude finishes responding | Yes (exit 2) | Full test suite, mutation scoring, commit staging, baton write |
SessionEnd |
Session terminates | No | Flush logs, release lane claims |
Notification |
Claude sends a notification | — | Forward to Slack/PagerDuty, route alerts |
| Type | Config Key | Default Timeout | Best For |
|---|---|---|---|
| Command | "type": "command" |
600s | Shell scripts, linters, test runners — receives JSON on stdin |
| HTTP | "type": "http" |
30s | Webhooks, external monitoring APIs |
| MCP Tool | "type": "mcp_tool" |
30s | Security scanners, Snyk, custom analysis tools via MCP |
| Prompt | "type": "prompt" |
30s | Yes/no policy evaluation by Claude (e.g., "is this change safe?") |
| Agent | "type": "agent" |
60s | Spawn subagents for complex verification tasks |
| Exit Code | Effect | Notes |
|---|---|---|
| 0 | Success; JSON stdout parsed | Normal path |
| 2 | Blocking error (for events that support blocking) | stderr shown to Claude; Claude asked to fix before continuing |
| Other | Non-blocking error | First line of stderr shown in transcript; execution continues |
[1] PostToolUse decision: "block" does NOT prevent tool execution (tool already ran); it instructs Claude to fix the flagged issue before its next action.[10]
| File | Scope | Git-tracked? |
|---|---|---|
~/.claude/settings.json |
User-wide (all projects) | No |
.claude/settings.json |
Team repo | Yes (committable) |
.claude/settings.local.json |
Personal overrides | No (gitignored) |
Plugin hooks/hooks.json |
While plugin is enabled | Plugin-managed |
| Skill/agent frontmatter | While component is active | Skill-managed |
[1][11] Hooks merge additively across all files — a user-level stop hook fires alongside a project-level stop hook.
| Pattern | Evaluated As | Example |
|---|---|---|
"*", "", or omitted |
Match all tools | "matcher": "*" |
| Letters/digits/underscores/pipe only | Exact string or pipe-separated list | "Edit|Write|MultiEdit" |
| Contains other characters | JavaScript regex | "mcp__memory__.*" |
Every hook receives a JSON payload on stdin. PostToolUse includes:[1][10]
{
"session_id": "abc123",
"transcript_path": "/Users/.../.claude/projects/.../transcript.jsonl",
"cwd": "/Users/...",
"permission_mode": "default",
"hook_event_name": "PostToolUse",
"tool_name": "Write",
"tool_input": { "file_path": "/path/to/file.txt", "content": "file content" },
"tool_response": { "filePath": "/path/to/file.txt", "success": true },
"tool_use_id": "toolu_01ABC123...",
"duration_ms": 12
}
| Field | Description |
|---|---|
timeout |
Seconds before canceling. Command default: 600s |
statusMessage |
Custom spinner text shown while hook runs |
once |
If true, runs once per session (skill frontmatter only) |
async |
Background execution without blocking agent loop (command hooks only) |
asyncRewake |
Background execution; wake session if exits with code 2 (command hooks only) |
if |
Permission rule: "Bash(git *)", "Edit(*.ts)" |
Stop hooks that exit code 2 trigger a new Claude response. Without a guard, failing tests → remediation pass → Stop fires again → infinite token consumption. The mandatory guard:[11]
ALREADY_LOOPING=$(echo "$INPUT" | jq -r '.stop_hook_active // false')
if [[ "$ALREADY_LOOPING" == "true" ]]; then exit 0; fi
The stop_hook_active field is injected automatically when a Stop hook is the cause of the current response, giving hooks a reliable re-entrancy check.[11]
| Pitfall | Consequence | Fix |
|---|---|---|
Missing stop_hook_active guard |
Runaway token consumption | Add re-entrancy check (see above) |
| Full test suite in PostToolUse | Latency on every file write | Move tests to Stop hook |
| exit 2 for style warnings | Wasted turns on minor issues | Reserve exit 2 for blocking failures only |
| No timeout set | Hung hook stalls agent indefinitely | Always set explicit timeout |
Committing settings.local.json |
Personal hooks leak to team | Keep team hooks in settings.json |
| Extra text before JSON output | JSON validation failure; hook ignored | Print only valid JSON to stdout |
| Hard-coded absolute paths | Portability failure across machines | Use $CLAUDE_PROJECT_DIR |
| Non-login shell assumptions | nvm/pyenv not loaded → wrong runtime | Explicitly source version managers in hook script |
Real-world impact: A team with a 40-second TypeScript test suite across 12 monorepo packages eliminated manual cleanup passes using Stop hooks, recovering several hours of developer time weekly.[11]
See also: Failure Containment (runaway loops and failure modes)Claude Code v2.1.72+ supports three scheduling mechanisms: Cloud Routines (Anthropic-hosted, no local machine required), Desktop Scheduled Tasks (machine-local cron), and the in-session /loop command.[2][20] All three use 5-field cron expressions internally but differ on persistence, local file access, and minimum interval.
| Dimension | Cloud Routines | Desktop Scheduled Tasks | /loop |
|---|---|---|---|
| Runs on | Anthropic cloud | Local machine | Local machine |
| Machine must be on | No | Yes | Yes |
| Open session required | No | No | Yes |
| Persistent across restarts | Yes | Yes | Restored on --resume if unexpired |
| Local file access | No (fresh clone) | Yes | Yes |
| MCP servers | Connectors configured per task | Config files + connectors | Inherits from session |
| Permission prompts | No (autonomous) | Configurable per task | Inherits from session |
| Minimum interval | 1 hour | 1 minute | 1 minute |
| Auto-expiry | Configurable | Configurable | 7 days |
| Syntax | Behavior |
|---|---|
/loop 5m check the deploy |
Fixed cron schedule, specified interval |
/loop check the deploy |
Claude picks interval dynamically after each iteration |
/loop |
Runs built-in maintenance prompt |
/loop 20m /review-pr 1234 |
Re-runs a packaged workflow slash command every 20 minutes |
[2] Supported interval units: s, m, h, d. Non-standard intervals (e.g., 7m, 90m) round to the nearest valid step; Claude reports the chosen interval.[2]
When no interval is specified, Claude selects delay dynamically after each iteration — shorter while a build is finishing or a PR has activity, longer when nothing is pending. It may use the Monitor tool for event-driven triggering instead of polling.[2] On Bedrock, Vertex AI, and Microsoft Foundry, no-interval loops fall back to a fixed 10-minute schedule.[2]
Key finding: Dynamic self-pacing reduces unnecessary turns in quiet periods without requiring the user to guess a polling interval. The 7-day automatic expiry bounds forgotten runaway loops — a session that stalls recovers naturally rather than burning tokens indefinitely.[2]
Running /loop with no arguments activates a built-in protocol that executes in order:[2][20]
Irreversible actions (pushing, deleting) proceed only when continuing already-authorized work. Irreversible new initiatives require human initiation.[2]
| Path | Scope |
|---|---|
.claude/loop.md |
Project-level (takes precedence) |
~/.claude/loop.md |
User-level (any project without own loop.md) |
Edits take effect on the next iteration. Maximum size: 25,000 bytes. Disable scheduling entirely: CLAUDE_CODE_DISABLE_CRON=1.[2][20]
| Tool | Purpose |
|---|---|
CronCreate |
Schedule a new task. Accepts 5-field cron expression, prompt, recur flag |
CronList |
List all tasks with IDs, schedules, prompts |
CronDelete |
Cancel a task by ID |
Maximum 50 scheduled tasks per session.[2]
| Expression | Meaning |
|---|---|
*/5 * * * * |
Every 5 minutes |
0 * * * * |
Every hour on the hour |
0 9 * * * |
Every day at 9am local |
0 9 * * 1-5 |
Weekdays at 9am local |
30 14 15 3 * |
March 15 at 2:30pm local |
Extended syntax (L, W, ?, name aliases like MON/JAN) not supported.[2]
The -p / --print flag runs Claude non-interactively, outputs to stdout, and exits. Previously called "headless mode"; the umbrella term is now "Agent SDK CLI."[9][12] The --bare flag skips auto-discovery of hooks, skills, plugins, MCP servers, auto memory, and CLAUDE.md — ensuring reproducible behavior in CI. Anthropic has signaled --bare will become the default for -p in a future release.[9]
| Format Flag | Output | Best For |
|---|---|---|
--output-format text |
Plain text (default) | Human-readable logs |
--output-format json |
JSON with result + session_id + metadata | Pipeline chaining, CI artifacts |
--output-format json + --json-schema |
Schema-enforced structured JSON | Typed extraction tasks |
--output-format stream-json |
Real-time streaming JSON events | Live monitoring, retry event handling |
| Mode | Flag | What It Permits |
|---|---|---|
| Default | (none) | Interactive prompts for most writes and shell commands |
| dontAsk | --permission-mode dontAsk |
Denies anything not in permissions.allow or read-only set |
| acceptEdits | --permission-mode acceptEdits |
File writes without prompts; auto-approves mkdir/touch/mv/cp; shell/network still need explicit allow |
| bypassPermissions | --dangerously-skip-permissions |
All interactive confirmations bypassed — isolated environments only |
| To Load | Flag |
|---|---|
| System prompt additions | --append-system-prompt, --append-system-prompt-file |
| Settings | --settings <file-or-json> |
| MCP servers | --mcp-config <file-or-json> |
| Custom agents | --agents <json> |
| Plugin directory | --plugin-dir <path> |
[9] Authentication in --bare mode must come from ANTHROPIC_API_KEY (OAuth and keychain are skipped).[9]
# First request
claude -p "Review this codebase for performance issues"
# Continue the most recent conversation
claude -p "Now focus on the database queries" --continue
claude -p "Generate a summary of all issues found" --continue
# Capture session ID for precise resume
session_id=$(claude -p "Start a review" --output-format json | jq -r '.session_id')
claude -p "Continue that review" --resume "$session_id"
When an API request fails with a retryable error, a system/api_retry event is emitted containing:[9][12]
| Field | Description |
|---|---|
attempt |
Current attempt number, starting at 1 |
max_retries |
Total retries permitted |
retry_delay_ms |
Milliseconds until next attempt |
error_status |
HTTP status code |
error |
Category: rate_limit, server_error, authentication_failed, billing_error, etc. |
# Agent A analyzes → Agent B fixes
claude -p "Analyze failing tests and output JSON findings" --output-format json \
| jq '.findings' \
| claude -p "Fix the failing tests based on these findings" --stdin
When integrating claude -p into system cron:[28]
Key finding: Interactive-mode skills (e.g.,See also: Cost Optimization (token cost of headless runs), Context Continuity (session state management)/commit) are unavailable in-pmode. For CI/CD pipelines, use--barewith explicit tool allowlists and describe the task directly rather than referencing slash commands.[9]
Without isolation, concurrent agents writing to the same repository produce uncommitted change collisions — Agent A's edits to config.py become mixed with Agent B's edits, making clean rollbacks impossible.[28] Git worktrees solve this by giving each agent its own working directory with an independent branch, while sharing the same object store.
isolation: "worktree"Agent frontmatter supports:[28]
isolation: worktree # each invocation gets its own git copy
maxTurns: 50 # caps autonomous execution depth
Each subagent gets its own context window, its own tool set, and its own git worktree with no shared file state. The worktree is automatically cleaned up if the agent makes no changes; otherwise the path and branch are returned to the caller.[28]
| Plane | Directory | Purpose |
|---|---|---|
| Control Plane | .tasks/ |
Task metadata (status, worktree binding) |
| Execution Plane | .worktrees/ |
Isolated directories with task-specific git branches |
| Event Log | .worktrees/events.jsonl |
Lifecycle events for audit and recovery |
| Index | .worktrees/index.json |
Active worktree registry (crash-durable) |
[28] File-based state in .tasks/ and .worktrees/index.json provides durability across crashes. Conversation memory remains volatile — worktree state does not.
| Operation | Effect | State Transition |
|---|---|---|
| Creation | Task persisted to .tasks/ |
→ pending |
| Binding | Worktree created, connected to task | → in_progress |
| Isolation | Commands execute within worktree cwd | (active) |
| Cleanup (keep) | worktree_keep() — preserves branch |
→ complete |
| Cleanup (remove) | worktree_remove(complete_task=True) |
→ complete |
# Spawn parallel feature agents
claude -p "Refactor authentication module" -w auth-refactor &
claude -p "Build payment API endpoints" -w payment-api &
claude -p "Fix all TypeScript type errors" -w ts-fixes &
# Wait for all to complete
wait
# Each produces a separate, mergeable PR
gh pr list
"One agent refactoring authentication while another builds the payment API, both producing separate, mergeable PRs by morning."[28]
Key finding: EnterWorktree (mid-session tool) migrates the current CLI into a git worktree when hooks block main-tree operations, eliminating the need to restart a session for branch isolation. Task with isolation: "worktree" handles isolation for subagents; EnterWorktree handles it for the parent session itself.
Claude Code's internal execution cycle within a worktree:[28]
Aider supports non-interactive operation via --message / --yes flags. Git integration is on by default: auto-commits each edit set with Conventional Commits format using a weak model, prepending "(aider)" to author/committer names. Pre-existing uncommitted changes are committed first (user/AI separation).[3][13]
| Flag | Default | Description |
|---|---|---|
--message / -m |
— | Single instruction; process and exit |
--message-file / -f |
— | Load instructions from file |
--yes / --yes-always |
False | Auto-confirm all prompts (headless) |
--auto-commits |
True | Auto-commit LLM changes |
--dirty-commits |
True | Commit dirty files before edits |
--no-git |
False | Disable git integration |
--dry-run |
False | Execute without modifying files |
--commit |
— | Commit all pending changes, then exit |
--no-auto-commits when the prompt handles commits (avoids conflict)--yes for fully unattended operationprogress.txt, prd.json)--timeout as a safety net for unexpected interactive prompts[21]Aider limitations: No built-in loop/cron mechanism (must wrap in external shell/cron). No native retry logic on LLM failures. Python API may break between releases.[21]
Devin operates in a sandboxed cloud environment with a code editor, terminal, web browser, and planning tools. Users describe tasks; Devin executes without constant oversight and submits PRs when complete. Cost: $500/month — break-even requires saving 12-20 developer hours monthly at $75/hour.[4]
| Task Type | Success Rate | Quality Notes |
|---|---|---|
| Test writing | 82% | Good |
| Clear bug fixes | 78% | Good |
| Well-defined features | 65% | Good |
| SWE-bench Verified | ~44% | Benchmark |
| Vague bug fixes | 35% | Mixed quality |
| Ambiguous features | 25% | Poor quality |
| Metric | Change |
|---|---|
| Problem-solving speed | 4× faster |
| Resource consumption | 2× more efficient |
| PR merge rate | 34% → 67% |
| Security vulnerability resolution | 20× efficiency gain (30 min → 1.5 min) |
| Java migration speed | 14× faster than humans |
| Test coverage uplift | 50-60% → 80-90% |
Data conflict: raw_4.md states a real-world fail rate of approximately 85% on tasks outside SWE-bench conditions; raw_22.md shows a 67% PR merge rate (implying 33% fail rate). Discrepancy likely reflects different task difficulty distributions and measurement methodologies.[4][22]
Goose is a Rust-based autonomous agent under the Agentic AI Foundation (AAIF) at the Linux Foundation (formerly block/goose). Apache 2.0 license. 29,400+ GitHub stars, 368 contributors, 2,600+ forks.[5][14][24]
"At its core, goose is very much about pushing autonomy, so we let the agent loop run as far as it can, so if it stumbles, if it hits obstacles, it'll back up and it'll try another approach."[14]
Recipes support composable sub-recipes for multi-stage pipelines. Real-world reported: 45-minute status prep reduced to under 5 minutes using a recipe combining GitHub, Linear, and Notion extensions.[5][24]
| Configuration | Detail |
|---|---|
| Local LLM backend | Ollama at localhost:11434 (llama3:70b or similar) |
| Data transmission | Zero — no prompts, code, or outputs transmitted externally |
| Compliance | GDPR/DSGVO compliant (confirmed no external data transfer) |
| Hardware requirement | 48GB+ RAM for 70B models |
Codex CLI sends HTTP requests to OpenAI's Responses API in a continuous cycle: user input → API request → model generates output or requests tool calls → Codex executes tools locally → resubmits expanded context → loops until model stops emitting tool calls.[17]
Prompt caching: Cache hits occur when the new prompt contains an exact prefix match with the previous inference call, enabling linear-time sampling despite growing context sizes. Critical for long-horizon coding tasks.[17]
| Mode | Behavior |
|---|---|
suggest |
Proposes changes; needs human approval |
auto-edit |
Edits files without prompting |
full-auto |
Runs commands and edits without any prompts |
Model timeline: GPT-5.2-Codex (December 2025) described as "the moment that people began to believe that using autonomous coding agents could be reliable." GPT-5.3-Codex and GPT-5.4 added in 2026.[17]
Cursor Composer 2 executes multi-file coding tasks autonomously across entire codebases. Shifts from isolated function generation to complete feature implementation. As of 2026, 50%+ of Fortune 500 companies use Cursor; SOC 2 Type II, GDPR, HIPAA compliant.[29]
| Phase | What Happens |
|---|---|
| Explore | Examines relevant codebase portions |
| Plan | Formulates plan; optional human approval gate |
| Execute | Runs autonomously with real-time feedback stream; self-corrects on failures |
[29] Real-time RL training: Cursor deploys improved model checkpoints as often as every 5 hours using real user interaction data.[29]
.cursorignore and @file references; don't flood context with irrelevant codetsc --noEmit + eslint + tests + build + semgrep after every Composer sessionKey finding: "Success depends on workflow discipline — writing clear specifications, maintaining review gates, and implementing verification systems — rather than tool selection alone."[29]
GitHub Agentic Workflows (Technical Preview, 2026): Markdown files combining YAML config with natural language task descriptions, built on GitHub Actions infrastructure. Safety-first design: isolated sandboxes, read-only by default, write actions require review approval.[23]
Sweep AI (open-source): Triggered by adding a sweep label to any GitHub issue or prefixing the title with "Sweep:". Automated flow: reads codebase → posts comment outlining plan → developer replies to adjust → creates branch → makes incremental commits (test-first) → submits PR.[23]
| Effective | Failure Modes |
|---|---|
| Typo fixes, config updates | Ambiguous requirements → loops without useful code |
| Simple API endpoints | Complex business logic → surface-level implementation |
| Documentation generation | Regulatory implications → missed without domain context |
| Well-described bug fixes | Migrations requiring deep domain context → breakdown |
| Dimension | Claude Code | Aider | Devin | Goose | Codex CLI | Cursor Composer 2 |
|---|---|---|---|---|---|---|
| Auto-commit | Requires hook or explicit instruction | Default on after each change | Automatic (PR on completion) | Via recipe steps | Requires full-auto mode |
Explicit review before commit |
| File targeting | Autonomous discovery | Requires explicit file list | Autonomous | Autonomous | Autonomous (sandboxed paths) | Autonomous (full codebase) |
| Loop primitives | /loop + CronCreate + self-pacing |
No built-in loop; wrap in shell | Built-in cloud agent loop | Built-in indefinite agent loop | Continuous until done | Single session; manual re-trigger |
| Hook system | PostToolUse, Stop, PreToolUse, 5 types | None | None (cloud-managed) | None | None (externally via CI) | None (verification script pattern) |
| Cost | $20–$200/month | Free (API costs only) | $500/month | Free (Apache 2.0; API costs only) | Free CLI (API costs only) | $20–$40/month |
| Data privacy | Cloud API required | Cloud API required | Cloud (Cognition hosted) | Can run fully local (Ollama) | Cloud API required | Cloud; SOC 2 Type II |
| Non-interactive flag | -p / --print |
--yes |
Task submission UI/API | --auto |
exec subcommand |
No headless mode |
Mutation testing introduces deliberate code modifications and verifies whether tests detect them. Score formula: killed mutants / total mutants. "Code coverage tells you what ran. Mutation testing tells you what your tests would actually catch if the code were wrong."[6]
Survival rates on AI-generated code are 15-25% higher than on human-written code at equivalent coverage levels.[6] AI produces "structurally correct but semantically drifted" code lacking domain understanding. Documented incident: 92% coverage + passing tests failed to catch a deduplication bug using reference equality instead of business key comparison.[6]
Key finding: "An agent that writes code and tests can achieve 100% coverage with tests that don't actually verify behavior." Mutation testing is the only quality signal that detects this class of failure — coverage metrics are insufficient for autonomous agent output.[15]
| Language | Tool | Maturity | Key Limitation |
|---|---|---|---|
| Java | PIT (pitest.org) | Most mature | JVM-only |
| Python | mutmut, cosmic-ray | Good | Slower than compiled-language tools |
| JavaScript/TypeScript | Stryker | Most mature JS option | Does not support Vitest browser mode |
| .NET | Stryker.NET | Mature | — |
| Rust | cargo mutants | Early stage | Limited mutation operators |
| Go | go-mutesting | Early stage | Limited mutation operators |
Per-test coverage analysis runs only relevant tests per mutant. With incremental mode and parallel execution: "Run mutation testing on every PR in 1-5 minutes for most codebases." In generative AI pipelines, parallel execution cut CI time from 45 to 18 minutes.[25]
Critical Stryker limitation: Does NOT support Vitest's browser mode because "their instrumentation assumes Node.js execution, but browser mode runs tests in actual Chromium via Playwright." Agent-based manual mutation testing is the fallback for browser-mode stacks.[15]
| Metric | Result |
|---|---|
| Mutant generation speed vs. Python tools | 2× faster |
| Precision for computer vision apps | 12% higher |
| Speedup on 32-core cluster (GitLab/Azure DevOps) | 8× via parallelization |
<plugin>
<groupId>org.pitest</groupId>
<artifactId>pitest-maven</artifactId>
<version>1.15.3</version>
<configuration>
<mutators>
<mutator>DEFAULTS</mutator>
<mutator>REMOVE_CONDITIONALS</mutator>
<mutator>RETURN_VALS</mutator>
</mutators>
<outputFormats><param>HTML</param><param>XML</param></outputFormats>
</configuration>
</plugin>
Differential mutation testing (changed files only): 2-5 minutes runtime.[6]
An autonomous agent can select the right mutation tool by inspecting project files:[25]
| File Present | Tool Invoked |
|---|---|
Cargo.toml |
cargo mutants |
package.json |
npx stryker run |
pyproject.toml / setup.py |
mutmut run |
Quality gate: 100% kill rate required (all mutants must be killed) in the agent skill pattern — below 100% blocks the PR and lists survivors with specific test recommendations. Scope: changed files only (not entire codebase).[25]
| Context | Threshold |
|---|---|
| Newly created AI code | 80% minimum |
| Critical domains (authentication, data integrity) | 90% |
| AI modifying existing code | No decrease allowed |
"A living mutant will spawn a secondary 'Test Generator' agent to create a test case to kill it, before a PR is even reviewed by a human."[6]
Claude Code can execute mutation testing manually when automated tools don't support the stack:[15][25]
Real result: 38% mutation score (8 surviving mutants of 13 total) on a settings feature. Gaps found: volume boundary constraints, DOM class assignments, error handling paths.[15]
See also: UI Feedback Loop (browser-mode testing stacks)Property-based testing specifies a general invariant and relies on the framework to generate thousands of inputs that might violate it, instead of verifying specific hand-picked examples. The framework shrinks failures to the smallest reproducing input.[7][16][26]
Published paper: "Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem" (NeurIPS 2025 Deep Learning for Code Workshop). arXiv reference: 2510.09907.[7][16]
| Step | Action |
|---|---|
| 1. Code analysis | Target code, documentation, codebase relationships |
| 2. Property inference | Derived from type hints, docstrings, function names |
| 3. Test generation | Hypothesis tests written for discovered properties |
| 4. Execution + self-reflection | Distinguish real bug vs. false positive |
| 5. Bug reporting | Formatted reports for confirmed issues |
[7][16][26] Agent uses a to-do list for long-range reasoning across multiple files. Self-reflection loop minimizes false positives.[7]
| Metric | Value |
|---|---|
| Total bug reports generated | 984 |
| Overall validity rate | 56% |
| Valid AND reportable | 32% |
| Top-ranked validity | 86% |
| Top-ranked reportable | 81% |
| Cost per bug report | $5.56 |
| Cost per valid bug | ~$9.93 |
| Package | Bug Found | Impact |
|---|---|---|
| NumPy (500M+ monthly downloads) | numpy.random.wald returning negative values — catastrophic cancellation error |
Fix achieved "nearly ten orders of magnitude lower relative error" |
| AWS Lambda Powertools | slice_dictionary() iterator not incrementing — returned duplicate chunks |
Silent data integrity failure |
| Hugging Face Tokenizers | Missing parenthesis in CSS output | Rendering failure |
| CloudFormation CLI | item_hash() produced identical hashes — .sort() returning None |
Hash collision in infrastructure provisioning |
| SciPy, Pandas | Additional bugs found | Confirmed and merged |
Key finding: Model improvements are measurable in PBT performance: "Opus 4.1 and Sonnet 4.5, compared to Sonnet 4" showed significantly better self-reflection capabilities — directly reducing the false positive rate that would otherwise block autonomous PR flows.[16]
| Profile | max_examples |
Context |
|---|---|---|
| dev | 25 | Fast feedback during development |
| ci (moderate) | 200 | Standard CI budget[7] |
| ci (thorough) | 500 | Extended CI budget for critical paths[16] |
Hypothesis is safe to run in parallel with pytest-xdist; each worker gets an independent random seed.[7]
CI integration with FC_NUM_RUNS=1000 increases iteration count for CI runs:[16][26]
A new family of PBT libraries bringing Hypothesis-quality property testing to every language. Hegel for Rust released 2026, designed for autonomous CI integration.[26]
Properties that survive mutations are vacuous — they don't actually detect real bugs. Running mutation testing after property-based test generation identifies which properties are checking real invariants vs. which are effectively no-ops.[26]
Security scanning in autonomous agent loops operates at "Level 2 Continuous AI" — AI handles routine security work (detection, triage, PR generation) while developers focus on complex problems.[8] Common vulnerabilities in AI-generated code include exposed API keys, missing input validation, insecure default configurations, overly permissive CORS headers, SQL injection, and XSS risks.[8][4]
| Dimension | Semgrep | Snyk | Checkmarx |
|---|---|---|---|
| Primary strength | SAST, custom rules, fast PR scanning | Best-in-class SCA, IDE integration, AI fixes | Enterprise compliance (SOC2/HIPAA/PCI-DSS), deep dataflow |
| PR scan speed | 10-30 seconds | Cloud latency (slower) | Slower |
| Cross-file analysis in PR scan | No — diff-aware scans only; full scans required | Yes — server-side processing | Yes |
| Languages | 30+ | 30+ | 30+ |
| Cost | Free CLI (commercial use) | Free tier: 200 tests/month; $50+/dev/mo | $15,000+/year |
| Finding suppression | Inline // nosemgrep |
.snyk policy files (web UI) |
Policy-based |
| AI fix generation | Semgrep Multimodal (March 2026) | Snyk Mission Control (autonomous PR) | Limited |
| Memory limit | 5GB fallback; 3-hour timeout | Cloud-managed | Cloud-managed |
"Semgrep's cross-file analysis does NOT run on diff-aware PR scans. It only runs on full repository scans." Vulnerabilities requiring multi-file taint tracking won't be detected until a subsequent full scan — an autonomous loop that runs only PR-triggered scans will miss cross-file injection chains.[27]
Combines AI reasoning with deterministic rule-based analysis. Claimed performance improvements: 8× more true positives with 50% fewer false positives compared to rule-only Semgrep.[8]
| Constraint | Value |
|---|---|
| Memory fallback threshold | 5GB |
| Maximum scan timeout | 3 hours |
| Scaling limit | Issues beyond 1,000 files |
| Recommended RAM per core | 4-8GB |
Used by 40%+ of Semgrep customers — Semgrep clones and scans repos on their own infrastructure, with no CI runner required. Eliminates the overhead of maintaining CI scan jobs for each repository.[27]
Key finding: Production-grade security-mature teams combine both tools: "Snyk for SCA, container scanning, and IaC, paired with Semgrep for custom SAST rules and fast CI scanning." This combination covers Semgrep's cross-file gap (Snyk's server-side processing handles cross-file analysis) while using Semgrep's speed advantage for every PR.[27]
A production autonomous build-test-ship loop combines CC hooks, scheduled execution, headless invocations, and quality signal layers into a single pipeline. The key design principle: each quality layer fires at the appropriate loop frequency to maximize signal without burning tokens.
| Signal | Trigger Point | Frequency | Rationale |
|---|---|---|---|
| Format (Prettier, Ruff) | PostToolUse on Edit/Write | Per file write | Near-zero cost; immediate feedback[11] |
| Lint with auto-fix (ESLint, Ruff check) | PostToolUse on Edit/Write | Per file write | Catches style violations before accumulation[11] |
| Security scan (Semgrep per file) | PostToolUse on Write/Edit via MCP | Per file write | Immediate catch of new vulnerabilities[19] |
| Unit test suite | Stop hook (exit 2 on failure) | Per response | Cost too high for PostToolUse frequency; Stop fires once per response[11] |
| Mutation testing (changed files) | Stop hook or CI PR trigger | Per response or per PR | Cost too high for per-file; acceptable at response level[6][25] |
| Property-based tests | Stop hook or CI PR trigger | Per response or per PR | Generated once per implementation block[7][16] |
| SCA (Snyk Open Source) | CI PR trigger | Per PR | Dependency changes are infrequent[27] |
| Full Semgrep cross-file scan | Nightly CronCreate task | Daily | Cross-file taint analysis requires full repo scan[27] |
Stage files on edit (PostToolUse), commit at Stop — stages incrementally (reversible) and commits only post-validation. Pattern sequence:[11]
git add)git commit with descriptive message; if fail: exit 2 → Claude fixesstop_hook_active → skip re-commit if already in remediation passwhile not done:
result = run_claude_headless(prompt)
errors = run_tests_and_lint()
if not errors:
done = True
else:
prompt = f"Fix these errors: {errors}"
#!/usr/bin/env bash
set -euo pipefail
INPUT=$(cat)
FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty')
[[ -z "$FILE_PATH" ]] && exit 0
cd "$CLAUDE_PROJECT_DIR"
case "$FILE_PATH" in
*.ts|*.tsx|*.js|*.jsx)
npx --no-install prettier --write "$FILE_PATH" 2>&1 || true
npx --no-install eslint --fix "$FILE_PATH" 2>&1 || true
;;
*.py)
ruff format "$FILE_PATH" 2>&1 || true
ruff check --fix "$FILE_PATH" 2>&1 || true
;;
esac
[11] Use || true guards to prevent linter errors from blocking; --no-install fails fast if deps are missing; $CLAUDE_PROJECT_DIR ensures portability. Production note: Hooks run in non-login shells — runtimes like nvm (Node.js) and pyenv (Python) are not automatically loaded. Add explicit sourcing before the case block, e.g. [ -s "$HOME/.nvm/nvm.sh" ] && source "$HOME/.nvm/nvm.sh", otherwise tools resolve to system-default versions rather than the project's pinned runtime.[11]
#!/usr/bin/env bash
set -euo pipefail
INPUT=$(cat)
ALREADY_LOOPING=$(echo "$INPUT" | jq -r '.stop_hook_active // false')
if [[ "$ALREADY_LOOPING" == "true" ]]; then exit 0; fi
cd "$CLAUDE_PROJECT_DIR"
if git diff --quiet --name-only -- '*.ts' '*.tsx' '*.js' '*.py' 2>/dev/null; then
exit 0
fi
if ! npm test --silent; then
echo "Tests failed. Review output and fix before ending." >&2
exit 2
fi
COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command // empty')
if echo "$COMMAND" | grep -qE 'rm[[:space:]]+-rf[[:space:]]+/|git[[:space:]]+push.*--force|chmod[[:space:]]+777'; then
echo '{"decision": "block", "reason": "Blocked: contains unsafe pattern. Request manual execution."}'
exit 0
fi
{
"hooks": {
"PostToolUse": [{
"matcher": "Write|Edit",
"hooks": [{
"type": "mcp_tool",
"server": "my_server",
"tool": "security_scan",
"input": { "file_path": "${tool_input.file_path}" }
}]
}]
}
}
name: Mutation Testing
on:
pull_request:
branches: [main]
jobs:
mutation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Stryker
run: npx stryker run --incremental
- name: Check mutation score
run: |
score=$(cat reports/mutation/mutation.json | jq '.metrics.mutationScore')
if (( $(echo "$score < 80" | bc -l) )); then
echo "Mutation score $score% below threshold"
exit 1
fi
[tool.mutmut]
paths_to_mutate = "src/"
backup = false
runner = "python -m pytest"
tests_dir = "tests/"
- name: Property-based tests
env:
HYPOTHESIS_PROFILE: ci
run: pytest --hypothesis-seed=0 tests/
name: Semgrep
on:
pull_request: {}
jobs:
semgrep:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Semgrep
run: semgrep --config=auto --json --output=results.json
env:
SEMGREP_APP_TOKEN: ${{ secrets.SEMGREP_APP_TOKEN }}
- name: Block on high-severity findings
run: |
high=$(jq '[.results[] | select(.extra.severity == "ERROR")] | length' results.json)
if [ "$high" -gt 0 ]; then exit 1; fi
name: Security Scan
on: [pull_request]
jobs:
snyk-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Snyk
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
with:
command: code test
- name: Generate AI fix
if: failure()
run: |
snyk code test --json | ai-fix-agent --post-as-pr-comment
claude --bare -p "Run the test suite. If tests fail, identify the root cause and fix it.
Then re-run tests to verify. Repeat until all tests pass or you've attempted 3 fixes." \
--allowedTools "Bash,Read,Edit" \
--output-format json
goal: Monitor system health and respond to alerts
extensions: [datadog-mcp, github-mcp, slack-mcp]
steps:
- Check active alerts in Datadog
- For each critical alert: diagnose root cause from logs
- Create GitHub issue with diagnosis
- Notify on-call via Slack
- If fixable: create PR with remediation
auto-commits: true
dirty-commits: true
yes-always: true
model: claude-sonnet-4-5
git-commit-verify: false
[agent]
model = "gpt-5.3-codex"
approval_mode = "full-auto"
[system]
instructions = "You are a TypeScript expert. Always write tests."
[sandbox]
max_file_size = 10485760
allow_paths = ["./src", "./tests"]
deny_paths = [".env", "~/.ssh"]
#!/bin/bash
set -e
npx tsc --noEmit
npx eslint src/ --max-warnings=0
npm test -- --passWithNoTests
npm run build
semgrep --config=auto --error
echo "All checks passed"
See also: Failure Containment, Cost Optimization, Context Continuity, Multi-CLI Coordination