Home

Autonomous Build-Test-Ship Loop

Pillar: autonomous-build-loop | Date: April 2026
Scope: CC 2026 automation primitives: PostToolUse hooks for format/lint/test, CronCreate for scheduled agents, /loop self-pacing, --print headless mode, Task with isolation:worktree, EnterWorktree, plan mode. Tool comparison for autonomous agents: Aider auto-commit, Cursor Composer, Devin headless, Sweep PR-from-issue, Goose loops, Codex CLI, Cline/RooCode. Continuous quality patterns in agent loops: mutation testing (Mutmut, Stryker), property-based testing (Hypothesis, fast-check), security scanning (Semgrep, Snyk), perf regression watchers. Real public configs.
Sources: 29 gathered, consolidated, synthesized.

Executive Summary

Critical finding: AI-generated code produces 15–25% higher mutation survival rates than human-written code at equivalent coverage levels — meaning a fully autonomous agent can achieve 100% test coverage while its tests detect zero behavioral regressions. Standard coverage metrics are a necessary but insufficient quality gate for any build loop that writes its own tests.[6]

The most consequential design decision in any CC 2026 autonomous loop is hook placement: PostToolUse fires after every tool call, Stop fires once per response. Routing a full test suite through PostToolUse adds that suite's runtime to every single file write — a 40-second TypeScript suite across 12 monorepo packages runs dozens of times per session instead of once. The correct split is format and lint on PostToolUse (near-zero cost, immediate feedback), test suites and mutation scoring on Stop (once per response, worth the latency). Teams that implement this split correctly recover measurable developer hours weekly.[11] Stop hooks that exit code 2 trigger another Claude response, creating a re-entrancy risk: without a stop_hook_active guard, a failing test suite produces an infinite token-burning remediation loop. Every production Stop hook must check this field before executing.[11]

CC 2026 provides three scheduling primitives with distinct trade-off profiles. Cloud Routines (Anthropic-hosted) require no local machine and survive reboots, but have a 1-hour minimum interval and no local file access — correct for scheduled reporting tasks, wrong for commit-loop automation. Desktop Scheduled Tasks (machine-local cron) have a 1-minute minimum interval with full file access but require the machine to be on. The in-session /loop command can be self-paced: when no interval is specified, Claude picks delay dynamically — shorter while a build is finishing, longer when nothing is pending — and the 7-day automatic expiry bounds forgotten runaway loops. Maximum 50 scheduled tasks per session via CronCreate.[2] All three mechanisms support 5-field cron expressions; extended syntax (L, W, name aliases) is not supported. Recurring tasks fire up to 10% late, capped at 15 minutes of jitter.[2]

Among autonomous agent tools, Claude Code is the only system combining native hook infrastructure with three loop primitives (/loop, CronCreate, self-pacing). Aider provides no built-in loop mechanism — external shell wrappers required — and has no hook system at all. Goose (Apache 2.0, 29,400+ GitHub stars, 25+ LLM providers via MCP) is the only option that runs fully locally with Ollama (48GB+ RAM for 70B models), making it the correct choice for GDPR/DSGVO environments with data residency requirements.[5][24] Cursor Composer 2 has no headless mode at all — it requires interactive sessions, though it deploys improved model checkpoints as often as every 5 hours via real-time RL training on production interaction data and is used by 50%+ of Fortune 500 companies as of 2026.[29] Devin (Cognition AI) costs $500/month — break-even requires saving 12–20 developer hours monthly at $75/hour — and improved its PR merge rate from 34% to 67% and security vulnerability resolution time from 30 minutes to 1.5 minutes (a 20× gain) over 18 months of production operation. However, success rates collapse on vague or ambiguous tasks: 82% on test writing, 35% on vague bug fixes, 25% on ambiguous features, with the documented "last 30% problem" (core logic delivered, edge cases and integration missing).[4][22]

Mutation testing is the only quality signal that catches the coverage-without-correctness failure mode characteristic of AI-generated tests. The standard recommendation for autonomous loops: 80% minimum mutation score for newly created AI code, 90% for critical domains (authentication, data integrity), and no decrease allowed when AI modifies existing code. Stryker's parallel execution cut CI time from 45 to 18 minutes in a generative AI pipeline, and the Stryker .NET benchmark showed 8× speedup on a 32-core cluster. A critical architectural gap: Stryker does not support Vitest browser mode because its instrumentation assumes Node.js execution — manual agent-driven mutation testing is the fallback for browser-mode stacks.[15][25] The emerging production pattern is a secondary agent loop: surviving mutants trigger a "Test Generator" agent that writes tests to kill them, before human review.[6]

Anthropic's published research (NeurIPS 2025, arXiv 2510.09907) on autonomous property-based testing across 100 popular Python packages produced 984 bug reports with a 56% overall validity rate and a $5.56 cost per bug report (~$9.93 per valid bug). Top-ranked reports reached 86% validity. Confirmed bugs merged into production included a catastrophic cancellation error in numpy.random.wald (500M+ monthly downloads), a silent iterator failure in AWS Lambda Powertools slice_dictionary(), and a hash collision in CloudFormation CLI's item_hash().[16] The self-reflection loop (step 4 of 5 in the agent workflow) is mandatory — without it, domain-specific invariants generate false positives that block legitimate PRs. Model generation matters: Opus 4.1 and Sonnet 4.5 showed significantly better self-reflection than Sonnet 4, directly reducing false positive rates.[16] Hypothesis CI profiles should run 200 examples for moderate CI budgets and 500 for critical paths; the dev profile at 25 examples is too thin for catching edge cases on AI-generated code.[7]

Security scanning in autonomous loops has a structural gap: Semgrep's cross-file analysis does not run on diff-aware PR scans — it only runs on full repository scans. An agent loop running only PR-triggered Semgrep scans will miss cross-file SQL injection and XSS chains entirely. The correct architecture: Semgrep for fast per-PR scanning (10–30 seconds) plus a nightly CronCreate task running a full cross-file scan. Semgrep Multimodal (announced March 2026) claims 8× more true positives with 50% fewer false positives versus rule-only Semgrep.[8] Snyk's free tier covers 200 tests/month; commercial plans start at $50/dev/month versus Checkmarx at $15,000+/year for enterprise compliance requirements.[27] Production-grade teams combine both: Snyk for SCA and container scanning (handles cross-file analysis server-side), Semgrep for fast CI SAST on every PR. Common vulnerabilities in AI-generated code that these scanners catch: exposed API keys, missing input validation, insecure default configurations, overly permissive CORS headers — Devin specifically has documented security blindness for SQL injection and XSS risks.[4][8]

The integrated loop architecture layers quality signals by cost and frequency. Format and per-file lint run on PostToolUse (near-zero cost, no latency impact). Security scans via MCP tool hooks run on PostToolUse against Write/Edit. Unit tests run on Stop with the mandatory re-entrancy guard. Mutation testing and property-based tests run on Stop or CI PR trigger. Full Semgrep cross-file analysis runs on a nightly CronCreate. The --bare flag in claude -p headless mode skips all auto-discovery (hooks, skills, MCP, CLAUDE.md) for reproducible CI behavior — Anthropic has signaled it will become the default for -p in a future release — and requires explicit ANTHROPIC_API_KEY rather than OAuth or keychain.[9][12]

Practitioners should wire quality layers before optimizing loop speed: a loop that ships fast but skips mutation testing on AI-generated code accumulates semantic debt that coverage metrics won't surface. The concrete implementation path is: (1) PostToolUse hook for format/lint/per-file security scan, (2) Stop hook with re-entrancy guard for test suite, (3) CI PR gate for mutation score threshold (80% floor, 90% for auth/data paths), (4) property-based tests on Stop or CI for critical invariants, (5) nightly CronCreate for full Semgrep cross-file scan. This ordering installs the quality signals at the correct granularity without burning tokens on per-file test runs or missing cross-file vulnerabilities in PR-only scan configurations. Tool selection matters less than the discipline to run these checks: "success depends on workflow discipline — writing clear specifications, maintaining review gates, and implementing verification systems — rather than tool selection alone."[29]



Table of Contents

  1. CC 2026 Hook System: PostToolUse, PreToolUse, Stop, SessionStart
  2. Scheduled Execution: CronCreate, /loop, Self-Pacing
  3. Headless Mode: --print / -p / --bare
  4. Worktree Isolation: Task tool, EnterWorktree, Parallel Agents
  5. Autonomous Agent Tool Comparison
  6. Mutation Testing: Mutmut, Stryker, PIT
  7. Property-Based Testing: Hypothesis, fast-check
  8. Security Scanning: Semgrep, Snyk
  9. Integrated Loop Architecture
  10. Production Patterns and Real Configurations

Section 1: CC 2026 Hook System — PostToolUse, PreToolUse, Stop, SessionStart

Claude Code hooks are shell commands (or HTTP calls, MCP tool calls, prompt evaluations, or subagent invocations) triggered at specific session lifecycle points, enforcing code standards deterministically without relying on model memory.[11] "Hooks shift responsibility from the model to the runtime."[11] Multiple hooks matching the same event all execute — PostToolUse hooks run in parallel; Stop hooks run sequentially.[10] Hooks run synchronously in the agent loop, meaning slow hooks add latency to every matched tool call.[10][19]

Key finding: The distinction between PostToolUse and Stop hooks determines autonomous loop behavior. PostToolUse fires after every tool call — correct for per-file lint/format; too expensive for full test suites. Stop fires once per response — ideal for test suites, commit staging, and mutation scoring. Conflating the two is the most common autonomous-loop performance mistake.[11]

Hook Lifecycle Events

Event When It Fires Can Block? (exit 2) Autonomous Loop Use
SessionStart Session begins or resumes No Load context, restore state, print lane status
UserPromptSubmit User submits prompt, before Claude processes Yes Inject project context, validate prompt safety
PreToolUse Before a tool call executes Yes Block dangerous commands (rm -rf /, force push), rewrite tool input
PostToolUse After a tool call succeeds No (provides feedback) Per-file lint/format, security scan on write, inject lint errors as context
PostToolUseFailure After a tool call fails No Log failures, trigger fallback paths
PostToolBatch After a full batch of parallel tool calls resolves Aggregate batch results before next turn
Stop When Claude finishes responding Yes (exit 2) Full test suite, mutation scoring, commit staging, baton write
SessionEnd Session terminates No Flush logs, release lane claims
Notification Claude sends a notification Forward to Slack/PagerDuty, route alerts

[1][10][19]

Five Hook Types

Type Config Key Default Timeout Best For
Command "type": "command" 600s Shell scripts, linters, test runners — receives JSON on stdin
HTTP "type": "http" 30s Webhooks, external monitoring APIs
MCP Tool "type": "mcp_tool" 30s Security scanners, Snyk, custom analysis tools via MCP
Prompt "type": "prompt" 30s Yes/no policy evaluation by Claude (e.g., "is this change safe?")
Agent "type": "agent" 60s Spawn subagents for complex verification tasks

[19]

Exit Code Semantics

Exit Code Effect Notes
0 Success; JSON stdout parsed Normal path
2 Blocking error (for events that support blocking) stderr shown to Claude; Claude asked to fix before continuing
Other Non-blocking error First line of stderr shown in transcript; execution continues

[1] PostToolUse decision: "block" does NOT prevent tool execution (tool already ran); it instructs Claude to fix the flagged issue before its next action.[10]

Hook Configuration Locations (Precedence Order)

File Scope Git-tracked?
~/.claude/settings.json User-wide (all projects) No
.claude/settings.json Team repo Yes (committable)
.claude/settings.local.json Personal overrides No (gitignored)
Plugin hooks/hooks.json While plugin is enabled Plugin-managed
Skill/agent frontmatter While component is active Skill-managed

[1][11] Hooks merge additively across all files — a user-level stop hook fires alongside a project-level stop hook.

Matcher Pattern Rules

Pattern Evaluated As Example
"*", "", or omitted Match all tools "matcher": "*"
Letters/digits/underscores/pipe only Exact string or pipe-separated list "Edit|Write|MultiEdit"
Contains other characters JavaScript regex "mcp__memory__.*"

[19]

PostToolUse JSON Input Schema

Every hook receives a JSON payload on stdin. PostToolUse includes:[1][10]

{
  "session_id": "abc123",
  "transcript_path": "/Users/.../.claude/projects/.../transcript.jsonl",
  "cwd": "/Users/...",
  "permission_mode": "default",
  "hook_event_name": "PostToolUse",
  "tool_name": "Write",
  "tool_input": { "file_path": "/path/to/file.txt", "content": "file content" },
  "tool_response": { "filePath": "/path/to/file.txt", "success": true },
  "tool_use_id": "toolu_01ABC123...",
  "duration_ms": 12
}

Key Hook Fields

Field Description
timeout Seconds before canceling. Command default: 600s
statusMessage Custom spinner text shown while hook runs
once If true, runs once per session (skill frontmatter only)
async Background execution without blocking agent loop (command hooks only)
asyncRewake Background execution; wake session if exits with code 2 (command hooks only)
if Permission rule: "Bash(git *)", "Edit(*.ts)"

[1]

Critical Anti-Pattern: The Infinite Stop Loop

Stop hooks that exit code 2 trigger a new Claude response. Without a guard, failing tests → remediation pass → Stop fires again → infinite token consumption. The mandatory guard:[11]

ALREADY_LOOPING=$(echo "$INPUT" | jq -r '.stop_hook_active // false')
if [[ "$ALREADY_LOOPING" == "true" ]]; then exit 0; fi

The stop_hook_active field is injected automatically when a Stop hook is the cause of the current response, giving hooks a reliable re-entrancy check.[11]

Common Pitfalls

Pitfall Consequence Fix
Missing stop_hook_active guard Runaway token consumption Add re-entrancy check (see above)
Full test suite in PostToolUse Latency on every file write Move tests to Stop hook
exit 2 for style warnings Wasted turns on minor issues Reserve exit 2 for blocking failures only
No timeout set Hung hook stalls agent indefinitely Always set explicit timeout
Committing settings.local.json Personal hooks leak to team Keep team hooks in settings.json
Extra text before JSON output JSON validation failure; hook ignored Print only valid JSON to stdout
Hard-coded absolute paths Portability failure across machines Use $CLAUDE_PROJECT_DIR
Non-login shell assumptions nvm/pyenv not loaded → wrong runtime Explicitly source version managers in hook script

[11]

Real-world impact: A team with a 40-second TypeScript test suite across 12 monorepo packages eliminated manual cleanup passes using Stop hooks, recovering several hours of developer time weekly.[11]

See also: Failure Containment (runaway loops and failure modes)

Section 2: Scheduled Execution — CronCreate, /loop, Self-Pacing

Claude Code v2.1.72+ supports three scheduling mechanisms: Cloud Routines (Anthropic-hosted, no local machine required), Desktop Scheduled Tasks (machine-local cron), and the in-session /loop command.[2][20] All three use 5-field cron expressions internally but differ on persistence, local file access, and minimum interval.

Scheduling Options Comparison

Dimension Cloud Routines Desktop Scheduled Tasks /loop
Runs on Anthropic cloud Local machine Local machine
Machine must be on No Yes Yes
Open session required No No Yes
Persistent across restarts Yes Yes Restored on --resume if unexpired
Local file access No (fresh clone) Yes Yes
MCP servers Connectors configured per task Config files + connectors Inherits from session
Permission prompts No (autonomous) Configurable per task Inherits from session
Minimum interval 1 hour 1 minute 1 minute
Auto-expiry Configurable Configurable 7 days

[2][20]

/loop Command: Four Invocation Modes

Syntax Behavior
/loop 5m check the deploy Fixed cron schedule, specified interval
/loop check the deploy Claude picks interval dynamically after each iteration
/loop Runs built-in maintenance prompt
/loop 20m /review-pr 1234 Re-runs a packaged workflow slash command every 20 minutes

[2] Supported interval units: s, m, h, d. Non-standard intervals (e.g., 7m, 90m) round to the nearest valid step; Claude reports the chosen interval.[2]

Dynamic Self-Pacing

When no interval is specified, Claude selects delay dynamically after each iteration — shorter while a build is finishing or a PR has activity, longer when nothing is pending. It may use the Monitor tool for event-driven triggering instead of polling.[2] On Bedrock, Vertex AI, and Microsoft Foundry, no-interval loops fall back to a fixed 10-minute schedule.[2]

Key finding: Dynamic self-pacing reduces unnecessary turns in quiet periods without requiring the user to guess a polling interval. The 7-day automatic expiry bounds forgotten runaway loops — a session that stalls recovers naturally rather than burning tokens indefinitely.[2]

Built-in Maintenance Prompt (Bare /loop)

Running /loop with no arguments activates a built-in protocol that executes in order:[2][20]

  1. Continue any unfinished work from the conversation
  2. Tend to the current branch's PR: review comments, failed CI, merge conflicts
  3. Run cleanup passes (bug hunts, simplification) when nothing else pending

Irreversible actions (pushing, deleting) proceed only when continuing already-authorized work. Irreversible new initiatives require human initiation.[2]

loop.md Customization

Path Scope
.claude/loop.md Project-level (takes precedence)
~/.claude/loop.md User-level (any project without own loop.md)

Edits take effect on the next iteration. Maximum size: 25,000 bytes. Disable scheduling entirely: CLAUDE_CODE_DISABLE_CRON=1.[2][20]

Underlying Tools: CronCreate / CronList / CronDelete

Tool Purpose
CronCreate Schedule a new task. Accepts 5-field cron expression, prompt, recur flag
CronList List all tasks with IDs, schedules, prompts
CronDelete Cancel a task by ID

Maximum 50 scheduled tasks per session.[2]

Execution Model and Jitter

CronCreate Expression Reference

Expression Meaning
*/5 * * * * Every 5 minutes
0 * * * * Every hour on the hour
0 9 * * * Every day at 9am local
0 9 * * 1-5 Weekdays at 9am local
30 14 15 3 * March 15 at 2:30pm local

Extended syntax (L, W, ?, name aliases like MON/JAN) not supported.[2]

See also: Context Continuity (context state across loop iterations), Cost Optimization (token cost of running loops)

Section 3: Headless Mode — --print / -p / --bare

The -p / --print flag runs Claude non-interactively, outputs to stdout, and exits. Previously called "headless mode"; the umbrella term is now "Agent SDK CLI."[9][12] The --bare flag skips auto-discovery of hooks, skills, plugins, MCP servers, auto memory, and CLAUDE.md — ensuring reproducible behavior in CI. Anthropic has signaled --bare will become the default for -p in a future release.[9]

Output Formats

Format Flag Output Best For
--output-format text Plain text (default) Human-readable logs
--output-format json JSON with result + session_id + metadata Pipeline chaining, CI artifacts
--output-format json + --json-schema Schema-enforced structured JSON Typed extraction tasks
--output-format stream-json Real-time streaming JSON events Live monitoring, retry event handling

[9][12][28]

Permission Modes

Mode Flag What It Permits
Default (none) Interactive prompts for most writes and shell commands
dontAsk --permission-mode dontAsk Denies anything not in permissions.allow or read-only set
acceptEdits --permission-mode acceptEdits File writes without prompts; auto-approves mkdir/touch/mv/cp; shell/network still need explicit allow
bypassPermissions --dangerously-skip-permissions All interactive confirmations bypassed — isolated environments only

[9][12][28]

--bare Context Loading Flags

To Load Flag
System prompt additions --append-system-prompt, --append-system-prompt-file
Settings --settings <file-or-json>
MCP servers --mcp-config <file-or-json>
Custom agents --agents <json>
Plugin directory --plugin-dir <path>

[9] Authentication in --bare mode must come from ANTHROPIC_API_KEY (OAuth and keychain are skipped).[9]

Multi-Turn Session Continuation

# First request
claude -p "Review this codebase for performance issues"

# Continue the most recent conversation
claude -p "Now focus on the database queries" --continue
claude -p "Generate a summary of all issues found" --continue

# Capture session ID for precise resume
session_id=$(claude -p "Start a review" --output-format json | jq -r '.session_id')
claude -p "Continue that review" --resume "$session_id"

[9][12]

Retry Event Handling (stream-json)

When an API request fails with a retryable error, a system/api_retry event is emitted containing:[9][12]

Field Description
attempt Current attempt number, starting at 1
max_retries Total retries permitted
retry_delay_ms Milliseconds until next attempt
error_status HTTP status code
error Category: rate_limit, server_error, authentication_failed, billing_error, etc.

Multi-Agent Output Chaining Pattern

# Agent A analyzes → Agent B fixes
claude -p "Analyze failing tests and output JSON findings" --output-format json \
  | jq '.findings' \
  | claude -p "Fix the failing tests based on these findings" --stdin

[28]

Cron Integration Requirements

When integrating claude -p into system cron:[28]

Key finding: Interactive-mode skills (e.g., /commit) are unavailable in -p mode. For CI/CD pipelines, use --bare with explicit tool allowlists and describe the task directly rather than referencing slash commands.[9]
See also: Cost Optimization (token cost of headless runs), Context Continuity (session state management)

Section 4: Worktree Isolation — Task Tool, EnterWorktree, Parallel Agents

Without isolation, concurrent agents writing to the same repository produce uncommitted change collisions — Agent A's edits to config.py become mixed with Agent B's edits, making clean rollbacks impossible.[28] Git worktrees solve this by giving each agent its own working directory with an independent branch, while sharing the same object store.

Task Tool with isolation: "worktree"

Agent frontmatter supports:[28]

isolation: worktree  # each invocation gets its own git copy
maxTurns: 50         # caps autonomous execution depth

Each subagent gets its own context window, its own tool set, and its own git worktree with no shared file state. The worktree is automatically cleaned up if the agent makes no changes; otherwise the path and branch are returned to the caller.[28]

Worktree Architecture (MindStudio Pattern, 2026)

Plane Directory Purpose
Control Plane .tasks/ Task metadata (status, worktree binding)
Execution Plane .worktrees/ Isolated directories with task-specific git branches
Event Log .worktrees/events.jsonl Lifecycle events for audit and recovery
Index .worktrees/index.json Active worktree registry (crash-durable)

[28] File-based state in .tasks/ and .worktrees/index.json provides durability across crashes. Conversation memory remains volatile — worktree state does not.

Worktree Lifecycle Operations

Operation Effect State Transition
Creation Task persisted to .tasks/ pending
Binding Worktree created, connected to task in_progress
Isolation Commands execute within worktree cwd (active)
Cleanup (keep) worktree_keep() — preserves branch complete
Cleanup (remove) worktree_remove(complete_task=True) complete

[28]

Parallel Agent Pattern (Production 2026)

# Spawn parallel feature agents
claude -p "Refactor authentication module" -w auth-refactor &
claude -p "Build payment API endpoints" -w payment-api &
claude -p "Fix all TypeScript type errors" -w ts-fixes &

# Wait for all to complete
wait

# Each produces a separate, mergeable PR
gh pr list

"One agent refactoring authentication while another builds the payment API, both producing separate, mergeable PRs by morning."[28]

Key finding: EnterWorktree (mid-session tool) migrates the current CLI into a git worktree when hooks block main-tree operations, eliminating the need to restart a session for branch isolation. Task with isolation: "worktree" handles isolation for subagents; EnterWorktree handles it for the parent session itself.

Internal Agentic Behavior Loop

Claude Code's internal execution cycle within a worktree:[28]

  1. Observe — Available context (file tree, git history, test results)
  2. Reason — Form plan spanning multiple file edits and shell commands
  3. Act — Invoke tools
  4. Verify — Check for errors or test failures
  5. Iterate — Loop if needed
See also: Multi-CLI Coordination (coordination between parallel loops)

Section 5: Autonomous Agent Tool Comparison

5.1 Aider: Scripting and Autonomous Loop

Aider supports non-interactive operation via --message / --yes flags. Git integration is on by default: auto-commits each edit set with Conventional Commits format using a weak model, prepending "(aider)" to author/committer names. Pre-existing uncommitted changes are committed first (user/AI separation).[3][13]

Key Automation Flags

Flag Default Description
--message / -m Single instruction; process and exit
--message-file / -f Load instructions from file
--yes / --yes-always False Auto-confirm all prompts (headless)
--auto-commits True Auto-commit LLM changes
--dirty-commits True Commit dirty files before edits
--no-git False Disable git integration
--dry-run False Execute without modifying files
--commit Commit all pending changes, then exit

[3][13][21]

Production Autonomous Loop Patterns (2026)

  1. Run with --no-auto-commits when the prompt handles commits (avoids conflict)
  2. Use --yes for fully unattended operation
  3. Each iteration runs with a fresh instance to prevent context window bloat
  4. Memory persists through git history and progress files (progress.txt, prd.json)
  5. Use --timeout as a safety net for unexpected interactive prompts[21]

Aider limitations: No built-in loop/cron mechanism (must wrap in external shell/cron). No native retry logic on LLM failures. Python API may break between releases.[21]

5.2 Devin (Cognition AI): Fully Autonomous Cloud Agent

Devin operates in a sandboxed cloud environment with a code editor, terminal, web browser, and planning tools. Users describe tasks; Devin executes without constant oversight and submits PRs when complete. Cost: $500/month — break-even requires saving 12-20 developer hours monthly at $75/hour.[4]

Performance Data: 5 Real-World Codebases

Task Type Success Rate Quality Notes
Test writing 82% Good
Clear bug fixes 78% Good
Well-defined features 65% Good
SWE-bench Verified ~44% Benchmark
Vague bug fixes 35% Mixed quality
Ambiguous features 25% Poor quality

[4]

18-Month Production Improvements (Cognition Review, 2025)

Metric Change
Problem-solving speed 4× faster
Resource consumption 2× more efficient
PR merge rate 34% → 67%
Security vulnerability resolution 20× efficiency gain (30 min → 1.5 min)
Java migration speed 14× faster than humans
Test coverage uplift 50-60% → 80-90%

[22]

Documented Weaknesses

Data conflict: raw_4.md states a real-world fail rate of approximately 85% on tasks outside SWE-bench conditions; raw_22.md shows a 67% PR merge rate (implying 33% fail rate). Discrepancy likely reflects different task difficulty distributions and measurement methodologies.[4][22]

5.3 Goose (Block/AAIF): Open-Source Local Agent

Goose is a Rust-based autonomous agent under the Agentic AI Foundation (AAIF) at the Linux Foundation (formerly block/goose). Apache 2.0 license. 29,400+ GitHub stars, 368 contributors, 2,600+ forks.[5][14][24]

Goose Architecture (Three Layers)

  1. Central agent loop — plan-execute-evaluate cycle runs until complete or unresolvable blocker
  2. Provider abstraction layer — 25+ LLM providers (Anthropic, OpenAI, Google, Mistral, xAI Grok, AWS Bedrock, GCP Vertex, Azure, Databricks, Ollama, Docker Model Runner, Ramalama)
  3. Extension system — built on MCP; 3,000+ available MCP servers[24]

"At its core, goose is very much about pushing autonomy, so we let the agent loop run as far as it can, so if it stumbles, if it hits obstacles, it'll back up and it'll try another approach."[14]

Recipes: Portable YAML Workflow Units

Recipes support composable sub-recipes for multi-stage pipelines. Real-world reported: 45-minute status prep reduced to under 5 minutes using a recipe combining GitHub, Linear, and Notion extensions.[5][24]

Local/Privacy-First Operation

Configuration Detail
Local LLM backend Ollama at localhost:11434 (llama3:70b or similar)
Data transmission Zero — no prompts, code, or outputs transmitted externally
Compliance GDPR/DSGVO compliant (confirmed no external data transfer)
Hardware requirement 48GB+ RAM for 70B models

[5][24]

5.4 OpenAI Codex CLI: Agent Loop Architecture

Codex CLI sends HTTP requests to OpenAI's Responses API in a continuous cycle: user input → API request → model generates output or requests tool calls → Codex executes tools locally → resubmits expanded context → loops until model stops emitting tool calls.[17]

Prompt caching: Cache hits occur when the new prompt contains an exact prefix match with the previous inference call, enabling linear-time sampling despite growing context sizes. Critical for long-horizon coding tasks.[17]

Approval Modes

Mode Behavior
suggest Proposes changes; needs human approval
auto-edit Edits files without prompting
full-auto Runs commands and edits without any prompts

[17]

Model timeline: GPT-5.2-Codex (December 2025) described as "the moment that people began to believe that using autonomous coding agents could be reliable." GPT-5.3-Codex and GPT-5.4 added in 2026.[17]

5.5 Cursor Composer 2: IDE-Native Autonomous Agent

Cursor Composer 2 executes multi-file coding tasks autonomously across entire codebases. Shifts from isolated function generation to complete feature implementation. As of 2026, 50%+ of Fortune 500 companies use Cursor; SOC 2 Type II, GDPR, HIPAA compliant.[29]

Three-Phase Execution Model

Phase What Happens
Explore Examines relevant codebase portions
Plan Formulates plan; optional human approval gate
Execute Runs autonomously with real-time feedback stream; self-corrects on failures

[29] Real-time RL training: Cursor deploys improved model checkpoints as often as every 5 hours using real user interaction data.[29]

Five Production Patterns (Cursor Composer 2)

  1. Spec-first: Provide detailed task specifications before execution (reduces hallucination significantly)
  2. Sandbox-Review-Merge (mandatory for production): Always use feature branches; review diff before merging; run tests + typecheck before PR
  3. Context management: Use .cursorignore and @file references; don't flood context with irrelevant code
  4. Verification loops (non-negotiable): tsc --noEmit + eslint + tests + build + semgrep after every Composer session
  5. Open-source alternative: MiroThinker 72B + Continue.dev for data residency requirements

[29]

Key finding: "Success depends on workflow discipline — writing clear specifications, maintaining review gates, and implementing verification systems — rather than tool selection alone."[29]

5.6 GitHub Agentic Workflows + Sweep: Issue-to-PR Automation

GitHub Agentic Workflows (Technical Preview, 2026): Markdown files combining YAML config with natural language task descriptions, built on GitHub Actions infrastructure. Safety-first design: isolated sandboxes, read-only by default, write actions require review approval.[23]

Sweep AI (open-source): Triggered by adding a sweep label to any GitHub issue or prefixing the title with "Sweep:". Automated flow: reads codebase → posts comment outlining plan → developer replies to adjust → creates branch → makes incremental commits (test-first) → submits PR.[23]

Sweep: Effective vs. Failure Use Cases

Effective Failure Modes
Typo fixes, config updates Ambiguous requirements → loops without useful code
Simple API endpoints Complex business logic → surface-level implementation
Documentation generation Regulatory implications → missed without domain context
Well-described bug fixes Migrations requiring deep domain context → breakdown

[23]

Tool Comparison Matrix

Dimension Claude Code Aider Devin Goose Codex CLI Cursor Composer 2
Auto-commit Requires hook or explicit instruction Default on after each change Automatic (PR on completion) Via recipe steps Requires full-auto mode Explicit review before commit
File targeting Autonomous discovery Requires explicit file list Autonomous Autonomous Autonomous (sandboxed paths) Autonomous (full codebase)
Loop primitives /loop + CronCreate + self-pacing No built-in loop; wrap in shell Built-in cloud agent loop Built-in indefinite agent loop Continuous until done Single session; manual re-trigger
Hook system PostToolUse, Stop, PreToolUse, 5 types None None (cloud-managed) None None (externally via CI) None (verification script pattern)
Cost $20–$200/month Free (API costs only) $500/month Free (Apache 2.0; API costs only) Free CLI (API costs only) $20–$40/month
Data privacy Cloud API required Cloud API required Cloud (Cognition hosted) Can run fully local (Ollama) Cloud API required Cloud; SOC 2 Type II
Non-interactive flag -p / --print --yes Task submission UI/API --auto exec subcommand No headless mode

[3][4][5][13][14][17][22][29]


Section 6: Mutation Testing — Mutmut, Stryker, PIT

Mutation testing introduces deliberate code modifications and verifies whether tests detect them. Score formula: killed mutants / total mutants. "Code coverage tells you what ran. Mutation testing tells you what your tests would actually catch if the code were wrong."[6]

Why AI Code Specifically Needs Mutation Testing

Survival rates on AI-generated code are 15-25% higher than on human-written code at equivalent coverage levels.[6] AI produces "structurally correct but semantically drifted" code lacking domain understanding. Documented incident: 92% coverage + passing tests failed to catch a deduplication bug using reference equality instead of business key comparison.[6]

Key finding: "An agent that writes code and tests can achieve 100% coverage with tests that don't actually verify behavior." Mutation testing is the only quality signal that detects this class of failure — coverage metrics are insufficient for autonomous agent output.[15]

Tool Landscape by Language

Language Tool Maturity Key Limitation
Java PIT (pitest.org) Most mature JVM-only
Python mutmut, cosmic-ray Good Slower than compiled-language tools
JavaScript/TypeScript Stryker Most mature JS option Does not support Vitest browser mode
.NET Stryker.NET Mature
Rust cargo mutants Early stage Limited mutation operators
Go go-mutesting Early stage Limited mutation operators

[6][15][25]

Stryker (JavaScript/TypeScript) Performance

Per-test coverage analysis runs only relevant tests per mutant. With incremental mode and parallel execution: "Run mutation testing on every PR in 1-5 minutes for most codebases." In generative AI pipelines, parallel execution cut CI time from 45 to 18 minutes.[25]

Critical Stryker limitation: Does NOT support Vitest's browser mode because "their instrumentation assumes Node.js execution, but browser mode runs tests in actual Chromium via Playwright." Agent-based manual mutation testing is the fallback for browser-mode stacks.[15]

Stryker .NET Performance Benchmark (2025 IEEE Software)

Metric Result
Mutant generation speed vs. Python tools 2× faster
Precision for computer vision apps 12% higher
Speedup on 32-core cluster (GitLab/Azure DevOps) 8× via parallelization

[15]

PIT (Java) Configuration

<plugin>
  <groupId>org.pitest</groupId>
  <artifactId>pitest-maven</artifactId>
  <version>1.15.3</version>
  <configuration>
    <mutators>
      <mutator>DEFAULTS</mutator>
      <mutator>REMOVE_CONDITIONALS</mutator>
      <mutator>RETURN_VALS</mutator>
    </mutators>
    <outputFormats><param>HTML</param><param>XML</param></outputFormats>
  </configuration>
</plugin>

Differential mutation testing (changed files only): 2-5 minutes runtime.[6]

Agent Skill Tool Detection Logic

An autonomous agent can select the right mutation tool by inspecting project files:[25]

File Present Tool Invoked
Cargo.toml cargo mutants
package.json npx stryker run
pyproject.toml / setup.py mutmut run

Quality gate: 100% kill rate required (all mutants must be killed) in the agent skill pattern — below 100% blocks the PR and lists survivors with specific test recommendations. Scope: changed files only (not entire codebase).[25]

Recommended CI Thresholds

Context Threshold
Newly created AI code 80% minimum
Critical domains (authentication, data integrity) 90%
AI modifying existing code No decrease allowed

[6]

Autonomous Mutation-Testing Loop Vision

"A living mutant will spawn a secondary 'Test Generator' agent to create a test case to kill it, before a PR is even reviewed by a human."[6]

  1. AI generates code
  2. Mutation testing identifies surviving mutants
  3. Secondary agent generates tests to kill them
  4. Loop continues until score threshold met
  5. Human reviewer only concerned with intent, not coverage

AI Agent Manual Mutation Testing (When Tools Don't Work)

Claude Code can execute mutation testing manually when automated tools don't support the stack:[15][25]

  1. Read source code
  2. Apply one mutation at a time (boundaries, boolean logic, return values, statement removal)
  3. Run test suite
  4. Record pass/fail results
  5. Restore original code
  6. Report surviving mutants with fix suggestions

Real result: 38% mutation score (8 surviving mutants of 13 total) on a settings feature. Gaps found: volume boundary constraints, DOM class assignments, error handling paths.[15]

See also: UI Feedback Loop (browser-mode testing stacks)

Section 7: Property-Based Testing — Hypothesis, fast-check

Property-based testing specifies a general invariant and relies on the framework to generate thousands of inputs that might violate it, instead of verifying specific hand-picked examples. The framework shrinks failures to the smallest reproducing input.[7][16][26]

Anthropic Research: Autonomous Property-Based Testing Agent

Published paper: "Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem" (NeurIPS 2025 Deep Learning for Code Workshop). arXiv reference: 2510.09907.[7][16]

Agent Workflow (5 Steps)

Step Action
1. Code analysis Target code, documentation, codebase relationships
2. Property inference Derived from type hints, docstrings, function names
3. Test generation Hypothesis tests written for discovered properties
4. Execution + self-reflection Distinguish real bug vs. false positive
5. Bug reporting Formatted reports for confirmed issues

[7][16][26] Agent uses a to-do list for long-range reasoning across multiple files. Self-reflection loop minimizes false positives.[7]

Large-Scale Evaluation Results (100 Popular Python Packages)

Metric Value
Total bug reports generated 984
Overall validity rate 56%
Valid AND reportable 32%
Top-ranked validity 86%
Top-ranked reportable 81%
Cost per bug report $5.56
Cost per valid bug ~$9.93

[7][16][26]

Confirmed Bugs Merged into Production Codebases

Package Bug Found Impact
NumPy (500M+ monthly downloads) numpy.random.wald returning negative values — catastrophic cancellation error Fix achieved "nearly ten orders of magnitude lower relative error"
AWS Lambda Powertools slice_dictionary() iterator not incrementing — returned duplicate chunks Silent data integrity failure
Hugging Face Tokenizers Missing parenthesis in CSS output Rendering failure
CloudFormation CLI item_hash() produced identical hashes — .sort() returning None Hash collision in infrastructure provisioning
SciPy, Pandas Additional bugs found Confirmed and merged

[16]

Key finding: Model improvements are measurable in PBT performance: "Opus 4.1 and Sonnet 4.5, compared to Sonnet 4" showed significantly better self-reflection capabilities — directly reducing the false positive rate that would otherwise block autonomous PR flows.[16]

Hypothesis CI Profile Configuration

Profile max_examples Context
dev 25 Fast feedback during development
ci (moderate) 200 Standard CI budget[7]
ci (thorough) 500 Extended CI budget for critical paths[16]

Hypothesis is safe to run in parallel with pytest-xdist; each worker gets an independent random seed.[7]

fast-check (TypeScript/JavaScript)

CI integration with FC_NUM_RUNS=1000 increases iteration count for CI runs:[16][26]

Hegel (2026): Cross-Language Property Testing

A new family of PBT libraries bringing Hypothesis-quality property testing to every language. Hegel for Rust released 2026, designed for autonomous CI integration.[26]

Combining with Mutation Testing

Properties that survive mutations are vacuous — they don't actually detect real bugs. Running mutation testing after property-based test generation identifies which properties are checking real invariants vs. which are effectively no-ops.[26]

Known Limitations


Section 8: Security Scanning — Semgrep, Snyk

Security scanning in autonomous agent loops operates at "Level 2 Continuous AI" — AI handles routine security work (detection, triage, PR generation) while developers focus on complex problems.[8] Common vulnerabilities in AI-generated code include exposed API keys, missing input validation, insecure default configurations, overly permissive CORS headers, SQL injection, and XSS risks.[8][4]

Snyk vs. Semgrep: Technical Comparison (2026)

Dimension Semgrep Snyk Checkmarx
Primary strength SAST, custom rules, fast PR scanning Best-in-class SCA, IDE integration, AI fixes Enterprise compliance (SOC2/HIPAA/PCI-DSS), deep dataflow
PR scan speed 10-30 seconds Cloud latency (slower) Slower
Cross-file analysis in PR scan No — diff-aware scans only; full scans required Yes — server-side processing Yes
Languages 30+ 30+ 30+
Cost Free CLI (commercial use) Free tier: 200 tests/month; $50+/dev/mo $15,000+/year
Finding suppression Inline // nosemgrep .snyk policy files (web UI) Policy-based
AI fix generation Semgrep Multimodal (March 2026) Snyk Mission Control (autonomous PR) Limited
Memory limit 5GB fallback; 3-hour timeout Cloud-managed Cloud-managed

[27][8]

Critical Semgrep Architectural Limitation

"Semgrep's cross-file analysis does NOT run on diff-aware PR scans. It only runs on full repository scans." Vulnerabilities requiring multi-file taint tracking won't be detected until a subsequent full scan — an autonomous loop that runs only PR-triggered scans will miss cross-file injection chains.[27]

Snyk Mission Control: Autonomous Vulnerability Loop

  1. Snyk detects vulnerability
  2. AI agent automatically generates fix
  3. Agent creates PR
  4. Agent validates solution
  5. No manual intervention required[8][27]

Semgrep Multimodal (Announced March 2026)

Combines AI reasoning with deterministic rule-based analysis. Claimed performance improvements: 8× more true positives with 50% fewer false positives compared to rule-only Semgrep.[8]

Semgrep Scaling Constraints

Constraint Value
Memory fallback threshold 5GB
Maximum scan timeout 3 hours
Scaling limit Issues beyond 1,000 files
Recommended RAM per core 4-8GB

[27]

Semgrep Managed Scans

Used by 40%+ of Semgrep customers — Semgrep clones and scans repos on their own infrastructure, with no CI runner required. Eliminates the overhead of maintaining CI scan jobs for each repository.[27]

Key finding: Production-grade security-mature teams combine both tools: "Snyk for SCA, container scanning, and IaC, paired with Semgrep for custom SAST rules and fast CI scanning." This combination covers Semgrep's cross-file gap (Snyk's server-side processing handles cross-file analysis) while using Semgrep's speed advantage for every PR.[27]

Pre-Commit Security Gate Policies

See also: Failure Containment (blast radius of security misses in autonomous loops)

Section 9: Integrated Loop Architecture

A production autonomous build-test-ship loop combines CC hooks, scheduled execution, headless invocations, and quality signal layers into a single pipeline. The key design principle: each quality layer fires at the appropriate loop frequency to maximize signal without burning tokens.

Quality Signal Firing Frequency

Signal Trigger Point Frequency Rationale
Format (Prettier, Ruff) PostToolUse on Edit/Write Per file write Near-zero cost; immediate feedback[11]
Lint with auto-fix (ESLint, Ruff check) PostToolUse on Edit/Write Per file write Catches style violations before accumulation[11]
Security scan (Semgrep per file) PostToolUse on Write/Edit via MCP Per file write Immediate catch of new vulnerabilities[19]
Unit test suite Stop hook (exit 2 on failure) Per response Cost too high for PostToolUse frequency; Stop fires once per response[11]
Mutation testing (changed files) Stop hook or CI PR trigger Per response or per PR Cost too high for per-file; acceptable at response level[6][25]
Property-based tests Stop hook or CI PR trigger Per response or per PR Generated once per implementation block[7][16]
SCA (Snyk Open Source) CI PR trigger Per PR Dependency changes are infrequent[27]
Full Semgrep cross-file scan Nightly CronCreate task Daily Cross-file taint analysis requires full repo scan[27]

Autonomous Commit Workflow Pattern

Stage files on edit (PostToolUse), commit at Stop — stages incrementally (reversible) and commits only post-validation. Pattern sequence:[11]

  1. PostToolUse (Edit/Write) → lint + format + stage file (git add)
  2. Stop hook → run tests → if pass: git commit with descriptive message; if fail: exit 2 → Claude fixes
  3. Next Stop (after fix) → check stop_hook_active → skip re-commit if already in remediation pass

Iterative Orchestration: Python Control Loop Pattern

while not done:
    result = run_claude_headless(prompt)
    errors = run_tests_and_lint()
    if not errors:
        done = True
    else:
        prompt = f"Fix these errors: {errors}"

[28]

Agentic Testing Architecture Trends (QualiZeal Research, 2025)


Section 10: Production Patterns and Real Configurations

PostToolUse: Per-File Lint and Format Hook

#!/usr/bin/env bash
set -euo pipefail
INPUT=$(cat)
FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty')
[[ -z "$FILE_PATH" ]] && exit 0
cd "$CLAUDE_PROJECT_DIR"
case "$FILE_PATH" in
  *.ts|*.tsx|*.js|*.jsx)
    npx --no-install prettier --write "$FILE_PATH" 2>&1 || true
    npx --no-install eslint --fix "$FILE_PATH" 2>&1 || true
    ;;
  *.py)
    ruff format "$FILE_PATH" 2>&1 || true
    ruff check --fix "$FILE_PATH" 2>&1 || true
    ;;
esac

[11] Use || true guards to prevent linter errors from blocking; --no-install fails fast if deps are missing; $CLAUDE_PROJECT_DIR ensures portability. Production note: Hooks run in non-login shells — runtimes like nvm (Node.js) and pyenv (Python) are not automatically loaded. Add explicit sourcing before the case block, e.g. [ -s "$HOME/.nvm/nvm.sh" ] && source "$HOME/.nvm/nvm.sh", otherwise tools resolve to system-default versions rather than the project's pinned runtime.[11]

Stop Hook: Test Suite with Anti-Loop Guard

#!/usr/bin/env bash
set -euo pipefail
INPUT=$(cat)
ALREADY_LOOPING=$(echo "$INPUT" | jq -r '.stop_hook_active // false')
if [[ "$ALREADY_LOOPING" == "true" ]]; then exit 0; fi
cd "$CLAUDE_PROJECT_DIR"
if git diff --quiet --name-only -- '*.ts' '*.tsx' '*.js' '*.py' 2>/dev/null; then
  exit 0
fi
if ! npm test --silent; then
  echo "Tests failed. Review output and fix before ending." >&2
  exit 2
fi

[11]

PreToolUse: Block Dangerous Commands

COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command // empty')
if echo "$COMMAND" | grep -qE 'rm[[:space:]]+-rf[[:space:]]+/|git[[:space:]]+push.*--force|chmod[[:space:]]+777'; then
  echo '{"decision": "block", "reason": "Blocked: contains unsafe pattern. Request manual execution."}'
  exit 0
fi

[11]

MCP Security Scan Hook (PostToolUse on Write)

{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Write|Edit",
      "hooks": [{
        "type": "mcp_tool",
        "server": "my_server",
        "tool": "security_scan",
        "input": { "file_path": "${tool_input.file_path}" }
      }]
    }]
  }
}

[1][19]

Stryker GitHub Actions CI Gate

name: Mutation Testing
on:
  pull_request:
    branches: [main]
jobs:
  mutation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Stryker
        run: npx stryker run --incremental
      - name: Check mutation score
        run: |
          score=$(cat reports/mutation/mutation.json | jq '.metrics.mutationScore')
          if (( $(echo "$score < 80" | bc -l) )); then
            echo "Mutation score $score% below threshold"
            exit 1
          fi

[25]

Mutmut Configuration (pyproject.toml)

[tool.mutmut]
paths_to_mutate = "src/"
backup = false
runner = "python -m pytest"
tests_dir = "tests/"

[15][25]

Hypothesis CI Configuration (GitHub Actions)

- name: Property-based tests
  env:
    HYPOTHESIS_PROFILE: ci
  run: pytest --hypothesis-seed=0 tests/

[7]

Semgrep PR Gate (GitHub Actions)

name: Semgrep
on:
  pull_request: {}
jobs:
  semgrep:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Semgrep
        run: semgrep --config=auto --json --output=results.json
        env:
          SEMGREP_APP_TOKEN: ${{ secrets.SEMGREP_APP_TOKEN }}
      - name: Block on high-severity findings
        run: |
          high=$(jq '[.results[] | select(.extra.severity == "ERROR")] | length' results.json)
          if [ "$high" -gt 0 ]; then exit 1; fi

[27]

Snyk CI/CD Integration (GitHub Actions)

name: Security Scan
on: [pull_request]
jobs:
  snyk-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Snyk
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
        with:
          command: code test
      - name: Generate AI fix
        if: failure()
        run: |
          snyk code test --json | ai-fix-agent --post-as-pr-comment

[27]

Headless CI Test-Fix Loop Pattern

claude --bare -p "Run the test suite. If tests fail, identify the root cause and fix it.
Then re-run tests to verify. Repeat until all tests pass or you've attempted 3 fixes." \
  --allowedTools "Bash,Read,Edit" \
  --output-format json

[9]

Goose SRE Agent Recipe

goal: Monitor system health and respond to alerts
extensions: [datadog-mcp, github-mcp, slack-mcp]
steps:
  - Check active alerts in Datadog
  - For each critical alert: diagnose root cause from logs
  - Create GitHub issue with diagnosis
  - Notify on-call via Slack
  - If fixable: create PR with remediation

[24]

Aider YAML Config for Autonomous Loops

auto-commits: true
dirty-commits: true
yes-always: true
model: claude-sonnet-4-5
git-commit-verify: false

[13]

Codex CLI Non-Interactive Configuration

[agent]
model = "gpt-5.3-codex"
approval_mode = "full-auto"

[system]
instructions = "You are a TypeScript expert. Always write tests."

[sandbox]
max_file_size = 10485760
allow_paths = ["./src", "./tests"]
deny_paths = [".env", "~/.ssh"]

[17]

Cursor Composer 2 Post-Session Verification Script

#!/bin/bash
set -e
npx tsc --noEmit
npx eslint src/ --max-warnings=0
npm test -- --passWithNoTests
npm run build
semgrep --config=auto --error
echo "All checks passed"

[29]

See also: Failure Containment, Cost Optimization, Context Continuity, Multi-CLI Coordination

Sources

  1. Claude Code Hooks Reference — Official Documentation (retrieved 2026-04-27)
  2. Claude Code — Run Prompts on a Schedule (Official Docs) (retrieved 2026-04-27)
  3. Aider — Scripting and Autonomous Workflow Documentation (retrieved 2026-04-27)
  4. Devin AI Engineer — Review, Testing & Limitations in 2026 (retrieved 2026-04-27)
  5. Goose by Block — Open-Source AI Agent: MCP Extensions, Recipes & Setup Guide 2026 (retrieved 2026-04-27)
  6. Mutation Testing: The Missing Safety Net for AI-Generated Code (DEV Community) (retrieved 2026-04-27)
  7. Property-Based Testing with Claude — Anthropic Research (2026) (retrieved 2026-04-27)
  8. Automated Security Scanning with Snyk MCP and Continue (Continue Docs) (retrieved 2026-04-27)
  9. Claude Code — Run Programmatically / Headless Mode (Official Docs) (retrieved 2026-04-27)
  10. Hooks Reference - Claude Code Official Documentation (retrieved 2026-04-27)
  11. Claude Code Hooks: Automate Linting, Tests, and Commits (retrieved 2026-04-27)
  12. Run Claude Code Programmatically — Official Claude Code Docs (Agent SDK / --print mode) (retrieved 2026-04-27)
  13. Scripting Aider — Autonomous Loop and Non-Interactive Mode (retrieved 2026-04-27)
  14. Goose AI Agent by Block — Open-Source Local Coding Agent Review, MCP Extensions, Recipes, Comparison with Claude Code 2026 (retrieved 2026-04-27)
  15. Mutation Testing with AI Agents When Stryker Doesn't Work (Vitest Browser Mode) (retrieved 2026-04-27)
  16. Property-Based Testing with Claude — Anthropic Red Team Research 2026 (retrieved 2026-04-27)
  17. OpenAI Codex Agent Loop Architecture — How It Works (2026) (retrieved 2026-04-27)
  18. Claude Code Hooks: Automate Linting, Tests, and Commits - TeachMeIDEA (retrieved 2026-04-27)
  19. Hooks reference - Claude Code Official Documentation (retrieved 2026-04-27)
  20. Run prompts on a schedule - Claude Code Official Documentation (retrieved 2026-04-27)
  21. Scripting Aider — Autonomous Execution Documentation (retrieved 2026-04-27)
  22. Devin's 2025 Performance Review: Learnings From 18 Months of Agents At Work — Cognition AI (retrieved 2026-04-27)
  23. GitHub Agentic Workflows Unleash AI-Driven Repository Automation - InfoQ (retrieved 2026-04-27)
  24. Goose AI Agent by Block — Open-Source AI Agent Review 2026 (retrieved 2026-04-27)
  25. Mutation Testing with AI Agents When Stryker Doesn't Work — alexop.dev (retrieved 2026-04-27)
  26. Property-Based Testing with Claude — Anthropic Research (retrieved 2026-04-27)
  27. Snyk vs Semgrep (2026): Technical Comparison for Security Teams — Konvu (retrieved 2026-04-27)
  28. What Is Claude Code Headless Mode? How to Run AI Agents Without a Terminal — MindStudio (retrieved 2026-04-27)
  29. Cursor Composer 2 Just Dropped — Here's How to Actually Use Autonomous Coding Agents in Production — DEV Community (retrieved 2026-04-27)

Home