Cookbook¶

TL;DR

8 copy-pasteable ralph loops: autoresearch, codebase improvement, documentation, bug hunting, deep research, code migration, security scanning, and test coverage. Each is a real, runnable example from the examples/ directory.

Copy-pasteable setups for common autonomous workflows. Each recipe is a real, runnable ralph from the examples/ directory.

All recipes use Claude Code as the agent. To use a different agent, swap the agent field — see Using with Different Agents.

User arguments in recipes

Many recipes accept CLI arguments like --focus or --target. These aren't built-in flags — they're user arguments declared in each recipe's args field. When you pass --focus "test coverage", the value replaces {{ args.focus }} in the prompt. See User Arguments for details.

Run autonomous ML experiments¶

An autonomous ML research loop inspired by Karpathy's autoresearch. The agent runs experiments on a training script to minimize validation loss — modifying code, training for 5 minutes, keeping improvements and reverting failures. This recipe uses helper scripts as commands to surface experiment state each iteration.

autoresearch/RALPH.md

---
agent: claude -p --dangerously-skip-permissions
commands:
  - name: results
    run: ./show-results.sh
  - name: git-log
    run: git log --oneline -20
  - name: last-run
    run: ./show-last-run.sh
args:
  - train_script
  - prepare_script
credit: false
---

# Autoresearch

You are an autonomous ML research agent running in a loop. Each iteration starts with a fresh context. Your progress lives in `results.tsv` and git history.

Your job: run experiments on `{{ args.train_script }}` to minimize **val_bpb** (validation bits per byte). Each training run uses a fixed 5-minute time budget, so all experiments are directly comparable.

## State

### Experiment history

{{ commands.results }}

### Git log

{{ commands.git-log }}

### Last run output

{{ commands.last-run }}

## Files

- **`{{ args.prepare_script }}`** — fixed constants, data prep, tokenizer, dataloader, evaluation. **Do not modify.**
- **`{{ args.train_script }}`** — the only file you edit. Model architecture, optimizer, hyperparameters, training loop. Everything is fair game.

Read both files at the start of each iteration to understand the current state.

## The experiment loop

Each iteration, do exactly one experiment:

1. **Orient** — review the experiment history and git log above. Identify what has been tried, what worked, what failed. Identify the current best val_bpb.
2. **Hypothesize** — pick ONE idea to test. Consider: architecture changes, optimizer tuning, hyperparameters, batch size, model size, activation functions, attention patterns, etc.
3. **Implement** — edit `{{ args.train_script }}` with your change.
4. **Commit** — `git commit` your change with a short descriptive message.
5. **Run** — execute: `uv run {{ args.train_script }} > run.log 2>&1` (redirect everything, do NOT let output flood your context).
6. **Read results** — `grep "^val_bpb:\|^peak_vram_mb:" run.log`. If empty, the run crashed; run `tail -n 50 run.log` to diagnose.
7. **Record** — append results to `results.tsv` (tab-separated). Do NOT commit results.tsv.
8. **Decide**:
   - val_bpb **improved** (lower): keep the commit, the branch advances.
   - val_bpb **equal or worse**: `git reset --hard HEAD~1` to revert.
   - **Crash**: log as crash in results.tsv, revert. If it's a trivial fix (typo, missing import), fix and retry once.

<!-- The first iteration should run the train script unmodified to establish the baseline. -->

## results.tsv format

Tab-separated, 5 columns:

commit val_bpb memory_gb status description

- commit: short hash (7 chars)
- val_bpb: e.g. 0.997900 (use 0.000000 for crashes)
- memory_gb: peak_vram_mb / 1024, rounded to .1f (use 0.0 for crashes)
- status: `keep`, `discard`, or `crash`
- description: short text of what was tried

If `results.tsv` doesn't exist yet, create it with just the header row, then run the baseline.

## Rules

- ONE experiment per iteration. No multi-variable changes.
- **Never modify `{{ args.prepare_script }}`**. It is read-only.
- **No new dependencies**. Only use what's in `pyproject.toml`.
- **Simplicity criterion**: a small val_bpb gain that adds ugly complexity is not worth it. Removing code for equal or better results is a win.
- **VRAM** is a soft constraint. Some increase is acceptable for meaningful val_bpb gains, but don't blow it up.
- **Timeout**: if a run exceeds 10 minutes, kill it and treat as crash.
- Never ask the human for input. You are fully autonomous.

Helper scripts that surface experiment state each iteration

autoresearch/show-results.sh

#!/bin/bash
# Show experiment history from results.tsv
if [ -f results.tsv ]; then
    cat results.tsv
else
    echo "No results.tsv yet — first iteration should create it and run the baseline."
fi

autoresearch/show-last-run.sh

#!/bin/bash
# Show key metrics from the most recent training run
if [ -f run.log ]; then
    grep "^val_bpb:\|^training_seconds:\|^peak_vram_mb:\|^mfu_percent:\|^num_params_M:\|^depth:" run.log
else
    echo "No run.log yet — no training runs have been executed."
fi

ralph run autoresearch --train_script train.py --prepare_script prepare.py

▶ Running: autoresearch
  2 commands · unlimited iterations

── Iteration 1 ──
  Commands: 2 ran
✓ Iteration 1 completed (312.4s)

── Iteration 2 ──
  Commands: 2 ran
✓ Iteration 2 completed (287.1s)

The train_script and prepare_script args let you point the ralph at any autoresearch-style project. The agent handles everything autonomously: establishing a baseline on the first iteration, then running experiments indefinitely. Each iteration is one hypothesis tested — modify the train script, train, evaluate, keep or revert.

Improve code quality continuously¶

A loop that continuously improves code quality without changing functionality. It runs commands (tests, type checking, lint) each iteration to give the agent a self-healing feedback loop, then picks one improvement to make.

improve-codebase/RALPH.md

---
agent: claude -p --dangerously-skip-permissions
commands:
  - name: tests
    run: uv run pytest -x
  - name: types
    run: uv run ty check
  - name: lint
    run: uv run ruff check .
  - name: git-log
    run: git log --oneline -10
args:
  - focus
---

# Improve Codebase

You are an autonomous coding agent running in a loop. Each iteration
starts with a fresh context. Your progress lives in the code and git.

## Recent changes

{{ commands.git-log }}

## Test results

{{ commands.tests }}

If any tests are failing above, fix them before doing anything else.

## Type checking

{{ commands.types }}

## Lint

{{ commands.lint }}

Fix any type errors or lint violations above before making new changes.

## Task

Make improvements to this codebase without changing any functionality.
{{ args.focus }}

Pick one improvement per iteration from the categories below (or discover your own). Research the code before changing anything.

## Improvement categories

### Code Quality
- Remove dead code, unused imports, and unreachable branches
- Eliminate code duplication by extracting shared logic into reusable functions
- Replace magic numbers and hardcoded strings with named constants
- Simplify overly complex conditionals and nested logic

### Structure & Organization
- Break up large files or functions that are doing too many things
- Move code to more logical locations (wrong file, wrong module, wrong layer)
- Standardize inconsistent naming conventions across the codebase
- Group related functionality that is scattered across unrelated files

### Robustness
- Add missing error handling and edge case coverage
- Replace silent failures with meaningful errors or logs
- Harden functions that assume inputs are always valid

### Readability
- Add or improve inline comments for non-obvious logic
- Improve variable and function names that are vague or misleading
- Normalize inconsistent formatting, spacing, or style

### Tests
- Increase test coverage for untested or undertested modules
- Remove flaky, redundant, or low-value tests
- Improve test naming so failures are self-explanatory

### Dependencies & Config
- Remove unused dependencies
- Consolidate duplicated configuration
- Replace deprecated library usage with modern equivalents

This is not an exhaustive list. If you discover opportunities for improving the codebase while not changing functionality, go for it!

## Rules

- One improvement per iteration
- Research code before creating anything new
- No placeholder code — full, working implementations only
- Fix all test failures, type errors, and lint violations before committing
- Commit with a descriptive message and push

ralph run improve-codebase -n 5 --focus "focus on test coverage" --log-dir ralph_logs

▶ Running: improve-codebase
  3 commands · max 5 iterations

── Iteration 1 ──
  Commands: 3 ran
✓ Iteration 1 completed (48.2s)
  → ralph_logs/001_20250115-142301.log

── Iteration 2 ──
  Commands: 3 ran
✓ Iteration 2 completed (55.7s)
  → ralph_logs/002_20250115-143112.log

Write and improve documentation automatically¶

A loop focused on writing and improving documentation.

docs/RALPH.md

---
agent: claude -p --dangerously-skip-permissions
commands:
  - name: docs-build
    run: uv run mkdocs build --strict
  - name: git-log
    run: git log --oneline -20
  - name: git-diff
    run: git diff --name-only HEAD~10
args:
  - focus
---

# Docs

You are an autonomous coding agent running in a loop. Each iteration
starts with a fresh context. Your progress lives in the code and git.

## Docs build output

{{ commands.docs-build }}

If there are warnings or errors above, fix them first.

## Recent changes

{{ commands.git-log }}

## Recently changed files

{{ commands.git-diff }}

## Task

Maintain and improve the ralphify documentation.
{{ args.focus }}

Pick one thing per iteration. Read the relevant code and existing docs
before making any change.

## What the docs are for

The docs serve two jobs:

1. **Users who want to build cool ralphs** — they need to understand
   the ralph format, prompt patterns, CLI flags, and how to get the
   most out of the loop. Get them productive fast.

2. **Project growth** — people discovering ralphify need to immediately
   understand what it does, why it matters, and how to get started.
   First impressions count: landing page, README, SEO, and visual
   polish all matter here.

Contributor docs (`docs/contributing/`) help developers and coding
agents understand the codebase so they can contribute effectively.

## Principles

- **Don't over-document.** Only document what helps someone do
  something. If it's obvious from the CLI help or the code, skip it.
  Not every function or flag needs a docs page.

- **Don't gold-plate.** Good enough is good enough. Clean, correct,
  and scannable beats comprehensive and polished. Move on.

- **Close important gaps.** When recent code changes introduced new
  features or changed behavior, update the relevant doc surfaces.
  Not every change needs a docs update — use judgement.

- **Keep all surfaces in sync.** When something user-facing changes,
  check: `docs/`, `README.md`, `src/ralphify/skills/new-ralph/SKILL.md`,
  and `docs/quick-reference.md`. Update what's relevant.

- **SEO basics.** Every page should have a clear `description` and
  `keywords` in its frontmatter. Titles should be descriptive. Don't
  over-optimize — just make sure search engines can understand what
  each page is about.

- **Branding and feel.** Ralphify's tone is direct, casual, and
  practical. The visual identity uses violet/deep-purple (#8B6CF0)
  and orange (#E87B4A) as brand colors. Keep the look consistent
  with existing pages. Don't introduce new visual patterns without
  good reason.

- **Think jobs-to-be-done**, not feature lists. Frame docs around
  what the user is trying to accomplish: "How do I pass arguments to
  my ralph?" not "The args field accepts a list of strings."

## What to work on

Look at the recent commits and changed files above. Then:

1. Fix any mkdocs build warnings or errors
2. Close gaps between code changes and docs
3. Improve existing pages (clarity, examples, scannability)
4. Improve SEO metadata where it's missing or weak
5. Clean up anything that feels bloated or gold-plated

## Rules

- One improvement per iteration
- Read the code and existing docs before changing anything
- Run `mkdocs build --strict` and ensure zero warnings before committing
- Commit with a descriptive message and push

ralph run docs --focus "focus on the API reference pages" --log-dir ralph_logs

▶ Running: docs
  1 command · unlimited iterations

── Iteration 1 ──
  Commands: 1 ran
✓ Iteration 1 completed (63.5s)
  → ralph_logs/001_20250120-091502.log

Find and fix bugs automatically¶

A loop that discovers bugs and fixes them. The agent reads the codebase, finds a real bug (edge case, off-by-one, missing validation), writes a failing test to prove it, then fixes it.

bug-hunter/RALPH.md

---
agent: claude -p --dangerously-skip-permissions
commands:
  - name: tests
    run: uv run pytest -x
  - name: types
    run: uv run ty check
  - name: lint
    run: uv run ruff check .
  - name: git-log
    run: git log --oneline -10
args:
  - focus
---

# Bug Hunter

You are an autonomous bug-hunting agent running in a loop. Each
iteration starts with a fresh context. Your progress lives in the
code and git.

## Test results

{{ commands.tests }}

## Type checking

{{ commands.types }}

## Lint

{{ commands.lint }}

## Recent commits

{{ commands.git-log }}

If tests, types, or lint are failing, fix that before hunting for new bugs.

## Task

Find and fix a real bug in this codebase.
{{ args.focus }}

Each iteration:

1. **Read code** — pick a module and read it carefully. Look for
   edge cases, off-by-one errors, missing validation, incorrect
   error handling, race conditions, or logic errors.
2. **Write a failing test** — prove the bug exists with a test that
   fails on the current code.
3. **Fix the bug** — make the test pass with a minimal fix.
4. **Verify** — all existing tests must still pass.

## Rules

- One bug per iteration
- The bug must be real — do not invent hypothetical issues
- Always write a regression test before fixing
- Do not change unrelated code
- Commit with `fix: resolve <description>`

ralph run bug-hunter -n 5 --focus "focus on input validation" --log-dir ralph_logs

▶ Running: bug-hunter
  1 command · max 5 iterations

── Iteration 1 ──
  Commands: 1 ran
✓ Iteration 1 completed (71.3s)
  → ralph_logs/001_20250118-103045.log

Run structured AI research loops¶

A structured research loop that builds up a report over many iterations. Uses shell scripts as commands to track maturity, show the question tree, and even run an editorial review agent that gives feedback between iterations.

This is a more advanced ralph — it uses args for the research topic, helper scripts (run with ./ relative to the ralph directory), and a timeout on the review command.

research/RALPH.md

---
agent: claude -p --dangerously-skip-permissions
commands:
  - name: git-log
    run: git log --oneline -15
  - name: last-diff
    run: git diff --stat HEAD~1
  - name: scratchpad
    run: ./show-focus.sh
  - name: questions
    run: ./show-questions.sh
  - name: outline
    run: ./show-outline.sh
  - name: maturity
    run: ./show-maturity.sh
  - name: review
    run: ./review.sh
    timeout: 120
args:
  - workspace
  - focus
---

# Deep Research

You are an autonomous research agent running in a loop. Each iteration starts with a fresh context. Your progress lives in files and git history.

## Your mission

{{ args.focus }}

Conduct structured, iterative research on this topic. Go deep. Discover angles and insights that aren't obvious from the surface.

## State

### Editorial review

{{ commands.review }}

Pay close attention to the review above. It's written by an editor who can see your full body of work. Follow its guidance on where to focus and what to improve.

### Git history (your progress across iterations)

{{ commands.git-log }}

### What changed last iteration

{{ commands.last-diff }}

### Last scratchpad entry

{{ commands.scratchpad }}

### Research questions

{{ commands.questions }}

### Report outline

{{ commands.outline }}

### Research maturity

{{ commands.maturity }}

## Workspace

You work within `{{ args.workspace }}/`. Read `{{ args.workspace }}/CONVENTIONS.md` for the full workspace structure and formatting rules. The short version:

- `REPORT.md` — executive overview + chapter table of contents (keep under 150 lines)
- `chapters/NN-slug.md` — deep-dive chapter files
- `notes/` — working memory: `questions.md`, `sources.md`, `insights.md`, `scratchpad.md`

If the workspace doesn't exist yet, create it and populate from the conventions file.

## Each iteration

1. **Orient** — read the state above. Read the editorial review. Understand where you left off.

2. **Decide: research or refine?** Roughly every 3-4 iterations, skip research and instead tighten prose, merge overlapping sections, restructure chapters, and sharpen insights. Less but better content always wins. Write your decision and focus area to `notes/scratchpad.md` before starting.

3. **Research** — pick ONE question or area. Go deep. Use web search aggressively. Prioritize practitioner sources (engineering blogs, HN/Reddit discussions, conference talks, RFCs) over generic SEO content. Parallelize across sub-agents when surveying a broad area. Log every useful source in `notes/sources.md`.

4. **Capture** — update `notes/questions.md` (mark answered, add new), add insights to `notes/insights.md`, dump raw notes to `notes/scratchpad.md`.

5. **Write** — findings go into the appropriate chapter. Best insights get distilled up into REPORT.md. Keep REPORT.md as a table of contents that links to chapters — don't inline detail.

6. **Commit and push** — stage all changes in `{{ args.workspace }}/`, commit, push.

## Rules

- ONE focused thread per iteration. Depth over breadth.
- The research question tree (`notes/questions.md`) must grow every research iteration.
- Every web source gets logged in `notes/sources.md` with URL, author, one-line summary, and relevance rating.
- The report should be readable and valuable at any point, not just at the end.
- Do not fabricate sources. When you find contradictions, note both sides.
- Prefer concrete examples and practical implications over abstract theory.

Helper scripts — show-focus.sh, show-questions.sh, review.sh

The helper scripts read from the workspace files and surface key state. The review.sh script pipes the full workspace to a separate Claude call that acts as an editorial reviewer — giving the research agent targeted feedback each iteration.

ralph run research --workspace ai-safety --focus "current approaches to AI alignment"

▶ Running: research
  4 commands · unlimited iterations

── Iteration 1 ──
  Commands: 4 ran
✓ Iteration 1 completed (185.6s)

This recipe shows several advanced patterns: commands that call scripts relative to the ralph directory (./show-focus.sh), a command with a timeout, a command that itself calls an AI agent (review.sh pipes to claude -p), and args used in the prompt body via {{ args.workspace }} placeholders.

Migrate code patterns across a codebase¶

A loop for batch code transformations — migrating from one pattern to another across a codebase. The remaining command counts how many files still need migration, giving the agent a clear finish line. Use --stop-on-error to halt the loop once all files are migrated.

migrate/RALPH.md

---
agent: claude -p --dangerously-skip-permissions
commands:
  - name: tests
    run: uv run pytest -x
  - name: remaining
    run: ./count-remaining.sh {{ args.old_pattern }}
  - name: types
    run: uv run ty check
  - name: lint
    run: uv run ruff check .
  - name: git-log
    run: git log --oneline -10
args:
  - old_pattern
  - new_pattern
---

# Code Migration

You are an autonomous coding agent running in a loop. Each iteration
starts with a fresh context. Your progress lives in the code and git.

## Migration spec

Migrate all usages of `{{ args.old_pattern }}` to `{{ args.new_pattern }}`.

## Remaining files

{{ commands.remaining }}

## Test results

{{ commands.tests }}

## Type checking

{{ commands.types }}

## Lint

{{ commands.lint }}

## Recent commits

{{ commands.git-log }}

If tests, types, or lint are failing, fix them before migrating more files.

## Rules

- Migrate 1-3 files per iteration — small batches that stay green
- Run tests after each change to catch breakage early
- Do not change behavior — only update the pattern
- Commit with `refactor: migrate <file> from old_pattern to new_pattern`
- If a file needs more than a mechanical replacement, note it in
  MIGRATION_NOTES.md and skip it

count-remaining.sh — tracks migration progress

The script receives the pattern as an argument (resolved from {{ args.old_pattern }} in the run field) to find files that still need migration:

#!/bin/bash
pattern="$1"
files=$(grep -rl "$pattern" src/ 2>/dev/null)
count=$(echo "$files" | grep -c . 2>/dev/null || echo 0)
echo "$count files remaining"
echo "$files" | head -20

ralph run migrate --old_pattern "from utils import legacy_helper" \
                  --new_pattern "from core.helpers import modern_helper"

▶ Running: migrate
  1 command · unlimited iterations

── Iteration 1 ──
  Commands: 1 ran
✓ Iteration 1 completed (34.8s)

The remaining command gives the agent a shrinking counter and a list of files still needing attention, so it always knows where to focus next.

Automate security scanning and fixes¶

An iterative security review loop. The agent runs a scanner each iteration, picks one finding, fixes it, and verifies the fix. Good for systematically hardening a codebase. Use -n to limit iterations and --log-dir to keep an audit trail.

security/RALPH.md

---
agent: claude -p --dangerously-skip-permissions
commands:
  - name: scan
    run: uv run bandit -r src/ -f json
  - name: open-issues
    run: cat SECURITY_FINDINGS.md
  - name: tests
    run: uv run pytest -x
  - name: types
    run: uv run ty check
  - name: lint
    run: uv run ruff check .
  - name: git-log
    run: git log --oneline -10
---

# Security Scan

You are an autonomous security agent running in a loop. Each iteration
starts with a fresh context. Your progress lives in the code and git.

## Scanner results

{{ commands.scan }}

## Open findings

{{ commands.open-issues }}

## Test results

{{ commands.tests }}

## Type checking

{{ commands.types }}

## Lint

{{ commands.lint }}

## Recent commits

{{ commands.git-log }}

If tests, types, or lint are failing, fix them before addressing security findings.

## Task

Review the scanner results above. Pick one finding and fix it. If a
finding is a false positive, document why in SECURITY_FINDINGS.md and
mark it as dismissed.

If no scanner findings remain, do a manual review: read one module,
look for injection risks, auth bypasses, or unsafe data handling, and
fix or document what you find.

## Rules

- One finding per iteration
- Always verify the fix doesn't break tests
- Log every finding (fixed or dismissed) in SECURITY_FINDINGS.md
  with: severity, location, description, resolution
- Do not suppress scanner warnings — fix the underlying issue
- Commit with `security: fix <description>`

ralph run security -n 10 --log-dir ralph_logs

▶ Running: security
  1 command · max 10 iterations

── Iteration 1 ──
  Commands: 1 ran
✓ Iteration 1 completed (42.9s)
  → ralph_logs/001_20250122-160830.log

Swap bandit for your scanner of choice — semgrep, npm audit, cargo audit, etc. The pattern works the same: scan, pick a finding, fix it, log it.

Increase test coverage automatically¶

A loop that systematically increases test coverage. The agent sees the current coverage percentage and a list of uncovered functions, then writes tests for one module per iteration. The coverage command output feeds into the prompt via {{ commands.coverage }} so the agent always knows where to focus.

test-coverage/RALPH.md

---
agent: claude -p --dangerously-skip-permissions
commands:
  - name: coverage
    run: uv run pytest --cov=src --cov-report=term-missing -q
  - name: types
    run: uv run ty check
  - name: lint
    run: uv run ruff check .
  - name: git-log
    run: git log --oneline -10
args:
  - target
---

# Test Coverage

You are an autonomous testing agent running in a loop. Each iteration
starts with a fresh context. Your progress lives in the code and git.

## Current coverage

{{ commands.coverage }}

## Lint

{{ commands.lint }}

## Recent commits

{{ commands.git-log }}

## Type checking

{{ commands.types }}

Fix any type errors or lint violations above before writing new tests.

## Task

Increase test coverage for this project.
{{ args.target }}

Pick the module with the most missing lines from the coverage report
above. Read the source code, understand what it does, and write
meaningful tests that exercise the uncovered paths.

## Rules

- One module per iteration
- Write tests that verify behavior, not just hit lines — assert
  return values, side effects, and error cases
- Do not mock things unnecessarily — prefer real objects when feasible
- Do not add `# pragma: no cover` comments
- All existing tests must still pass after your changes
- Commit with `test: add coverage for <module>`

ralph run test-coverage -n 5 --target "focus on error handling paths" --log-dir ralph_logs

▶ Running: test-coverage
  2 commands · max 5 iterations

── Iteration 1 ──
  Commands: 2 ran
✓ Iteration 1 completed (56.1s)
  → ralph_logs/001_20250125-140210.log

The coverage report gives the agent a clear metric to improve and shows exactly which lines are missing, so it always knows where to focus.

Next steps¶

CLI Reference — all ralph run options (--timeout, --stop-on-error, --delay, user args)
Troubleshooting — when the agent hangs, commands fail, or output looks wrong