Automating Prompt Engineering with DSPy

I spent three weeks teaching Claude to rewrite its own instructions. The hardest part wasn't the optimization algorithm — it's defining what "better" means when the judge is also an AI.

The Problem

Claude Code has agents. Code reviewers, test writers, debuggers, security auditors. Each has a prompt that tells it what to do and how to do it. These prompts were themselves generated by Claude as part of a capability integration pipeline. They worked. But "worked" is a low bar when you can't measure it.

The failure mode is reactive tuning. The code reviewer misses a SQL injection, so you adjust the prompt. Then it over-flags string concatenation. So you add nuance. The prompt gets longer. Longer prompts cost more tokens. More tokens mean slower responses. At some point you're maintaining a novel-length instruction set and praying it generalizes.

DSPy's pitch: stop writing prompts. Define what "good" looks like. Let the system find the prompt.

What DSPy Is

Stanford's DSPy (Declarative Self-improving Language Programs) treats prompts as programs. Instead of hand-crafting instructions, you define:

A task (input → output)
A metric (how to score the output)
Training examples (input/output pairs with known-good answers)

Then you run an optimizer. We used three:

BootstrapFewShot — Run the model on training examples. Keep the traces where it scored well. Attach those as few-shot demonstrations for future runs. The model learns from its own successful attempts.

COPRO (Coordinate Prompt Optimization) — Generate prompt variants through mutations: paraphrase, elaborate, simplify, extend. Evaluate each on training data. Keep the winner.

Iterative — Run Bootstrap, generate synthetic training data from the results, run Bootstrap again. Repeat until convergence.

What We Actually Built

DSPy targets the OpenAI/Anthropic API. We're running everything through Claude Code CLI, which means the "model" is a full session with tool access, not a stateless API call. Outputs are verbose multi-paragraph responses, not structured returns.

This required adapting the framework:

17 Custom Metrics

Generic metrics (exact match, F1) don't work when the output is three paragraphs of analysis. We built domain-specific scoring:

| Metric | What It Evaluates | |--------|-------------------| | issue_severity_match | Code review severity (Critical/High/Med/Low) | | test_coverage_score | Test count, categories, assertion structure | | security_cwe_match | CWE identifier extraction | | root_cause_match | Debugging: cause, fix, verification steps | | routing_accuracy | Search tool selection across MCP providers | | pr_quality_match | PR description across 5 dimensions | | refactoring_match | Smell detection, technique selection, rationale | | plan_quality_match | Plan completeness and quality |

Plus nine more covering structured output matching, complexity classification, binary decisions, and length-regularized scoring.

Semantic Equivalence Groups

Standard metrics fail when there are multiple correct answers. "Feature" and "enhancement" mean the same thing. "Extract method" and "compose method" are both valid refactoring techniques.

We built equivalence groups into the metrics:

CHANGE_TYPE_EQUIVALENCES = {
    'additive': {'feature', 'enhancement', 'new_feature', 'feature_addition', 'implement'},
    'corrective': {'bugfix', 'bug_fix', 'fix', 'patch', 'security', 'hotfix'},
    'structural': {'refactor', 'cleanup', 'reorganize', 'migration', 'restructure'},
    'meta': {'docs', 'documentation', 'test', 'testing', 'config', 'ci', 'chore'},
}

With partial credit for related-but-not-equivalent answers. A valid refactoring technique that isn't the one in the gold answer gets 0.6, not 0.0. Same family of search tools gets 0.6, related tools get 0.3.

There are also equivalence groups for refactoring techniques (TECHNIQUE_EQUIVALENCES), code smells (SMELL_EQUIVALENCES), and a mapping between smells and their likely fix techniques (SMELL_TECHNIQUE_AFFINITY).

Verbose Output Extraction

Claude doesn't return {"tool": "exa_web_search"}. It returns three paragraphs explaining why Exa is the right choice, with caveats. Metrics need structured data. So we built extractors — regex chains and pattern matchers that pull structured signals from free-form text.

Results

Thirteen optimization targets. Eleven passed holdout validation and were deployed.

The holdout scores ranged from 0.525 (code-reviewer) to 1.0 (capability-evaluator, mgrep-guide, mcp-search-framework, plan-mode-quality). The two failures: pr-preparer (0.352) and refactoring-advisor (0.308) — both well below the 0.5 threshold, despite training scores of 0.800 and 0.863 respectively.

Why Those Two Failed

Both pr_quality_match and refactoring_match are multi-dimensional metrics. PR quality scores across change type classification, risk assessment, required sections, testing checklist, and quality indicators — five weighted dimensions. Refactoring scores across smell detection, technique selection, risk assessment, code examples, and rationale quality.

The optimizer could push training scores high on these, but the holdout gap (0.800 → 0.352, 0.863 → 0.308) suggests overfitting to the training distribution rather than learning generalizable patterns. The status notes say "metric may be too strict" — meaning the metrics themselves might need rethinking before re-optimizing.

What "Improved" Means

Every target has a holdout set — examples the optimizer never sees during training. We score on holdout after optimization to catch overfitting. Cross-validation with dropout regularization as a second check. The bootstrap implementation caps dropout at 0.5 to prevent demo starvation.

The Meta-Problem

Here's the thing that keeps me up: the metric functions are themselves evaluated by an AI. We use Claude to judge whether Claude's output is good. If the judge is biased, the optimization maximizes bias.

We mitigate this three ways:

Reference outputs — AI-generated expected outputs for every training example, reviewed for correctness
Structural scoring — Many metrics check for the presence of specific elements (CWE IDs, severity levels, required sections) rather than semantic quality
Cross-model validation — Codex (GPT-5) reviews capabilities through the evaluation pipeline as a second opinion

But the fundamental circularity remains. An AI optimizing its own prompts, evaluated by itself, is still a closed loop. The escape hatch is empirical: does the optimized system produce better outcomes in practice? Eleven of thirteen say yes.

The System Today

The optimizer runs as a background process. You point it at a skill or agent prompt, specify the algorithm, and it produces an optimized version with cross-validated scores.

# Optimize a single target
./scripts/run_optimization.sh --targets "code-reviewer" --algorithm bootstrap

# Multiple targets in parallel
./scripts/run_optimization.sh --targets "mgrep-guide,mcp-search-framework" --algorithm copro

It's part of the larger Claude Evolution System — the self-improving pipeline that discovers, evaluates, and integrates new capabilities. The optimizer is how the pipeline improves its own components.

What I Learned

Metrics are harder than algorithms. The BootstrapFewShot algorithm is off-the-shelf. The 17 metric functions are 1800+ lines of semantic equivalence groups, partial credit calculations, smell-technique affinity mappings, and verbose output extractors. The algorithm is generic. The evaluation is entirely bespoke.

Multi-objective optimization is genuinely hard. The two failures were both multi-dimensional metrics with 5 weighted components. Single-dimension metrics (severity classification, tool routing) optimized cleanly. Joint optimization across dimensions overfits.

Circularity is manageable, not solvable. Using AI to evaluate AI is inevitable when the outputs are too complex for simple metrics. The mitigation is empirical validation, not theoretical purity.

The optimizer source, training data, metric implementations, and deployment status are in the Claude Evolution System.