I Measured 5,119 AI Outputs. Here’s What I Actually Learned.
For ten weeks, I measured every file Claude wrote for me. Not a sample – all of them. 5,119 datapoints across 35 projects, captured automatically via git hook after every session. Edit distance between Claude’s original output and what ended up in the repository.
The hypothesis: if I formalize corrections as rules and inject them into the prompt, the correction rate drops over time. Convergence. Measurable learning without fine-tuning.
The hypothesis holds. Partially. The rest is more interesting than the result.
Methodology
EPO (Evolutionary Prompt Optimization) is a Python system that runs as a SessionEnd hook in Claude Code. After every session, it automatically:
- Scans git log for commits tagged `Co-Authored-By: Claude`
- For each file in the Claude commit: computes edit distance to the next human commit (Levenshtein ratio, normalized to [0, 1])
- Stores the score (SQLite, local)
- Injects learned rules into `CLAUDE.md` – the file Claude reads as system prompt at the start of every session
Edit score 0.0 = file accepted unchanged. Edit score 1.0 = fully rewritten. The comparison point is always the next commit, not HEAD – otherwise, later development would count as corrections.
Raw Data
| Week | n | μ (Mean) | Median | P95 | Zero Rate |
|---|---|---|---|---|---|
| W11 | 66 | 3.68% | 0.00% | 35.5% | 81.8% |
| W12 | 371 | 3.52% | 0.00% | 21.6% | 67.1% |
| W13 | 2,409 | 2.47% | 0.00% | 12.0% | 82.0% |
| W15 | 600 | 1.02% | 0.00% | 3.1% | 87.2% |
| W16 | 263 | 0.01% | 0.00% | 0.0% | 98.5% |
| W17 | 391 | 0.74% | 0.00% | 0.1% | 91.8% |
| W18 | 259 | 1.69% | 0.00% | 7.0% | 91.9% |
| W19 | 303 | 0.53% | 0.00% | 1.1% | 94.1% |
| W20 | 101 | 2.99% | 0.00% | 19.1% | 77.2% |
| W21 | 337 | 1.25% | 0.00% | 6.8% | 88.7% |
Three things that stand out immediately:
1. W13 dominates the dataset. 2,409 of 5,119 datapoints (47%) come from a single week. One large project skews the overall average. Any aggregate statement about the dataset is essentially a statement about that one week.
2. The mean is useless. The median is 0.00% in every single week – the distribution is extremely right-skewed: 85% of files are accepted unchanged, a few outliers drive the mean up. The 95th percentile (P95) is the better measure: it drops from 35.5% in W11 to under 7% in W21. The trend lives in the tails, not in the average.
3. W20 breaks the trend – or does it? After four weeks of declining scores, W20 jumps to 3.0%. At n=101 with this variance, this can’t be cleanly separated from noise – the P95 of 19.1% shows that a few outliers are distorting the mean. Without confidence intervals or a Mann-Whitney U test against preceding weeks, it’s unclear whether W20 is a trend break or a sampling artifact.
Confounds
Four factors I didn’t control for:
Project effect. Projects aren’t equally distributed. The lowest-scoring project sits at 0.08%, the highest at 8.5%. If I spend a week mainly on low-edit projects, the weekly average drops – without EPO contributing anything.
| Project | n | μ Edit Score |
|---|---|---|
| Blog (Content) | 443 | 0.08% |
| Website (Frontend) | 828 | 0.75% |
| Platform (Full-Stack) | 924 | 1.19% |
| Home Automation | 263 | 4.16% |
| Client Project A | 149 | 7.70% |
| Dashboard Prototype | 172 | 8.48% |
Operator effect. I’m getting better at prompting. Over ten weeks, you learn how to ask Claude. To separate this from the system effect, I compared edit scores within individual projects over time – this eliminates the project effect. Instead of cherry-picking first vs. last week, a Spearman rank correlation across all measured weeks per project:
| Project | Weeks | n | ρ (Spearman) |
|---|---|---|---|
| Blog (Content) | 6 | 443 | −0.83 |
| Home Automation | 9 | 265 | −0.68 |
| Backend Project A | 8 | 318 | −0.64 |
| Backend Project B | 8 | 157 | −0.62 |
| Website (Frontend) | 8 | 828 | −0.41 |
| Ops/Tooling | 8 | 222 | −0.16 |
| IoT Project | 6 | 268 | +0.26 |
10 of 12 projects with ≥5 measured weeks show a negative correlation – declining edit scores over time. 5 of those with ρ < −0.6 (strong monotonic relationship). This is a more robust signal than the aggregate weekly curve because the project effect is controlled for.
But: operator effect and system effect are still not separable. I’m learning the project and EPO is collecting project-specific rules. Co-evolution, not a one-way street.
There’s a second dataset that supports the operator effect. I track my Claude Code sessions in parallel: cost, tokens, tool calls. The metric tools per message measures how many tool calls Claude executes per user message – a proxy for how autonomously Claude can operate. Over 20 weeks:
- **Week 10:** 1.5 tools/message
- **Week 22:** 7.4 tools/message
That’s a 5× increase. I’m giving more complex tasks, structuring prompts so Claude does more independently. This isn’t an EPO effect – it’s operator learning. Edit scores may be dropping not because the system got better, but because I did.
File type confound. The highest edit scores come from Excel files, HTML presentations, and JSON dumps – file types where a single changed byte inflates the distance. These aren’t quality signals; they’re measurement artifacts.
Missing baseline. How much does a comparable Claude Code user accept unchanged, without EPO? Without a control group, a 96% acceptance rate is a number, not a result.
What EPO Learns Automatically
EPO extracts conventions from edit data. After 5,119 datapoints, the system has 45 active rules. The automatically generated ones:
„Outputs are rarely edited. Maintain current style and level of detail.“
This is a tautology. The data shows low edit scores, so the system generates a rule to maintain the style. The rule carries no information beyond what the metric already says.
The fundamental problem: epo learn correlates aggregate scores with prefabricated rule templates. There’s no signal for what was wrong with an output – only that something was changed. Edit distance is a scalar signal. Useful rules would require semantic understanding of the correction.
What Actually Works
The most valuable conventions were entered manually. Not extracted, not learned – written after Claude made the same mistake repeatedly:
Fabrication guard. „Never invent quotes or put words in real people’s mouths, even if it fits the narrative perfectly.“ – Created after Claude attributed a fabricated quote to a real person. In a blog post. No edit score would have caught this. The file’s score was low because only one sentence was wrong.
Epistemic modesty. „Run a reality check before escalating compliance arguments.“ – Claude reflexively amplifies data protection and security concerns. The rule forces verification: what data actually flows? What scenario triggers the damage?
Context decay warning. „Proactively flag warning signs in long sessions – topic jumps, >80k context, repetitions.“ – LLMs get fuzzier after long context but won’t say so on their own.
These rules measurably change outputs. Not because an algorithm extracted them from metrics, but because a human turned a concrete failure into an explicit instruction. The mechanism isn’t machine learning. It’s a text file.
Related Work
CIPHER (Gao et al., NeurIPS 2024) uses the same core mechanism: edit distance as implicit feedback, no fine-tuning, rule injection into the prompt. Microsoft Research and Cornell, published April 2024 – nearly two years before I started building EPO.
The difference: CIPHER derives one natural-language preference description per user. EPO maintains a scored database of individual rules with scope-aware injection (global/personal/project). Whether that makes a practical difference, I can’t answer with my data – that would require a controlled comparison.
Two independent implementations converging on the same approach suggests the idea is obvious. Neither of us having a clean causal proof suggests the problem is harder than the tooling.
Three Results
1. Memory matters more than learning.
LLMs forget everything between sessions. A text file with 20 rules, loaded at every startup, has more effect than any optimization algorithm. The most technically sophisticated component of EPO – the edit distance calculation, the convergence metrics, the template genome – accomplished less than CLAUDE.md with manually written rules.
And there’s a second memory that’s even more reliable: git. The commit history is the most robust long-term memory an AI workflow can have – every correction, every decision, every revert, dated and immutable. EPO uses exactly this as its data source. But the insight is: the history itself is more valuable than what EPO extracts from it.
2. Implicit feedback is too thin for semantic learning.
Edit distance says: something was changed. Not: what was wrong. A fabricated quote has an edit score of 0.02 (one sentence in a long file). A reformatted Excel table has 0.98. The signal-to-noise ratio is too low for automatic rule extraction. CIPHER addresses this by having an LLM interpret the edits – a promising approach I didn’t implement.
3. Measuring creates discipline, not proof.
The real value of EPO isn’t the convergence curve. It’s the habit of formalizing corrections. When I change something and it bothers me enough, I write a rule. Without the system, I wouldn’t have done that – not for 45 corrections, and not synchronized across 35 projects. The measurement system created the discipline that makes the measurement system unnecessary.
—
EPO is open source under AGPL-3.0: github.com/tkoerting/epo-framework
Part 3 of a series on AI quality measurement. Previously: What If Your AI Learned From Every Edit You Make? · KI-Tools vergessen was sie lernen