I Measured 5,119 AI Outputs. Here's What I Actually Learned.

For ten weeks, I measured every file Claude wrote for me. Not a sample – all of them. 5,119 datapoints across 35 projects, captured automatically via git hook after every session. Edit distance between Claude’s original output and what ended up in the repository.

The hypothesis: if I formalize corrections as rules and inject them into the prompt, the correction rate drops over time. Convergence. Measurable learning without fine-tuning.

The hypothesis holds. Partially. The rest is more interesting than the result.

Methodology

EPO (Evolutionary Prompt Optimization) is a Python system that runs as a SessionEnd hook in Claude Code. After every session, it automatically:

Scans git log for commits tagged `Co-Authored-By: Claude`
For each file in the Claude commit: computes edit distance to the next human commit (Levenshtein ratio, normalized to [0, 1])
Stores the score (SQLite, local)
Injects learned rules into `CLAUDE.md` – the file Claude reads as system prompt at the start of every session

Edit score 0.0 = file accepted unchanged. Edit score 1.0 = fully rewritten. The comparison point is always the next commit, not HEAD – otherwise, later development would count as corrections.

Raw Data

Week	n	μ (Mean)	Median	P95	Zero Rate
W11	66	3.68%	0.00%	35.5%	81.8%
W12	371	3.52%	0.00%	21.6%	67.1%
W13	2,409	2.47%	0.00%	12.0%	82.0%
W15	600	1.02%	0.00%	3.1%	87.2%
W16	263	0.01%	0.00%	0.0%	98.5%
W17	391	0.74%	0.00%	0.1%	91.8%
W18	259	1.69%	0.00%	7.0%	91.9%
W19	303	0.53%	0.00%	1.1%	94.1%
W20	101	2.99%	0.00%	19.1%	77.2%
W21	337	1.25%	0.00%	6.8%	88.7%

Three things that stand out immediately:

1. W13 dominates the dataset. 2,409 of 5,119 datapoints (47%) come from a single week. One large project skews the overall average. Any aggregate statement about the dataset is essentially a statement about that one week.

2. The mean is useless. The median is 0.00% in every single week – the distribution is extremely right-skewed: 85% of files are accepted unchanged, a few outliers drive the mean up. The 95th percentile (P95) is the better measure: it drops from 35.5% in W11 to under 7% in W21. The trend lives in the tails, not in the average.

3. W20 breaks the trend – or does it? After four weeks of declining scores, W20 jumps to 3.0%. At n=101 with this variance, this can’t be cleanly separated from noise – the P95 of 19.1% shows that a few outliers are distorting the mean. Without confidence intervals or a Mann-Whitney U test against preceding weeks, it’s unclear whether W20 is a trend break or a sampling artifact.

Confounds

Four factors I didn’t control for:

Project effect. Projects aren’t equally distributed. The lowest-scoring project sits at 0.08%, the highest at 8.5%. If I spend a week mainly on low-edit projects, the weekly average drops – without EPO contributing anything.

Project	n	μ Edit Score
Blog (Content)	443	0.08%
Website (Frontend)	828	0.75%
Platform (Full-Stack)	924	1.19%
Home Automation	263	4.16%
Client Project A	149	7.70%
Dashboard Prototype	172	8.48%

Operator effect. I’m getting better at prompting. Over ten weeks, you learn how to ask Claude. To separate this from the system effect, I compared edit scores within individual projects over time – this eliminates the project effect. Instead of cherry-picking first vs. last week, a Spearman rank correlation across all measured weeks per project:

Project	Weeks	n	ρ (Spearman)
Blog (Content)	6	443	−0.83
Home Automation	9	265	−0.68
Backend Project A	8	318	−0.64
Backend Project B	8	157	−0.62
Website (Frontend)	8	828	−0.41
Ops/Tooling	8	222	−0.16
IoT Project	6	268	+0.26

10 of 12 projects with ≥5 measured weeks show a negative correlation – declining edit scores over time. 5 of those with ρ < −0.6 (strong monotonic relationship). This is a more robust signal than the aggregate weekly curve because the project effect is controlled for.

But: operator effect and system effect are still not separable. I’m learning the project and EPO is collecting project-specific rules. Co-evolution, not a one-way street.

There’s a second dataset that supports the operator effect. I track my Claude Code sessions in parallel: cost, tokens, tool calls. The metric tools per message measures how many tool calls Claude executes per user message – a proxy for how autonomously Claude can operate. Over 20 weeks:

**Week 10:** 1.5 tools/message
**Week 22:** 7.4 tools/message

That’s a 5× increase. I’m giving more complex tasks, structuring prompts so Claude does more independently. This isn’t an EPO effect – it’s operator learning. Edit scores may be dropping not because the system got better, but because I did.

File type confound. The highest edit scores come from Excel files, HTML presentations, and JSON dumps – file types where a single changed byte inflates the distance. These aren’t quality signals; they’re measurement artifacts.

Missing baseline. How much does a comparable Claude Code user accept unchanged, without EPO? Without a control group, a 96% acceptance rate is a number, not a result.

What EPO Learns Automatically

EPO extracts conventions from edit data. After 5,119 datapoints, the system has 45 active rules. The automatically generated ones:

„Outputs are rarely edited. Maintain current style and level of detail.“

This is a tautology. The data shows low edit scores, so the system generates a rule to maintain the style. The rule carries no information beyond what the metric already says.

The fundamental problem: epo learn correlates aggregate scores with prefabricated rule templates. There’s no signal for what was wrong with an output – only that something was changed. Edit distance is a scalar signal. Useful rules would require semantic understanding of the correction.

What Actually Works

The most valuable conventions were entered manually. Not extracted, not learned – written after Claude made the same mistake repeatedly:

Fabrication guard. „Never invent quotes or put words in real people’s mouths, even if it fits the narrative perfectly.“ – Created after Claude attributed a fabricated quote to a real person. In a blog post. No edit score would have caught this. The file’s score was low because only one sentence was wrong.

Epistemic modesty. „Run a reality check before escalating compliance arguments.“ – Claude reflexively amplifies data protection and security concerns. The rule forces verification: what data actually flows? What scenario triggers the damage?

Context decay warning. „Proactively flag warning signs in long sessions – topic jumps, >80k context, repetitions.“ – LLMs get fuzzier after long context but won’t say so on their own.

These rules measurably change outputs. Not because an algorithm extracted them from metrics, but because a human turned a concrete failure into an explicit instruction. The mechanism isn’t machine learning. It’s a text file.

Related Work

CIPHER (Gao et al., NeurIPS 2024) uses the same core mechanism: edit distance as implicit feedback, no fine-tuning, rule injection into the prompt. Microsoft Research and Cornell, published April 2024 – nearly two years before I started building EPO.

The difference: CIPHER derives one natural-language preference description per user. EPO maintains a scored database of individual rules with scope-aware injection (global/personal/project). Whether that makes a practical difference, I can’t answer with my data – that would require a controlled comparison.

Two independent implementations converging on the same approach suggests the idea is obvious. Neither of us having a clean causal proof suggests the problem is harder than the tooling.

Three Results

1. Memory matters more than learning.

LLMs forget everything between sessions. A text file with 20 rules, loaded at every startup, has more effect than any optimization algorithm. The most technically sophisticated component of EPO – the edit distance calculation, the convergence metrics, the template genome – accomplished less than CLAUDE.md with manually written rules.

And there’s a second memory that’s even more reliable: git. The commit history is the most robust long-term memory an AI workflow can have – every correction, every decision, every revert, dated and immutable. EPO uses exactly this as its data source. But the insight is: the history itself is more valuable than what EPO extracts from it.

2. Implicit feedback is too thin for semantic learning.

Edit distance says: something was changed. Not: what was wrong. A fabricated quote has an edit score of 0.02 (one sentence in a long file). A reformatted Excel table has 0.98. The signal-to-noise ratio is too low for automatic rule extraction. CIPHER addresses this by having an LLM interpret the edits – a promising approach I didn’t implement.

3. Measuring creates discipline, not proof.

The real value of EPO isn’t the convergence curve. It’s the habit of formalizing corrections. When I change something and it bothers me enough, I write a rule. Without the system, I wouldn’t have done that – not for 45 corrections, and not synchronized across 35 projects. The measurement system created the discipline that makes the measurement system unnecessary.

—

EPO is open source under AGPL-3.0: github.com/tkoerting/epo-framework

Part 3 of a series on AI quality measurement. Previously: What If Your AI Learned From Every Edit You Make? · KI-Tools vergessen was sie lernen

I Measured 5,119 AI Outputs. Here’s What I Actually Learned.

Methodology

Raw Data

Confounds

What EPO Learns Automatically

What Actually Works

Related Work

Three Results

What If Your AI Learned From Every Edit You Make?

KI-Tools vergessen was sie lernen — und keiner redet darüber

Ich habe 5.119 KI-Outputs vermessen. Hier ist, was ich wirklich gelernt habe.

Der Hack, der erwachsen werden wollte.

Methodology

Raw Data

Confounds

What EPO Learns Automatically

What Actually Works

Related Work

Three Results

Ähnliche Beiträge