
A senior engineer at one of our clients spent an hour last month fixing a subtle off-by-one in a payment-reconciliation path she had written in November — 400 lines, passing tests, in production for five months. It took an hour because she no longer remembered why the function was shaped the way it was. Her own git blame looked like a stranger's code. She had written it with Claude Code in an afternoon.
That is the cheapest possible illustration of what the Anthropic skill-formation study (opens in new tab) measured: 52 engineers learning the Trio async library, both groups finishing in roughly the same time, but the AI-assisted group scored 50% on the follow-up quiz versus the control group's 67% — a 17-percentage-point gap, largest in debugging (opens in new tab). Our existing post on review discipline and the 23.5% incidents-per-PR number from the Cortex 2026 benchmark (opens in new tab) covers a different liability: incident rate is measurable in week one; comprehension debt is not measurable until year two — when someone leaves, an incident hits at 03:00, or a regulator asks about a subsystem nobody on the current team wrote. For Mittelstand clients running software with 10- to 20-year lifetimes, that second category is the expensive one.
Comprehension debt is a liability on humans, not artefacts
The clearest framing is Addy Osmani's (opens in new tab): comprehension debt is "the growing gap between how much code exists in your system and how much of it any human being genuinely understands." Technical debt lives in the code and surfaces through friction you can grep for; comprehension debt lives in the heads of the people who maintain the code and surfaces only under stress — a reviewer waving through a PR because the code looks fine, an on-call spending forty minutes at 02:30 reconstructing why a function swallows ValidationError. None of it shows up in the repo.
The arXiv study "Comprehension Debt in GenAI-Assisted Software Engineering Projects" (opens in new tab) — 207 students, 621 reflective diaries over eight weeks — puts it plainly: comprehension debt "resides in the collective cognition of development teams rather than in the codebase itself." You cannot refactor it out. The thing that has decayed is the team's mental model, and mental models do not respond to a Jira epic.
The four accumulation patterns, translated from the classroom to your team
The arXiv paper's patterns were observed in undergraduates, but they map cleanly onto professional teams — and are harder to spot because senior engineers have the fluency to disguise them.
Black-box acceptance. A senior accepts a Claude-written migration script because tests pass, without reconstructing why it locks tables in the order it does. "Understand" gets quietly redefined as "can predict the output," which is much weaker than "can debug under pressure."
Context-mismatch debt. The model writes plausible code that assumes a different architecture, different invariants. The code works; it does not fit. Over months you accumulate modules written against an imagined version of your system, and the gap is where subtle bugs live.
Dependency-induced atrophy. Continuous AI use measurably reduces independent comprehension effort. In the Anthropic RCT, participants using AI for conceptual inquiry scored above 65%; those who delegated code generation scored below 40% (opens in new tab). Same tool, nearly a letter-grade gap.
Verification bypass. If the engineer is learning the domain through the model, they are systematically under-equipped to verify what it produced — the mechanism behind most of the 45% security-failure rate in Veracode's 2025 data.
The paper's one mitigating pattern: students who used GenAI as a comprehension scaffold — asking for explanations, rewriting generated code, verifying against documentation — retained understanding. The tool is not the problem. The interaction style is.
Why Mittelstand systems pay this bill first
Hyperscalers can absorb comprehension debt against a much larger revenue base. Mittelstand engineering economics are inverted on every relevant axis.
System lifetimes are long. Industry data on enterprise legacy systems (opens in new tab) puts 20-year operational lifespans within normal, with 41% of German and Italian industrial firms still running custom ERP built before 2005 (opens in new tab). Code shipped today will still be in production in 2040 — and the author's context is already 17 points thinner than it would have been.
Teams are small. A Mittelstand team is often five to fifteen people. Research on open-source projects found 16% faced total departure of key engineers, and only 41% recovered momentum afterwards (opens in new tab). One senior leaves and the gap is their knowledge plus the comprehension debt their LLM-assisted work accumulated over two years that nobody else loaded into their head.
Turnover is low — usually good, occasionally a trap. Mittelstand firms average 3% annual turnover (opens in new tab). The "we'll just ask Jane" anti-pattern feels safe — until Jane goes on parental leave in 2029 and her replacement opens a 600-line file Jane shipped with Cursor in 2026, with no comments and no living memory of why any of the conditionals are there.
QA depth is limited. Few Mittelstand clients have staff for the property-testing and mutation-testing disciplines we recommend in our review post. The verification-bypass failure mode is more likely to reach and survive in production. The Mittelstand profile is the exact profile most exposed: long-lived systems, small teams, low turnover that delays the reckoning, thin QA. The productivity dashboard will look excellent for two years. The maintenance budget will tell you the truth in year three.
What we changed in our own practice
We ship AI-assisted code across ZahlFlow, BookMe, Commersio, FlexiLearn, Ordrino, and iCAS every week. The disciplines below all hurt to adopt.
Reading time is a first-class metric. We track per-engineer per-sprint hours spent reading code the engineer did not write in the last 30 days. Target: 15% minimum, reported alongside PRs merged in retros. Highest-leverage change on the list.
"Explain the diff" as a merge gate on AI-touched PRs. Before merge, the author writes two-to-four sentences explaining what changed and why, in their own words. If they cannot, the PR does not merge. Several engineers quietly dropped Copilot-style inline completions for Claude Code conversations after a month — the conversational mode teaches while completion mode does not, matching the Anthropic RCT exactly.
Deliberate LLM-free maintenance rotations. Every senior spends one day per fortnight maintaining older modules with no AI assistance — just the editor, docs, and git blame. The only practice we have found that reliably restores comprehension of code shipped six months ago.
Code reading clubs on AI-heavy modules. Once a month, for any module where more than 40% of lines were AI-authored last quarter, four engineers read it together for an hour. The module loads into four more heads, and ambient comprehension debt shows up when someone asks "why is this catch block here?" and nobody can answer.
Comprehension debt on the engineering balance sheet. A quarterly report alongside the technical-debt inventory: percentage of code AI-assisted, modules with bus factor < 2, average age of authors' last-touch on critical paths, explain-back sample rate. Outputs a budget line for reading, rotation, and pair-programming time — the first model we have had that surfaces the liability before an incident does.
What we tried and dropped
Banning AI for "important" modules. Engineers routed around the label and labels got stale. The dangerous case is unexamined AI on any module, not AI on critical ones.
Requiring engineers to retype AI-generated code by hand. Output was the original AI code with keystroke delays. Retyping does not build comprehension; explaining does.
Measuring AI assistance directly. Tracking percentage of accepted AI characters is gameable, noisy, and feels like surveillance. We track comprehension outputs now — reading time, explain-back quality, incident response time on the engineer's own code — not AI inputs.
Pretending "AI-native" juniors would absorb comprehension on the job. They do not. Juniors who learned with an assistant at their side have thinner debugging instincts. The mentorship burden on seniors is higher, not lower, in an AI-heavy team. Budget for it.
The bill comes due in year two
The AI-productivity dashboard and the comprehension-debt dashboard move in opposite directions for the first 12 to 18 months. Throughput looks great; reviewers feel productive; leadership has the numbers they were promised. The bill comes due in year two — someone changes, an incident requires reading code from a year ago, a regulator asks about an architectural choice nobody on the current team actually made because an LLM made it. On Mittelstand timescales the signal is slow.
The answer is not to ship less AI-assisted code; Anthropic's 2026 agentic coding trends report (opens in new tab) makes the productivity case clearly, and we believe it. The answer is that the engineering budget needs a new line item. Writing code is the cheap part now. Understanding it — at the level where you can walk through its failure modes at 03:00 — is the expensive part. Go as fast as you want on generation, and spend the throughput dividend on comprehension. The teams doing this will still be shipping confidently in 2030. The teams banking the full productivity gain will not.
Further reading
- Addy Osmani, Comprehension Debt — the hidden cost of AI generated code (opens in new tab)
- Comprehension Debt in GenAI-Assisted Software Engineering Projects (arXiv 2604.13277) (opens in new tab)
- Anthropic, How AI assistance impacts the formation of coding skills (opens in new tab)
- Cortex, Engineering in the Age of AI: 2026 Benchmark Report (opens in new tab)
- Anthropic, 2026 Agentic Coding Trends Report (opens in new tab)
- GitClear, AI Copilot Code Quality: 2025 Research (opens in new tab)
A random post, once a week.
Enter your email and we'll send you a handpicked article from our archive — no sales, no spam.
Roughly one email per week. Unsubscribe with one click.
Related posts

Killing the Excel Workflow: How Mittelstand Teams Actually Replace Spreadsheets
A pragmatic migration pattern for replacing the shared Excel file that runs your business — without breaking operations or forcing change management.

GDPR by Design: Engineering Patterns for SMB Software (Not Legal Advice)
Concrete engineering patterns for residency, consent, deletion, audit logs, and vendor review — drawn from shipping products for German Mittelstand clients.

The 35% Flip: Why Teams Are Replacing SaaS With Custom Builds
The build-vs-buy math changed in 2024–2026. Here are the SaaS categories flipping first, a break-even model, and the ones to keep.