
Last quarter a Mittelstand client asked us to look at their incident log. They had moved most engineers to Claude Code and Cursor nine months earlier. PR volume was up, satisfaction was glowing, and production incidents were up 31% year over year — with a noticeable shift toward "we merged a regression nobody caught in review."
That pattern is not a vibe. The Cortex 2026 Engineering Benchmark (opens in new tab) puts numbers on it: pull requests per author are up 20% year over year, incidents per pull request are up 23.5%, and change failure rate is up roughly 30%. The tools made individual developers faster and the system less reliable at the same time.
We are not anti-AI — we ship AI-authored code every day across ZahlFlow, Commersio, FlexiLearn, BookMe, Ordrino, and iCAS. But we now treat "an LLM wrote this" as a first-class signal in code review.
The throughput-quality decoupling is real and measurable
GitClear analysed 211 million changed lines of code from 2020 through 2024 (opens in new tab) and found refactoring fell from 25% of changed lines in 2021 to under 10% in 2024, copy-paste clones grew from 8.3% to 12.3%, and short-term churn rose materially. AI made it cheaper to add a second implementation than to extend the first, and developers took the cheaper path.
On security, Veracode's 2025 GenAI Code Security Report (opens in new tab) found 45% of samples introduced an OWASP Top 10 vulnerability, with XSS defences failing in 86% of relevant tasks. The uncomfortable finding: security performance is flat across model generations. Models get better at functional tests, not at avoiding log-injection sinks.
CodeRabbit's comparison of 470 open-source PRs (opens in new tab) found AI-authored PRs contain 1.7× more issues overall (10.83 vs 6.45 per PR), with security issues 2.74× more common. "AI makes developers 20% more productive" and "AI ships 23.5% more incidents per merge" are the same claim in different units.
Why this happens: comprehension debt meets reviewer fatigue
Addy Osmani's framing of comprehension debt (opens in new tab) is the cleanest diagnosis: "the growing gap between how much code exists in your system and how much of it any human being genuinely understands." A junior engineer with Claude now generates code faster than a senior engineer can critically audit it. Osmani cites an Anthropic study where engineers using AI assistance scored 17 percentage points lower on a follow-up comprehension quiz, with the largest gaps in debugging.
The mechanism compounds in review through three new fatigues:
False fluency. LLM code reads well — idiomatic names, complete-sentence comments, familiar control flow. Reviewers pattern-match on "looks like code I would have written" and wave it through. Veracode's 45% is almost entirely this.
PR-size creep. The marginal cost of adding "one more thing" fell when the author stopped typing it. Anthropic's 2026 Agentic Coding Trends Report (opens in new tab) describes agents running seven hours at a stretch — code that still has to pass through a human review slot designed for a 200-line diff.
Novelty inversion. Pre-AI, reviewers assumed the author could answer "why this, not that?" Post-AI the author may never have considered the alternative. The German software engineering survey (opens in new tab) captures the senior reaction: "in complex projects, hallucinations are all over the place" with the risk of "AI-generated software tested with AI-generated tests."
What AI is actually good at in review, and what it is not
Used well, AI reviewers catch naming inconsistencies, missing null checks, obvious n+1 queries, and documentation gaps. Anthropic's Claude Code review feature (opens in new tab) reports surfacing findings in 84% of reviews over 1,000 lines. We run it on every PR and it has a real hit rate on the boring-but-important class of defect.
What AI review cannot do: tell you the PR contradicts a design decision from six months ago, or that the new endpoint duplicates one in a module the author never opened, or that the author is a junior who should learn this pattern by writing it themselves. AI review is a better lint. Humans still have to do the engineering conversation.
Five review disciplines that actually close the gap
None of these ban AI. All raise the cost of shipping code nobody understands.
1. Small PRs, enforced
We cap AI-assisted PRs at 400 lines of diff excluding lockfiles and generated code. Sub-400-line PRs in our monorepos produced 3.1 incidents per 1,000 merged; PRs over 800 lines produced 11.4. Halving the diff roughly doubles the attention per line.
2. Mandatory "what and why" in the description
Every PR opens with two paragraphs: what the change does behaviourally, and why it is the right change. If the author cannot articulate the change, they did not understand what the AI produced — and neither will the reviewer. Reviewers report PRs with real descriptions take less time to review, because the description does the context-building work.
3. Property and mutation testing on AI-touched modules
For any module where AI wrote more than 30% of the lines in the last 90 days, we require property-based tests (fast-check, hypothesis) plus a mutation-testing gate (Stryker, mutmut) with a 70% kill-rate floor. LLMs frequently write tests that pass on exactly the examples they considered; property tests probe the space they did not. This is the only technique we have found that reliably catches the Veracode-class failure.
4. Diff-first review, not file-first
We require review to start from the commit-by-commit diff, reading the change in the order the author made it. This forces reviewers to reconstruct intent rather than evaluate the end state — the only way to notice that commit 3 undid something commit 1 established.
5. Static analysis tuned for AI idioms
Off-the-shelf linters miss things LLMs do that humans rarely do: duplicate utility functions across files, hard-coded magic values with subtle drift, defensive code that swallows errors. We have added custom Semgrep (opens in new tab) rules and SonarQube profiles to flag these. If 12.3% of new lines are clones, the analyser has to notice.
The explain-back gate
Before approving any AI-assisted PR, the reviewer pastes a two- to four-sentence summary of the change — in their own words, as a top-level comment. Not a description of the diff; a description of the change to the system.
Four consequences, all wanted: the reviewer cannot approve what they do not understand; the author gets a free comprehension check; the team accumulates a corpus of summaries that becomes the best internal documentation you have; and it resets the author-reviewer power dynamic against comprehension-debt drift.
Pushback we hear: "this slows reviews." It does, at first. After a month it speeds them up because reviewers stop approving ambiguous PRs and sending them back from production. Our client's incident-per-PR rate dropped from their 23.5%-elevated baseline back under their pre-AI baseline within two sprints of enabling this gate.
What not to do
Do not ban AI assistants. The productivity signal is real. Banning drives tools underground and loses the throughput lift on unambiguously good uses — boilerplate, test scaffolds, translations.
Do not require humans to rewrite AI code by hand. We tried it. It produces AI code that a human retyped, which is strictly worse than either pure human code or reviewed AI code. The only reliable output is resentment.
Do not ignore the incident signal. The worst pattern is leadership citing throughput ("PRs up 20%!") while on-call quietly absorbs the 23.5%. That gap is where burnout lives.
Do not over-index on AI review tools as the fix. They catch the class of bug that was cheap for humans to catch too. The "this contradicts our architecture" class is exactly what AI review cannot address, because it does not know your architecture.
The harder truth: AI shifted where engineering effort is spent. Writing got cheaper. Understanding got more expensive. Review, testing, and documentation budgets all need to go up to match. "AI makes you faster" has an asterisk: if and only if your review discipline kept up with your generation speed.
Further reading
- Cortex, Engineering in the Age of AI: 2026 Benchmark Report (opens in new tab)
- GitClear, AI Copilot Code Quality: 2025 Research (opens in new tab)
- Veracode, 2025 GenAI Code Security Report (opens in new tab)
- Addy Osmani, Comprehension Debt — the hidden cost of AI generated code (opens in new tab)
- CodeRabbit, State of AI vs Human Code Generation Report (opens in new tab)
- Anthropic, 2026 Agentic Coding Trends Report (opens in new tab)
- Sonar, State of Code Developer Survey Report 2026 (opens in new tab)
A random post, once a week.
Enter your email and we'll send you a handpicked article from our archive — no sales, no spam.
Roughly one email per week. Unsubscribe with one click.
Related posts

Comprehension Debt Is the Real AI Tax
AI-assisted engineers score 17% lower on comprehension of their own code. The codebase looks fine. The humans who shipped it can no longer reason about it under pressure.

Killing the Excel Workflow: How Mittelstand Teams Actually Replace Spreadsheets
A pragmatic migration pattern for replacing the shared Excel file that runs your business — without breaking operations or forcing change management.

i18n Done Right for DACH Products: What Most SaaS Gets Wrong About German
Sie vs du, compound-word overflow, DIN 5008, hreflang, and the translation workflow that survives scale. Hard-won i18n lessons for DACH SaaS.