The 23.5% Problem: Why AI-Assisted Pull Requests Ship More Incidents

A team moves most of its engineers onto Claude Code and Cursor. Pull-request volume goes up, satisfaction is glowing, and a few quarters later the incident log has quietly changed shape — more and more of it reads "we merged a regression nobody caught in review."

That pattern is not a vibe. The Cortex 2026 Engineering Benchmark (opens in new tab) puts numbers on it: pull requests per author are up 20% year over year, incidents per pull request are up 23.5%, and change failure rate is up roughly 30%. The tools made individual developers faster and the system less reliable at the same time.

We are not anti-AI — we ship AI-authored code every day across ZahlFlow, Commersio, FlexiLearn, BookMe, Ordrino, and iCAS. But we now treat "an LLM wrote this" as a first-class signal in code review.

The throughput-quality decoupling is real and measurable

GitClear analysed 211 million changed lines of code from 2020 through 2024 (opens in new tab) and found refactoring fell from 25% of changed lines in 2021 to under 10% in 2024, copy-paste clones grew from 8.3% to 12.3%, and short-term churn rose materially. AI made it cheaper to add a second implementation than to extend the first, and developers took the cheaper path.

On security, Veracode's 2025 GenAI Code Security Report (opens in new tab) found 45% of samples introduced an OWASP Top 10 vulnerability, with XSS defences failing in 86% of relevant tasks. The uncomfortable finding: security performance is flat across model generations. Models get better at functional tests, not at avoiding log-injection sinks.

CodeRabbit's comparison of 470 open-source PRs (opens in new tab) found AI-authored PRs contain 1.7× more issues overall (10.83 vs 6.45 per PR), with security issues 2.74× more common. "AI makes developers 20% more productive" and "AI ships 23.5% more incidents per merge" are the same claim in different units.

Why this happens: comprehension debt meets reviewer fatigue

Addy Osmani's framing of comprehension debt (opens in new tab) is the cleanest diagnosis: "the growing gap between how much code exists in your system and how much of it any human being genuinely understands." A junior engineer with Claude now generates code faster than a senior engineer can critically audit it. Osmani cites an Anthropic study where engineers using AI assistance scored 17 percentage points lower on a follow-up comprehension quiz, with the largest gaps in debugging.

The mechanism compounds in review through three new fatigues:

False fluency. LLM code reads well — idiomatic names, complete-sentence comments, familiar control flow. Reviewers pattern-match on "looks like code I would have written" and wave it through. Veracode's 45% is almost entirely this.

PR-size creep. The marginal cost of adding "one more thing" fell when the author stopped typing it. Anthropic's 2026 Agentic Coding Trends Report (opens in new tab) describes agents running seven hours at a stretch — code that still has to pass through a human review slot designed for a 200-line diff.

Novelty inversion. Pre-AI, reviewers assumed the author could answer "why this, not that?" Post-AI the author may never have considered the alternative. The German software engineering survey (opens in new tab) captures the senior reaction: "in complex projects, hallucinations are all over the place" with the risk of "AI-generated software tested with AI-generated tests."

What AI is actually good at in review, and what it is not

Used well, AI reviewers catch naming inconsistencies, missing null checks, obvious n+1 queries, and documentation gaps. Anthropic's Claude Code review feature (opens in new tab) reports surfacing findings in 84% of reviews over 1,000 lines. That is the boring-but-important class of defect, and it is worth automating.

What AI review cannot do: tell you the PR contradicts a design decision from six months ago, or that the new endpoint duplicates one in a module the author never opened, or that the author is a junior who should learn this pattern by writing it themselves. AI review is a better lint. Humans still have to do the engineering conversation.

Five review disciplines that actually close the gap

None of these ban AI. All raise the cost of shipping code nobody understands.

1. Small PRs, enforced

Cap AI-assisted PRs at a few hundred lines of diff, excluding lockfiles and generated code. Halving the diff roughly doubles the attention per line, and a reviewer's attention is the scarce resource the whole problem turns on.

2. Mandatory "what and why" in the description

Every PR opens with two paragraphs: what the change does behaviourally, and why it is the right change. If the author cannot articulate the change, they did not understand what the AI produced — and neither will the reviewer. Reviewers report PRs with real descriptions take less time to review, because the description does the context-building work.

3. Property and mutation testing on AI-touched modules

On modules where an LLM wrote a large share of the recent lines, require property-based tests (fast-check, hypothesis) and a mutation-testing gate (Stryker, mutmut). LLMs frequently write tests that pass on exactly the examples they considered; property tests probe the space they did not. It is the most direct answer we know to the Veracode-class failure, because it attacks the assumption rather than the output.

4. Diff-first review, not file-first

Start review from the commit-by-commit diff, reading the change in the order the author made it. This forces reviewers to reconstruct intent rather than evaluate the end state — the only way to notice that commit 3 undid something commit 1 established.

5. Static analysis tuned for AI idioms

Off-the-shelf linters miss things LLMs do that humans rarely do: duplicate utility functions across files, hard-coded magic values with subtle drift, defensive code that swallows errors. Custom Semgrep (opens in new tab) rules and SonarQube profiles can be tuned to flag them. If 12.3% of new lines are clones, the analyser has to notice.

The explain-back gate

Before approving any AI-assisted PR, the reviewer pastes a two- to four-sentence summary of the change — in their own words, as a top-level comment. Not a description of the diff; a description of the change to the system.

Four consequences, all wanted: the reviewer cannot approve what they do not understand; the author gets a free comprehension check; the team accumulates a corpus of summaries that becomes the best internal documentation you have; and it resets the author-reviewer power dynamic against comprehension-debt drift.

The obvious pushback: "this slows reviews." It does, at first. The bet is that it speeds them up later, because reviewers stop approving ambiguous PRs and then meeting them again in production.

What not to do

Do not ban AI assistants. The productivity signal is real. Banning drives tools underground and loses the throughput lift on unambiguously good uses — boilerplate, test scaffolds, translations.

Do not require humans to rewrite AI code by hand. It produces AI code that a human retyped, which is strictly worse than either pure human code or reviewed AI code. The only reliable output is resentment.

Do not ignore the incident signal. The worst pattern is leadership citing throughput ("PRs up 20%!") while on-call quietly absorbs the 23.5%. That gap is where burnout lives.

Do not over-index on AI review tools as the fix. They catch the class of bug that was cheap for humans to catch too. The "this contradicts our architecture" class is exactly what AI review cannot address, because it does not know your architecture.

The harder truth: AI shifted where engineering effort is spent. Writing got cheaper. Understanding got more expensive. Review, testing, and documentation budgets all need to go up to match. "AI makes you faster" has an asterisk: if and only if your review discipline kept up with your generation speed.