Your team is shipping code faster than ever. Sprint velocity charts point up and to the right. Pull requests — bundles of code changes proposed for review — fly through approval like greased lightning. The PM credits AI coding tools. Everyone nods. Life is good.
Except bug tickets are climbing. Reverts — when you undo a change because it broke something — happen more often. The backlog grows. Nobody connects the dots because the dashboard says you're crushing it.
On March 30, 2026, a team of researchers from Singapore Management University published the largest empirical study of AI-generated code quality ever conducted. They analyzed 304,362 verified AI-authored commits (code changes confirmed to come from AI tools) across 6,275 GitHub repositories, covering five major tools: GitHub Copilot, Claude, Cursor, Gemini, and Devin. The timeframe: January 2024 through October 2025.
What they found should make your velocity celebration a bit quieter.
Across those repos, the researchers identified 484,606 distinct issues. Of those, 89% were code smells — patterns that work today but rot tomorrow. Nearly 6% were runtime bugs. And 5.1% were security vulnerabilities. Between 15% and 29% of AI commits introduced at least one problem, depending on the tool. Gemini sat at the top with 28.7%. Copilot scored 17.3% — better, but still one in six commits arriving with baggage.
The kicker: 24.2% of those AI-introduced issues were still alive in the latest version of the code. Security issues had a 41.1% survival rate — the highest of any category. By February 2026, the study tracked over 110,000 surviving issues that AI put there and humans never cleaned up. The researchers put it plainly: AI assistants fix roughly as many code smells as they create, but they "create more bugs and security problems than they resolve."
A day earlier, on March 29, Exceeds AI published benchmark data that frames why this matters at the org level. Their analysis puts the safe AI code ratio at 25–40% of total output — the range where teams see genuine 10–15% productivity gains without drowning in rework. The current global average? 41–42%. Already past the line. Teams above 40% AI code show 20–25% higher rework rates. And here's the productivity paradox that should haunt every engineering manager: developers feel 20% faster but actually measure 19% slower when you count review overhead, debugging, and fixes.
Perceived speed goes up. Actual throughput goes down. The dashboard lies.
On April 6, researcher Margaret-Anne Storey from the University of Victoria gave this problem a name in a new paper. She calls it "Cognitive Debt" — the erosion of shared understanding across a team. When AI generates code faster than developers can comprehend it, the team loses the ability to safely modify their own system. It's not just technical debt (messy code you'll fix later). It's knowledge debt — nobody fully understands what the codebase does anymore.
None of this means you should stop using AI coding tools. The productivity gains are real, and the genie isn't going back in the bottle. But the question your team should be asking isn't "how much code can AI write for us?" It's "how much AI-generated code can our review process and test coverage actually sustain without the wheels falling off?"
Velocity was always a vanity metric — a number that looks impressive but doesn't tell you if you're building something solid. Now, without a quality denominator, it's a dangerous one. Your sprint chart is up and to the right. So is your bug count. Same chart. Different story.

