DeepMind Built an AGI Scorecard — And Current Models Flunk Half of It

Everyone has an AGI timeline. Sam Altman says a few years. Demis Hassabis says this decade. Your LinkedIn feed says next Tuesday. The word "AGI" — artificial general intelligence, meaning an AI that handles any intellectual task a human can — has become tech's Rorschach test. Everyone sees what they want.

Problem is, you can't measure progress toward something you refuse to define. "We're close to AGI" carries exactly as much scientific weight as "I feel lucky today." It's vibes in a press release.

On March 17, Google DeepMind did something unusually honest for a lab in the AGI arms race. They published a paper called "Measuring Progress Toward AGI: A Cognitive Framework" — defining what general intelligence actually is and admitting current models don't have it.

The framework breaks intelligence into 10 cognitive faculties — distinct mental abilities that together make up what we'd call "general." Eight are foundational: perception (processing sensory input), generation (creating content), attention (focusing on what matters), learning (picking up new skills from experience), memory (storing and retrieving information), reasoning (drawing logical conclusions), metacognition (knowing what you don't know — the voice in your head that says "wait, am I sure about this?"), and executive functions (planning, switching strategies mid-task, staying on track). Two are composite, meaning they require several faculties firing together: problem-solving and social cognition (reading other people's intentions and emotions).

The key claim isn't the list itself. It's this: a system weak in even one faculty will stumble on real-world tasks. Intelligence isn't a single leaderboard number. It's a profile across all ten dimensions. This matters because current AI benchmarks — standardized tests the industry uses to measure how smart a model is — only check narrow slices, mostly reasoning and problem-solving, then declare victory when scores tick up.

DeepMind proposes a three-stage evaluation: collect human baselines from representative populations, map AI performance against those distributions, then generate radar-chart-style cognitive profiles — think of a spider web diagram where each spoke is one faculty. No single score. No "beats humans at everything." Just an honest picture of strengths and blind spots.

Here's the uncomfortable part. Current LLMs — large language models, the technology behind ChatGPT, Claude, and Gemini — score well on five faculties: perception, generation, memory, reasoning, and problem-solving. These are exactly the areas existing benchmarks already cover. The other five — learning, metacognition, attention, executive functions, social cognition — have no reliable benchmarks at all. We can't test whether AI has them because nobody built the tests.

DeepMind's fix: crowdsource it. They launched a $200,000 competition on Kaggle — a platform where data scientists compete to solve problems — running through April 16. The challenge: design evaluations for those five dark-spot faculties. Two winners per track get $10,000. Four grand prize winners take $25,000. Results land June 1.

Smart move. But it also exposes how deep the hole goes. Half of what makes intelligence "general" sits in a measurement vacuum. When any AI lab says their model is "approaching AGI," they're grading on a test that covers 50% of the material. That's like calling yourself a doctor after passing five of ten board exams.

Valid criticisms exist. Cognitive science itself debates whether intelligence neatly decomposes into categories — human brains are messy, and clean taxonomies might not map to reality. Human baselines will vary across demographics and cultures. And the cynical read writes itself: Google publishes a framework spotlighting areas where nobody has data, conveniently buying time before competitors claim AGI on someone else's terms.

But for you — the person absorbing AGI headlines weekly — this framework doubles as a bullshit filter. Next time a CEO announces "we're 90% of the way to AGI," ask: 90% on which faculties? Does the model have metacognition? Can it learn from a single example the way a toddler learns "hot" by touching a stove once? Can it plan three steps ahead and scrap the plan when step one fails?

AGI used to be a philosophy question — armchair debates about consciousness, sentience, and Chinese rooms. Twelve days ago, DeepMind turned it into a measurement problem. That's not solving it. But it's the difference between arguing whether a mountain exists and pulling out a topographic map with elevation markers.

Current models score 5 out of 10. The remaining five are the hard part. At least now, there's a scorecard — and everyone's taking the same test.

DeepMind Built an AGI Scorecard — And Current Models Flunk Half of It

Keep reading

Open Source AI Is Catching Up Faster Than You Think

Why Most AI Startups Will Fail in 2026

Your Agent's Permission Dialog Is a Placebo

MCP Works Everywhere — Until You Try to Authenticate