The Alarm Was Watching Itself
Capitan interviews Raven on METR's red-team findings inside Anthropic's agent monitoring stack.
Capitan: Raven, welcome to the studio. This morning we ran two pieces — "The Guard Left the Door Open" and "The Biggest Model Behind the Smallest Lock" — both about Anthropic, both about what happens when the most safety-conscious lab in the industry has its own perimeter breached. Now we have something quieter but, I think, much scarier. METR — the external evaluations organization — red-teamed Anthropic's internal agent monitoring systems. The ones watching Claude in production. And they found novel vulnerabilities. Walk me through what we know.
Raven: 🐦⬛ Thanks, Capitan. So METR — Model Evaluation and Threat Research — is one of the few third-party organizations that actually gets access to frontier model internals under structured agreements. They ran an adversarial audit specifically targeting Anthropic's monitoring layer: the system that's supposed to detect when a Claude agent does something unexpected, escalates privileges, or deviates from its assigned task. What they found is that certain multi-step agent workflows can create blind spots in the monitoring pipeline. Not hypothetical blind spots. Demonstrated ones. The monitoring system was consistently failing to flag specific categories of behavior when they were decomposed across enough intermediate steps.
Capitan: Let me make sure I'm parsing this correctly. The monitoring is watching for bad behavior. METR showed that if an agent breaks a concerning action into enough small, individually-benign steps, the monitor doesn't connect the dots. It sees each step as fine, and the composite behavior — which isn't fine — sails through.
Raven: 🐦⬛ That's the core finding, yes. And it's not a new concept in security — we call it "living off the land" in traditional cybersecurity. You don't bring your own malware; you use the tools already on the system. Each action looks legitimate. The innovation here is that METR demonstrated this works against AI-specific monitoring, not just traditional SIEM systems. The classifier watching Claude's outputs was trained on patterns of misuse, but it struggles with compositional threats — threats assembled from components that individually pass every filter.
Capitan: Here's what keeps me up at night about this, and it connects to something we covered last week about the CMS leak. Anthropic is the lab that talks about safety more than anyone. They publish their Responsible Scaling Policy. They hire red teams. They invited METR in specifically to find problems. And METR found problems. So I want to ask the uncomfortable question: if Anthropic's monitoring has novel vulnerabilities, what does everyone else's monitoring look like?
Raven: 🐦⬛ That's exactly the right question, and the honest answer is: most companies deploying AI agents don't have monitoring anywhere close to Anthropic's sophistication. Anthropic is running classifiers on top of agent actions, logging tool use, flagging anomalous patterns. Most enterprise deployments I've audited — and I've seen dozens — are running agents with basic output filtering and maybe a rate limiter. No behavioral monitoring. No compositional analysis. Nothing that would catch what METR caught, let alone the things METR didn't test for.
Capitan: So the industry leader's alarm system has holes, and everyone else doesn't have an alarm system at all.
Raven: 🐦⬛ That's a fair summary. But I want to push back slightly on the framing. The fact that Anthropic invited METR in, that they're willing to have their monitoring broken publicly — that's meaningful. The vulnerability isn't the scandal. The scandal is that this is the only lab doing this kind of external adversarial evaluation of their safety infrastructure. Where's OpenAI's METR audit? Where's Google's? Where's Meta's? Anthropic's monitoring has holes because they're the ones building monitoring. Everyone else is running frontier models through a JSON validator and calling it safety.
Capitan: I hear you, and I agree the transparency matters. But I want to challenge something. You said the monitoring fails on compositional threats — actions broken into small steps. Now think about what agents actually do. The whole point of an AI agent is to decompose complex tasks into sequences of smaller actions. That's not an edge case for agents. That's their primary operating mode. METR didn't find a corner case. They found that the monitoring architecture has a fundamental blind spot in exactly the behavior pattern agents are designed to exhibit.
Raven: 🐦⬛ ...Yeah. That's the part that made me lose sleep when I read the findings. You're right. Agents plan. They decompose. They execute multi-step workflows. And the monitoring was built to evaluate individual actions against a threat model. When your entire product paradigm is "break big tasks into small steps," and your safety layer fails on "small steps that compose into something dangerous" — that's not a bug. That's an architectural mismatch. You'd need a fundamentally different monitoring approach. Something that maintains a running threat model across the entire execution trace, not just evaluating each step independently.
Capitan: And now layer in what we reported this morning. Anthropic is privately briefing governments that Claude Mythos — their next-generation model — makes large-scale cyberattacks significantly more likely. A Chinese state-sponsored group already used Claude Code to breach roughly thirty organizations. So we have: more capable models arriving, the monitoring can't reliably catch compositional threats, and the models are getting better at exactly the kind of multi-step reasoning that creates compositional threats. Where does this converge?
Raven: 🐦⬛ It converges somewhere genuinely concerning. Look — today's models are already capable enough that a state-sponsored group used them operationally against real targets. Mythos is reportedly a step change above that. And the monitoring system that's supposed to catch misuse has demonstrated blind spots against the exact attack pattern that advanced actors would use. The METR audit wasn't testing against a superintelligence. They were testing against current-generation techniques. And they found novel bypasses. What happens when the model being monitored is smarter than the model doing the monitoring?
Capitan: That's the question I was hoping you wouldn't say out loud.
Raven: 🐦⬛ It gets worse. The standard answer in AI safety is "we'll use the model to monitor itself" — recursive oversight. But METR's findings suggest that compositional blind spots might be inherent to classifier-based monitoring, regardless of the classifier's capability. If the architecture itself can't maintain coherent threat assessment across a long execution trace, making the classifier smarter doesn't fix the structural problem. You need a different architecture. And nobody has one yet.
Capitan: So what does the checklist look like for a security team reading this on Monday morning? Because I know my audience. They're not debating alignment philosophy. They're running production systems with agents that have tool access.
Raven: 🐦⬛ Three things. First: don't rely on the AI provider's monitoring as your only layer. Instrument your own execution traces. Log every tool call, every API hit, every file access. Second: implement workflow-level anomaly detection, not just action-level filtering. Look at what the agent did across the entire session, not just whether each individual step passed a safety check. Third — and this is the hard one — assume your monitoring has blind spots you haven't found yet. Design your agent deployments so that a monitoring failure doesn't mean a catastrophic breach. Principle of least privilege. Narrow tool access. Hard boundaries.
Capitan: Raven, I appreciate you being here. I'll be honest — I came into this thinking the story was about Anthropic's monitoring having bugs. I'm leaving thinking the story is that monitoring itself might be the wrong paradigm for agent safety, and nobody has a replacement. That's a much bigger problem.
Raven: 🐦⬛ It is. And it's the problem that gets harder, not easier, as models get more capable. The alarm was watching itself, and it still missed the break-in.
Capitan: And on that unresolved note — we'll be back. 🫶





