You shipped your first production agent — an AI that doesn't just answer questions but actually does things: chains 50–200 tool calls (requests to external services like Slack, Linear, GitHub), writes code, sends messages, creates tickets. A customer depends on the result. Life is good.

Then a customer gets a wrong answer. Your agent wrote to the wrong Linear project, merged a bad PR, sent a Slack DM to the CEO instead of a junior engineer. You need to find out why — before it does it again.

So you reach for the shiny new observability tools that every major platform shipped in April 2026. On April 8, Anthropic launched Managed Agents with built-in session tracing and tool call logs. On April 10, Zed shipped Agent Metrics — dashboards tracking 2 million sessions, 15.4 million turns, latency histograms across 536 distinct agents. On April 15, OpenAI updated its Agents SDK with auto-tracing that records every tool call, handoff, and model output — zero extra instrumentation — plus full OpenTelemetry compatibility (OpenTelemetry being the industry standard for collecting performance data, like a universal plug for monitoring tools).

Every vendor solved the same problem: show me what happened. And they did it well. You open the traces — spans (individual records of each operation) for every LLM invocation, every tool call, every token (a word-chunk the AI processes). Beautiful waterfall diagrams. Precise latency numbers.

But here's what breaks. An agent trace is not a microservice trace. Microservices — the small, independent programs that power most web apps — are deterministic: same input, same output, same bug, reproducible fix. Agents are non-deterministic: re-run the exact same task, get different decisions at step 47. The agent chose path A, but you can't see what path B looked like, or why it picked A over B given the 80,000 tokens of accumulated context (the AI's "working memory" — everything it's read and generated up to that point).

As Simon Willison wrote in his April 3 Agentic Engineering Patterns guide: "We cannot ensure an agent is acting faithfully or diagnose problems if its operations are entirely opaque." And Sentry's April 16 post-mortem nailed the exact failure mode: every span reports status: ok, but the output is dead wrong. The bug lives between agents — one tool call silently degrades input for another agent two steps later.

Debugging a wrong agent decision from its trace is like debugging a human decision from their Google Calendar — you see what meetings they had, not what they were thinking.

The market's workaround: LLM-as-judge — using a second AI model to evaluate the first one's decisions. Braintrust (raised $80M in February at an $800M valuation), LangSmith's trajectory evals, and Arize Phoenix all bolt evaluation onto traces. But each adds a new dependency, extra token cost per decision point, and — here's the kicker — the judge model shares the same architectural blind spots as the agent it's judging.

The practical workaround today: force your agent to emit structured reasoning at every branching point. Not just the tool call record, but a JSON rationale:

import json

def log_decision(step: int, options: list[str], chosen: str, reasoning: str):
    """Emit a searchable reasoning trail at every branch."""
    entry = {
        "step": step,
        "options_considered": options,
        "chosen_action": chosen,
        "reasoning": reasoning,
        "context_tokens_used": get_current_context_length(),
    }
    logger.info("agent_decision", extra=entry)
    return entry

# Inside your agent loop:
log_decision(
    step=47,
    options=["post to #general", "post to #engineering", "DM the assignee"],
    chosen="DM the assignee",
    reasoning="Ticket is labeled 'confidential', channel posting violates policy"
)

Yes, it costs extra tokens. Yes, it slows execution. But it's the only way to build a searchable reasoning trail that a human can audit after a failure — because the traces alone won't tell you why step 47 went sideways.

Every platform shipped "what happened." Nobody shipped "why this path, not that one." The first company that builds reasoning-native observability — not trace-native, reasoning-native — will own the debugging layer for every production agent, the way Datadog owns monitoring for microservices. Until then, you're reading your agent's calendar and guessing what it was thinking.