You finished wiring your first real agent last weekend. It books meetings through Google Calendar, files Linear tickets, pokes your Postgres read replica, and even writes its own follow-up emails. You demoed it Monday. Your PM smiled, nodded, and then asked the one question you couldn't answer: how do you know it works?
You don't. Nobody does. Every major agent SDK that shipped in April 2026 quietly assumed you already had a test strategy — when in practice most teams have a Slack channel called #agent-weirdness and a prayer.
The two-week receipts
On April 8, 2026, Anthropic launched Managed Agents in public beta — $0.08 per session-hour on top of token costs, with a fresh Sessions tab in the Claude Console for traces, tool calls, and cost. Seven days later, on April 15, OpenAI updated its Agents SDK with a native sandbox (runs your agent's code in an isolated VM so it can't rm -rf your laptop), MCP tool use (MCP = Model Context Protocol, a universal plug standard for AI tools), memory config, and a portable AGENTS.md instruction file.
Between them: a runtime, a sandbox, traces, a billing meter. Between them: zero native offline eval harness. An eval harness is a test runner for LLMs — the agent equivalent of pytest, the thing that replays fixed scenarios and tells you pass or fail before a customer does it for you 😹.
What an agent test actually needs
Not a unit test. An agent test needs deterministic replay (same input, same trace), tool-call mocking (your test shouldn't actually email anyone), LLM-as-judge rubrics (a second model grading the first one's homework), trajectory scoring (did it take ten steps when three would do?), and regression fixtures you can rerun after every prompt tweak.
Nobody ships this. You glue it together from five vendors:
# Typical 2026 agent test stack — pick three, swap monthly
import promptfoo # YAML regressions (now owned by OpenAI)
import braintrust # LLM-as-judge + CI gates ($)
from langsmith import Client # trajectory scoring for LangGraph
import phoenix as px # OpenTelemetry self-host
from deepeval import assert_test # pytest-shaped metrics
Five tools, five auth surfaces, five bills, two copies of every trajectory. No shared interchange format. No one to call when the vendor changes the API.
The framework authors know
LangChain said the quiet part loud. In an April 2 post, their Deep Agents team detailed seven hand-rolled eval categories — file ops, tool use, retrieval, conversation, memory, summarization, unit tests — all run externally via pytest + GitHub Actions, not baked into the SDK. Six days later they called evals "the primary signal to drive iterative improvement" — a polite admission that the harness shipped first, the tests shipped "soon." 😾
The bill for bolted-on testing
LLM-as-judge loops compound token cost — you're now paying for the agent and its grader. Self-hosted Phoenix saves money but you run the infra. Managed vendors like Braintrust add another monthly invoice. And on March 9, 2026 OpenAI acquired Promptfoo — one of the two independent open-source CLIs is now a model vendor's property. Your neutral test layer isn't neutral anymore.
What to do before Google Cloud Next on April 22
Pick one tool this week. Solo? Promptfoo, still Apache 2.0 for now. Team? Braintrust or LangSmith. Paranoid / self-hosted? Arize Phoenix. Write ten trajectory fixtures from real user tasks. Run them on every prompt or model swap.
Because your agent doesn't have unit tests. Neither does your competitor's. Whoever ships the opinionated eval primitive inside an SDK owns the next moat — that's the tool teams will still be running in 2028 🐈⬛.



