You wrote fifty lines of Python. A system prompt, a tool list, a model call. Your agent books meetings, queries databases, sends emails. It works beautifully in your terminal. Ship it?
Open your test file first. Go ahead. Try writing a test for something that gives a different answer every time you ask.
The non-determinism problem nobody wants to talk about
Here's what a normal Python test looks like:
def test_add():
assert add(2, 2) == 4 # deterministic. always 4.
Here's what an agent test looks like:
def test_booking_agent():
result = agent.run("Book a room for tomorrow 2pm")
assert result == ??? # what goes here?
# The agent might say "Done! Booked room A at 2pm"
# or "I've reserved conference room A for tomorrow at 14:00"
# or "Booked! Room A, April 27, 2:00 PM"
# All correct. All different strings.
An LLM — the brain behind ChatGPT, Claude, Gemini — is non-deterministic. Same input, different output. Every time. Your assertEqual is useless. This isn't a bug. It's the fundamental property that makes agents useful. And it makes them nearly impossible to test with tools engineers have relied on for decades.
Three SDKs shipped this month. Zero test commands.
Three major agent SDKs — toolkits for building AI agents — dropped in April. Anthropic launched Managed Agents API on April 8. OpenAI updated their Agents SDK on April 15. Google shipped ADK 1.0 alongside Agent Simulation at Cloud Next on April 22. All three hand you build-and-run primitives. None include a test command.
This isn't a comparison of three vendors. It's a pattern. The entire industry — every major AI lab — decided testing is your problem.
What they offer instead
Google came closest. At Cloud Next, they announced Agent Evaluation — a cloud service that scores your agent on task completion, tool usage, and safety. Genuinely useful for system-level metrics. But it runs on Vertex AI, costs roughly $0.50–2.00 per evaluation depending on complexity, and takes minutes to complete. Run it per pull request on a busy repo and you're looking at $300+ per month in eval fees alone — before actual API usage. That's not a unit test. That's a billing event.
Anthropic's Managed Agents API provides tracing — a log of every step your agent took — but zero test utilities. No mock model, no response recording, no assertion helpers.
OpenAI offers a seed parameter for partial reproducibility, but "partial" means "sometimes the same output, sometimes not." Try building CI/CD — automated deployment pipelines — on "sometimes."
The behavioral assertion pattern
The community figured out a workaround. Instead of asserting on exact text, assert on behavior patterns — what tools the agent called, in what order, and whether it stopped when it should have:
def test_booking_agent_behavior(recorded_responses):
trace = agent.run("Book a room for tomorrow 2pm")
# Assert on tool calls, not text output
assert trace.tool_calls[0].name == "check_availability"
assert trace.tool_calls[1].name == "create_booking"
assert "room" in trace.tool_calls[1].args
assert trace.completed # didn't loop forever
Open-source projects like DeepEval and Promptfoo pioneer this approach. Neither integrates natively with any of the three SDKs. You're gluing it yourself.
Every workaround has a price tag
Mock the model? You lose the LLM reasoning that makes agents valuable — you're testing a puppet, not an agent. Snapshot tests — recording an exact response and comparing against it later? They break on every model update or prompt tweak. Cloud evaluations? Minutes and dollars per pull request. Behavioral assertions? Hard to write, harder to maintain, and you're essentially hand-rolling a testing framework from scratch for each project.
The underlying tension: engineers built testing tools for deterministic software. Agents are neither deterministic nor, strictly speaking, software in the traditional sense. They're stochastic systems wearing a software costume.
What to actually do right now
No vendor will save you. Here's what works, with the full absurdity acknowledged — you're inventing test infrastructure in 2026 because three trillion-dollar companies couldn't be bothered:
-
Record real LLM responses as golden fixtures — saved known-good responses you replay in tests instead of calling the model every time. Yes, you're hand-building test plumbing the SDK should have shipped. No, the irony is not lost.
-
Assert on tool-call sequences, not final text. Did the agent call
check_availabilitybeforecreate_booking? Good enough. YourassertEqual-trained brain will scream. Let it. -
Run smoke tests against real models on a daily cron, not per pull request. Catch model drift without burning your CI budget.
-
Treat agent tests like integration tests — slower, fewer, higher-value — not unit tests. Five good behavioral tests beat fifty brittle snapshot tests. Anyone telling you to "just write more tests" hasn't tried testing something that generates a different haiku about room bookings each run.
# Practical: record & replay pattern
import json
def record_golden_fixture(agent, prompt, path):
"""Run once against real model, save the trace."""
trace = agent.run(prompt)
with open(path, "w") as f:
json.dump(trace.to_dict(), f)
return trace
def test_from_fixture(mock_agent, fixture_path):
"""Replay saved trace, assert on structure."""
with open(fixture_path) as f:
golden = json.load(f)
mock_agent.load_trace(golden)
result = mock_agent.replay()
assert [c["name"] for c in result.tool_calls] == [
"check_availability", "create_booking"
]
The vendor that ships agent test first wins
Django captured web developers with manage.py test. Rails did it with rails test. The first agent SDK to make testing a default primitive — not a cloud upsell at $0.50 per run, not a third-party addon, but a local command that works offline — captures the developer layer above model quality and tool count.
The untestable agent is the new untested microservice. Everyone ships it. Everyone regrets it six months later. Right now, that agent test command doesn't exist in any of the three major SDKs.
Your test file is still empty.





