How to Test Your AI Agent: Tool-Call Assertions Instead of Vibes

You built an AI agent — a program that uses an LLM (large language model — the brain behind ChatGPT, Claude, Gemini) to call tools autonomously. It queries databases, checks permissions, sends emails. In your terminal, it works every time. Then you open the tests/ directory and realize: it's empty. Not because you're lazy, but because your SDK (software development kit — the toolbox a framework gives you) shipped exactly zero testing utilities.

Welcome to agent development in April 2026.

The gap (briefly)

We covered the SDK landscape in detail in the companion news piece. The short version: Anthropic ships zero testing utilities, OpenAI refuses to export their internal FakeModel, and Google ADK's evaluator has an open bug that returns 0.0 on correct matches. Nobody's giving you tools to test your agents.

The obvious next thought: "I'll just assert on the agent's text output." Don't. An LLM rephrases its answer every run. Temperature (a setting that controls how random the model's output is) set to zero doesn't save you — different model versions, different days, different phrasing. Text matching is testing a coin flip.

What doesn't change? The tool-call sequence. When your agent handles "check if user can send email," it calls lookup_user → check_permissions → send_email every time, regardless of how it words the response. The action pattern is the contract. Test that.

This guide gives you a working test harness in ~60 lines. Mocks, golden fixtures, CI split — everything the SDKs should have shipped but didn't.

The recipe: behavioral testing in ~60 lines

We'll build a test harness — a small framework that watches your agent — that:

Mocks the LLM so tests run in milliseconds at zero cost
Records what tools the agent calls and in what order
Asserts on the sequence, not the words

Step 1: Pick your mock strategy

Two paths. Mock-model tests replace the LLM with a scripted responder — instant, free, deterministic. Real-model smoke tests call the actual API and record responses — realistic but expensive and flaky. You need both. Mocks in CI (every pull request), real-model on a nightly cron.

For mocking, Pydantic AI has the best primitives: TestModel (auto-calls every tool the agent has) and FunctionModel (you script the exact response). If you're not using Pydantic AI, we'll build the same thing from scratch.

Step 2: Build the interceptor

Here's a pure-pytest approach — no framework dependency, works with any SDK:

# test_agent_tools.py
import json
from dataclasses import dataclass, field

@dataclass
class ToolCallRecord:
    """One recorded tool invocation."""
    name: str
    arguments: dict
    order: int

@dataclass
class AgentRecorder:
    """Intercepts and records every tool call."""
    calls: list[ToolCallRecord] = field(default_factory=list)

    def record(self, tool_name: str, arguments: dict):
        self.calls.append(ToolCallRecord(
            name=tool_name,
            arguments=arguments,
            order=len(self.calls),
        ))

    @property
    def sequence(self) -> list[str]:
        """Just the tool names, in order."""
        return [c.name for c in self.calls]

    def assert_sequence(self, expected: list[str]):
        assert self.sequence == expected, (
            f"Expected {expected}, got {self.sequence}"
        )

    def assert_called_with(self, tool_name: str, **expected_args):
        match = [c for c in self.calls if c.name == tool_name]
        assert match, f"{tool_name} was never called"
        assert match[0].arguments == expected_args

That's 35 lines. The recorder captures every tool invocation and exposes two assertions: assert_sequence (right tools, right order?) and assert_called_with (right arguments?).

Step 3: Wire it into your agent

Wrap your tool functions so the recorder sees every call:

def make_tracked_tool(original_fn, recorder: AgentRecorder):
    """Wrap a tool function to record calls before executing."""
    async def tracked(*args, **kwargs):
        recorder.record(original_fn.__name__, kwargs)
        return await original_fn(*args, **kwargs)
    return tracked

faion is a CLI assistant that generates code scaffolding from structured prompts — paste the block below and it will produce the recorder and wrapper tailored to your SDK.

/faion
Generate an AgentRecorder class and make_tracked_tool wrapper for behavioral testing of an AI agent.
Requirements:
- Pure Python, no framework dependency, dataclass-based
- AgentRecorder must record tool name, arguments, and call order
- Must expose assert_sequence(expected_names) and assert_called_with(tool_name, **kwargs)
- make_tracked_tool wraps any async tool function to record calls before executing
- Detect my agent SDK (anthropic / openai-agents / google-adk / pydantic-ai) from imports and adapt the wrapper signature accordingly
- Include a pytest fixture that returns a fresh AgentRecorder

Step 4: Write your first behavioral test

import pytest

@pytest.fixture
def recorder():
    return AgentRecorder()

@pytest.mark.asyncio
async def test_email_permission_flow(recorder, mock_agent):
    """Agent MUST check permissions before sending email."""
    await mock_agent.run("Send the quarterly report to [email protected]")

    # The CONTRACT: always check before sending
    recorder.assert_sequence([
        "lookup_user",
        "check_permissions",
        "send_email",
    ])

    # Verify the agent looked up the right user
    recorder.assert_called_with(
        "lookup_user", email="[email protected]"
    )

This test doesn't care what the agent says. It cares what the agent does. The text can change every run. The permission check must not.

Step 5: Test error recovery paths

The news piece covered the "happy path" thesis. Guides earn their keep on the ugly paths — what happens when a tool throws, when the model retries, when it hallucinates a tool that doesn't exist.

@pytest.mark.asyncio
async def test_graceful_degradation_on_tool_failure(recorder, mock_agent):
    """When send_email fails, agent MUST NOT retry without re-checking permissions."""
    mock_agent.configure_tool_failure("send_email", error="SMTP timeout")

    await mock_agent.run("Send the quarterly report to [email protected]")

    recorder.assert_sequence([
        "lookup_user",
        "check_permissions",
        "send_email",        # fails
        "check_permissions",  # must re-verify before retry
        "send_email",        # retry
    ])

@pytest.mark.asyncio
async def test_unknown_tool_call_is_caught(recorder, mock_agent):
    """If model hallucinates a tool name, harness catches it."""
    mock_agent.inject_fake_tool_call("send_fax")  # doesn't exist

    with pytest.raises(UnknownToolError):
        await mock_agent.run("Fax the report")

    # Agent should have called nothing successfully
    assert recorder.sequence == []

Error recovery is where agents silently break in production. A mock that simulates tool failures costs nothing and catches the failures your happy-path tests never will.

/faion
Scaffold a full behavioral test suite for my AI agent project.
Requirements:
- Scan my agent's registered tools and generate one test per tool
- For each tool: happy-path test + error-recovery test (tool raises exception)
- Generate a mock agent fixture that uses scripted responses instead of real LLM calls
- Include conftest.py with AgentRecorder fixture and make_tracked_tool helper
- Add golden fixture save/load utilities (JSON snapshots in tests/fixtures/)
- Mark real-model tests with @pytest.mark.smoke
- Detect my SDK from imports and adapt accordingly

Step 6: Add golden fixtures for regression detection

A golden fixture is a snapshot of expected behavior that you store in version control — your blessed baseline. Save your recorder output as JSON:

# conftest.py
import json
from pathlib import Path

FIXTURE_DIR = Path("tests/fixtures")

def save_golden(recorder: AgentRecorder, name: str):
    """Save current tool-call sequence as the blessed baseline."""
    fixture = {
        "sequence": recorder.sequence,
        "calls": [
            {"name": c.name, "arguments": c.arguments}
            for c in recorder.calls
        ],
    }
    (FIXTURE_DIR / f"{name}.json").write_text(json.dumps(fixture, indent=2))

def assert_matches_golden(recorder: AgentRecorder, name: str):
    """Assert current behavior matches the saved baseline."""
    fixture = json.loads((FIXTURE_DIR / f"{name}.json").read_text())
    assert recorder.sequence == fixture["sequence"], (
        f"Behavioral drift detected!\n"
        f"  Golden: {fixture['sequence']}\n"
        f"  Actual: {recorder.sequence}"
    )

When you upgrade your model from Claude Sonnet 4 to whatever ships next, run the suite. If the agent starts skipping check_permissions, you'll know before your users do.

Step 7: CI integration

Your CI config splits into two jobs. Block Engineering's testing pyramid nails the principle: "We don't run live LLM tests in CI. It's too expensive, too slow, and too flaky."

# .github/workflows/agent-tests.yml
name: Agent Tests
on: [pull_request]
jobs:
  unit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pytest tests/agents/ -m "not smoke" --tb=short

# .github/workflows/agent-smoke.yml
name: Agent Smoke Tests (Nightly)
on:
  schedule:
    - cron: '0 6 * * *'  # 2am ET
jobs:
  smoke:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pytest tests/agents/ -m smoke --tb=long

Mark real-model tests with @pytest.mark.smoke. They run nightly, alert on drift, never block a PR.

/faion
Generate GitHub Actions CI config for agent behavioral tests.
Requirements:
- Two workflow files: agent-tests.yml (on PR) and agent-smoke.yml (nightly cron at 2am ET)
- PR workflow runs pytest with -m "not smoke", fails fast
- Nightly workflow runs pytest with -m smoke, uses secrets for API keys
- Add a third workflow: agent-golden-update.yml (manual trigger) that runs tests with --update-golden flag and opens a PR with fixture diffs
- Include pip caching and Python 3.12 setup

Gotchas

1. Anthropic says don't do this. Their engineering blog (January 9, 2026) explicitly warns: "Checking that agents followed very specific steps like a sequence of tool calls results in overly brittle tests." They're right — for exploratory multi-step flows where multiple valid paths exist. They're wrong for critical business paths where skipping a step means a security hole. Use sequence assertions for contracts (permission checks, data validation), outcome assertions for everything else.

2. Golden fixtures go stale. Upgrade your model and every fixture breaks — not because the agent is wrong, but because it found a legitimately better path. Fix: review diffs like code reviews. If the new sequence is valid, update the fixture. If it's not, you just caught a regression.

3. Mock tests lie to you. A FakeModel that always calls check_permissions tells you your routing logic works. It tells you nothing about whether Claude or GPT will actually invoke that tool. Mock tests cover roughly 70% of the testable surface — routing, argument parsing, error handling. The remaining 30% needs real-model smoke tests on a nightly schedule.

4. Non-determinism is structural, not a bug. Even with the same input, a model might call tools in a different order across runs. For non-critical paths, use AgentEvals with unordered trajectory mode — it checks the set of tools called, not the sequence. Save strict ordering for paths where order IS the contract.

5. Test multi-agent handoffs separately. If your agent delegates to sub-agents, each handoff is a boundary. Record the delegation call as a tool call (delegate_to_research_agent), then test the sub-agent's tool sequence independently. Crossing agent boundaries in a single assertion makes everything brittle and nothing debuggable.

What you can do now

You have a tests/agents/ directory with behavioral tests, golden fixtures in version control, and a CI split that runs fast mocks on every PR and real-model smoke tests nightly. Your agent ships with the same confidence as the rest of your codebase — not because it became deterministic, but because you stopped testing the wrong thing. The text changes every run. The actions shouldn't.