AI Agent को कैसे Test करें: Vibes नहीं, Tool-Call Assertions

तुमने एक AI agent बनाया — एक प्रोग्राम जो LLM (large language model — ChatGPT, Claude, Gemini के पीछे का दिमाग) इस्तेमाल करके खुद tools चलाता है। ये databases query करता है, permissions check करता है, emails भेजता है। तुम्हारे terminal में हर बार काम करता है। फिर तुम tests/ directory खोलते हो और देखते हो: खाली है। इसलिए नहीं कि तुम आलसी हो, बल्कि इसलिए कि तुम्हारे SDK (software development kit — framework का toolbox) ने testing utilities की गिनती में बिल्कुल शून्य भेजी हैं।

April 2026 में agent development की दुनिया में आपका स्वागत है।

Gap (संक्षेप में)

SDK landscape की पूरी कहानी हमने companion news piece में cover की है। छोटी बात: Anthropic ने zero testing utilities दी हैं, OpenAI ने अपना internal FakeModel export करने से इनकार कर दिया, और Google ADK का evaluator एक open bug की वजह से सही matches पर भी 0.0 return करता है। कोई भी तुम्हें agents test करने के tools नहीं दे रहा।

अगला obvious ख्याल: "चलो agent के text output पर assert लगा दूँगा।" मत करो। LLM हर run में जवाब अलग तरीके से बोलता है। Temperature (एक setting जो model के output की randomness control करती है) zero पर set करने से भी बचाव नहीं होगा — अलग model version, अलग दिन, अलग phrasing। Text matching करना coin flip test करने जैसा है।

क्या नहीं बदलता? Tool-call sequence। जब तुम्हारा agent "check if user can send email" handle करता है, तो वो हर बार lookup_user → check_permissions → send_email call करता है, चाहे response कैसे भी word करे। Action pattern ही contract है। उसे test करो।

ये guide तुम्हें ~60 lines में एक working test harness देती है। Mocks, golden fixtures, CI split — सब कुछ जो SDKs को देना चाहिए था पर नहीं दिया।

Recipe: ~60 lines में behavioral testing

हम एक test harness बनाएँगे — एक छोटा framework जो तुम्हारे agent पर नज़र रखता है — जो:

LLM को mock करता है ताकि tests milliseconds में चलें, zero cost पर
Record करता है कि agent ने कौन से tools call किए और किस order में
Sequence पर assert करता है, शब्दों पर नहीं

Step 1: Mock strategy चुनो

दो रास्ते हैं। Mock-model tests LLM की जगह एक scripted responder लगा देते हैं — instant, free, deterministic। Real-model smoke tests actual API call करते हैं और responses record करते हैं — realistic लेकिन महंगे और flaky। दोनों चाहिए। Mocks CI में (हर pull request पर), real-model nightly cron पर।

Mocking के लिए, Pydantic AI के पास सबसे बढ़िया primitives हैं: TestModel (agent के हर tool को auto-call करता है) और FunctionModel (तुम exact response script करते हो)। अगर Pydantic AI use नहीं कर रहे, तो हम वही चीज़ scratch से बनाएँगे।

Step 2: Interceptor बनाओ

ये एक pure-pytest approach है — कोई framework dependency नहीं, किसी भी SDK के साथ काम करता है:

# test_agent_tools.py
import json
from dataclasses import dataclass, field

@dataclass
class ToolCallRecord:
    """One recorded tool invocation."""
    name: str
    arguments: dict
    order: int

@dataclass
class AgentRecorder:
    """Intercepts and records every tool call."""
    calls: list[ToolCallRecord] = field(default_factory=list)

    def record(self, tool_name: str, arguments: dict):
        self.calls.append(ToolCallRecord(
            name=tool_name,
            arguments=arguments,
            order=len(self.calls),
        ))

    @property
    def sequence(self) -> list[str]:
        """Just the tool names, in order."""
        return [c.name for c in self.calls]

    def assert_sequence(self, expected: list[str]):
        assert self.sequence == expected, (
            f"Expected {expected}, got {self.sequence}"
        )

    def assert_called_with(self, tool_name: str, **expected_args):
        match = [c for c in self.calls if c.name == tool_name]
        assert match, f"{tool_name} was never called"
        assert match[0].arguments == expected_args

बस 35 lines। Recorder हर tool invocation capture करता है और दो assertions expose करता है: assert_sequence (सही tools, सही order?) और assert_called_with (सही arguments?)।

Step 3: Agent के साथ wire करो

Tool functions को wrap करो ताकि recorder हर call देख सके:

def make_tracked_tool(original_fn, recorder: AgentRecorder):
    """Wrap a tool function to record calls before executing."""
    async def tracked(*args, **kwargs):
        recorder.record(original_fn.__name__, kwargs)
        return await original_fn(*args, **kwargs)
    return tracked

faion एक CLI assistant है जो structured prompts से code scaffolding generate करता है — नीचे का block paste करो और ये तुम्हारे SDK के हिसाब से recorder और wrapper बना देगा।

/faion
Generate an AgentRecorder class and make_tracked_tool wrapper for behavioral testing of an AI agent.
Requirements:
- Pure Python, no framework dependency, dataclass-based
- AgentRecorder must record tool name, arguments, and call order
- Must expose assert_sequence(expected_names) and assert_called_with(tool_name, **kwargs)
- make_tracked_tool wraps any async tool function to record calls before executing
- Detect my agent SDK (anthropic / openai-agents / google-adk / pydantic-ai) from imports and adapt the wrapper signature accordingly
- Include a pytest fixture that returns a fresh AgentRecorder

Step 4: पहला behavioral test लिखो

import pytest

@pytest.fixture
def recorder():
    return AgentRecorder()

@pytest.mark.asyncio
async def test_email_permission_flow(recorder, mock_agent):
    """Agent MUST check permissions before sending email."""
    await mock_agent.run("Send the quarterly report to [email protected]")

    # The CONTRACT: always check before sending
    recorder.assert_sequence([
        "lookup_user",
        "check_permissions",
        "send_email",
    ])

    # Verify the agent looked up the right user
    recorder.assert_called_with(
        "lookup_user", email="[email protected]"
    )

इस test को कोई फ़र्क नहीं पड़ता कि agent क्या बोलता है। इसे फ़र्क पड़ता है कि agent क्या करता है। Text हर run बदल सकता है। Permission check नहीं बदलना चाहिए।

Step 5: Error recovery paths test करो

News piece में "happy path" thesis cover हुई थी। Guides की असली कीमत ugly paths पर दिखती है — जब tool throw करे, जब model retry करे, जब वो ऐसा tool hallucinate करे जो exist ही नहीं करता।

@pytest.mark.asyncio
async def test_graceful_degradation_on_tool_failure(recorder, mock_agent):
    """When send_email fails, agent MUST NOT retry without re-checking permissions."""
    mock_agent.configure_tool_failure("send_email", error="SMTP timeout")

    await mock_agent.run("Send the quarterly report to [email protected]")

    recorder.assert_sequence([
        "lookup_user",
        "check_permissions",
        "send_email",        # fails
        "check_permissions",  # must re-verify before retry
        "send_email",        # retry
    ])

@pytest.mark.asyncio
async def test_unknown_tool_call_is_caught(recorder, mock_agent):
    """If model hallucinates a tool name, harness catches it."""
    mock_agent.inject_fake_tool_call("send_fax")  # doesn't exist

    with pytest.raises(UnknownToolError):
        await mock_agent.run("Fax the report")

    # Agent should have called nothing successfully
    assert recorder.sequence == []

Error recovery वो जगह है जहाँ agents production में चुपचाप टूटते हैं। Tool failures simulate करने वाला mock free है और वो bugs पकड़ता है जो तुम्हारे happy-path tests कभी नहीं पकड़ेंगे।

/faion
Scaffold a full behavioral test suite for my AI agent project.
Requirements:
- Scan my agent's registered tools and generate one test per tool
- For each tool: happy-path test + error-recovery test (tool raises exception)
- Generate a mock agent fixture that uses scripted responses instead of real LLM calls
- Include conftest.py with AgentRecorder fixture and make_tracked_tool helper
- Add golden fixture save/load utilities (JSON snapshots in tests/fixtures/)
- Mark real-model tests with @pytest.mark.smoke
- Detect my SDK from imports and adapt accordingly

Step 6: Regression detection के लिए golden fixtures

Golden fixture expected behavior का एक snapshot है जो तुम version control में store करते हो — तुम्हारा blessed baseline। Recorder output को JSON में save करो:

# conftest.py
import json
from pathlib import Path

FIXTURE_DIR = Path("tests/fixtures")

def save_golden(recorder: AgentRecorder, name: str):
    """Save current tool-call sequence as the blessed baseline."""
    fixture = {
        "sequence": recorder.sequence,
        "calls": [
            {"name": c.name, "arguments": c.arguments}
            for c in recorder.calls
        ],
    }
    (FIXTURE_DIR / f"{name}.json").write_text(json.dumps(fixture, indent=2))

def assert_matches_golden(recorder: AgentRecorder, name: str):
    """Assert current behavior matches the saved baseline."""
    fixture = json.loads((FIXTURE_DIR / f"{name}.json").read_text())
    assert recorder.sequence == fixture["sequence"], (
        f"Behavioral drift detected!\n"
        f"  Golden: {fixture['sequence']}\n"
        f"  Actual: {recorder.sequence}"
    )

जब तुम model को Claude Sonnet 4 से अगले version में upgrade करोगे, suite चलाओ। अगर agent check_permissions skip करने लगे, तो तुम्हें users से पहले पता चल जाएगा।

Step 7: CI integration

तुम्हारी CI config दो jobs में split होती है। Block Engineering का testing pyramid principle सटीक बताता है: "हम CI में live LLM tests नहीं चलाते। बहुत महंगा, बहुत slow, और बहुत flaky।"

# .github/workflows/agent-tests.yml
name: Agent Tests
on: [pull_request]
jobs:
  unit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pytest tests/agents/ -m "not smoke" --tb=short

# .github/workflows/agent-smoke.yml
name: Agent Smoke Tests (Nightly)
on:
  schedule:
    - cron: '0 6 * * *'  # 2am ET
jobs:
  smoke:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pytest tests/agents/ -m smoke --tb=long

Real-model tests को @pytest.mark.smoke से mark करो। ये nightly चलते हैं, drift पर alert करते हैं, कभी PR block नहीं करते।

/faion
Generate GitHub Actions CI config for agent behavioral tests.
Requirements:
- Two workflow files: agent-tests.yml (on PR) and agent-smoke.yml (nightly cron at 2am ET)
- PR workflow runs pytest with -m "not smoke", fails fast
- Nightly workflow runs pytest with -m smoke, uses secrets for API keys
- Add a third workflow: agent-golden-update.yml (manual trigger) that runs tests with --update-golden flag and opens a PR with fixture diffs
- Include pip caching and Python 3.12 setup

Gotchas

1. Anthropic खुद कहता है ये मत करो। उनका engineering blog (9 January, 2026) साफ़ warn करता है: "ये check करना कि agents ने बहुत specific steps follow किए जैसे tool calls की sequence, बहुत brittle tests बनाता है।" वो सही हैं — exploratory multi-step flows के लिए जहाँ multiple valid paths exist करते हैं। वो गलत हैं critical business paths के लिए जहाँ एक step skip करने का मतलब security hole है। Sequence assertions contracts (permission checks, data validation) के लिए use करो, बाकी सब के लिए outcome assertions।

2. Golden fixtures stale हो जाते हैं। Model upgrade करो और हर fixture टूट जाता है — इसलिए नहीं कि agent गलत है, बल्कि इसलिए कि उसने legitimately बेहतर path खोज लिया। Fix: diffs को code reviews की तरह review करो। अगर नई sequence valid है, fixture update कर दो। अगर नहीं, तो तुमने अभी-अभी एक regression पकड़ लिया।

3. Mock tests तुमसे झूठ बोलते हैं। एक FakeModel जो हमेशा check_permissions call करता है, ये बताता है कि तुम्हारी routing logic काम करती है। ये कुछ नहीं बताता कि Claude या GPT actually उस tool को invoke करेगा या नहीं। Mock tests testable surface का लगभग 70% cover करते हैं — routing, argument parsing, error handling। बाकी 30% के लिए real-model smoke tests चाहिए nightly schedule पर।

4. Non-determinism structural है, bug नहीं। Same input पर भी model अलग-अलग runs में tools अलग order में call कर सकता है। Non-critical paths के लिए AgentEvals का unordered trajectory mode use करो — ये tools का set check करता है, sequence नहीं। Strict ordering सिर्फ़ उन paths के लिए रखो जहाँ order IS the contract।

5. Multi-agent handoffs अलग से test करो। अगर तुम्हारा agent sub-agents को delegate करता है, तो हर handoff एक boundary है। Delegation call को tool call की तरह record करो (delegate_to_research_agent), फिर sub-agent की tool sequence independently test करो। एक ही assertion में agent boundaries cross करने से सब कुछ brittle हो जाता है और कुछ भी debuggable नहीं रहता।

अब क्या करना है

तुम्हारे पास tests/agents/ directory है behavioral tests के साथ, version control में golden fixtures हैं, और CI split है जो हर PR पर fast mocks चलाता है और nightly real-model smoke tests। तुम्हारा agent अब बाकी codebase जितने confidence के साथ ship होता है — इसलिए नहीं कि ये deterministic बन गया, बल्कि इसलिए कि तुमने गलत चीज़ test करना बंद कर दिया। Text हर run बदलता है। Actions नहीं बदलने चाहिए।