AI Agents अब Incidents ठीक कर सकते हैं — बशर्ते तुम्हारे Runbooks लोककथाएँ न हों

रात के 3 बजे तुम्हारा फोन चीखता है। तुम SSH करते हो — server के terminal से remote-connect — और वही तीन commands चलाते हो जो पिछले महीने चलाए थे। वही problem fix करते हो जो पिछली तिमाही में fix किया था। तुम्हारी उंगलियों को fix याद है, दिमाग जागने से पहले।

असली थकान repetition है। Incidents खुद नहीं — ये बात कि laptop खोलने से पहले ही तुम्हें answer पता होता है, और किसी ने उस answer को अब तक script में नहीं बदला।

Q1 2026 ने automation का case पहले से कहीं ज़्यादा strong बना दिया। तीन major platforms ने AI agents ship किए जो सीधे उस muscle memory को target करते हैं। 12 मार्च को PagerDuty ने अपना SRE Agent announce किया — एक AI जो past incidents, dependencies, और conversation history याद रखता है, फिर चार phases में काम करता है: detect, diagnose, remediate, learn। साथ में 30+ AI partners लाए, जिनमें Claude Code और Cursor integrations शामिल हैं। मार्च की शुरुआत में Datadog ने Bits AI SRE v2 ship किया — पिछले version से लगभग दोगुना fast, 3–4 मिनट में investigations पूरी करता है, investigation plan बनाता है, competing root-cause hypotheses evaluate करता है, और real time में refine करता है। Grafana Labs अपने Assistant Investigations late 2025 से roll out कर रहा है — एक multi-agent architecture (कई AI agents मिलकर काम करते हैं, हर एक की अपनी specialty) जहाँ एक lead investigator काम plan करता है और Prometheus, Loki, Tempo, Pyroscope — Grafana के monitoring tools — के लिए specialized agents parallel में evidence gather करते हैं।

तीन companies, एक ही core loop: runbooks ingest करो (step-by-step fix instructions जो humans ने लिखी हैं), incoming alerts के साथ patterns match करो, pre-approved remediation steps execute करो, और सिर्फ तब escalate करो जब confidence एक threshold से नीचे गिरे। PagerDuty का agent हर incident के बाद updated runbooks generate करता है। Datadog का नया Agent Trace View हर investigation step, हर tool call, हर query में full transparency देता है। Grafana के agents findings और hypotheses produce करते हैं, फिर तुम्हें actionable recommendations देते हैं। ये machinery real है। Testing के दौरान Datadog के system से 2,000+ customer environments में tens of thousands investigations गुज़री हैं।

Early numbers solid दिख रहे हैं — एक specific band के अंदर। PagerDuty claim करता है कि उसका agent incidents 50% तक faster resolve करता है। Datadog early customers में MTTR (mean time to resolution — कुछ टूटने से लेकर fix होने तक का time) में 70% तक की कटौती cite करता है, press materials में best cases में 95% का भी ज़िक्र है। Vendor optimism हटाओ तो honest range 40–60% improvement के आसपास बैठती है, लेकिन सिर्फ well-documented, repeatable failures के लिए। Low-risk, reversible actions — servers scale up करना, restarts, cache clearing, feature flag toggles। वो सब जो तुम्हारी muscle memory रात 3 बजे पहले से handle कर लेती है।

यहीं conventional wisdom टूटती है। Industry conversation AI capability पर focus करती है — क्या agent सही diagnose कर सकता है, क्या safely remediate कर सकता है, क्या past incidents से सीख सकता है। लेकिन जैसा Rootly की AI SRE analysis कहती है: "Incident resolution tribal knowledge पर depend करती है जो Slack, tickets, runbooks, code comments, और past postmortems में बिखरी होती है।" ज़्यादातर runbooks documentation नहीं हैं — ये लोककथाएँ हैं जिन पर formatting लगा दी गई है। नए engineers को incidents confidently resolve करने में 12–18 महीने लगते हैं, इसलिए नहीं कि incidents complex हैं, बल्कि इसलिए कि knowledge लोगों के दिमाग में रहती है। एक machine को root access और restart permissions दो एक खराब runbook के साथ, और तुम्हें machine speed पर खराब automated remediation मिलेगी। Trust problem AI capability के बारे में नहीं है। ये documentation quality के बारे में है जो ज़्यादातर teams को कभी build करने की ज़रूरत ही नहीं पड़ी।

High-risk flows — payments, identity, trading systems — इनमें अभी भी human approval gates ज़रूरी हैं। हर vendor ये मानता है। Maturity path read-only से advised से approval-based से fully autonomous तक जाता है। ज़्यादातर organizations पहले दो stages में कहीं बैठे हैं।

AI SRE agents on-call engineers को replace नहीं करते। ये on-call के उस repetitive, आत्मा तोड़ने वाले 80% हिस्से को replace करते हैं — वो हिस्सा जो burnout पैदा करता है, वो हिस्सा जिसकी वजह से अच्छे लोग छोड़कर चले जाते हैं। Industry analyses बताते हैं कि AI-driven incident ops अपनाने वाली organizations में 30–50% कम customer-visible outages होते हैं। इसलिए नहीं कि AI तुमसे smart है। बल्कि इसलिए कि उसे रात 3 बजे pod restart करने के लिए चाय की ज़रूरत नहीं पड़ती।

Ops role shift हो रहा है। Person-who-fixes-things से person-replaced-by-machine की तरफ नहीं, बल्कि person-who-decides-what-is-safe-to-automate की तरफ। और ये दूसरी job पहली से कहीं बेहतर documentation माँगती है। तुम्हारे runbooks अब सिर्फ अगले on-call के लिए notes नहीं रहे। ये एक machine के लिए instructions हैं जिसके पास root access है। उसी हिसाब से लिखो।

AI Agents अब Incidents ठीक कर सकते हैं — बशर्ते तुम्हारे Runbooks लोककथाएँ न हों

Keep reading

तुम्हारे AI Agent को पता ही नहीं कि रात के 3 बजे हैं और Prod में आग लगी है

Claude Code Routines: Anthropic ने अपना पहला AI Daemon ship कर दिया

अप्रैल में तीन Agent Platforms लॉन्च हुए। किसी ने Deploy Button नहीं दिया।

तेरे Agent Tools के पास कोई Version Number नहीं। 97 Million Downloads को फ़र्क नहीं पड़ता।