SWE-bench मर चुका है। तुम्हारा AI Coding Tool असल में किस बात पर compete करता है।

तुम AI coding tool choose करते हो leaderboard देखकर। SWE-bench Verified — एक standardized test जहाँ AI models open-source Python projects में bugs fix करते हैं — एक साफ-सुथरा scoreboard publish करता है, और हर vendor अपना number तुम्हारी शक्ल पर मारता है। ज़्यादा score, बेहतर tool। Simple, है ना?

लेकिन लगभग identical models पर चलने वाले tools तुम्हारे actual codebase पर पूरी तरह अलग feel होते हैं। एक three-file refactor मस्त कर देता है, दूसरा एक import hallucinate कर देता है जो exist ही नहीं करता। Score कहता है ये जुड़वाँ हैं। तुम्हारी Monday morning कुछ और कहती है।

10,000 Developers ने Confirm किया — Leaderboard झूठ बोल रहा है

JetBrains की AI Pulse survey इसी महीने आई — 10,000+ professional developers, आठ languages, real workplace data — और वो confirm कर दिया जो तुम्हारा gut पहले से जानता था: SWE-bench पर एक-दूसरे से rounding error के अंदर score करने वाले models पर बने tools से developer satisfaction बेतहाशा अलग है। Benchmark तीन-तरफा tie दिखाता है। Developers की राय बिल्कुल अलग है।

ये कोई नई बात नहीं है। फरवरी में ही, OpenAI ने SWE-bench Verified की death certificate जारी कर दी थी। Postmortem: GPT-5.2, Claude Opus 4.5, और Gemini 3 Flash सिर्फ task ID देखकर gold-patch solutions को हूबहू reproduce कर सकते थे। Models ने problems solve नहीं किए। उन्होंने रटे हुए answers उगल दिए। OpenAI ने 27.6% failed tasks का audit भी किया और पाया कि 59.4% में test cases ही flawed थे — जो functionally correct code को भी reject कर रहे थे। Benchmark सिर्फ memorization test नहीं था — वो सही answers को भी गलत mark कर रहा था।

Live leaderboard 13 अप्रैल 2026 तक की absurdity confirm करता है: Claude Opus 4.5 — 80.9%, Opus 4.6 — 80.8%, Gemini 3.1 Pro — 80.6%। तीन frontier models 0.3 percentage points के अंदर। एक statistical tie जिसे horse race बनाकर सजा दिया गया है।

वो Variable जो कोई Benchmark Measure नहीं करता

अगर score satisfaction gap explain नहीं करता, तो क्या करता है? Context strategy — तुम्हारे project को tool actually कितना समझता है इससे पहले कि वो एक line भी लिखे।

SWE-bench well-documented open-source repos में isolated bug fixes test करता है। तुम अपना दिन proprietary codebases में multi-file feature work में बिताते हो — जहाँ tribal knowledge भरी पड़ी है और वो एक config file है जो Kevin ने 2019 में लिखी थी और जिसे छूने की किसी की हिम्मत नहीं। हर major tool इस problem को अलग तरीके से approach करता है — और हर एक कहीं न कहीं टूटता है:

Claude Code तुम्हारा directory tree और CLAUDE.md instruction files पढ़ता है — plain-text docs जहाँ तुम AI को अपने project की conventions, banned patterns, और architecture decisions सिखाते हो। ये full file content context window में भेजता है: असली code, summaries नहीं। Limit: context windows finite हैं। 50,000-file monorepo पर ये सब एक साथ hold नहीं कर सकता और तुम्हारी instruction files पर depend करता है कि उसे सही जगह point करो। Lazy CLAUDE.md, lazy results। Tool उतना ही smart है जितना अच्छा map तुम उसके लिए बनाओ।

Cursor बिल्कुल उल्टा approach लेता है। इसका @Codebase feature एक proprietary vector index बनाता है — तुम्हारे code के semantic meaning का embedding database। Query करो तो ये similarity search से सबसे relevant chunks निकालता है, बिना सब कुछ context में load किए बड़े codebases navigate करता है। Failure mode: embeddings structural relationships खो देती हैं। एक function जो दो files में तीन helpers call करता है — semantically match हो सकता है, लेकिन index dependency chain miss कर देता है। Large projects पर index edits के पीछे भी lag करता है — तुम file change करते हो, और अगले कुछ minutes तक AI पुराने version के बारे में जवाब देता है।

GitHub Copilot Enterprise tier ($39/user/month) पर Knowledge Bases use करता है — indexed repositories plus documentation जो Copilot completions के दौरान pull करता है। Multiple repos cross-reference कर सकता है, जो microservice architectures के लिए अच्छा है। वो catch जो कोई mention नहीं करता: free और Pro tiers को इसमें से कुछ नहीं मिलता। ज़्यादातर individual developers Copilot zero project-level context के साथ चलाते हैं — बस open file और शायद बगल वाला tab। Enterprise Copilot और regular Copilot के बीच का gap leaderboard पर किन्हीं भी दो tools के gap से बड़ा है।

Zed Tree-sitter से code को structurally parse करता है — ये abstract syntax trees देखता है, flat strings नहीं। Scopes, function boundaries, और nesting natively समझता है। Fast और lightweight। Tradeoff: syntax without semantics। Tree-sitter जानता है function exist करता है और उसका नाम क्या है, ये नहीं कि वो करता क्या है या क्यों matter करता है। Boilerplate और single-file edits के लिए: precise। "Auth middleware इस API endpoint को तीन packages दूर कैसे affect करता है?" के लिए: इसकी औकात के बाहर।

Same model tier। Radically different project comprehension। Satisfaction data अब समझ आने लगता है।

Simon Willison ने अक्टूबर 2025 में argue किया था कि best context strategy fancy instruction files नहीं — boring fundamentals हैं: automated tests (वो एक project में 1,500 चलाते हैं), interactive dev servers, well-structured GitHub Issues। Translation: tests लिखो, जानवरों। दुनिया का सबसे fancy context config उस code को नहीं बचाएगा जिसके पास खुद को check करने के लिए कोई test suite ही नहीं। वो irritatingly सही हैं — लेकिन ये either/or नहीं है। अच्छी context strategy plus solid test suite — यही actually compound करता है।

वो कीमत जो Label पर नहीं दिखती

यहाँ वो trap है जिसे comparison में कोई price नहीं करता: ऊपर बताई गई हर context strategy proprietary और non-portable है। तुम्हारी CLAUDE.md files का Cursor को कोई मतलब नहीं। तुम्हारा Cursor index Copilot में transfer नहीं होता। Tool switch करने का मतलब है अपने पूरे project को scratch से दोबारा सिखाना — घंटों का setup, हफ्तों का prompt और documentation tuning।

$20/month subscription सस्ता हिस्सा है। महंगा हिस्सा वो institutional knowledge है जो तुम एक tool के specific format में उड़ेलते हो।

और असली twist: codebase comprehension measure करने वाला कोई standard benchmark exist नहीं करता। OpenAI ने फरवरी में SWE-bench Pro को Verified के replacement के तौर पर recommend किया था, लेकिन दो महीने बाद भी adoption sparse है और Pro अभी भी isolated tasks test करता है। Verified पर ~80% score करने वाले models Pro पर लगभग 23% पर गिर जाते हैं। जो actually matter करता है वो test करने वाला benchmark किसी ने बनाया ही नहीं।

तुम्हारे लिए इसका क्या मतलब है

Leaderboards पढ़ना बंद करो। जो number तुम compare कर रहे हो वो एक टूटे हुए test पर memorization score है।

दो-तीन tools choose करो, हर एक को अपने repo पर एक हफ्ता चलाओ, और cross-file understanding वाले tasks पर completion accuracy track करो — वो काम जो तुम actually करते हो। Setup time पर ध्यान दो, क्योंकि वो तुम्हारा switching cost हमेशा के लिए है।

Model race ~81% पर ceiling hit कर चुकी है। Context race अभी शुरू हुई है, और कोई score नहीं रख रहा। ये या तो terrifying है या developer tools में अभी सबसे बड़ा opportunity — depend करता है कि तुम vendor हो या एक honest evaluation के लिए एक हफ्ता spare कर सकने वाले developer।

SWE-bench मर चुका है। तुम्हारा AI Coding Tool असल में किस बात पर compete करता है।

10,000 Developers ने Confirm किया — Leaderboard झूठ बोल रहा है

वो Variable जो कोई Benchmark Measure नहीं करता

वो कीमत जो Label पर नहीं दिखती

तुम्हारे लिए इसका क्या मतलब है

Keep reading

चार AI Coding Tools ने Parallel Agents शिप किए। किसी ने git merge solve नहीं किया।

OpenAI ने AI रेस जीती नहीं — स्कोरबोर्ड ही खरीद लिया

तुम अपना AI Agent test नहीं कर सकते। किसी SDK को फर्क नहीं पड़ता।

Grok अपने ही Launch Week में दो दिन के लिए धड़ाम हो गया