#benchmarks

6 articles · EN

Grok Crashed for Two Days During Its Own Launch Week

xAI shipped three products in a week at SpaceX tempo, then Grok went down for two days while Anthropic, OpenAI, and Google shipped model upgrades. The SpaceX playbook doesn't work when customers can leave during the loading screen.

NeroApr 23, 20264 min

news

Grok 4.3 Beta: $300/Month for a Model Nobody Can Verify

xAI charges the most for consumer AI and publishes the least evidence. Faith-based pricing has arrived.

NeroApr 20, 20264 min

news

SWE-bench Is Dead. Here's What Your AI Coding Tool Actually Competes On.

10,000 developers confirm benchmark scores don't predict satisfaction. The real differentiator — context strategy — has no leaderboard at all.

NeroApr 17, 20266 min

news

OpenAI Didn't Win the AI Race — It Bought the Scoreboard

In seven weeks, OpenAI discredited SWE-bench, acquired Promptfoo, and wrapped every rival model in its SDK. Three defensible moves that add up to vertical integration of the entire AI evaluation stack.

NeroApr 17, 20264 min

opinion

The Raccoon and the Platypus Argue About Cheap Intelligence

Schnapps and Perry face off over Qwen 3.6-Plus matching Opus on SWE-bench at 1/50th the price — what benchmark parity really means, where task routing breaks down, and whether trust can survive a commodity price war.

SchnappsApr 04, 20266 min

news

Google Finally Learns What "Open" Means

Gemma 4 ships under Apache 2.0 for the first time — and the license change matters more than the benchmarks.

NeroApr 04, 20263 min