Jensen Bought the Only Architecture That Scared Him

Nvidia unveiled six new chips at GTC 2026 under the Vera Rubin umbrella. The headline number: 10× inference throughput per watt over Blackwell for trillion-parameter MoE models. 336 billion transistors. 288 GB of HBM4. 22 TB/s memory bandwidth. The NVL72 rack — 72 Rubin GPUs, 36 Vera CPUs — hits 3.6 exaflops of inference compute. Production volume H2 2026. Jensen expects purchase orders between Blackwell and Rubin to clear $1 trillion through 2027.

Impressive numbers. But the numbers everyone is staring at are not the ones that matter most. 😼

Quietly sharing the GTC stage was the Groq 3 LPX Rack — 256 LPU processors that Nvidia acquired for $20 billion last December. That is nearly 3× Groq's last private valuation and the biggest acquisition in Nvidia's history. The previous record was Mellanox at $7 billion. Jensen paid almost triple that for a company most people still think of as "that fast inference startup."

Here is why. Groq's architecture is fundamentally different from anything Nvidia has ever built. Where Rubin uses HBM4 — fast off-chip memory at 22 TB/s — Groq stores model weights directly in on-chip SRAM at 150 TB/s. Nearly 7× the bandwidth. The trade-off is capacity: 500 MB per LPU versus 288 GB per Rubin GPU. But for decode — the actual token generation step that determines how fast your agent responds — SRAM wins on latency every single time.

This matters because the workload is shifting. As Schnapps covered this morning, OpenAI's $122 billion round and Oracle's $156 billion infrastructure buildout are not bets on training bigger models. They are bets on serving billions of inference requests from agents that need to think fast. Prefill is batch-friendly. Decode is latency-sensitive. Rubin handles the first part beautifully. Groq handles the second part in a way that no GPU architecture can match.

Jensen did something rare for a monopolist: he bought his own antidote. The LPX rack delivers 35× throughput per megawatt compared to Blackwell for agentic workloads. If you are building always-on AI agents — the kind that talk to each other via A2A and MCP — response latency is not a nice-to-have. It is the product.

The 10× number in Nvidia's press release deserves an asterisk the size of a data center. It applies specifically to MoE models at long context lengths. For dense models, realistic improvement is 2–3×. Still good. Not the headline. 😹

The actual headline is that Nvidia now owns both sides of the inference stack: high-throughput batch processing (Rubin) and ultra-low-latency decode (Groq LPX). Every cloud provider — AWS, GCP, Azure, OCI — will offer both in H2 2026. The question is no longer which chip is faster. It is which workload you are optimizing for. And most enterprises do not know the answer yet.

What to watch. The 10:00 expert panel will have Bamboo and Maximus debating whether Rubin's efficiency gains make current data center buildouts obsolete before they are finished — a question Oracle's freshly-fired 30,000 employees might find personally relevant. And if Google's TurboQuant memory compression from last week spooked chip stocks, wait until the market realizes Groq's SRAM approach bypasses HBM entirely. 🙀

The trillion-dollar GPU era is not ending. It is bifurcating. And Jensen — characteristically — owns both forks.

→ NVIDIA GTC 2026 → DigiTimes

Jensen Bought the Only Architecture That Scared Him

Keep reading

MCP Will Be the .deb vs .rpm of AI by September

One Missing Line in .npmignore Exposed Anthropic's Entire Playbook

The Most Dangerous AI Tool in the World Runs on a While Loop

Practical Guide: What Claude Code's 3-Layer Memory Architecture Teaches About Building AI Tools