Every quarter, you watch the same cycle: AI companies announce bigger models, memory chip stocks go up, Samsung and SK Hynix investors pop champagne. More parameters mean more RAM. More RAM means more revenue. The escalator only goes one direction.
Nobody bothers asking the uncomfortable question: what if the models don't actually need all that memory?
Google drops a math bomb
On March 25, 2026, Google Research published TurboQuant — a compression algorithm that shrinks LLM memory usage by 6x and delivers up to 8x speedup on Nvidia H100 GPUs. The kicker: zero accuracy loss. The next day, memory chip stocks cratered across three continents.
Here's what happened technically, because it's elegant.
LLMs — large language models, the AI brains behind ChatGPT, Claude, and Gemini — have a component called KV cache (key-value cache). Think of it as the model's short-term memory: everything it holds in its head during a conversation. The longer the conversation, the bigger the cache, the fatter your GPU bill.
TurboQuant attacks this cache with a trick called PolarQuant. Normally, data gets stored as points on a grid — like street addresses on a city map. PolarQuant converts those points to polar coordinates — think compass directions: an angle plus distance from center. This transformation makes data patterns predictable enough to compress from 32 bits down to just 3 bits per value. No retraining. No fine-tuning (teaching a model new tricks with custom data). No calibration. You just apply it.
A second stage called QJL catches leftover errors by projecting them into a simpler mathematical space and reducing each value to a single sign bit — plus or minus one. An unbiased error corrector at the cost of one extra bit. Mathematically clean.
The internet immediately called it real-life Pied Piper middle-out compression from HBO's Silicon Valley. For once, the meme was accurate.
Wall Street notices
On March 26, the stock market responded with the subtlety of a cat knocking things off a shelf. SK Hynix dropped 6.2%. Samsung fell nearly 5%. Japan's Kioxia lost 6%. In the US, Micron slid 3.4% and SanDisk 3.5%. The KOSPI — South Korea's main stock index — dropped over 3%, with semiconductor stocks leading the selloff.
To be fair, these stocks had gained 200–300% over the previous year, so profit-taking amplified the damage. But the trigger was unmistakable.
The cold water
Before you short everything with a chip in it: TurboQuant is a research paper heading to ICLR 2026 — a top AI conference — in April. Not a shipping product. It compresses KV cache specifically — not full model weights, not training workloads. Morgan Stanley argues it lets systems handle 4–8x longer conversations on the same hardware, which means more deployments, not fewer chips. Analysts at Lynx Equity Strategies say memory demand survives the next three to five years regardless.
The bull case isn't dead. It just got more nuanced.
What this changes
For anyone running LLM inference — inference means actually using a trained model to generate answers — from solo developers paying per-token to hyperscalers burning through GPU fleets, this signals that serving costs are heading down. Once TurboQuant-class techniques land in standard inference engines (the software that runs AI models in production), the economics shift for every AI application.
The most impactful Google AI announcement this month wasn't a bigger model or a flashier product. It was a math paper that made existing models smaller. The trillion-dollar hardware bet assumed software would stay dumb forever.
Software just got smarter.





