The inference-engine wave: MTP, DFlash, PAGED MoE
Single-card 600 tok/s. 397B on a 64GB Mac. 85 tok/s at 524k context. Three weeks of runtime breakthroughs reset what 'local' means.
This issue is unusual: almost no new weights matter. What matters is how this month's weights run on hardware you already own. Three runtime-level shifts landed together in May, and the cost curve looks different now.
What shipped (May 9 – May 19)
| Item | Type | Highlight |
|---|---|---|
| MTP self-speculation on DeepSeek V4-Flash | Inference optimization | 85 tok/s @ 524k ctx on 2× RTX PRO 6000 Max-Q |
| TurboQuant + MTP on Qwen 3.6 27B | Inference optimization | 80+ tok/s @ 262K on a single RTX 4090 |
| DFlash speculative decoding for Gemma 4 26B | Inference optimization | 600 tok/s on a single RTX 5090 (vLLM) |
| ExLlamaV3 + DFlash integration | Engine update | 2.51× agentic benchmark vs pre-DFlash |
| PAGED MoE engine | Engine release | 397B-param model in ~14GB resident on M1 Ultra 64GB |
| llama.cpp MTP merged to main | Engine update | 2.44× throughput on Strix Halo, 2.17× on RTX 3090 |
| MiniCPM 4.6 | Model release | 1.2B dense edge model |
| GPT-Realtime-2 / Translate / Whisper | Cloud model | OpenAI's real-time voice trio |
| HoloMotion-1 (Horizon) | Robotics model | 400M open whole-body humanoid control |
The model column is short on purpose. The interesting story is the inference column.
1. MTP self-speculation grafts onto big weights
Multi-Token Prediction started as a Qwen 3.6 27B training-time trick. In May it became a generic runtime accelerator: take a base model with MTP heads (or graft one on via fine-tune), use the heads as a self-speculator, and skip the separate draft-model dance.
Numbers from this window:
- DeepSeek V4-Flash with W4A16 + FP8 + MTP self-speculation: 85 tok/s at 524k context on 2× RTX PRO 6000 Max-Q
- Qwen 3.6 27B with TurboQuant + MTP: 80+ tok/s at 262K context on a single RTX 4090
- Qwen 3.6 27B with MTP graft + llama.cpp PR: 50 tok/s on a single RTX 3090
- llama.cpp MTP support merged to main (May 19): 2.44× throughput on Strix Halo, 2.17× on RTX 3090
For agentic coding the implication is direct: a 27B-class model holding 262K of context now sustains throughput that previously required a server. The "local agentic dev loop" feasibility threshold just moved.
2. DFlash on consumer Blackwell
The vLLM team merged DFlash speculative decoding into ExLlamaV3 (turboderp's stack). On an RTX 5090, Gemma 4 26B sustains 600 tok/s single-card. Agentic benchmarks improved 2.51× vs the pre-DFlash baseline.
That's not "a faster prompt" — that's "single-card local inference matches last year's API tier."
Pair this with NVIDIA's Gemma-4-26B-A4B-NVFP4 quant (50k ctx at 80% of a 5090's 32GB VRAM, shipped earlier in May) and the 5090 is now the most leveraged single piece of consumer hardware for local LLMs since the original 3090.
3. PAGED MoE on a 64GB Mac
The wildest entry of the month. An open-source PAGED MoE engine ran a 397B-parameter model on an M1 Ultra 64GB Mac Studio at 1.59 tok/s — using ~14GB of resident memory at any moment, paging experts in and out.
1.59 tok/s isn't a daily driver. But "consumer Mac ≈ 70B ceiling" is no longer a load-bearing assumption. For the right workload (low-latency-tolerant batch inference, agent-with-memory architectures, async assistants), sparse-MoE models far larger than the box's RAM are now in scope.
A companion data point: an enthusiast got a 1T-parameter model running on Intel Optane Persistent Memory at 4 tok/s. Reference value > practical value, but the architectural envelope is being mapped.
Small but useful
- MiniCPM 4.6 — 1.2B dense edge model. Targets phones and laptop CPUs. Useful as a small classifier / router in a multi-model pipeline.
- Qwen 3.6 35B-A3B — community consensus settled: faster and stronger on code than Gemma 4 26B-A4B. Distill variants (14B / 9B) still pending; will be the roleplay/fine-tune base of choice when they land.
- Gemma 4 WebGPU + Transformers.js — full offline Gemma 4 in a browser tab, no native runtime. Companion / embedded scenarios just got a new deployment shape.
- AMD Strix Halo ROCm tutorial — first cleanly end-to-end ROCm fine-tune path published this year. AMD as a price-performance option moves from "in theory" to "demonstrated."
Cloud and adjacent
- OpenAI shipped a real-time voice trio: GPT-Realtime-2 (GPT-5-tier reasoning in real-time conversation), GPT-Realtime-Translate, GPT-Realtime-Whisper. The voice category is being aggressively consolidated under OpenAI; local TTS/ASR projects need a positioning answer.
- Horizon open-sourced HoloMotion-1, a 400M "small-brain" model for whole-body humanoid control. Not a chat LLM, but a signal: robotics open-source is heating up, and the parameter count required for embodied control is small enough that consumer hardware is genuinely the target.
What to actually do this week
- RTX 5090 owner: install ExLlamaV3 with DFlash. Re-benchmark your daily driver. If you're not seeing >300 tok/s on Gemma 4 26B, something is mis-configured.
- RTX 4090 / 3090 owner: pull the latest llama.cpp main with MTP support. Re-run Qwen 3.6 27B and your throughput should jump ~2×.
- Mac 64GB owner: try the PAGED MoE engine on a sparse-MoE model you previously assumed was out of reach. Plug it into an async-agent or memory-augmented loop where 1.5 tok/s is acceptable.
- AMD ROCm user: walk through the Strix Halo fine-tune tutorial end-to-end at least once. Even if you don't fine-tune today, the path matters.
The headlines this month favor inference engineers. That's a healthier place for the local-AI ecosystem to be than the weight-release treadmill — and it's the wave that closes the gap with GPT-5.5 Pro on throughput-per-dollar even as the raw-intelligence gap widens.
— runlocal