Issue #6May 19, 2026

The inference-engine wave: MTP, DFlash, PAGED MoE

Single-card 600 tok/s. 397B on a 64GB Mac. 85 tok/s at 524k context. Three weeks of runtime breakthroughs reset what 'local' means.

This issue is unusual: almost no new weights matter. What matters is how this month's weights run on hardware you already own. Three runtime-level shifts landed together in May, and the cost curve looks different now.

What shipped (May 9 – May 19)

Item	Type	Highlight
MTP self-speculation on DeepSeek V4-Flash	Inference optimization	85 tok/s @ 524k ctx on 2× RTX PRO 6000 Max-Q
TurboQuant + MTP on Qwen 3.6 27B	Inference optimization	80+ tok/s @ 262K on a single RTX 4090
DFlash speculative decoding for Gemma 4 26B	Inference optimization	600 tok/s on a single RTX 5090 (vLLM)
ExLlamaV3 + DFlash integration	Engine update	2.51× agentic benchmark vs pre-DFlash
PAGED MoE engine	Engine release	397B-param model in ~14GB resident on M1 Ultra 64GB
llama.cpp MTP merged to main	Engine update	2.44× throughput on Strix Halo, 2.17× on RTX 3090
MiniCPM 4.6	Model release	1.2B dense edge model
GPT-Realtime-2 / Translate / Whisper	Cloud model	OpenAI's real-time voice trio
HoloMotion-1 (Horizon)	Robotics model	400M open whole-body humanoid control

The model column is short on purpose. The interesting story is the inference column.

1. MTP self-speculation grafts onto big weights

Multi-Token Prediction started as a Qwen 3.6 27B training-time trick. In May it became a generic runtime accelerator: take a base model with MTP heads (or graft one on via fine-tune), use the heads as a self-speculator, and skip the separate draft-model dance.

Numbers from this window:

DeepSeek V4-Flash with W4A16 + FP8 + MTP self-speculation: 85 tok/s at 524k context on 2× RTX PRO 6000 Max-Q
Qwen 3.6 27B with TurboQuant + MTP: 80+ tok/s at 262K context on a single RTX 4090
Qwen 3.6 27B with MTP graft + llama.cpp PR: 50 tok/s on a single RTX 3090
llama.cpp MTP support merged to main (May 19): 2.44× throughput on Strix Halo, 2.17× on RTX 3090

For agentic coding the implication is direct: a 27B-class model holding 262K of context now sustains throughput that previously required a server. The "local agentic dev loop" feasibility threshold just moved.

2. DFlash on consumer Blackwell

The vLLM team merged DFlash speculative decoding into ExLlamaV3 (turboderp's stack). On an RTX 5090, Gemma 4 26B sustains 600 tok/s single-card. Agentic benchmarks improved 2.51× vs the pre-DFlash baseline.

That's not "a faster prompt" — that's "single-card local inference matches last year's API tier."

Pair this with NVIDIA's Gemma-4-26B-A4B-NVFP4 quant (50k ctx at 80% of a 5090's 32GB VRAM, shipped earlier in May) and the 5090 is now the most leveraged single piece of consumer hardware for local LLMs since the original 3090.

3. PAGED MoE on a 64GB Mac

The wildest entry of the month. An open-source PAGED MoE engine ran a 397B-parameter model on an M1 Ultra 64GB Mac Studio at 1.59 tok/s — using ~14GB of resident memory at any moment, paging experts in and out.

1.59 tok/s isn't a daily driver. But "consumer Mac ≈ 70B ceiling" is no longer a load-bearing assumption. For the right workload (low-latency-tolerant batch inference, agent-with-memory architectures, async assistants), sparse-MoE models far larger than the box's RAM are now in scope.

A companion data point: an enthusiast got a 1T-parameter model running on Intel Optane Persistent Memory at 4 tok/s. Reference value > practical value, but the architectural envelope is being mapped.

Small but useful

MiniCPM 4.6 — 1.2B dense edge model. Targets phones and laptop CPUs. Useful as a small classifier / router in a multi-model pipeline.
Qwen 3.6 35B-A3B — community consensus settled: faster and stronger on code than Gemma 4 26B-A4B. Distill variants (14B / 9B) still pending; will be the roleplay/fine-tune base of choice when they land.
Gemma 4 WebGPU + Transformers.js — full offline Gemma 4 in a browser tab, no native runtime. Companion / embedded scenarios just got a new deployment shape.
AMD Strix Halo ROCm tutorial — first cleanly end-to-end ROCm fine-tune path published this year. AMD as a price-performance option moves from "in theory" to "demonstrated."

Cloud and adjacent

OpenAI shipped a real-time voice trio: GPT-Realtime-2 (GPT-5-tier reasoning in real-time conversation), GPT-Realtime-Translate, GPT-Realtime-Whisper. The voice category is being aggressively consolidated under OpenAI; local TTS/ASR projects need a positioning answer.
Horizon open-sourced HoloMotion-1, a 400M "small-brain" model for whole-body humanoid control. Not a chat LLM, but a signal: robotics open-source is heating up, and the parameter count required for embodied control is small enough that consumer hardware is genuinely the target.

What to actually do this week

RTX 5090 owner: install ExLlamaV3 with DFlash. Re-benchmark your daily driver. If you're not seeing >300 tok/s on Gemma 4 26B, something is mis-configured.
RTX 4090 / 3090 owner: pull the latest llama.cpp main with MTP support. Re-run Qwen 3.6 27B and your throughput should jump ~2×.
Mac 64GB owner: try the PAGED MoE engine on a sparse-MoE model you previously assumed was out of reach. Plug it into an async-agent or memory-augmented loop where 1.5 tok/s is acceptable.
AMD ROCm user: walk through the Strix Halo fine-tune tutorial end-to-end at least once. Even if you don't fine-tune today, the path matters.

The headlines this month favor inference engineers. That's a healthier place for the local-AI ecosystem to be than the weight-release treadmill — and it's the wave that closes the gap with GPT-5.5 Pro on throughput-per-dollar even as the raw-intelligence gap widens.

— runlocal

← All posts