runlocal.cc
Check My GPU →
Issue #6May 19, 2026

The inference-engine wave: MTP, DFlash, PAGED MoE

Single-card 600 tok/s. 397B on a 64GB Mac. 85 tok/s at 524k context. Three weeks of runtime breakthroughs reset what 'local' means.

This issue is unusual: almost no new weights matter. What matters is how this month's weights run on hardware you already own. Three runtime-level shifts landed together in May, and the cost curve looks different now.

What shipped (May 9 – May 19)

Item Type Highlight
MTP self-speculation on DeepSeek V4-Flash Inference optimization 85 tok/s @ 524k ctx on 2× RTX PRO 6000 Max-Q
TurboQuant + MTP on Qwen 3.6 27B Inference optimization 80+ tok/s @ 262K on a single RTX 4090
DFlash speculative decoding for Gemma 4 26B Inference optimization 600 tok/s on a single RTX 5090 (vLLM)
ExLlamaV3 + DFlash integration Engine update 2.51× agentic benchmark vs pre-DFlash
PAGED MoE engine Engine release 397B-param model in ~14GB resident on M1 Ultra 64GB
llama.cpp MTP merged to main Engine update 2.44× throughput on Strix Halo, 2.17× on RTX 3090
MiniCPM 4.6 Model release 1.2B dense edge model
GPT-Realtime-2 / Translate / Whisper Cloud model OpenAI's real-time voice trio
HoloMotion-1 (Horizon) Robotics model 400M open whole-body humanoid control

The model column is short on purpose. The interesting story is the inference column.

1. MTP self-speculation grafts onto big weights

Multi-Token Prediction started as a Qwen 3.6 27B training-time trick. In May it became a generic runtime accelerator: take a base model with MTP heads (or graft one on via fine-tune), use the heads as a self-speculator, and skip the separate draft-model dance.

Numbers from this window:

  • DeepSeek V4-Flash with W4A16 + FP8 + MTP self-speculation: 85 tok/s at 524k context on 2× RTX PRO 6000 Max-Q
  • Qwen 3.6 27B with TurboQuant + MTP: 80+ tok/s at 262K context on a single RTX 4090
  • Qwen 3.6 27B with MTP graft + llama.cpp PR: 50 tok/s on a single RTX 3090
  • llama.cpp MTP support merged to main (May 19): 2.44× throughput on Strix Halo, 2.17× on RTX 3090

For agentic coding the implication is direct: a 27B-class model holding 262K of context now sustains throughput that previously required a server. The "local agentic dev loop" feasibility threshold just moved.

2. DFlash on consumer Blackwell

The vLLM team merged DFlash speculative decoding into ExLlamaV3 (turboderp's stack). On an RTX 5090, Gemma 4 26B sustains 600 tok/s single-card. Agentic benchmarks improved 2.51× vs the pre-DFlash baseline.

That's not "a faster prompt" — that's "single-card local inference matches last year's API tier."

Pair this with NVIDIA's Gemma-4-26B-A4B-NVFP4 quant (50k ctx at 80% of a 5090's 32GB VRAM, shipped earlier in May) and the 5090 is now the most leveraged single piece of consumer hardware for local LLMs since the original 3090.

3. PAGED MoE on a 64GB Mac

The wildest entry of the month. An open-source PAGED MoE engine ran a 397B-parameter model on an M1 Ultra 64GB Mac Studio at 1.59 tok/s — using ~14GB of resident memory at any moment, paging experts in and out.

1.59 tok/s isn't a daily driver. But "consumer Mac ≈ 70B ceiling" is no longer a load-bearing assumption. For the right workload (low-latency-tolerant batch inference, agent-with-memory architectures, async assistants), sparse-MoE models far larger than the box's RAM are now in scope.

A companion data point: an enthusiast got a 1T-parameter model running on Intel Optane Persistent Memory at 4 tok/s. Reference value > practical value, but the architectural envelope is being mapped.

Small but useful

  • MiniCPM 4.6 — 1.2B dense edge model. Targets phones and laptop CPUs. Useful as a small classifier / router in a multi-model pipeline.
  • Qwen 3.6 35B-A3B — community consensus settled: faster and stronger on code than Gemma 4 26B-A4B. Distill variants (14B / 9B) still pending; will be the roleplay/fine-tune base of choice when they land.
  • Gemma 4 WebGPU + Transformers.js — full offline Gemma 4 in a browser tab, no native runtime. Companion / embedded scenarios just got a new deployment shape.
  • AMD Strix Halo ROCm tutorial — first cleanly end-to-end ROCm fine-tune path published this year. AMD as a price-performance option moves from "in theory" to "demonstrated."

Cloud and adjacent

  • OpenAI shipped a real-time voice trio: GPT-Realtime-2 (GPT-5-tier reasoning in real-time conversation), GPT-Realtime-Translate, GPT-Realtime-Whisper. The voice category is being aggressively consolidated under OpenAI; local TTS/ASR projects need a positioning answer.
  • Horizon open-sourced HoloMotion-1, a 400M "small-brain" model for whole-body humanoid control. Not a chat LLM, but a signal: robotics open-source is heating up, and the parameter count required for embodied control is small enough that consumer hardware is genuinely the target.

What to actually do this week

  • RTX 5090 owner: install ExLlamaV3 with DFlash. Re-benchmark your daily driver. If you're not seeing >300 tok/s on Gemma 4 26B, something is mis-configured.
  • RTX 4090 / 3090 owner: pull the latest llama.cpp main with MTP support. Re-run Qwen 3.6 27B and your throughput should jump ~2×.
  • Mac 64GB owner: try the PAGED MoE engine on a sparse-MoE model you previously assumed was out of reach. Plug it into an async-agent or memory-augmented loop where 1.5 tok/s is acceptable.
  • AMD ROCm user: walk through the Strix Halo fine-tune tutorial end-to-end at least once. Even if you don't fine-tune today, the path matters.

The headlines this month favor inference engineers. That's a healthier place for the local-AI ecosystem to be than the weight-release treadmill — and it's the wave that closes the gap with GPT-5.5 Pro on throughput-per-dollar even as the raw-intelligence gap widens.

runlocal