Issue #9May 21, 2026

llama.cpp + MTP is here: Qwen3.6 27B hits 2.17× on an RTX 3090 — should you upgrade tonight?

Multi-Token Prediction landed in mainline llama.cpp this week. The real numbers across RTX 5090 / 3090 / Strix Halo / 8GB cards, and a one-table answer to whether MTP is worth your weekend.

What changed this week

PR #22673 merged into mainline llama.cpp on May 16, bringing Multi-Token Prediction (MTP) speculative decoding out of community forks and into the default build. The r/LocalLLaMA front page lit up with three back-to-back posts crossing 600+ upvotes within 24 hours.

Four days later (May 20), LM Studio shipped MTP support. Google AI Edge Gallery (v1.0.13/14) added Gemma 4 MTP on Pixel TPU. Lemonade graduated macOS support out of beta. The full local-inference stack converged on the same primitive in a single week.

Timeline:

May 06  Gemma 4 MTP drafters released by Google
May 13  Community fork: MTP for Qwen on llama.cpp + TurboQuant
May 16  PR #22673 merges into mainline llama.cpp
May 17  First independent benchmarks (RTX 5090, Strix Halo)
May 19  Follow-up PR #23269: faster prompt processing for MTP
May 20  LM Studio ships MTP UI

What MTP actually is, in 60 seconds

MTP is a flavor of speculative decoding. Instead of needing a separate small "draft" model running in parallel, the main model has extra prediction heads baked in that propose the next N tokens in one forward pass. The model then verifies them and rolls back any that miss.

The win: no second model in VRAM, no version-skew between draft and target, and the speedup is "free" if your card has the headroom to run the slightly larger weights.

The catch: those extra heads cost VRAM and KV cache. On cards under ~10GB the overhead eats the gains.

Real-world numbers

Aggregated from the past week's benchmark posts on r/LocalLLaMA:

Hardware	Model	Speedup (TG)	Notes
RTX 5090 (32GB)	Qwen3.6 27B	~2.0×	Linux, llama.cpp 4f13cb7
RTX 3090 (24GB)	Qwen3.6 27B	2.17×	Headless, single-card
AMD Strix Halo	Qwen3.6 27B	2.44×	Best result reported
AMD Strix Halo	Qwen3.6 35B-A3B	Mixed	MoE gains less
2× RTX 3090	Qwen3.6 27B	+40% over MTP-off	`--split-mode tensor` quirks
MI50 (32GB)	Qwen3.6 27B	52.8 t/s	MTP not yet supported here
GTX 1080 (8GB)	Qwen3.6 35B-A3B	24+ t/s	MoE offload, no MTP
Laptop 6GB VRAM	Qwen3.6 35B-A3B	Not worth	Overhead eats the win

One-line takeaway: if you have 24GB+ of VRAM and run a dense Qwen3.6 27B, MTP is a free 2× and you should rebuild llama.cpp tonight. Below 10GB, skip it.

The tooling stack, as of today

Tool	MTP status	Notes
llama.cpp	Mainline	Update past commit 4f13cb7; PR #23198 + #23269 fix PP
LM Studio	Shipped	UI toggle, May 20
Lemonade (AMD/Mac)	Shipped	macOS out of beta, ROCm 7.13
Google AI Edge	Shipped	Gemma 4 MTP on Pixel TPU
Ollama	Pending	No official announcement
MLX (Apple)	Not yet	Still no MTP path
vLLM	Partial	Qwen3 MTP works, FP8 path reports no gain

PSA: if you tested MTP between May 16 and May 19 and saw bad prompt-processing speed, that's been fixed. Rebuild from current main.

GGUF + MTP vs MLX: the Mac question

The headline question for Apple Silicon users this week: does GGUF with MTP finally beat MLX?

Community consensus from r/LocalLLaMA and Hacker News:

Dense models ~27B and up: GGUF + MTP now matches or beats MLX on M-series chips. The 2× generation speedup closes the gap MLX held on raw matmul throughput.
Smaller models (≤14B dense): MLX still leads. The MTP overhead isn't worth it.
MoE models (Qwen3.6 35B-A3B): mixed — MoE routing limits how many draft tokens get verified per forward pass.

If you're on an M3/M4 Max with 36GB+ unified memory, this is the week to A/B test your daily-driver model.

What to do this weekend

Check your VRAM headroom — open the runlocal.dev calculator and confirm your card has room for Qwen3.6 27B at Q4/Q5 plus the MTP overhead (roughly +1.5GB at Q4).
24GB cards (3090/4090/5090/7900 XTX): rebuild llama.cpp from current main, grab a Qwen3.6 27B MTP GGUF, and try -ctk q8_0 -ctv q8_0 to claw back KV cache VRAM.
Mac M3/M4 with 36GB+: pull the same GGUF + MTP in LM Studio and benchmark against your current MLX setup. Report back if your numbers contradict the consensus above.
8GB or less: don't bother. Stick to Qwen3.6 8B or Gemma 4 E4B. Your speedup story is in quantization, not speculation.
AMD users: Strix Halo is currently the best price/perf MTP target. Lemonade v10.5.1 with ROCm 7.13 is the supported stack.

Why this matters past this week

MTP is the first speculative-decoding variant that's:

Built into the model weights (no separate draft model to fetch and version-match)
Already standardized across 3+ inference engines in under a month
Working at consumer-VRAM scale (24GB delivers a real 2× on a model people actually use)

The agentic-coding throughput floor just moved. A Qwen3.6 27B on a 3090 doing 2× more tokens per second is the difference between "local LLM as a curiosity" and "local LLM as a viable Cursor backend." Expect Qwen3.7, DeepSeek next-gen, and Llama 5 to ship with MTP heads as a baseline feature, not an option.

The thing to watch over the next 60 days: when do Ollama and MLX catch up, and does anyone publish a clean MTP fine-tuning recipe so community models (not just first-party releases) can ship MTP weights.

Next issue: the post-MTP inference-engine landscape — where Ollama, MLX, and vLLM go from here.

← All posts