llama.cpp + MTP is here: Qwen3.6 27B hits 2.17× on an RTX 3090 — should you upgrade tonight?
Multi-Token Prediction landed in mainline llama.cpp this week. The real numbers across RTX 5090 / 3090 / Strix Halo / 8GB cards, and a one-table answer to whether MTP is worth your weekend.
What changed this week
PR #22673 merged into mainline llama.cpp on May 16, bringing Multi-Token Prediction (MTP) speculative decoding out of community forks and into the default build. The r/LocalLLaMA front page lit up with three back-to-back posts crossing 600+ upvotes within 24 hours.
Four days later (May 20), LM Studio shipped MTP support. Google AI Edge Gallery (v1.0.13/14) added Gemma 4 MTP on Pixel TPU. Lemonade graduated macOS support out of beta. The full local-inference stack converged on the same primitive in a single week.
Timeline:
May 06 Gemma 4 MTP drafters released by Google
May 13 Community fork: MTP for Qwen on llama.cpp + TurboQuant
May 16 PR #22673 merges into mainline llama.cpp
May 17 First independent benchmarks (RTX 5090, Strix Halo)
May 19 Follow-up PR #23269: faster prompt processing for MTP
May 20 LM Studio ships MTP UI
What MTP actually is, in 60 seconds
MTP is a flavor of speculative decoding. Instead of needing a separate small "draft" model running in parallel, the main model has extra prediction heads baked in that propose the next N tokens in one forward pass. The model then verifies them and rolls back any that miss.
The win: no second model in VRAM, no version-skew between draft and target, and the speedup is "free" if your card has the headroom to run the slightly larger weights.
The catch: those extra heads cost VRAM and KV cache. On cards under ~10GB the overhead eats the gains.
Real-world numbers
Aggregated from the past week's benchmark posts on r/LocalLLaMA:
| Hardware | Model | Speedup (TG) | Notes |
|---|---|---|---|
| RTX 5090 (32GB) | Qwen3.6 27B | ~2.0× | Linux, llama.cpp 4f13cb7 |
| RTX 3090 (24GB) | Qwen3.6 27B | 2.17× | Headless, single-card |
| AMD Strix Halo | Qwen3.6 27B | 2.44× | Best result reported |
| AMD Strix Halo | Qwen3.6 35B-A3B | Mixed | MoE gains less |
| 2× RTX 3090 | Qwen3.6 27B | +40% over MTP-off | --split-mode tensor quirks |
| MI50 (32GB) | Qwen3.6 27B | 52.8 t/s | MTP not yet supported here |
| GTX 1080 (8GB) | Qwen3.6 35B-A3B | 24+ t/s | MoE offload, no MTP |
| Laptop 6GB VRAM | Qwen3.6 35B-A3B | Not worth | Overhead eats the win |
One-line takeaway: if you have 24GB+ of VRAM and run a dense Qwen3.6 27B, MTP is a free 2× and you should rebuild llama.cpp tonight. Below 10GB, skip it.
The tooling stack, as of today
| Tool | MTP status | Notes |
|---|---|---|
| llama.cpp | Mainline | Update past commit 4f13cb7; PR #23198 + #23269 fix PP |
| LM Studio | Shipped | UI toggle, May 20 |
| Lemonade (AMD/Mac) | Shipped | macOS out of beta, ROCm 7.13 |
| Google AI Edge | Shipped | Gemma 4 MTP on Pixel TPU |
| Ollama | Pending | No official announcement |
| MLX (Apple) | Not yet | Still no MTP path |
| vLLM | Partial | Qwen3 MTP works, FP8 path reports no gain |
PSA: if you tested MTP between May 16 and May 19 and saw bad prompt-processing speed, that's been fixed. Rebuild from current main.
GGUF + MTP vs MLX: the Mac question
The headline question for Apple Silicon users this week: does GGUF with MTP finally beat MLX?
Community consensus from r/LocalLLaMA and Hacker News:
- Dense models ~27B and up: GGUF + MTP now matches or beats MLX on M-series chips. The 2× generation speedup closes the gap MLX held on raw matmul throughput.
- Smaller models (≤14B dense): MLX still leads. The MTP overhead isn't worth it.
- MoE models (Qwen3.6 35B-A3B): mixed — MoE routing limits how many draft tokens get verified per forward pass.
If you're on an M3/M4 Max with 36GB+ unified memory, this is the week to A/B test your daily-driver model.
What to do this weekend
- Check your VRAM headroom — open the runlocal.dev calculator and confirm your card has room for Qwen3.6 27B at Q4/Q5 plus the MTP overhead (roughly +1.5GB at Q4).
- 24GB cards (3090/4090/5090/7900 XTX): rebuild llama.cpp from current main, grab a Qwen3.6 27B MTP GGUF, and try
-ctk q8_0 -ctv q8_0to claw back KV cache VRAM. - Mac M3/M4 with 36GB+: pull the same GGUF + MTP in LM Studio and benchmark against your current MLX setup. Report back if your numbers contradict the consensus above.
- 8GB or less: don't bother. Stick to Qwen3.6 8B or Gemma 4 E4B. Your speedup story is in quantization, not speculation.
- AMD users: Strix Halo is currently the best price/perf MTP target. Lemonade v10.5.1 with ROCm 7.13 is the supported stack.
Why this matters past this week
MTP is the first speculative-decoding variant that's:
- Built into the model weights (no separate draft model to fetch and version-match)
- Already standardized across 3+ inference engines in under a month
- Working at consumer-VRAM scale (24GB delivers a real 2× on a model people actually use)
The agentic-coding throughput floor just moved. A Qwen3.6 27B on a 3090 doing 2× more tokens per second is the difference between "local LLM as a curiosity" and "local LLM as a viable Cursor backend." Expect Qwen3.7, DeepSeek next-gen, and Llama 5 to ship with MTP heads as a baseline feature, not an option.
The thing to watch over the next 60 days: when do Ollama and MLX catch up, and does anyone publish a clean MTP fine-tuning recipe so community models (not just first-party releases) can ship MTP weights.
Next issue: the post-MTP inference-engine landscape — where Ollama, MLX, and vLLM go from here.