The local-AI map redrawn in 7 days
Qwen 3.6 27B beats a 397B predecessor. Gemma 4 26B-A4B lands with 22 quants. Kimi K2.6 hits Opus parity at 1T params.
Seven days. Fourteen model releases. Here's what actually changed for the people running inference locally.
What shipped (April 18–24)
| Model | Provider | Params | Local? | Highlight |
|---|---|---|---|---|
| Qwen 3.6 27B | Alibaba | 27B dense | ✓ | SWE-bench 77.2, multimodal |
| Gemma 4 26B-A4B | 26B/4B MoE | ✓ | 22 Unsloth quants | |
| micro-kiki v3 | L'Électron Rare | 35B/3B MoE + LoRA | ✓ | Domain expert routing |
| Kimi K2.6 | Moonshot AI | 1T/32B MoE | Cloud | ~Opus 4.6 benchmark parity |
| DeepSeek V4 Pro | DeepSeek | 1.6T/862B MoE | Cloud | FP4 QAT at scale |
| DeepSeek V4 Flash | DeepSeek | 292B/158B MoE | Cloud | Near-Pro at 1/12 cost |
| HunYuan Hy3 | Tencent | 295B/21B MoE | Cloud | +40% efficiency, open weights |
| Ling 2.6 1T | Ant Group | 1T/50B MoE | Cloud | AA Index #2, 262K ctx |
The local column is the shorter list — but the two models on it are genuinely good.
Two that change your local stack
Qwen 3.6 27B — a 397B model in a 27B shell
Qwen 3.6 27B is a dense 27B model that outperforms Qwen 3.5 397B on SWE-bench Verified (77.2 vs the prior-gen flagship). For context: the 397B model requires multi-GPU infrastructure. The 27B runs on a single 16GB GPU at Q4_K_M (17GB VRAM), or on any M-series Mac with 24GB+.
It's also natively multimodal — image, video, and text in a single model. No adapter, no pipeline switch. Compatible with Claude Code and Qwen Code tooling out of the box.
ollama pull qwen3.6:27b
→ Use the calculator to confirm it fits your GPU.
Gemma 4 26B-A4B — the best-quantized model in its class
Google quietly released a 26B-total / 4B-active MoE variant of Gemma 4. The model itself isn't the story. Unsloth's quantization coverage is. They published best-in-class GGUFs across 22 quant levels — the broadest coverage in the Gemma 4 family. If you've been waiting for a Gemma 4 MoE that runs cleanly at Q2_K on a 12GB GPU, this is it.
Pair it with the model page to see the full quant ladder: Q8_0 at 29GB VRAM down to Q2_K at 10.5GB.
ollama pull gemma4:26b-a4b
Why the cloud tier matters anyway
The cloud releases this week serve as the capability ceiling — the "if you had infinite VRAM" benchmark that tells you what's eventually coming to consumer hardware.
Kimi K2.6 (1T total, 32B active, 256K context) hits Claude Opus 4.6 on SWE-bench and supports 300 parallel sub-agents out of the box. The community verdict: 85% of Opus 4.6 tasks are replaceable today. It's already on OpenRouter and Cloudflare Workers AI.
DeepSeek V4 Pro introduces two architectural innovations worth watching: Hybrid Attention (CSA+HCA layers replace standard full-attention), and Manifold-Constrained Hyper-Connections that replace residual connections entirely. If those ideas distill down to 7B and 14B models the way prior DeepSeek advances did, your local stack benefits in 6–12 months.
Ling 2.6 1T from Ant Group is the dark horse: AA Intelligence Index #2 globally (score 34 vs the 13-point mean), open weights incoming, Apache 2.0. Price is aggressive at $0.30/$2.50 per million tokens. Flag this one for local quantization when the weights land.
Hardware that makes sense now
Two hardware additions this week close gaps that have been there since early 2025:
RTX 5070 Ti (16GB, 896 GB/s) — same VRAM class as the RTX 4080 SUPER but at $749 vs $999. The bandwidth jump (896 vs 736 GB/s) directly improves token generation throughput for quantized models. Qwen 3.6 27B at Q4_K_M (17GB) needs this tier.
RX 9070 XT (16GB, 644 GB/s, RDNA 4, $599) — AMD's first consumer GPU with credible llama.cpp ROCm support in this generation. Gemma 4 26B-A4B Q4_K_M fits in 16.5GB — tight, but workable.
Apple M4 Ultra 64GB makes the 31B–72B dense class local. Every Q4_K_M model up to ~70B fits without quantization compromise. At 1092 GB/s unified bandwidth, it's the fastest token generation on consumer hardware by a wide margin.
Run it
Minimum VRAM for the locally-runnable models this week:
| Model | Best quant for 16GB GPU | VRAM | Fits 24GB GPU (Q4_K_M) |
|---|---|---|---|
| Qwen 3.6 27B | Q2_K | 11GB | ✓ (17GB) |
| Gemma 4 26B-A4B | Q2_K | 10.5GB | ✓ (16.5GB) |
| Qwen 3.6 35B-A3B | Q2_K | 13GB | needs 24GB+ (21GB) |
Use the calculator to find the best quant for your exact GPU — Q4_K_M is the quality sweet spot once you have 24GB+.
RunLocal covers the Ollama, OpenCode, and local inference ecosystem weekly.
runlocal.dev · @RunLocalcc