Issue #4Apr 25, 2026

The local-AI map redrawn in 7 days

Qwen 3.6 27B beats a 397B predecessor. Gemma 4 26B-A4B lands with 22 quants. Kimi K2.6 hits Opus parity at 1T params.

Seven days. Fourteen model releases. Here's what actually changed for the people running inference locally.

What shipped (April 18–24)

Model	Provider	Params	Local?	Highlight
Qwen 3.6 27B	Alibaba	27B dense	✓	SWE-bench 77.2, multimodal
Gemma 4 26B-A4B	Google	26B/4B MoE	✓	22 Unsloth quants
micro-kiki v3	L'Électron Rare	35B/3B MoE + LoRA	✓	Domain expert routing
Kimi K2.6	Moonshot AI	1T/32B MoE	Cloud	~Opus 4.6 benchmark parity
DeepSeek V4 Pro	DeepSeek	1.6T/862B MoE	Cloud	FP4 QAT at scale
DeepSeek V4 Flash	DeepSeek	292B/158B MoE	Cloud	Near-Pro at 1/12 cost
HunYuan Hy3	Tencent	295B/21B MoE	Cloud	+40% efficiency, open weights
Ling 2.6 1T	Ant Group	1T/50B MoE	Cloud	AA Index #2, 262K ctx

The local column is the shorter list — but the two models on it are genuinely good.

Two that change your local stack

Qwen 3.6 27B — a 397B model in a 27B shell

Qwen 3.6 27B is a dense 27B model that outperforms Qwen 3.5 397B on SWE-bench Verified (77.2 vs the prior-gen flagship). For context: the 397B model requires multi-GPU infrastructure. The 27B runs on a single 16GB GPU at Q4_K_M (17GB VRAM), or on any M-series Mac with 24GB+.

It's also natively multimodal — image, video, and text in a single model. No adapter, no pipeline switch. Compatible with Claude Code and Qwen Code tooling out of the box.

ollama pull qwen3.6:27b

→ Use the calculator to confirm it fits your GPU.

Gemma 4 26B-A4B — the best-quantized model in its class

Google quietly released a 26B-total / 4B-active MoE variant of Gemma 4. The model itself isn't the story. Unsloth's quantization coverage is. They published best-in-class GGUFs across 22 quant levels — the broadest coverage in the Gemma 4 family. If you've been waiting for a Gemma 4 MoE that runs cleanly at Q2_K on a 12GB GPU, this is it.

Pair it with the model page to see the full quant ladder: Q8_0 at 29GB VRAM down to Q2_K at 10.5GB.

ollama pull gemma4:26b-a4b

Why the cloud tier matters anyway

The cloud releases this week serve as the capability ceiling — the "if you had infinite VRAM" benchmark that tells you what's eventually coming to consumer hardware.

Kimi K2.6 (1T total, 32B active, 256K context) hits Claude Opus 4.6 on SWE-bench and supports 300 parallel sub-agents out of the box. The community verdict: 85% of Opus 4.6 tasks are replaceable today. It's already on OpenRouter and Cloudflare Workers AI.

DeepSeek V4 Pro introduces two architectural innovations worth watching: Hybrid Attention (CSA+HCA layers replace standard full-attention), and Manifold-Constrained Hyper-Connections that replace residual connections entirely. If those ideas distill down to 7B and 14B models the way prior DeepSeek advances did, your local stack benefits in 6–12 months.

Ling 2.6 1T from Ant Group is the dark horse: AA Intelligence Index #2 globally (score 34 vs the 13-point mean), open weights incoming, Apache 2.0. Price is aggressive at $0.30/$2.50 per million tokens. Flag this one for local quantization when the weights land.

Hardware that makes sense now

Two hardware additions this week close gaps that have been there since early 2025:

RTX 5070 Ti (16GB, 896 GB/s) — same VRAM class as the RTX 4080 SUPER but at $749 vs $999. The bandwidth jump (896 vs 736 GB/s) directly improves token generation throughput for quantized models. Qwen 3.6 27B at Q4_K_M (17GB) needs this tier.

RX 9070 XT (16GB, 644 GB/s, RDNA 4, $599) — AMD's first consumer GPU with credible llama.cpp ROCm support in this generation. Gemma 4 26B-A4B Q4_K_M fits in 16.5GB — tight, but workable.

Apple M4 Ultra 64GB makes the 31B–72B dense class local. Every Q4_K_M model up to ~70B fits without quantization compromise. At 1092 GB/s unified bandwidth, it's the fastest token generation on consumer hardware by a wide margin.

Run it

Minimum VRAM for the locally-runnable models this week:

Model	Best quant for 16GB GPU	VRAM	Fits 24GB GPU (Q4_K_M)
Qwen 3.6 27B	Q2_K	11GB	✓ (17GB)
Gemma 4 26B-A4B	Q2_K	10.5GB	✓ (16.5GB)
Qwen 3.6 35B-A3B	Q2_K	13GB	needs 24GB+ (21GB)

Use the calculator to find the best quant for your exact GPU — Q4_K_M is the quality sweet spot once you have 24GB+.

RunLocal covers the Ollama, OpenCode, and local inference ecosystem weekly.
runlocal.dev · @RunLocalcc

← All posts