Blog
Condensed weekly issues — model releases, benchmarks, Ollama workflows. Also published on Substack.
A FLUX-class 4B image model, squeezed to 1.21 GB — and yes, it runs in your browser
PrismML's Bonsai Image quantizes a FLUX.2-derived diffusion transformer down to sub-2-bit weights, with MLX, CUDA, WebGPU and iPhone builds. The footprint and speed are real; the quality cost is real too. Here's the honest tradeoff — official numbers, our own benchmarks pending.
Claude shipped 'Agent Skills'. r/LocalLLaMA already converged on the canonical 4 — copy these into your local agent today.
Plan-first, test-first, refactor-with-constraint, debug-loop. Four skill files that should ship with every local-LLM coding agent — sourced from the threads that actually got Qwen3.6 27B to daily-driver quality.
Stop trying to use Cline locally. r/LocalLLaMA's real answer for daily-driving Qwen3.6 27B + MTP.
Cloud agents fall apart on local models. Three scaffold-first tools the community is actually shipping with — SmallCode, PI Coding Agent, and little-coder — plus a decision matrix by VRAM.
llama.cpp + MTP is here: Qwen3.6 27B hits 2.17× on an RTX 3090 — should you upgrade tonight?
Multi-Token Prediction landed in mainline llama.cpp this week. The real numbers across RTX 5090 / 3090 / Strix Halo / 8GB cards, and a one-table answer to whether MTP is worth your weekend.
AI coding agents compared (2026): Claude Code vs Cursor vs OpenCode vs OpenClaw vs Gemini CLI vs Cluely vs z.ai — and the best local models on Ollama
Seven cloud agents, four local models, one decision matrix. Pricing, underlying models, benchmark numbers, and the cases where each one is the right pick.
OpenClaw + Ollama: the complete setup after Anthropic's third-party clampdown
Search interest in `openclaw ollama` is up 507,000% in 30 days. Here's the full migration path off Claude Pro — install, model pick, common errors, and what you actually give up.
The inference-engine wave: MTP, DFlash, PAGED MoE
Single-card 600 tok/s. 397B on a 64GB Mac. 85 tok/s at 524k context. Three weeks of runtime breakthroughs reset what 'local' means.
Omni-modal locals land, GPT-5.5 resets the frontier
Nemotron-3 Nano Omni puts four modalities on a 24GB card. OpenAI ships GPT-5.5 to everyone. Xiaomi enters the open-weight flagship tier.
The local-AI map redrawn in 7 days
Qwen 3.6 27B beats a 397B predecessor. Gemma 4 26B-A4B lands with 22 quants. Kimi K2.6 hits Opus parity at 1T params.
Persistent AI memory on a Raspberry Pi 5
Local embeddings + ChromaDB + Ollama in ~150 lines. ~$100 of hardware. No tokens.
Gemma 4 changes local LLM — and the first killer use case is Claude Code
88% accuracy at 175 tok/s, 17GB VRAM, and how to cut your Claude Code bill with one env var
Your local AI stack is already being scanned
113K requests, a Raspberry Pi honeypot, and the attack surface you didn't know you had