Gemma 4 changes local LLM — and the first killer use case is Claude Code
88% accuracy at 175 tok/s, 17GB VRAM, and how to cut your Claude Code bill with one env var
Gemma 4 26B MoE hits 88% accuracy at 175 tok/s in 17GB of VRAM on a financial classification benchmark — outperforming Qwen 3.5 72B by 17 points. If you have 18–20GB VRAM (RTX 4090, RTX 5080, or a 24GB Apple Silicon config), this is the new default.
The benchmark
500 real corporate disclosures, 5-category stock direction prediction, published on Zenn this week:
| Model | Accuracy | Speed | VRAM |
|---|---|---|---|
| Gemma 4 26B MoE | 88% | 175.7 tok/s | 17 GB |
| Gemma 4 31B Dense | 88% | 61.5 tok/s | 19 GB |
| Qwen 3.5 72B | 71% | 146.5 tok/s | 24 GB |
The Dense and MoE variants produced identical outputs on all 50 test cases despite different architectures. MoE is 2.9× faster for 2GB less VRAM.
The 17-point gap against Qwen 3.5 came from one failure mode: Qwen generated 19 false signals versus 4 for Gemma 4 on routine disclosures. Gemma 4 distinguishes material information from noise. Qwen 3.5 doesn't — at least not reliably.
The killer use case
Point Claude Code at Ollama. One environment variable.
Unlike LM Studio (OpenAI-format only), Ollama natively supports Anthropic's Messages API at /v1/messages — no proxy, no LiteLLM, no conversion layer.
# Pull the models
ollama pull gemma4:e4b # ~5GB VRAM — commit msgs, summaries
ollama pull gemma4:26b # ~17GB VRAM — code review, PR bodies
# Point Claude Code at local Ollama
ANTHROPIC_BASE_URL=http://localhost:11434 \
ANTHROPIC_AUTH_TOKEN=ollama \
claude
# Shortcut — same thing
ollama launch claude --model gemma4:26b
The split
- Stays on Claude: architecture decisions, debugging, novel problem-solving.
- Goes to local Gemma 4: commit message generation, PR bodies, code review, session summarization, translation.
A Japanese developer built this into seven shell commands (ai-commit-msg, ai-summarize, ai-review, ai-pr, etc.) wired through Claude Code's rules file. PreToolUse hooks intercept git commits for auto-generated messages. Stop hooks run a Gemma-based safety check before session close.
Two implementation lessons
- Don't mix model families. Switching between Qwen3 and Gemma4 causes full model reloads. Staying Gemma4-only — E4B for quick tasks, 26B for code review — keeps models warm in VRAM.
- Cut unused Claude Code plugins. One dev dropped from 16 to 8 enabled plugins and eliminated "thousands of tokens" of unused skill descriptions per session.
What to update
- Ollama v0.20.6-rc1 in testing; v0.20.5 stable has the Flash Attention fix that was silently corrupting Gemma 4 output on pre-Ampere GPUs.
- OpenClaw 2026.4.11 — stable + same-day beta. 5–7 releases per week at 343K+ stars.
- OpenCode v0.0.55 — one release every ~3 days. Go + Bubble Tea TUI, worth reading for the LSP-feedback loop architecture.
Based on RunLocal Issue #2 · Full newsletter version on Substack →