runlocal.devCheck My GPU →
Issue #2Apr 12, 2026

Gemma 4 changes local LLM — and the first killer use case is Claude Code

88% accuracy at 175 tok/s, 17GB VRAM, and how to cut your Claude Code bill with one env var

Gemma 4 26B MoE hits 88% accuracy at 175 tok/s in 17GB of VRAM on a financial classification benchmark — outperforming Qwen 3.5 72B by 17 points. If you have 18–20GB VRAM (RTX 4090, RTX 5080, or a 24GB Apple Silicon config), this is the new default.

The benchmark

500 real corporate disclosures, 5-category stock direction prediction, published on Zenn this week:

Model Accuracy Speed VRAM
Gemma 4 26B MoE 88% 175.7 tok/s 17 GB
Gemma 4 31B Dense 88% 61.5 tok/s 19 GB
Qwen 3.5 72B 71% 146.5 tok/s 24 GB

The Dense and MoE variants produced identical outputs on all 50 test cases despite different architectures. MoE is 2.9× faster for 2GB less VRAM.

The 17-point gap against Qwen 3.5 came from one failure mode: Qwen generated 19 false signals versus 4 for Gemma 4 on routine disclosures. Gemma 4 distinguishes material information from noise. Qwen 3.5 doesn't — at least not reliably.

The killer use case

Point Claude Code at Ollama. One environment variable.

Unlike LM Studio (OpenAI-format only), Ollama natively supports Anthropic's Messages API at /v1/messages — no proxy, no LiteLLM, no conversion layer.

# Pull the models
ollama pull gemma4:e4b    # ~5GB VRAM — commit msgs, summaries
ollama pull gemma4:26b    # ~17GB VRAM — code review, PR bodies

# Point Claude Code at local Ollama
ANTHROPIC_BASE_URL=http://localhost:11434 \
ANTHROPIC_AUTH_TOKEN=ollama \
claude

# Shortcut — same thing
ollama launch claude --model gemma4:26b

The split

  • Stays on Claude: architecture decisions, debugging, novel problem-solving.
  • Goes to local Gemma 4: commit message generation, PR bodies, code review, session summarization, translation.

A Japanese developer built this into seven shell commands (ai-commit-msg, ai-summarize, ai-review, ai-pr, etc.) wired through Claude Code's rules file. PreToolUse hooks intercept git commits for auto-generated messages. Stop hooks run a Gemma-based safety check before session close.

Two implementation lessons

  1. Don't mix model families. Switching between Qwen3 and Gemma4 causes full model reloads. Staying Gemma4-only — E4B for quick tasks, 26B for code review — keeps models warm in VRAM.
  2. Cut unused Claude Code plugins. One dev dropped from 16 to 8 enabled plugins and eliminated "thousands of tokens" of unused skill descriptions per session.

What to update

  • Ollama v0.20.6-rc1 in testing; v0.20.5 stable has the Flash Attention fix that was silently corrupting Gemma 4 output on pre-Ampere GPUs.
  • OpenClaw 2026.4.11 — stable + same-day beta. 5–7 releases per week at 343K+ stars.
  • OpenCode v0.0.55 — one release every ~3 days. Go + Bubble Tea TUI, worth reading for the LSP-feedback loop architecture.

Based on RunLocal Issue #2 · Full newsletter version on Substack →