Issue #10May 21, 2026

Stop trying to use Cline locally. r/LocalLLaMA's real answer for daily-driving Qwen3.6 27B + MTP.

Cloud agents fall apart on local models. Three scaffold-first tools the community is actually shipping with — SmallCode, PI Coding Agent, and little-coder — plus a decision matrix by VRAM.

The myth I had wrong

I almost wrote a different post this week. Google Trends made it look obvious: Cline avg 62, Aider 27, Continue.dev 3 over the past 12 months. Clear ranking, clear comparison piece.

Then I went to look at what r/LocalLLaMA was actually using on Qwen3.6 27B over the past 30 days:

Cline: 1 post, score 2
Aider: 1 post, score 21
Continue.dev: 0 posts
SmallCode: 1 post, score 792
PI Coding Agent: 1 post, score 256
little-coder: 1 post, score 21 (deep technical anchor)

Trends lied. The cloud-coding-agent audience and the actual-running-local-models audience barely overlap. Last week's MTP landing in llama.cpp gave us 2× the generation speed. The next question — what do you actually run on top of it? — has a different answer than the SEO results suggest.

Why cloud agents fail locally

The 792-upvote thread that reframed this whole piece opens with a sentence that says it all:

"I was frustrated that every coding agent (OpenCode, Cursor, Claude Code) assumes you're running GPT-5.4 or Claude Opus. If you try them with a local model like Gemma or Qwen they fall apart."

— r/LocalLLaMA 1tgecrq

The four failure modes the thread identifies, that every commenter confirmed:

Tool-call chains break after 3+ sequential calls. Small models lose coherence.
Context overflows because cloud agents dump whole files into the prompt.
Multi-step tasks collapse when one step's output doesn't fit the next step's input contract.
No recovery on error — cloud agents assume the model is smart enough to fix itself.

MTP makes Qwen3.6 27B fast. It doesn't make it Claude Opus. The local-coding problem isn't speed anymore; it's scaffold-paradigm mismatch. The fix is tools designed bottom-up for small models — not Cline retrofitted for them.

The three scaffold-first tools

1. SmallCode — the viral newcomer

Anchor: r/LocalLLaMA 1tgecrq — 792 upvotes, 350 comments, May 18
Claim: 87/100 benchmark pass with Gemma 4 4B (active params), vs OpenCode's ~75% with 14B models
The tricks:
- Compound tools — one tool does find+read+edit+verify in one call, so the model doesn't have to chain 4 sequential calls
- Improvement loop — every code generation gets compiled/linted instantly and errors fed back
- Decompose on failure — second retry breaks the task into smaller pieces instead of repeating
- Auto-escalation — drops to Claude/OpenAI for the one task that needs it, stays local 95% of the time
- Code graph — symbol-level index, walks the graph instead of grep-snippet-dumping
Install: npm install -g smallcode → point at LM Studio/Ollama
Gaps: no LSP, no multi-session, no desktop app

2. PI Coding Agent + plan-first skill

Anchor: r/LocalLLaMA 1stjwg5 — 256 upvotes, 106 comments
Claim: Qwen3.6 35B-A3B Q4_K_XL on real production work, "held up"
The unlock: a single plan-first skill file that forces the model into a 5-phase loop:
1. Analyze the project silently
2. Ask up to 5 clarifying questions in one round
3. Write TODO.md with concrete, dependency-ordered tasks
4. Revision loop until the user approves
5. Execute one task at a time, mark [x] as done
Why this matters: small models can code; they can't hold a 200-line plan in their head. Skill files externalize the planning so the model only has to handle one task at a time.
The skill file is in the thread. Copy-paste it; it works.

3. little-coder + task-shape routing

Anchor: r/LocalLLaMA 1st4cqq — 21 upvotes but the deepest technical post in the set
Setup: RTX 5090 (Frodo) + RTX Pro 6000 96GB (Gandalf), Qwen3.6 35B-A3B on Frodo, Qwen3-Coder-Next 80B on Gandalf via vLLM
Claim: 9/10 pass on a real 10-task Go eval, $0 cost, 1489s wall-clock
The shift: the author started with an "Aider-style harness" and got 3/10. Switched to little-coder + routing and got 8/10 single-model, 9/10 routed.
Routing policy:

General Go module work        → Qwen3.6 + little-coder
SQL/store/migration work      → Qwen3.6 + little-coder
Narrow compile/import failure → local Gandalf (Qwen3-Coder-Next) repair
Timer/ticker/concurrency bug  → frontier escalation or specialized playbook

Deterministic fixups outside the model: goimports, gofmt, go mod tidy, go test -timeout. Don't make the model do them.
The thesis line: "The right abstraction is not 'pick the best model.' The right abstraction is 'route by task shape and failure mode.'"

Decision matrix by hardware

Your hardware	Recommended stack
8–12GB VRAM	SmallCode + Gemma 4 4B (Q4)
24GB VRAM (3090/4090)	PI Coding Agent + Qwen3.6 27B + MTP + plan-first skill
32GB+ single (5090)	PI Coding Agent + Qwen3.6 35B-A3B + MTP, or SmallCode for speed
Dual GPU (24+24, 32+96)	little-coder + routing (Qwen3.6 35B-A3B + Qwen3-Coder-Next 80B)
Mac M3/M4 with 36GB+	SmallCode or PI Coding Agent (GGUF + MTP path in LM Studio)
6GB or less	Don't run an agent locally. Run inline autocomplete with a 1–3B model.

Cross-check your card against the runlocal.dev calculator before committing.

The old guard's last stand

Aider — Still the most-mentioned name in the broader internet, but on r/LocalLLaMA in May 2026 it surfaces as the baseline that lost. From 1st4cqq: "my old Aider-style harness got 3/10 on the same tasks." It's not gone — it's now the thing scaffold-first tools benchmark against.
Cline — High search volume, low local-LLM mindshare. The community using Cline runs Claude Sonnet / GPT-5.4 behind it. Don't fight that current.
Continue.dev — The inline-autocomplete extension is still fine. The agent mode is not where local-LLM users are spending time. Trends shows it declining; r/LocalLLaMA reflects that.

Japan corner: Kiro + Hermes

The Japanese-language local-coding scene is converging on a parallel-but-different stack: Kiro CLI + Hermes Agent + Ollama, with Hermes handling the "which model fires for which task" routing problem. Same underlying insight (scaffold-first beats cloud-agent retrofit), different building blocks. We'll cover the Hermes routing pattern as its own piece — it's worth a full issue, not a sidebar.

What to do this weekend

24GB+ card: install PI Coding Agent + Qwen3.6 27B (MTP GGUF), paste the plan-first skill file from the thread, run it on a real ticket from your backlog.
8GB card: npm install -g smallcode + Gemma 4 4B in Ollama, point SmallCode at it, give it a small refactor.
Dual-GPU: clone little-coder, run the author's 10-task eval pattern on a copy of your own repo.
Verify VRAM headroom: runlocal.dev calculator → pick your card → confirm the quant fits with MTP overhead.
Post your daily driver: drop it into 1ti2ga0 (the 48GB daily-driver thread, 153 comments and counting).

Why this matters past this week

The story of 2026's first half wasn't "which cloud agent wins." It was the cloud-agent paradigm losing on small models. MTP made local fast. Scaffold-first tools made local usable. The two together are now what r/LocalLLaMA's daily-driver crowd actually runs.

The next 60 days to watch:

Does SmallCode add LSP and multi-session? That gates whether 8GB users can drop OpenCode entirely.
Does someone publish a canonical skill library (plan-first + test-first + refactor + debug) so PI Coding Agent users don't reinvent each one?
Does little-coder's routing-policy idea get extracted into a standalone library so other agents can adopt it?

If you only do one thing from this issue: copy the plan-first skill file and put it in your local agent today. It's the highest-leverage change you can make in 10 minutes.

Next issue: the canonical skill library — what every local-LLM coding agent should ship with, and what to write yourself.

← All posts