Stop trying to use Cline locally. r/LocalLLaMA's real answer for daily-driving Qwen3.6 27B + MTP.
Cloud agents fall apart on local models. Three scaffold-first tools the community is actually shipping with — SmallCode, PI Coding Agent, and little-coder — plus a decision matrix by VRAM.
The myth I had wrong
I almost wrote a different post this week. Google Trends made it look obvious: Cline avg 62, Aider 27, Continue.dev 3 over the past 12 months. Clear ranking, clear comparison piece.
Then I went to look at what r/LocalLLaMA was actually using on Qwen3.6 27B over the past 30 days:
- Cline: 1 post, score 2
- Aider: 1 post, score 21
- Continue.dev: 0 posts
- SmallCode: 1 post, score 792
- PI Coding Agent: 1 post, score 256
- little-coder: 1 post, score 21 (deep technical anchor)
Trends lied. The cloud-coding-agent audience and the actual-running-local-models audience barely overlap. Last week's MTP landing in llama.cpp gave us 2× the generation speed. The next question — what do you actually run on top of it? — has a different answer than the SEO results suggest.
Why cloud agents fail locally
The 792-upvote thread that reframed this whole piece opens with a sentence that says it all:
"I was frustrated that every coding agent (OpenCode, Cursor, Claude Code) assumes you're running GPT-5.4 or Claude Opus. If you try them with a local model like Gemma or Qwen they fall apart."
The four failure modes the thread identifies, that every commenter confirmed:
- Tool-call chains break after 3+ sequential calls. Small models lose coherence.
- Context overflows because cloud agents dump whole files into the prompt.
- Multi-step tasks collapse when one step's output doesn't fit the next step's input contract.
- No recovery on error — cloud agents assume the model is smart enough to fix itself.
MTP makes Qwen3.6 27B fast. It doesn't make it Claude Opus. The local-coding problem isn't speed anymore; it's scaffold-paradigm mismatch. The fix is tools designed bottom-up for small models — not Cline retrofitted for them.
The three scaffold-first tools
1. SmallCode — the viral newcomer
- Anchor: r/LocalLLaMA
1tgecrq— 792 upvotes, 350 comments, May 18 - Claim: 87/100 benchmark pass with Gemma 4 4B (active params), vs OpenCode's ~75% with 14B models
- The tricks:
- Compound tools — one tool does find+read+edit+verify in one call, so the model doesn't have to chain 4 sequential calls
- Improvement loop — every code generation gets compiled/linted instantly and errors fed back
- Decompose on failure — second retry breaks the task into smaller pieces instead of repeating
- Auto-escalation — drops to Claude/OpenAI for the one task that needs it, stays local 95% of the time
- Code graph — symbol-level index, walks the graph instead of grep-snippet-dumping
- Install:
npm install -g smallcode→ point at LM Studio/Ollama - Gaps: no LSP, no multi-session, no desktop app
2. PI Coding Agent + plan-first skill
- Anchor: r/LocalLLaMA
1stjwg5— 256 upvotes, 106 comments - Claim: Qwen3.6 35B-A3B Q4_K_XL on real production work, "held up"
- The unlock: a single
plan-firstskill file that forces the model into a 5-phase loop:- Analyze the project silently
- Ask up to 5 clarifying questions in one round
- Write
TODO.mdwith concrete, dependency-ordered tasks - Revision loop until the user approves
- Execute one task at a time, mark
[x]as done
- Why this matters: small models can code; they can't hold a 200-line plan in their head. Skill files externalize the planning so the model only has to handle one task at a time.
- The skill file is in the thread. Copy-paste it; it works.
3. little-coder + task-shape routing
- Anchor: r/LocalLLaMA
1st4cqq— 21 upvotes but the deepest technical post in the set - Setup: RTX 5090 (Frodo) + RTX Pro 6000 96GB (Gandalf), Qwen3.6 35B-A3B on Frodo, Qwen3-Coder-Next 80B on Gandalf via vLLM
- Claim: 9/10 pass on a real 10-task Go eval, $0 cost, 1489s wall-clock
- The shift: the author started with an "Aider-style harness" and got 3/10. Switched to little-coder + routing and got 8/10 single-model, 9/10 routed.
- Routing policy:
General Go module work → Qwen3.6 + little-coder
SQL/store/migration work → Qwen3.6 + little-coder
Narrow compile/import failure → local Gandalf (Qwen3-Coder-Next) repair
Timer/ticker/concurrency bug → frontier escalation or specialized playbook
- Deterministic fixups outside the model:
goimports,gofmt,go mod tidy,go test -timeout. Don't make the model do them. - The thesis line: "The right abstraction is not 'pick the best model.' The right abstraction is 'route by task shape and failure mode.'"
Decision matrix by hardware
| Your hardware | Recommended stack |
|---|---|
| 8–12GB VRAM | SmallCode + Gemma 4 4B (Q4) |
| 24GB VRAM (3090/4090) | PI Coding Agent + Qwen3.6 27B + MTP + plan-first skill |
| 32GB+ single (5090) | PI Coding Agent + Qwen3.6 35B-A3B + MTP, or SmallCode for speed |
| Dual GPU (24+24, 32+96) | little-coder + routing (Qwen3.6 35B-A3B + Qwen3-Coder-Next 80B) |
| Mac M3/M4 with 36GB+ | SmallCode or PI Coding Agent (GGUF + MTP path in LM Studio) |
| 6GB or less | Don't run an agent locally. Run inline autocomplete with a 1–3B model. |
Cross-check your card against the runlocal.dev calculator before committing.
The old guard's last stand
- Aider — Still the most-mentioned name in the broader internet, but on r/LocalLLaMA in May 2026 it surfaces as the baseline that lost. From
1st4cqq: "my old Aider-style harness got 3/10 on the same tasks." It's not gone — it's now the thing scaffold-first tools benchmark against. - Cline — High search volume, low local-LLM mindshare. The community using Cline runs Claude Sonnet / GPT-5.4 behind it. Don't fight that current.
- Continue.dev — The inline-autocomplete extension is still fine. The agent mode is not where local-LLM users are spending time. Trends shows it declining; r/LocalLLaMA reflects that.
Japan corner: Kiro + Hermes
The Japanese-language local-coding scene is converging on a parallel-but-different stack: Kiro CLI + Hermes Agent + Ollama, with Hermes handling the "which model fires for which task" routing problem. Same underlying insight (scaffold-first beats cloud-agent retrofit), different building blocks. We'll cover the Hermes routing pattern as its own piece — it's worth a full issue, not a sidebar.
What to do this weekend
- 24GB+ card: install PI Coding Agent + Qwen3.6 27B (MTP GGUF), paste the
plan-firstskill file from the thread, run it on a real ticket from your backlog. - 8GB card:
npm install -g smallcode+ Gemma 4 4B in Ollama, point SmallCode at it, give it a small refactor. - Dual-GPU: clone little-coder, run the author's 10-task eval pattern on a copy of your own repo.
- Verify VRAM headroom: runlocal.dev calculator → pick your card → confirm the quant fits with MTP overhead.
- Post your daily driver: drop it into
1ti2ga0(the 48GB daily-driver thread, 153 comments and counting).
Why this matters past this week
The story of 2026's first half wasn't "which cloud agent wins." It was the cloud-agent paradigm losing on small models. MTP made local fast. Scaffold-first tools made local usable. The two together are now what r/LocalLLaMA's daily-driver crowd actually runs.
The next 60 days to watch:
- Does SmallCode add LSP and multi-session? That gates whether 8GB users can drop OpenCode entirely.
- Does someone publish a canonical skill library (plan-first + test-first + refactor + debug) so PI Coding Agent users don't reinvent each one?
- Does little-coder's routing-policy idea get extracted into a standalone library so other agents can adopt it?
If you only do one thing from this issue: copy the plan-first skill file and put it in your local agent today. It's the highest-leverage change you can make in 10 minutes.
Next issue: the canonical skill library — what every local-LLM coding agent should ship with, and what to write yourself.