runlocal.cc
Check My GPU →
Issue #5May 8, 2026

Omni-modal locals land, GPT-5.5 resets the frontier

Nemotron-3 Nano Omni puts four modalities on a 24GB card. OpenAI ships GPT-5.5 to everyone. Xiaomi enters the open-weight flagship tier.

Two weeks since Issue #4. The cloud-frontier moved (GPT-5.5, Grok 4.3) and — quietly more important — the local frontier picked up native four-modality support at a 30B-class footprint.

What shipped (April 25 – May 8)

Model Provider Params Local? Highlight
Nemotron-3 Nano Omni 30B-A3B NVIDIA 30B/3.5B Mamba-MoE Native text/image/video/audio, 256K–1M ctx
Poolside Laguna XS.2 Poolside AI 33B/3B MoE Apache-2.0, agentic coding
Zyphra ZAYA1-8B Zyphra AI 8B MoE Trained on AMD silicon, 12GB-friendly
Mimo v2.5 Xiaomi 310B/15B MoE Cloud Xiaomi's first open flagship, llama.cpp PR day-one
GPT-5.5 / Pro OpenAI Cloud New frontier anchor, Gödel-grade reasoning
Grok 4.3 SpaceXAI Cloud Shipped alongside the xAI → SpaceXAI rebrand

Also relevant but not in our model directory: NVIDIA's Gemma-4-26B-A4B-NVFP4 quant (single RTX 5090, 50k ctx at 80% VRAM); Google's Gemma 4 MTP draft models for speculative decoding (Ollama v0.23.1 day-one); OpenAI's Privacy Filter 150M Apache-2.0 PII scrubber.

The local pick: Nemotron-3 Nano Omni 30B-A3B

NVIDIA shipped a 30B-total / 3.5B-active hybrid (Mamba-2 + MoE + Attention) that handles text, image, video, and audio natively, with 256K context (up to 1M). It tops MMLongBench-Doc, OCRBenchV2, and VoiceBench in its weight class. Mamba layers give ~4× compute efficiency vs equivalent-size attention.

If you have a 24GB card, this is the new "interesting model to actually put on it." Three properties that matter:

  1. Active-parameter footprint — 3.5B active means token generation costs land in 9B-dense territory, not 30B-dense.
  2. Modality coverage — most omni models so far were closed (Gemini, GPT-4o) or partial (vision-only). This is genuinely four-modality, open-weight, commercial-friendly.
  3. Distribution — BF16 on HuggingFace, Unsloth GGUF up day-one, free OpenRouter tier for eval before pulling weights.

Companion picks for the same window:

  • Poolside Laguna XS.2 — Apache-2.0 33B/A3B, agentic coding focused. Roughly matches Qwen 3.5 35B-A3B on benchmarks while staying fully open-weight.
  • Zyphra ZAYA1-8B — compact reasoning-tuned MoE trained on AMD GPUs. Optimized for intelligence-per-parameter; comfortable at Q4 on a 12GB card.

Cloud side: GPT-5.5 era starts

OpenAI rolled out GPT-5.5 (May 6), pushed GPT-5.5 Instant to free-tier ChatGPT (May 7, reportedly -52.5% hallucination on high-stakes prompts vs GPT-5.3 Instant), and shipped GPT-5.5 Pro as the new deep-reasoning anchor. xAI rebranded to SpaceXAI and released Grok 4.3 the same week. Anthropic and SpaceXAI signed a Colossus1 compute deal in the background.

Worth noting on a local-AI newsletter because the target moves. The capability gap between local-runnable and frontier widened on raw intelligence — but the throughput-per-dollar gap continues to narrow, which we'll cover in Issue #6.

Xiaomi's quiet entry

Mimo v2.5 is the open-source story buried under the GPT-5.5 headlines. 310B Sparse MoE with 15B active, fully multimodal, llama.cpp PR #22493 merged day-one. Local-runnable only on a cluster (200GB+ aggregate) for now, but the precedent matters: a major consumer-electronics player just published a full-stack open multimodal flagship without a paywall.

Also from the China side: Doubao-Seed-2.0-lite (ByteDance, omni-modal), SenseNova 6.7 Flash-Lite (SenseTime, multimodal agent that parses dense web/document layouts directly).

What to actually do this week

  • 24GB card owner: pull Nemotron-3 Nano Omni 30B-A3B Q4_K_M. Hand it a document with embedded images and audio. See if the omni claim holds for your workload.
  • 12GB card owner: ZAYA1-8B Q4_K_M as your daily-driver candidate to compare against Qwen 3.6 9B distills.
  • Agentic-coding stack: drop Poolside Laguna XS.2 into your benchmark slot for Qwen 3.5 35B-A3B and re-run.
  • Cloud-comparison rig: if you maintain "best local vs best cloud" eval scripts, your cloud baseline should now be GPT-5.5 Pro, not Opus 4.7.

Next issue: the inference-engine wave (MTP, DFlash, PAGED MoE) that's quietly rewriting the cost curve.

runlocal