Omni-modal locals land, GPT-5.5 resets the frontier
Nemotron-3 Nano Omni puts four modalities on a 24GB card. OpenAI ships GPT-5.5 to everyone. Xiaomi enters the open-weight flagship tier.
Two weeks since Issue #4. The cloud-frontier moved (GPT-5.5, Grok 4.3) and — quietly more important — the local frontier picked up native four-modality support at a 30B-class footprint.
What shipped (April 25 – May 8)
| Model | Provider | Params | Local? | Highlight |
|---|---|---|---|---|
| Nemotron-3 Nano Omni 30B-A3B | NVIDIA | 30B/3.5B Mamba-MoE | ✓ | Native text/image/video/audio, 256K–1M ctx |
| Poolside Laguna XS.2 | Poolside AI | 33B/3B MoE | ✓ | Apache-2.0, agentic coding |
| Zyphra ZAYA1-8B | Zyphra AI | 8B MoE | ✓ | Trained on AMD silicon, 12GB-friendly |
| Mimo v2.5 | Xiaomi | 310B/15B MoE | Cloud | Xiaomi's first open flagship, llama.cpp PR day-one |
| GPT-5.5 / Pro | OpenAI | — | Cloud | New frontier anchor, Gödel-grade reasoning |
| Grok 4.3 | SpaceXAI | — | Cloud | Shipped alongside the xAI → SpaceXAI rebrand |
Also relevant but not in our model directory: NVIDIA's Gemma-4-26B-A4B-NVFP4 quant (single RTX 5090, 50k ctx at 80% VRAM); Google's Gemma 4 MTP draft models for speculative decoding (Ollama v0.23.1 day-one); OpenAI's Privacy Filter 150M Apache-2.0 PII scrubber.
The local pick: Nemotron-3 Nano Omni 30B-A3B
NVIDIA shipped a 30B-total / 3.5B-active hybrid (Mamba-2 + MoE + Attention) that handles text, image, video, and audio natively, with 256K context (up to 1M). It tops MMLongBench-Doc, OCRBenchV2, and VoiceBench in its weight class. Mamba layers give ~4× compute efficiency vs equivalent-size attention.
If you have a 24GB card, this is the new "interesting model to actually put on it." Three properties that matter:
- Active-parameter footprint — 3.5B active means token generation costs land in 9B-dense territory, not 30B-dense.
- Modality coverage — most omni models so far were closed (Gemini, GPT-4o) or partial (vision-only). This is genuinely four-modality, open-weight, commercial-friendly.
- Distribution — BF16 on HuggingFace, Unsloth GGUF up day-one, free OpenRouter tier for eval before pulling weights.
Companion picks for the same window:
- Poolside Laguna XS.2 — Apache-2.0 33B/A3B, agentic coding focused. Roughly matches Qwen 3.5 35B-A3B on benchmarks while staying fully open-weight.
- Zyphra ZAYA1-8B — compact reasoning-tuned MoE trained on AMD GPUs. Optimized for intelligence-per-parameter; comfortable at Q4 on a 12GB card.
Cloud side: GPT-5.5 era starts
OpenAI rolled out GPT-5.5 (May 6), pushed GPT-5.5 Instant to free-tier ChatGPT (May 7, reportedly -52.5% hallucination on high-stakes prompts vs GPT-5.3 Instant), and shipped GPT-5.5 Pro as the new deep-reasoning anchor. xAI rebranded to SpaceXAI and released Grok 4.3 the same week. Anthropic and SpaceXAI signed a Colossus1 compute deal in the background.
Worth noting on a local-AI newsletter because the target moves. The capability gap between local-runnable and frontier widened on raw intelligence — but the throughput-per-dollar gap continues to narrow, which we'll cover in Issue #6.
Xiaomi's quiet entry
Mimo v2.5 is the open-source story buried under the GPT-5.5 headlines. 310B Sparse MoE with 15B active, fully multimodal, llama.cpp PR #22493 merged day-one. Local-runnable only on a cluster (200GB+ aggregate) for now, but the precedent matters: a major consumer-electronics player just published a full-stack open multimodal flagship without a paywall.
Also from the China side: Doubao-Seed-2.0-lite (ByteDance, omni-modal), SenseNova 6.7 Flash-Lite (SenseTime, multimodal agent that parses dense web/document layouts directly).
What to actually do this week
- 24GB card owner: pull Nemotron-3 Nano Omni 30B-A3B Q4_K_M. Hand it a document with embedded images and audio. See if the omni claim holds for your workload.
- 12GB card owner: ZAYA1-8B Q4_K_M as your daily-driver candidate to compare against Qwen 3.6 9B distills.
- Agentic-coding stack: drop Poolside Laguna XS.2 into your benchmark slot for Qwen 3.5 35B-A3B and re-run.
- Cloud-comparison rig: if you maintain "best local vs best cloud" eval scripts, your cloud baseline should now be GPT-5.5 Pro, not Opus 4.7.
Next issue: the inference-engine wave (MTP, DFlash, PAGED MoE) that's quietly rewriting the cost curve.
— runlocal