Issue #12May 27, 2026

A FLUX-class 4B image model, squeezed to 1.21 GB — and yes, it runs in your browser

PrismML's Bonsai Image quantizes a FLUX.2-derived diffusion transformer down to sub-2-bit weights, with MLX, CUDA, WebGPU and iPhone builds. The footprint and speed are real; the quality cost is real too. Here's the honest tradeoff — official numbers, our own benchmarks pending.

Editor's note. Every speed and size figure below is PrismML's official number on an M4 Pro / 48 GB — cross-checked against the live Hugging Face model card, but not independently re-measured by us on our own Apple Silicon. Read them as the vendor's verified claims, not our own benchmarks.

1-bit text-to-image stopped being a paper

Ternary and 1-bit neural networks are not new. BitNet b1.58 (arXiv 2402.17764) made the {−1, 0, +1} weight a serious idea in early 2024, and 1-bit diffusion showed up the same year with BinaryDM (2404.05662). So the headline is not "first 1-bit diffusion model." It isn't.

What's actually new is the productization: a modern, FLUX-class 4B text-to-image model shipped as ternary/binary weights, with runtime packs for MLX, CUDA (gemlite), WebGPU and iPhone — all under Apache-2.0. That's PrismML's Bonsai Image (prism-ml on Hugging Face), and there's a zero-install browser demo you can hit right now.

What it actually is

Bonsai Image is FLUX.2 Klein 4B (an MMDiT diffusion transformer) with the architecture left intact and only the weight representation changed to ternary {−1, 0, +1} or binary {−1, +1}:

~4.0B trunk, 25 MMDiT blocks, all 100 matmul-heavy linear layers quantized.
Text encoder: Qwen3-4B compressed to 4-bit, offloaded the moment encoding finishes.
VAE: Flux2 32-channel latent, tiled decode (128px tiles).
Native 1024×1024; also 512×512 and any multiple of 32.
Ternary uses FP16 group-wise scaling (group size 128) — an effective ~1.71 bits per weight, not a clean 2.
Sampler: FlowMatchEuler-discrete, 4 steps, guidance 1.0 (no CFG), shift 3.0.

The "4 steps, no CFG" part matters as much as the bit-width: a single forward path per step, four steps total, is a big chunk of why the wall-clock numbers below are what they are.

The footprint and speed (PrismML's numbers)

Variant	Transformer	Total payload	512²	1024²	vs FP16 MFLUX
Ternary MLX 2-bit	1.21 GB	3.88 GB	5.78s	24.26s	3.15× / 5.56×
Binary MLX 1-bit	0.93 GB	3.42 GB	6.01s	24.07s	—

End-to-end active memory at 1024² (ternary) is 2.38 GB — about 6× below the FP16 baseline. The binary variant is the curious case: it's marginally slower at 512² (6.01s vs 5.78s) yet essentially tied at 1024² (24.07s vs 24.26s) — so the smaller 0.93 GB transformer buys footprint, not speed.

This is the same size↔speed tradeoff we've been tracking on the text side — see issue 9 on MTP and llama.cpp. The lever here is quantization rather than speculative decoding, but the question is identical: how much can you shrink before quality moves?

The honest part: two quality numbers, two readings

Bonsai Image reports two benchmarks against its FP16 parent (FLUX.2 Klein 4B). They say different things, and collapsing them into one "X% worse" number would mislead you:

GenEval: 0.723 vs 0.819. This is a real hit on prompt compliance — object counts, attributes, spatial relations. If your prompts are compositional ("a red cube on top of three blue spheres"), expect more misses than the full model.
DPG-Bench: 0.851 vs 0.853. Near parity on dense descriptive prompts. For rich scene descriptions, quality holds up.

Read together: the compression costs you precision on structured/compositional prompts more than it costs you on descriptive richness. At 1.21 GB versus 7.75 GB, that's a trade many people will happily take for ideation, drafts, and on-device generation — and won't take when final compositional accuracy is the whole point.

Try it with zero install

The fastest path is the browser. There's a live WebGPU demo:

huggingface.co/spaces/webml-community/bonsai-image-webgpu

(Hosted under the webml-community namespace, not PrismML's own. You need a WebGPU-capable browser — latest Chrome or Edge.)

Run it on your own Mac

Access status (verified 2026-05-27): the weights are public — the Hugging Face API reports gated: false. But there's a trap: the official demo repo (PrismML-Eng/Bonsai-Image-Demo) README is stale and still tells you to set BONSAI_TOKEN "until the public launch." Ignore that. Pull the weights directly, no token:

pip install "huggingface_hub[hf_xet]"
huggingface-cli download \
  --local-dir bonsai-image-ternary-4B-mlx-2bit \
  prism-ml/bonsai-image-ternary-4B-mlx-2bit

The ternary-4B-mlx-2bit repo is the Apple Silicon pick. If you want the full Bonsai Studio UI (FastAPI + Next.js), the demo repo's serve.sh is the documented path — just note its README still instructs you to set the now-unnecessary BONSAI_TOKEN.

Memory: 16 GB+ unified is comfortable. An 8 GB Mac may manage 512² given the 2.38 GB active footprint, but that's unverified — no official backing. (Need a quick sanity check on whether your machine has the headroom at all? The runlocal.dev calculator is built for LLM VRAM math, not diffusion, but it'll tell you your unified-memory ceiling.)

It also runs on an iPhone

PrismML reports the ternary model on an A19 Pro (iPhone 17 Pro Max) at 9.4s for 512² and 34.0s for 1024², via its on-device MLX Swift iOS build. Fully on-device, no cloud round-trip. That an entire FLUX-derived text-to-image stack fits and runs on a phone is the part that would have read as science fiction a year ago.

What to watch

Does sub-2-bit become the default packaging for small diffusion models, or stay a one-off? PrismML already shipped a ternary text model family in April (8B/4B/1.7B, BitNet-style 1.58-bit) — this is the same lineage applied to diffusion, which suggests a deliberate strategy rather than a demo.
The binary 1-bit variant's quality. Its speed is already published (and oddly flat versus ternary — see the table above), but PrismML hasn't released GenEval/DPG-Bench scores for the binary build. The real unknown is what the extra compression from ternary to binary costs you in image quality.
Whether 5.78s at 512² holds on consumer Apple Silicon, and — more importantly — what that GenEval gap actually feels like on real prompts rather than on a benchmark. We may publish our own M-series numbers in a later issue if the model gets traction; for now we're going on PrismML's verified figures.

The most honest test is your own eyes: open the WebGPU demo, throw five of your real prompts at it, and judge the quality tradeoff for yourself.

← All posts