Interfaze

logo

Beta

pricing

docs

blog

sign in

Bonsai 8B Gguf

Bonsai 8B Gguf by prism-ml, a text-generation model. Understand and compare features, benchmarks, and capabilities.

Comparison

FeatureBonsai 8B GgufInterfaze
Input Modalities

text

image, text, audio, video, document

Native OCRNoYes
Long Document ProcessingNoYes
Language Support

unknown

162+

Native Speech-to-TextNoYes
Native Object DetectionNoYes
Guardrail ControlsNoYes
Context Input Size

65.5K

1M

Tool CallingNo

Tool calling supported + built in browser, code execution and web search

Scaling

FeatureBonsai 8B GgufInterfaze
Scaling

Self-hosted/Provider-hosted with quantization

Unlimited

View model card on Hugging Face

End-to-end 1-bit language model for llama.cpp (CUDA, Metal, CPU)

14.1x smaller than FP16 | 6.2x faster on RTX 4090 | 4-5x lower energy/token

Highlights

  • 1.15 GB parameter memory (down from 16.38 GB FP16) — fits on virtually any device with a GPU
  • End-to-end 1-bit weights across embeddings, attention projections, MLP projections, and LM head
  • GGUF Q1_0 (g128) format with inline dequantization kernels — no FP16 materialization
  • Cross-platform: CUDA (RTX/datacenter), Metal (Mac), Android, CPU
  • Competitive benchmarks: 70.5 avg score across 6 categories, matching full-precision 8B models at 1/14th the size
  • MLX companion: also available as MLX 1-bit g128 for native Apple Silicon inference

Resources

  • Google Colab — try Bonsai in your browser, no setup required
  • Whitepaper — for more details on Bonsai, check out our whitepaper
  • Demo repo — comprehensive examples for serving, benchmarking, and integrating Bonsai
  • Discord — join the community for support, discussion, and updates
  • 1-bit kernels: llama.cpp fork (CUDA + Metal) · MLX fork (Apple Silicon) · mlx-swift fork (iOS/macOS)
  • Locally AI — we have partnered with Locally AI for iPhone support

Model Overview

ItemSpecification
Parameters8.19B (~6.95B non-embedding)
ArchitectureQwen3-8B dense: GQA (32 query / 8 KV heads), SwiGLU MLP, RoPE, RMSNorm
Layers36 Transformer decoder blocks
Context length65,536 tokens
Vocab size151,936
Weight formatGGUF Q1_0
Deployed size1.15 GB (14.2x smaller than FP16)
1-bit coverageEmbeddings, attention projections, MLP projections, LM head
LicenseApache 2.0

Quantization Format: Q1_0

Each weight is a single bit: 0 maps to −scale, 1 maps to +scale. Every group of 128 weights shares one FP16 scale factor.

Effective bits per weight: 1.125 (1 sign bit + 16-bit scale amortized over 128 weights).

Memory Requirement

Parameter memory only (weights and scales loaded into memory):

FormatSizeReductionRatio
FP1616.38 GB1.0x
**GGUF Q1_0 **1.15 GB93.0%14.2x
MLX 1-bit g1281.28 GB92.2%12.8x

The GGUF file on disk is 1.16 GB (~6.6 MB larger) because the format embeds the tokenizer, chat template, and model metadata alongside the weights.

Best Practices

Generation Parameters

ParameterDefaultSuggested range
Temperature0.50.5 -- 0.7
Top-k2020 -- 40
Top-p0.90.85 -- 0.95
Repetition penalty1.0
Presence penalty0.0

System Prompt

You can use a simple system prompt such as:

You are a helpful assistant

Quickstart

llama.cpp (CUDA)


git clone https://github.com/PrismML-Eng/llama.cpp
cd llama.cpp


cmake -B build -DGGML_CUDA=ON && cmake --build build -j


./build/bin/llama-cli \
    -m Bonsai-8B.gguf \
    -p "Explain quantum computing in simple terms." \
    -n 256 \
    --temp 0.5 \
    --top-p 0.85 \
    --top-k 20 \
    -ngl 99

llama.cpp (Metal / macOS)


git clone https://github.com/PrismML-Eng/llama.cpp
cd llama.cpp


cmake -B build && cmake --build build -j


./build/bin/llama-cli \
    -m Bonsai-8B.gguf \
    -p "Explain quantum computing in simple terms." \
    -n 256 \
    --temp 0.5 \
    --top-p 0.85 \
    --top-k 20 \
    -ngl 99

llama.cpp Server

./build/bin/llama-server \
    -m Bonsai-8B.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99

Open the web UI at http://127.0.0.1:8080, or see our llama.cpp fork for more examples.

Cross-Platform Throughput

PlatformBackendTG128 (tok/s)FP16 TG (tok/s)TG vs FP16PP512 (tok/s)FP16 PP512 (tok/s)
RTX 4090llama.cpp CUDA368596.2x11,80910,453
RTX L40Sllama.cpp CUDA327526.3x9,5928,325
RTX 3060 Laptopllama.cpp CUDA813.5¹23x¹1,87194¹
M4 Pro 48 GBllama.cpp Metal85165.4x498490
Samsung S25 Ultrallama.cpp OpenCL19.630.4

¹ FP16 only fits partially on GPU's 6 GB VRAM; 1-bit fits entirely in VRAM.

Energy Efficiency

PlatformBonsai E_tg (mWh/tok)Baseline E_tgAdvantage
RTX 4090 (CUDA)0.2761.134 (FP16)4.1x
Mac M4 Pro (Metal)0.0910.471 (FP16)5.1x

Benchmarks

Evaluated with EvalScope v1.4.2 + vLLM 0.15.1 on NVIDIA H100 under identical infrastructure, generation parameters, and scoring. All models are in the 6B–9B parameter range.

ModelCompanySizeAvgMMLU-RMuSRGSM8KHE+IFEvalBFCL
Qwen 3 8BAlibaba16 GB79.383559382.384.281
RNJ 8BEssentialAI16 GB73.175.550.493.784.273.861.1
Mistral3 8BMistral16 GB71.073.953.887.267.475.445.4
Olmo 3 7BAllen Inst14 GB70.97256.192.579.337.138.4
1-bit Bonsai 8BPrismML1.15 GB70.565.7508873.879.865.7
LFM2 8BLiquidAI16 GB69.672.749.590.18182.262.0
Llama 3.1 8BMeta16 GB67.172.951.387.97551.5
GLM v6 9BZhipuAI16 GB65.761.943.293.478.769.321.9
Hermes 8BNous Research16 GB65.467.452.282.951.26573.5
Trinity Nano 6BArcee12 GB61.268.852.681.1545062.5
Marin 8BStanford CRFM16 GB56.664.842.686.45150
R1-D 7BDeepSeek14 GB55.162.529.192.781.748.815.4

Despite being 1/14th the size, 1-bit Bonsai 8B is competitive with leading full-precision 8B instruct models.

Intelligence Density

Intelligence density captures the ratio of a model's capability to its deployed size:

alpha = -ln(1 - score/100) / size_GB
ModelSizeIntelligence Density (1/GB)
1-bit Bonsai 8B1.15 GB1.062
Qwen 3 8B16 GB0.098
Llama 3.1 8B16 GB0.074
Mistral3 8B16 GB0.077

Bonsai 8B achieves 10.8x higher intelligence density than full-precision Qwen 3 8B.

Use Cases

  • On-device assistants: interactive AI on laptops and phones with low latency
  • Mobile deployment: runs on a wide variety of phones due to low memory footprint
  • Edge robotics and autonomy: compact deployment on devices with thermal, memory, or connectivity constraints
  • Cost-sensitive GPU serving: higher throughput and lower energy per token on RTX-class and datacenter GPUs
  • Enterprise and private inference: local or controlled-environment inference for data residency requirements

Limitations

  • No native 1-bit hardware exists yet — current gains are software-kernel optimizations on general-purpose hardware
  • Mobile power measurement is estimated rather than hardware-metered
  • The full-precision benchmark frontier continues to advance; the 1-bit methodology is architecture-agnostic and will be applied to newer bases

Citation

If you use 1-bit Bonsai 8B, please cite:

@techreport{bonsai8b,
    title   = {1-bit Bonsai 8B: End-to-End 1-bit Language Model Deployment
               Across Apple, GPU, and Mobile Runtimes},
    author  = {Prism ML},
    year    = {2026},
    month   = {March},
    url     = {https://prismml.com}
}

Contact

For questions, feedback, or collaboration inquiries: contact@prismml.com

Want more deterministic results?