Interfaze

logo

Beta

pricing

docs

blog

sign in

Gemma 4 31B It NVFP4 Turbo

Gemma 4 31B It NVFP4 Turbo by LilaRest, a text-generation model. Understand and compare features, benchmarks, and capabilities.

Comparison

FeatureGemma 4 31B It NVFP4 TurboInterfaze
Input Modalities

text

image, text, audio, video, document

Native OCRNoYes
Long Document ProcessingNoYes
Language Support

unknown

162+

Native Speech-to-TextNoYes
Native Object DetectionNoYes
Guardrail ControlsNoYes
Context Input Size

unknown

1M

Tool CallingNo

Tool calling supported + built in browser, code execution and web search

Scaling

FeatureGemma 4 31B It NVFP4 TurboInterfaze
Scaling

Self-hosted/Provider-hosted with quantization

Unlimited

View model card on Hugging Face

A repackaged nvidia/Gemma-4-31B-IT-NVFP4 that is 68% smaller in GPU memory and ~2.5× faster than the base model, while retaining nearly identical quality (1-3% loss). Fits on a single RTX 5090 (🎉).

It fully leverages NVIDIA Blackwell FP4 tensor cores (RTX 5090, RTX PRO 6000, B200, and other SM 12.0+ GPUs with ≥20 GB VRAM) for ~2× higher concurrent throughput than other quants like prithivMLmods/gemma-4-31B-it-NVFP4 or cyankiwi/gemma-4-31B-it-AWQ-4bit.

This variant is text-only, video/audio weights and encoders have been stripped. If you need video/audio support -> open an issue or PR.

Benchmark

Benchmark chart

[!NOTE]
RTX PRO 6000, vllm bench @ 1K input / 200 output tokens. See bench.sh.

Note: We also ran ⚡Turbo benchmark on RTX 5090, and it performed exactly the same because at 16k context, the performance is not limited by the GPU memory.

Base modelNVIDIA quant⚡ Turbo (this model)
GPU memory58.9 GiB31 GiB18.5 GiB (-68% base, -40% nvidia)
GPQA Diamond75.71%75.46%72.73% (-2.98% base, -2.73% nvidia)
MMLU Pro85.25%84.94%83.93% (-1.32% base, -1.01% nvidia)
Prefill6352 tok/s11069 tok/s15359 tok/s (+142% base, +39% nvidia)
Decode (single)24.1 tok/s39.2 tok/s51 tok/s (+112% base, +30% nvidia)
Decode (batched)494 tok/s913 tok/s1244 tok/s (+152% base, +36% nvidia)
Concurrency2.47 tok/s4.56 req/s6.22 req/s (+152% base, +36% nvidia)

Other quants of similar size use kernel paths (compressed-tensors, Marlin) that don't leverage Blackwell's FP4 tensor cores, resulting in significantly lower prefill and concurrent throughput:

prithivMLmods NVFP4cyankiwi AWQ⚡ Turbo (this model)
GPU memory19.6 GiB19.6 GiB18.5 GiB
Prefill6647 tok/s6626 tok/s15359 tok/s
Decode (single)64.3 tok/s64.4 tok/s51 tok/s
Decode (batched)757 tok/s757 tok/s1244 tok/s
Concurrency3.79 req/s3.78 req/s6.22 req/s

Usage

Requirements:

  • A Blackwell GPU (see Compatibility)
  • transformers >= 5.5.0
  • vllm >= 0.19 with CUDA 13.0

    Note: pip install vllm installs CUDA 12, which doesn't support Blackwell FP4 tensor cores. Use one of the methods below.

We recommend using the vllm/vllm-openai:cu130-nightly Docker image, which ships with CUDA 13.0 and Blackwell support out of the box.

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:cu130-nightly \
  --model LilaRest/gemma-4-31B-it-NVFP4-turbo \
  --quantization modelopt \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --trust-remote-code

If you get model type gemma4 not recognized, run pip install transformers>=5.5.0 inside the container.

pip (CUDA 13.0 wheel)

pip install https://github.com/vllm-project/vllm/releases/download/v0.19.0/vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl
pip install transformers>=5.5.0

vllm serve LilaRest/gemma-4-31B-it-NVFP4-turbo \
  --quantization modelopt \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --trust-remote-code

Key flags

  • --quantization modelopt — required, activates NVIDIA's optimized CUTLASS kernels
  • --kv-cache-dtype fp8 — halves KV cache memory on Blackwell
  • --max-model-len 16384 — maximum context length per request. See Compatibility for max value per GPU.

Tuning

The above benchmarks use a generic workload (1K input / 200 output tokens). You can tune vLLM flags for your specific use case:

  • High-throughput classification / short output — Reduce --max-model-len and limit output tokens (max_tokens in the API request). Less KV cache pressure means more concurrent requests. Expect 14+ req/s on RTX 5090 for classification workloads (~1K input, ~10 output tokens).
  • Long context — Increase --max-model-len (up to ~25K on RTX 5090, ~180K on PRO 6000). Trade concurrent capacity for longer sequences.
  • Latency-sensitive — Keep concurrency low. Single-request decode is ~51 tok/s with TTFT under 70ms — fast enough for interactive use.
  • Batch processing — Push --max-num-seqs higher and use --request-rate inf with --max-concurrency to saturate the GPU. Peak throughput is ~6.2 req/s on RTX PRO 6000 at 1K/200 workload.

Compatibility

Blackwell (SM 12.0+) — full FP4 tensor core support:

GPUVRAMWorks?Max contextNotes
RTX 509032 GB~25kPrimary target
RTX PRO 600096 GB~180KIdeal for high-concurrency or long-context workloads.
B200192 GB262k (full)Datacenter, untested
B100192 GB262k (full)Datacenter, untested
RTX 5080 and lower≤16 GBNot enough VRAM

Older GPUs (H100, A100, RTX 4090, etc.) may work without --quantization modelopt but they lack FP4 tensor cores, so you'll lose the optimized kernel path and performance will be significantly worse.

Approach

Three changes were made:

  1. Quantized all self-attention weights from BF16 → FP4 (RTN, group_size=16, matching modelopt NVFP4 format)
  2. Updated architecture to Gemma4ForCausalLM and quantization config accordingly
  3. Stripped the vision and audio encoder

Everything else is untouched — MLP layers keep NVIDIA's calibrated FP4, embed_tokens stays BF16, all norms preserved, so we retain all the nvidia/Gemma-4-31B-IT-NVFP4 optimizations.

Why RTN didn't hurt quality

RTN (Round-To-Nearest) is the simplest quantization method — no calibration data, fully reproducible. It worked here because:

  • FP4 with group_size=16 and per-group scaling preserves relative weight distributions well
  • Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5)
  • MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4
  • embed_tokens stays BF16, preventing noise from propagating through all layers

License

Apache 2.0 — same as the base model.

Credits

Want more deterministic results?