Gemma 4 31B It NVFP4 Turbo

Gemma 4 31B It NVFP4 Turbo by LilaRest, a text-generation model. Understand and compare features, benchmarks, and capabilities.

Comparison

Feature	Gemma 4 31B It NVFP4 Turbo	Interfaze
Input Modalities	text	image, text, audio, video, document
Native OCR	No	Yes
Long Document Processing	No	Yes
Language Support	unknown	162+
Native Speech-to-Text	No	Yes
Native Object Detection	No	Yes
Guardrail Controls	No	Yes
Context Input Size	unknown	1M
Tool Calling	No	Tool calling supported + built in browser, code execution and web search

Scaling

Feature	Gemma 4 31B It NVFP4 Turbo	Interfaze
Scaling	Self-hosted/Provider-hosted with quantization	Unlimited

View model card on Hugging Face

A repackaged nvidia/Gemma-4-31B-IT-NVFP4 that is 68% smaller in GPU memory and ~2.5× faster than the base model, while retaining nearly identical quality (1-3% loss). Fits on a single RTX 5090 (🎉).

It fully leverages NVIDIA Blackwell FP4 tensor cores (RTX 5090, RTX PRO 6000, B200, and other SM 12.0+ GPUs with ≥20 GB VRAM) for ~2× higher concurrent throughput than other quants like prithivMLmods/gemma-4-31B-it-NVFP4 or cyankiwi/gemma-4-31B-it-AWQ-4bit.

This variant is text-only, video/audio weights and encoders have been stripped. If you need video/audio support -> open an issue or PR.

Benchmark

[!NOTE]
RTX PRO 6000, vllm bench @ 1K input / 200 output tokens. See bench.sh.

Note: We also ran ⚡Turbo benchmark on RTX 5090, and it performed exactly the same because at 16k context, the performance is not limited by the GPU memory.

	Base model	NVIDIA quant	⚡ Turbo (this model)
GPU memory	58.9 GiB	31 GiB	18.5 GiB (-68% base, -40% nvidia)
GPQA Diamond	75.71%	75.46%	72.73% (-2.98% base, -2.73% nvidia)
MMLU Pro	85.25%	84.94%	83.93% (-1.32% base, -1.01% nvidia)
Prefill	6352 tok/s	11069 tok/s	15359 tok/s (+142% base, +39% nvidia)
Decode (single)	24.1 tok/s	39.2 tok/s	51 tok/s (+112% base, +30% nvidia)
Decode (batched)	494 tok/s	913 tok/s	1244 tok/s (+152% base, +36% nvidia)
Concurrency	2.47 tok/s	4.56 req/s	6.22 req/s (+152% base, +36% nvidia)

Other quants of similar size use kernel paths (compressed-tensors, Marlin) that don't leverage Blackwell's FP4 tensor cores, resulting in significantly lower prefill and concurrent throughput:

	prithivMLmods NVFP4	cyankiwi AWQ	⚡ Turbo (this model)
GPU memory	19.6 GiB	19.6 GiB	18.5 GiB
Prefill	6647 tok/s	6626 tok/s	15359 tok/s
Decode (single)	64.3 tok/s	64.4 tok/s	51 tok/s
Decode (batched)	757 tok/s	757 tok/s	1244 tok/s
Concurrency	3.79 req/s	3.78 req/s	6.22 req/s

Usage

Requirements:

A Blackwell GPU (see Compatibility)
transformers >= 5.5.0
vllm >= 0.19 with CUDA 13.0

Note: pip install vllm installs CUDA 12, which doesn't support Blackwell FP4 tensor cores. Use one of the methods below.

Docker (recommended)

We recommend using the vllm/vllm-openai:cu130-nightly Docker image, which ships with CUDA 13.0 and Blackwell support out of the box.

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:cu130-nightly \
  --model LilaRest/gemma-4-31B-it-NVFP4-turbo \
  --quantization modelopt \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --trust-remote-code

If you get model type gemma4 not recognized, run pip install transformers>=5.5.0 inside the container.

pip (CUDA 13.0 wheel)

pip install https://github.com/vllm-project/vllm/releases/download/v0.19.0/vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl
pip install transformers>=5.5.0

vllm serve LilaRest/gemma-4-31B-it-NVFP4-turbo \
  --quantization modelopt \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --trust-remote-code

Key flags

--quantization modelopt — required, activates NVIDIA's optimized CUTLASS kernels
--kv-cache-dtype fp8 — halves KV cache memory on Blackwell
--max-model-len 16384 — maximum context length per request. See Compatibility for max value per GPU.

Tuning

The above benchmarks use a generic workload (1K input / 200 output tokens). You can tune vLLM flags for your specific use case:

High-throughput classification / short output — Reduce --max-model-len and limit output tokens (max_tokens in the API request). Less KV cache pressure means more concurrent requests. Expect 14+ req/s on RTX 5090 for classification workloads (~1K input, ~10 output tokens).
Long context — Increase --max-model-len (up to ~25K on RTX 5090, ~180K on PRO 6000). Trade concurrent capacity for longer sequences.
Latency-sensitive — Keep concurrency low. Single-request decode is ~51 tok/s with TTFT under 70ms — fast enough for interactive use.
Batch processing — Push --max-num-seqs higher and use --request-rate inf with --max-concurrency to saturate the GPU. Peak throughput is ~6.2 req/s on RTX PRO 6000 at 1K/200 workload.

Compatibility

Blackwell (SM 12.0+) — full FP4 tensor core support:

GPU	VRAM	Works?	Max context	Notes
RTX 5090	32 GB	✅	~25k	Primary target
RTX PRO 6000	96 GB	✅	~180K	Ideal for high-concurrency or long-context workloads.
B200	192 GB	✅	262k (full)	Datacenter, untested
B100	192 GB	✅	262k (full)	Datacenter, untested
RTX 5080 and lower	≤16 GB	❌	—	Not enough VRAM

Older GPUs (H100, A100, RTX 4090, etc.) may work without --quantization modelopt but they lack FP4 tensor cores, so you'll lose the optimized kernel path and performance will be significantly worse.

Approach

Three changes were made:

Quantized all self-attention weights from BF16 → FP4 (RTN, group_size=16, matching modelopt NVFP4 format)
Updated architecture to Gemma4ForCausalLM and quantization config accordingly
Stripped the vision and audio encoder

Everything else is untouched — MLP layers keep NVIDIA's calibrated FP4, embed_tokens stays BF16, all norms preserved, so we retain all the nvidia/Gemma-4-31B-IT-NVFP4 optimizations.

Why RTN didn't hurt quality

RTN (Round-To-Nearest) is the simplest quantization method — no calibration data, fully reproducible. It worked here because:

FP4 with group_size=16 and per-group scaling preserves relative weight distributions well
Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5)
MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4
embed_tokens stays BF16, preventing noise from propagating through all layers