Gemma 4 31B It DFlash

Gemma 4 31B It DFlash by z-lab, a text-generation model with multimodal capabilities. Understand and compare multimodal features, benchmarks, and capabilities.

Comparison

Feature	Gemma 4 31B It DFlash	Interfaze
Input Modalities	text, image, video	image, text, audio, video, document
Native OCR	No	Yes
Long Document Processing	No	Yes
Language Support	140 partial	162+
Native Speech-to-Text	No	Yes
Native Object Detection	No	Yes
Guardrail Controls	Yes	Yes
Context Input Size	256K	1M
Tool Calling	Yes	Tool calling supported + built in browser, code execution and web search

Scaling

Feature	Gemma 4 31B It DFlash	Interfaze
Scaling	Self-hosted/Provider-hosted with quantization	Unlimited

View model card on Hugging Face

Paper | GitHub | Blog

DFlash is a speculative decoding method that uses a lightweight block diffusion model to draft multiple tokens in parallel. This is the drafter model, which must be paired with google/gemma-4-31B-it.

Quick Start

Installation

vLLM: until Gemma4 DFlash support is merged, install vLLM from PR #41703:

uv pip install -U --torch-backend=auto \
  "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/41703/head"

SGLang:

uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/23000/head#subdirectory=python"

Launch Server

vLLM:

vllm serve google/gemma-4-31B-it \
  --speculative-config '{"method": "dflash", "model": "z-lab/gemma-4-31B-it-DFlash", "num_speculative_tokens": 15, "attention_backend": "flash_attn"}' \
  --attention-backend triton_attn \
  --max-num-batched-tokens 32768 \
  --trust-remote-code

SGLang:

python -m sglang.launch_server \
  --model-path google/gemma-4-31B-it \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path z-lab/gemma-4-31B-it-DFlash \
  --speculative-num-draft-tokens 16 \
  --tp-size 1 \
  --attention-backend triton \
  --speculative-draft-attention-backend fa4 \
  --trust-remote-code

Usage

For vLLM, use port 8000. For SGLang, use port 30000.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=4096,
    temperature=0.0,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.content)

Benchmark Results

Setup: Single NVIDIA B300 GPU per server/run, vLLM, thinking enabled, max output length 4096, greedy decoding.

Throughput and Speedup

DFlash achieves up to 5.8x speedup at concurrency 1.

Generated tokens/sec (speedup vs. autoregressive baseline)

Block Size = 16

Task	Concurrency	AR	DFlash
Math500	1	77	447 (5.8x)
	8	511	2650 (5.2x)
	32	1308	4962 (3.8x)
GSM8K	1	78	408 (5.3x)
	8	520	2321 (4.5x)
	32	1382	4447 (3.2x)
HumanEval	1	76	420 (5.6x)
	8	494	2389 (4.8x)
	32	1145	4139 (3.6x)
MBPP	1	79	343 (4.4x)
	8	535	2036 (3.8x)
	32	1389	3636 (2.6x)
MT-Bench	1	79	236 (3.0x)
	8	503	1334 (2.7x)
	32	1177	2257 (1.9x)

Acceptance Length

Task	c1	c8	c32
Math500	8.59	8.59	8.62
GSM8K	7.53	7.50	7.52
HumanEval	8.00	7.89	7.96
MBPP	6.13	6.13	6.14
MT-Bench	4.23	4.19	4.19

Acknowledgements

Special thanks to David Wang for his outstanding engineering support on this project. We are also grateful to Modal, InnoMatrix, and Yotta Labs for providing the compute resources used to train this draft model.

Citation

If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: DFlash Feedback.

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}