Deepseek V4 Gguf

Deepseek V4 Gguf by antirez, a text-generation model. Understand and compare features, benchmarks, and capabilities.

Comparison

Feature	Deepseek V4 Gguf	Interfaze
Input Modalities	text	image, text, audio, video, document
Native OCR	No	Yes
Long Document Processing	No	Yes
Language Support	unknown	162+
Native Speech-to-Text	No	Yes
Native Object Detection	No	Yes
Guardrail Controls	Yes	Yes
Context Input Size	1M	1M
Tool Calling	Yes	Tool calling supported + built in browser, code execution and web search

Scaling

Feature	Deepseek V4 Gguf	Interfaze
Scaling	Self-hosted/Provider-hosted with quantization	Unlimited

View model card on Hugging Face

This quants are specific for the DS4 inference engine. They may work with other inference engines or not (they should, but not the MTP model which requires a specific loader).

https://github.com/antirez/ds4

Files

File	Size	Routed experts (`ffn_{gate,up,down}_exps`)	Everything else
`DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf`	80.8 GiB	`IQ2_XXS` (gate, up) + `Q2_K` (down)	`Q8_0` attn proj / shared experts / output, `F16` router + embed + indexer + compressor + HC, `F32` norms / sinks / bias
`DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf`	153.3 GiB	`Q4_K` (all three)	same as above
`DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf`	3.6 GiB	MTP / speculative-decoding support (optional, not standalone).

Use q2 on 128 GB Mac machines, q4 on machines with ≥ 256 GB RAM, pair either with MTP for optional speculative decoding.

Quantization recipe

The filename is the spec. In detail, for the q2 file:

Tensor class	Quant	Notes
`blk..ffn_gate_exps`, `blk..ffn_up_exps`	`IQ2_XXS`	routed-expert up/gate
`blk.*.ffn_down_exps`	`Q2_K`	routed-expert down (K-quant for quality)
`blk.*.ffn_{gate,up,down}_shexp`	`Q8_0`	shared experts
`blk.*.attn_q_a`, `attn_q_b`, `attn_kv`, `attn_output_a`, `attn_output_b`	`Q8_0`	all attention projections (MLA + low-rank output)
`output.weight`	`Q8_0`	output head
`token_embd.weight`	`F16`	input embedding
`blk.*.ffn_gate_inp` (router)	`F16`	learned router
`blk..exp_probs_b` (router bias), `blk..attn_sinks`, all `*_norm.weight`	`F32`
`blk.*.ffn_gate_tid2eid`	`I32`	hash-routing tables (first 3 layers only)
`blk..attn_compressor_`, `blk..indexer_`, `blk..hc_`, `blk..output_hc_`	`F16` / `F32`	DSv4-specific auxiliary blocks

For the q4 file, only the three routed-expert classes change to Q4_K. Everything else is byte-for-byte identical to the q2 recipe.

The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components at Q8_0 preserves model behavior; crushing the experts buys the size.

Usage

git clone https://github.com/antirez/ds4
cd ds4
./download_model.sh q2     # 128 GB RAM machines
./download_model.sh q4     # >= 256 GB RAM machines
./download_model.sh mtp    # optional MTP / speculative decoding
make

./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

The download_model.sh script fetches from this repository, resumes partial downloads, and points ./ds4flash.gguf at the selected variant.

License

MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model's release terms.