Deepseek V4 Gguf
Deepseek V4 Gguf by antirez, a text-generation model. Understand and compare features, benchmarks, and capabilities.
Comparison
| Feature | Deepseek V4 Gguf | Interfaze |
|---|---|---|
| Input Modalities | text | image, text, audio, video, document |
| Native OCR | No | Yes |
| Long Document Processing | No | Yes |
| Language Support | unknown | 162+ |
| Native Speech-to-Text | No | Yes |
| Native Object Detection | No | Yes |
| Guardrail Controls | Yes | Yes |
| Context Input Size | 1M | 1M |
| Tool Calling | Yes | Tool calling supported + built in browser, code execution and web search |
Scaling
| Feature | Deepseek V4 Gguf | Interfaze |
|---|---|---|
| Scaling | Self-hosted/Provider-hosted with quantization | Unlimited |
View model card on Hugging Face
This quants are specific for the DS4 inference engine. They may work with other inference engines or not (they should, but not the MTP model which requires a specific loader).
https://github.com/antirez/ds4
Files
| File | Size | Routed experts (ffn_{gate,up,down}_exps) | Everything else |
|---|---|---|---|
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf | 80.8 GiB | IQ2_XXS (gate, up) + Q2_K (down) | Q8_0 attn proj / shared experts / output, F16 router + embed + indexer + compressor + HC, F32 norms / sinks / bias |
DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf | 153.3 GiB | Q4_K (all three) | same as above |
DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf | 3.6 GiB | MTP / speculative-decoding support (optional, not standalone). |
Use q2 on 128 GB Mac machines, q4 on machines with ≥ 256 GB RAM, pair either with MTP for optional speculative decoding.
Quantization recipe
The filename is the spec. In detail, for the q2 file:
| Tensor class | Quant | Notes |
|---|---|---|
blk.*.ffn_gate_exps, blk.*.ffn_up_exps | IQ2_XXS | routed-expert up/gate |
blk.*.ffn_down_exps | Q2_K | routed-expert down (K-quant for quality) |
blk.*.ffn_{gate,up,down}_shexp | Q8_0 | shared experts |
blk.*.attn_q_a, attn_q_b, attn_kv, attn_output_a, attn_output_b | Q8_0 | all attention projections (MLA + low-rank output) |
output.weight | Q8_0 | output head |
token_embd.weight | F16 | input embedding |
blk.*.ffn_gate_inp (router) | F16 | learned router |
blk.*.exp_probs_b (router bias), blk.*.attn_sinks, all *_norm.weight | F32 | |
blk.*.ffn_gate_tid2eid | I32 | hash-routing tables (first 3 layers only) |
blk.*.attn_compressor_*, blk.*.indexer_*, blk.*.hc_*, blk.*.output_hc_* | F16 / F32 | DSv4-specific auxiliary blocks |
For the q4 file, only the three routed-expert classes change to Q4_K. Everything else is byte-for-byte identical to the q2 recipe.
The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components at Q8_0 preserves model behavior; crushing the experts buys the size.
Usage
git clone https://github.com/antirez/ds4
cd ds4
./download_model.sh q2 # 128 GB RAM machines
./download_model.sh q4 # >= 256 GB RAM machines
./download_model.sh mtp # optional MTP / speculative decoding
make
./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192The download_model.sh script fetches from this repository, resumes partial downloads, and points ./ds4flash.gguf at the selected variant.
License
MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model's release terms.