Interfaze

logo

Beta

pricing

docs

blog

sign in

Deepseek V4 Gguf

Deepseek V4 Gguf by antirez, a text-generation model. Understand and compare features, benchmarks, and capabilities.

Comparison

FeatureDeepseek V4 GgufInterfaze
Input Modalities

text

image, text, audio, video, document

Native OCRNoYes
Long Document ProcessingNoYes
Language Support

unknown

162+

Native Speech-to-TextNoYes
Native Object DetectionNoYes
Guardrail ControlsYesYes
Context Input Size

1M

1M

Tool CallingYes

Tool calling supported + built in browser, code execution and web search

Scaling

FeatureDeepseek V4 GgufInterfaze
Scaling

Self-hosted/Provider-hosted with quantization

Unlimited

View model card on Hugging Face

This quants are specific for the DS4 inference engine. They may work with other inference engines or not (they should, but not the MTP model which requires a specific loader).

https://github.com/antirez/ds4

Files

FileSizeRouted experts (ffn_{gate,up,down}_exps)Everything else
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf80.8 GiBIQ2_XXS (gate, up) + Q2_K (down)Q8_0 attn proj / shared experts / output, F16 router + embed + indexer + compressor + HC, F32 norms / sinks / bias
DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf153.3 GiBQ4_K (all three)same as above
DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf3.6 GiBMTP / speculative-decoding support (optional, not standalone).

Use q2 on 128 GB Mac machines, q4 on machines with ≥ 256 GB RAM, pair either with MTP for optional speculative decoding.

Quantization recipe

The filename is the spec. In detail, for the q2 file:

Tensor classQuantNotes
blk.*.ffn_gate_exps, blk.*.ffn_up_expsIQ2_XXSrouted-expert up/gate
blk.*.ffn_down_expsQ2_Krouted-expert down (K-quant for quality)
blk.*.ffn_{gate,up,down}_shexpQ8_0shared experts
blk.*.attn_q_a, attn_q_b, attn_kv, attn_output_a, attn_output_bQ8_0all attention projections (MLA + low-rank output)
output.weightQ8_0output head
token_embd.weightF16input embedding
blk.*.ffn_gate_inp (router)F16learned router
blk.*.exp_probs_b (router bias), blk.*.attn_sinks, all *_norm.weightF32
blk.*.ffn_gate_tid2eidI32hash-routing tables (first 3 layers only)
blk.*.attn_compressor_*, blk.*.indexer_*, blk.*.hc_*, blk.*.output_hc_*F16 / F32DSv4-specific auxiliary blocks

For the q4 file, only the three routed-expert classes change to Q4_K. Everything else is byte-for-byte identical to the q2 recipe.

The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components at Q8_0 preserves model behavior; crushing the experts buys the size.

Usage

git clone https://github.com/antirez/ds4
cd ds4
./download_model.sh q2     # 128 GB RAM machines
./download_model.sh q4     # >= 256 GB RAM machines
./download_model.sh mtp    # optional MTP / speculative decoding
make

./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

The download_model.sh script fetches from this repository, resumes partial downloads, and points ./ds4flash.gguf at the selected variant.

License

MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model's release terms.

Want more deterministic results?