Interfaze

logo

Beta

pricing

help

docs

blog

sign in

Step 3.7 Flash GGUF

Step 3.7 Flash GGUF by stepfun-ai, a image-text-to-text model with multimodal capabilities. Understand and compare multimodal features, benchmarks, and capabilities.

Comparison

FeatureStep 3.7 Flash GGUFInterfaze
Input Modalities

text, image, video

image, text, audio, video, document

Native OCRNoYes
Long Document ProcessingNoYes
Language Support

unknown

162+

Native Speech-to-TextNoYes
Native Object DetectionNoYes
Guardrail ControlsNoYes
Context Input Size

256K

1M

Tool CallingYes

Tool calling supported + built in browser, code execution and web search

Scaling

FeatureStep 3.7 Flash GGUFInterfaze
Scaling

Self-hosted/Provider-hosted with quantization

Unlimited

View model card on Hugging Face

[ModelPage]: https://static.stepfun.com/blog/step-3.7-flash/

1. Introduction

GGUF quantizations of stepfun-ai/Step-3.7-Flash.

Step-3.7-Flash is a 198B-parameter sparse Mixture-of-Experts vision-language model from StepFun-ai, activating ~11B parameters per token for up to 400 t/s throughput. It pairs a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding, supports a 256K context window, and offers three selectable reasoning levels (low / medium / high) to balance speed, cost, and depth. Built for agentic workloads — tool calling, multi-step reasoning, code, and math — with native multilingual coverage.

A separate mmproj projector ships alongside the language quants for multimodal inference. With 128 GB of unified memory (Mac Studio, DGX Spark, Ryzen AI Max+ 395, etc.), you can privately host Step-3.7-Flash: Q4 quants and below run at full 256K context with high precision.

2. Files

FileQuantSizeNotes
Step-3.7-flash-BF16.ggufBF16394 GBFull-precision reference.
Step-3.7-flash-Q8_0.ggufQ8_0209 GBNear-lossless. Does not use imatrix.
Step-3.7-flash-Q4_K_S.ggufQ4_K_S112 GBimatrix-calibrated. Balanced quality / size.
Step-3.7-flash-IQ4_XS.ggufIQ4_XS105 GBimatrix-calibrated. Slightly smaller than Q4_K_S, comparable quality.
Step-3.7-flash-Q3_K_L.ggufQ3_K_L103 GBimatrix-calibrated. Aggressive size reduction.
Step-3.7-flash-Q3_K_M.ggufQ3_K_M94 GBimatrix-calibrated. Use when you need to fit on a single 64-96 GB device; expect modest quality loss at low bit-widths.
Step-3.7-flash-IQ3_XXS.ggufIQ3_XXS76 GBimatrix-calibrated. Recommended only when memory is the primary constraint; offers the smallest footprint among the provided quantizations.
mmproj-Step-3.7-flash-f16.ggufF164 GBVision projector. Pair with any of the language quants above for image input.

3. Quickstart

Build llama.cpp and run:

git clone https://github.com/stepfun-ai/llama.cpp.git
cd llama.cpp
git checkout -b step3.7 origin/step3.7
cmake -B build -DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_SERVER=ON
cmake --build build --config Release -j$(nproc)


./build/bin/llama-batched-bench \
  -m Step-3.7-flash-Q4_K_S.gguf \
  -c 32768 -b 2048 -ub 2048 \
  -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1


./build/bin/llama-cli \
  -m Step-3.7-flash-Q4_K_S.gguf \
  -c 32768 -ngl 99 -fa on \
  -p "Write a Python function to compute the n-th Fibonacci number."


./build/bin/llama-mtmd-cli \
  -m Step-3.7-flash-Q4_K_S.gguf \
  --mmproj mmproj-Step-3.7-flash-f16.gguf \
  -c 32768 -ngl 99 -fa on \
  --image path/to/image.jpg \
  -p "Describe this image."


./build/bin/llama-server \
  -m Step-3.7-flash-Q4_K_S.gguf \
  --mmproj mmproj-Step-3.7-flash-f16.gguf \
  -c 32768 -ngl 99 -fa on \
  --host 0.0.0.0 --port 8080

For full CLI / server options, see the llama.cpp README.

4. Performance

Apple Mac Studio (M4 max, 128 GB unified memory)

Step-3.7-flash-Q4_K_S

./llama-batched-bench -m Step-3.7-flash-Q4_K_S.gguf -c 262150 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536,131072,262144 -ntg 128 -npl 1
PPTGPLN_KVT_PP sS_PP t/sT_TG sS_TG t/sT sS t/s
012811280.0000.002.50051.202.50051.20
2048128121764.873420.282.63948.517.512289.68
81921281832020.292403.702.75746.4323.049360.97
1638412811651242.854382.322.92443.7745.779360.69
3276812813289695.168344.323.22339.7298.391334.34
65536128165664233.885280.213.90932.74237.794276.14
1310721281131200635.499206.255.75922.23641.258204.60
26214412812622722362.488110.9613.1889.712375.677110.40

Step-3.7-flash-IQ4_XS

./llama-batched-bench -m Step-3.7-flash-IQ4_XS.gguf -c 262150 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536,131072,262144 -ntg 128 -npl 1
PPTGPLN_KVT_PP sS_PP t/sT_TG sS_TG t/sT sS t/s
012811280.0000.002.58249.582.58249.58
2048128121764.835423.562.67947.787.514289.60
81921281832019.954410.552.80345.6622.757365.60
1638412811651242.142388.782.95743.2945.098366.13
3276812813289693.489350.503.28838.9396.777339.91
65536128165664227.088288.593.94532.44231.033284.22
1310721281131200635.047206.405.79122.10640.838204.73
26214412812622722170.271120.7913.0709.792183.342120.12

Step-3.7-flash-Q3_K_L

./llama-batched-bench -m Step-3.7-flash-Q3_K_L.gguf -c 262272 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536,131072,262144 -ntg 128 -npl 1
PPTGBN_KVT_PP sS_PP t/sT_TG sS_TG t/sT sS t/s
012811280.0000.003.59035.663.59035.66
2048128121765.263389.153.70234.578.965242.72
81921281832021.789375.973.81733.5325.606324.92
1638412811651245.819357.583.97732.1849.796331.59
32768128132896100.827324.994.30829.71105.135312.89
65536128165664242.172270.624.97725.72247.149265.69
1310721281131200659.645198.706.76418.92666.409196.88
26214412812622722200.370119.1414.0089.142214.378118.44

NVIDIA DGX Spark (GB10, 128 GB unified memory)

Step-3.7-flash-Q4_K_S

./llama-batched-bench -m Step-3.7-flash-Q4_K_S.gguf -c 131300 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536,131072 -ntg 128 -npl 1
PPTGBN_KVT_PP sS_PP t/sT_TG sS_TG t/sT sS t/s
012811280.0000.005.15724.825.15724.82
2048128121768.021255.334.90726.0812.929168.31
81921281832010.866753.895.16924.7616.035518.86
1638412811651229.389557.496.21520.6035.603463.78
3276812813289652.501624.146.93118.4759.432553.50
65536128165664112.321583.477.76916.48120.090546.79
1310721281131200281.479465.669.83413.02291.313450.37

Step-3.7-flash-IQ4_XS

./llama-batched-bench -m Step-3.7-flash-IQ4_XS.gguf -c 262272 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536,131072,262144 -ntg 128 -npl 1
PPTGPLN_KVT_PP sS_PP t/sT_TG sS_TG t/sT sS t/s
012811280.0000.005.36823.855.36823.85
2048128121764.250481.875.31124.109.561227.58
81921281832012.531653.735.81722.0118.348453.46
1638412811651224.474669.445.91521.6430.389543.35
3276812813289651.976630.446.53119.6058.508562.25
65536128165664116.305563.487.93416.13124.239528.53
1310721281131200298.746438.7410.26312.47309.009424.58
2621441281262272924.872283.4414.8628.61939.734279.09

Step-3.7-flash-Q3_K_L

./llama-batched-bench -m Step-3.7-flash-Q3_K_L.gguf -c 262272 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536,131072,262144 -ntg 128 -npl 1
PPTGPLN_KVT_PP sS_PP t/sT_TG sS_TG t/sT sS t/s
012811280.0000.005.94721.525.94721.52
2048128121764.145494.085.62322.769.768222.77
81921281832014.889550.205.79922.0720.688402.17
1638412811651229.374557.786.14020.8535.513464.95
3276812813289654.957596.256.74418.9861.702533.15
65536128165664129.827504.798.34715.33138.174475.23
1310721281131200315.402415.5710.78011.87326.182402.23
2621441281262272910.215288.0015.5688.22925.783283.30

AMD Ryzen AI Max+ 395 (Strix Halo, 128 GB unified memory)

Step-3.7-flash-Q4_K_S

llama-batched-bench.exe -m Step-3.7-flash-Q4_K_S.gguf -c 65664 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536 -ntg 128 -npl 1
PPTGBN_KVT_PP sS_PP t/sT_TG sS_TG t/sT sS t/s
012811280.0000.004.87826.244.87826.24
2048128121769.367218.635.13424.9314.501150.06
81921281832043.540188.155.50823.2449.048169.63
16384128116512111.814146.535.94721.53117.761140.22
32768128132896357.81991.586.77918.88364.59890.23
655361281656641342.50148.828.49515.071350.99648.60

Step-3.7-flash-IQ4_XS

./llama-batched-bench -m Step-3.7-flash-IQ4_XS.gguf -c 65664 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536 -ntg 128 -npl 1
PPTGBN_KVT_PP sS_PP t/sT_TG sS_TG t/sT sS t/s
012811280.0000.005.93121.585.93121.58
2048128121768.143251.506.19420.6714.337151.78
81921281832039.899205.326.52119.6346.420179.23
16384128116512105.098155.896.89118.57111.989147.44
32768128132896338.64596.767.79316.42346.43994.95
655361281656641310.82050.009.48913.491320.30949.73

Step-3.7-flash-Q3_K_L

./llama-batched-bench -m Step-3.7-flash-Q3_K_L.gguf -c 262272 -b 2048 -ub 1024 -ctk q8_0 -ctv q8_0 -npp 0,2048,8192,16384,32768,65536,131072,262144 -ntg 128 -npl 1
PPTGBN_KVT_PP sS_PP t/sT_TG sS_TG t/sT sS t/s
012811280.0000.005.01525.535.01525.53
20481281217610.246199.885.07325.2315.319142.04
81921281832037.229220.055.34123.9642.570195.44
1638412811651279.234206.785.48923.3284.723194.89
32768128132896179.697182.355.81022.03185.507177.33
65536128165664436.593150.116.57719.46443.169148.17
13107212811312001262.377103.839.12414.031271.501103.19
26214412812622723487.92175.1611.39111.243499.31274.95

5. Acknowledgments

This release stands on the work of the following authors and communities:

  • bartowski — for calibration_datav5, the community-standard imatrix calibration anchor used by countless GGUF releases. Used for calibration purposes only; no license has been verified for this resource.
  • eaddario — for the imatrix-calibration dataset (MIT), providing multilingual / code / math splits that form the backbone of this release's domain balance
  • NousResearch — for hermes-function-calling-v1 (Apache-2.0), used for agent / tool-call calibration coverage
  • ggml-org / llama.cpp — for the entire quantization and inference toolchain (MIT)

6. License

The GGUF quantization files in this repository are derivative works of stepfun-ai/Step-3.7-Flash and are released under the same Apache 2.0 license.

ComponentLicense
Base model weights (stepfun-ai/Step-3.7-Flash)Apache-2.0
Calibration dataset (eaddario/imatrix-calibration)MIT
Calibration dataset (NousResearch/hermes-function-calling-v1)Apache-2.0
Quantization toolchain (llama.cpp)MIT

All calibration datasets retain their original licenses and are used strictly for quantization calibration purposes only.

Want more deterministic results?