Interfaze

logo

Beta

pricing

docs

blog

sign in

Gemma 4 21b A4b It REAP

Gemma 4 21b A4b It REAP by 0xSero, a text-generation model with multimodal capabilities. Understand and compare multimodal features, benchmarks, and capabilities.

Comparison

FeatureGemma 4 21b A4b It REAPInterfaze
Input Modalities

text, image

image, text, audio, video, document

Native OCRNoYes
Long Document ProcessingNoYes
Language Support

140 partial

162+

Native Speech-to-TextNoYes
Native Object DetectionNoYes
Guardrail ControlsYesYes
Context Input Size

262.1K

1M

Tool CallingYes

Tool calling supported + built in browser, code execution and web search

Scaling

FeatureGemma 4 21b A4b It REAPInterfaze
Scaling

Self-hosted/Provider-hosted with quantization

Unlimited

View model card on Hugging Face

20% expert-pruned version of google/gemma-4-26b-a4b-it using Cerebras REAP (Router-weighted Expert Activation Pruning).

OriginalThis Model (0.20)0.30 variant
Total params~26B21.34B19.02B
Experts per layer12810390
Active params/tok~4B~4B~4B
Experts/tok888
FormatBF16BF16BF16
Disk size~52 GB~43 GB~36 GB

REAP removes 20% of MoE experts (25 of 128 per layer) while preserving the model's routing behavior. The active parameter count per token is unchanged since the router still selects 8 experts per token from the remaining pool. This yields an ~18% reduction in total disk/memory footprint.

How This Model Was Made

Step 1: Calibration (Activation Observation)

We ran the full Gemma 4 26B-A4B-it model over a curated calibration dataset to record expert activation patterns. The observer hooks capture router gate values, expert activation norms, and routing frequencies for every layer across all calibration tokens.

Calibration dataset: 22,000 samples drawn from 12 sources covering coding, reasoning, math, science, tool-calling, and agentic tasks:

CategorySamplesSource Dataset
Coding (general)1,000theblackcat102/evol-codealpaca-v1
Coding (additional)1,636theblackcat102/evol-codealpaca-v1
Reasoning -- code3,480open-r1/Mixture-of-Thoughts[code]
Reasoning -- math3,578open-r1/Mixture-of-Thoughts[math]
Reasoning -- science3,576open-r1/Mixture-of-Thoughts[science]
Tool calling1,000Salesforce/xlam-function-calling-60k
Agentic coding1,000SWE-bench/SWE-smith-trajectories
Biomedical QA800qiaojin/PubMedQA[pqa_labeled]
Science QA800derek-thomas/ScienceQA
Grade-school math4,466openai/gsm8k[main]
Competition math500HuggingFaceH4/MATH-500
Code correctness164evalplus/humanevalplus
Total22,000

Step 2: REAP Pruning

Using the recorded activation data, REAP scores each expert's importance per layer by combining router gate values, expert activation norms, and frequency-weighted saliency. The lowest-scoring 20% of experts (25 per layer) are removed. Router logits are renormalized post-pruning to maintain the output distribution.

Pruning Configuration

ParameterValue
Compression ratio0.20 (20% expert removal)
Original experts per layer128
Remaining experts per layer103
Pruning methodREAP
Distance measureAngular (cosine)
Router weight renormalizationYes
Seed42

Benchmark Results

Accuracy (generative, 0-shot, 50 samples/task, thinking enabled, vLLM 0.19, 4x RTX 3090)

Evaluated using lm-eval generative tasks with --apply_chat_template and think_end_token=<channel|> to properly handle Gemma 4's thinking mode. Scores extracted from model responses using regex matching.

TaskOriginalREAP 0.20REAP 0.30
Elementary Math92%90%88%
Philosophy92%88%74%
World Religions90%64%48%
College CS56%76%68%
HS Math24%*44%*48%*
Abstract Algebra12%*28%*28%*
College Math16%*18%*24%*
GSM8K86%84%--

* Tasks with significant extraction failures (model outputs equations rather than single letters). Real accuracy likely higher for all models.

Notes:

  • Gemma 4 is a thinking model -- it reasons internally before answering. Standard loglikelihood-based benchmarks give incorrect results because the model wants to think first.
  • GSM8K uses flexible-extract which handles thinking output well.
  • College CS and math tasks show REAP sometimes outperforming the original, likely due to sampling variance at n=50.

Generation Quality: Side-by-Side (14 prompts, temp=0.7, top_p=0.9, max 2048 tokens)

Both the original and REAP 0.20 models were tested on 14 challenging prompts across coding, math, philosophy, long-context, and repetition stress with proper chat template formatting.

DomainNOrig AvgWordsREAP AvgWordsOrig LoopREAP LoopOrig CollapseREAP Collapse
Coding36706480%0%0%0%
Math reasoning32962610%0%0%0%
Philosophy38197270%0%0%0%
Long context2121085450%0%0%0%
Repetition stress31088109933%33%0%0%

12/14 clean ties, 1 REAP win (long-context), 1 mutual mild loop (sorting algorithms). The REAP 0.20 model is essentially indistinguishable from the original on generation quality.

Architecture

Gemma 4 uses a hybrid sliding/full attention MoE architecture:

  • 30 transformer layers
  • Sliding attention (window=1024) for 25 layers, full attention every 6th layer
  • MoE FFN with 103 remaining experts per layer (originally 128), 8 active per token
  • Thinking model -- uses <|channel>thought / <|channel>response channels
  • Multimodal -- supports text and vision inputs
  • Context window: 262,144 tokens
  • Vocab size: 262,144

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/gemma-4-21b-a4b-it-REAP"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

pip install vllm>=0.19 transformers>=5.0

vllm serve 0xSero/gemma-4-21b-a4b-it-REAP \
    --tensor-parallel-size 2 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 16384 \
    --trust-remote-code

Citation

@inproceedings{lasby2025reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
  author={Lasby, Mike and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.13999}
}

Want more deterministic results?