Interfaze

logo

Beta

pricing

docs

blog

sign in

Qwen3.6 35B A3B Claude 4.7 Opus Reasoning Distilled

Qwen3.6 35B A3B Claude 4.7 Opus Reasoning Distilled by lordx64, a text-generation model with multimodal capabilities. Understand and compare multimodal features, benchmarks, and capabilities.

Comparison

FeatureQwen3.6 35B A3B Claude 4.7 Opus Reasoning DistilledInterfaze
Input Modalities

text, image

image, text, audio, video, document

Native OCRNoYes
Long Document ProcessingNoYes
Language Support

unknown

162+

Native Speech-to-TextNoYes
Native Object DetectionNoYes
Guardrail ControlsNoYes
Context Input Size

262.1K

1M

Tool CallingYes

Tool calling supported + built in browser, code execution and web search

Scaling

FeatureQwen3.6 35B A3B Claude 4.7 Opus Reasoning DistilledInterfaze
Scaling

Self-hosted/Provider-hosted with quantization

Unlimited

View model card on Hugging Face

A reasoning-distilled variant of Qwen3.6-35B-A3B taught to imitate the chain-of-thought style of Claude Opus 4.7, the frontier reasoning model from Anthropic. The goal: port Claude-grade reasoning behavior into a permissively-licensed Mixture-of-Experts model that an individual can actually run.

Why this model

  • Claude-style reasoning, open weights. Claude Opus 4.7 is one of the strongest reasoning models available, but only via a proprietary API. This model has been fine-tuned on ~8k high-quality reasoning traces produced by Opus 4.7, teaching the base to think before answering — with explicit <think>…</think> blocks — in Claude's structure and cadence.
  • Sparse activation, dense knowledge. The base is a 35B-parameter MoE with 256 experts, 8 routed + 1 shared, of which only about 3B parameters are active per token. You get the capacity of a 35B model at the inference cost of a small dense model. Full-quality bf16 inference runs on a single 80GB A100 or H100.
  • Long thinking supported. 64k token context. The model routinely emits 5–30k tokens of <think> reasoning on hard problems before giving the final answer — which is the whole point of reasoning models, and why this one was specifically trained end-to-end with an upstream teacher that also reasons explicitly.
  • Clean base to build on. LoRA adapter is also published separately (…-adapter), so you can apply the distillation to other checkpoints of the same base, or stack further fine-tunes.

Intended use

Built for hard reasoning: graduate-level STEM, competition math (AIME / MATH), code reasoning with explicit walk-through, multi-step logic puzzles, and agentic planning where explicit <think> helps correctness.

For short-turn conversational latency-sensitive workloads the thinking budget can be large; cap max_new_tokens or post-process to strip <think>…</think> blocks if you only want final answers in production.

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(
    repo, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)

messages = [{"role": "user", "content": "How many positive integers less than 1000 have digits that sum to 20?"}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=32768, do_sample=False)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

Recommended backend: vLLM for serving — the MoE routing + KV cache benefit significantly from continuous batching.

vllm serve lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled \ --dtype bfloat16 --max-model-len 65536 --gpu-memory-utilization 0.9

GGUF (LM Studio / llama.cpp)

Quantized GGUF weights are available for llama.cpp and LM Studio:

  • IQ4_XS (18.9 GB) — smallest, default pick for LM Studio
  • Q5_K_M (~25 GB) — balanced quality / size
  • Q8_0 (~35 GB) — near-lossless

Search lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled inside LM Studio's model browser once HF has indexed the GGUF repos (usually within an hour of publication).

Training

Base modelQwen/Qwen3.6-35B-A3B (loaded via unsloth/Qwen3.6-35B-A3B for faster finetuning)
TeacherClaude Opus 4.7 (Anthropic)
Training datasetlordx64/reasoning-distill-opus-4-7-max-sft — reasoning traces from Claude Opus 4.7 reformatted into SFT conversations
Source datasetlordx64/reasoning-distill-claude-opus-4-7-max — raw teacher traces (pre-SFT formatting)
Dataset size~7,800 full conversations, assistant side trained including <think>…</think>
MethodSFT with Unsloth + TRL SFTTrainer + train_on_responses_only (loss only on assistant tokens)
LoRA configr=16, alpha=16, dropout=0.0, targets=["q_proj","k_proj","v_proj","o_proj"] (attention-only)
Hyperparameterslr=2e-5, cosine schedule, warmup_ratio=0.03, weight_decay=0.01, optimizer adamw_8bit
Batchper_device=1, grad_accum=16, effective=16, 2 epochs = 978 steps
Sequence4096 tokens during training (64k usable at inference — base supports it natively)
Precisionbf16 on 1× H200 141GB (HF Inference Endpoint, custom container)
Trainable3.44M params out of 35.1B (0.01%)

Why attention-only LoRA on a MoE

The initial plan was full LoRA including the MoE expert FFNs (gate_proj/up_proj/down_proj). In the course of this project I filed and upstreamed a shape-mismatch fix to unsloth-zoo's MoE+LoRA grouped-mm path — unslothai/unsloth-zoo#601 — without which the expert-LoRA forward crashes on Qwen3.6's 256-expert layout. Even with that fix, single-GPU memory made expert-LoRA impractical for this run. Attention-only captures most of the signal on style distillation anyway (the point of this model) while leaving the expert FFNs' learned knowledge intact — a v2 training run with expert LoRA on multi-GPU is a natural next step if the style-only signal isn't enough.

Evaluation

Evaluated via lm-evaluation-harness (v0.4.9) with vLLM backend at 64k context, bf16. Custom eval path strips <think>…</think> from generations before the filter pipeline, uses per-task conventional fewshot counts, and runs with fewshot_as_multiturn=True so few-shot examples are proper chat turns rather than concatenated prompt text. Raw results JSON is public: lordx64/qwen3-6-distill-evals.

BenchmarkSetupScore
GSM8K CoT8-shot multiturn, limit 30084.3% (flexible-extract) / 76.7% (strict-match)
MMLU-Pro5-shot multiturn, limit 50074.9%
AIME 20240-shot, full (30)extraction fix in progress — model generates answers but not in a format the AIME extractor recognizes (\boxed{} vs plain prose)
AIME 20250-shot, full (30)same — pending
GPQA Diamond0-shot CoT, full (198)same — pending
MATH-5000-shot, limit 100rerun pending (missing sympy / math_verify dep in the first run)

MMLU-Pro subject breakdown

Standard reasoning-model profile: strong on STEM, weaker on law/engineering. All subjects evaluated at limit 500, 5-shot multiturn.

SubjectAccSubjectAcc
Biology86.0%Chemistry78.8%
Psychology83.4%Health73.8%
Math83.6%Business74.4%
Economics83.0%Other72.6%
Physics81.0%Philosophy71.3%
Computer Science79.0%History70.9%
Engineering54.8%
Law55.6%

Full per-task JSON with stderr, filter configs, and timings lives in the evals dataset. The remaining tasks will be added to this table after a diagnostic rerun identifies why AIME/GPQA extraction is returning no-match on generated outputs.

Limitations

  • Reasoning ≠ knowledge. Distillation transfers how to reason, not new facts. Anything the base Qwen3.6-35B-A3B doesn't already know, this model still doesn't know.
  • Attention-only LoRA. Expert FFNs are untouched from the base — domains where Claude and Qwen3.6 diverge in factual priors may see uneven improvement.
  • Long generations. The model will genuinely use tens of thousands of tokens on hard problems. Budget your max_new_tokens accordingly, and provide max_model_len ≥ 32k at inference.
  • Distillation provenance. Training data was generated with Anthropic's Claude Opus 4.7 via API. Downstream users should confirm compliance with Anthropic's usage policies for their specific use case.

Citation

If you use this model, please cite the base and the distillation:

@misc{qwen36_a3b_2026,
  title  = {Qwen3.6-35B-A3B},
  author = {Qwen Team},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Qwen/Qwen3.6-35B-A3B}},
}

@misc{lordx64_qwen36_distill_2026,
  title  = {Qwen3.6-35B-A3B distilled from Claude Opus 4.7 reasoning},
  author = {lordx64},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled}},
}

Acknowledgements

  • Unsloth — 2× faster training of large MoE LoRA; the bug we hit and fixed was in their unsloth-zoo patches (credit for rapid review of PR #601).
  • Anthropic — for the teacher model.
  • Qwen team — for releasing Qwen3.6 with a permissive Apache-2.0 license, enabling work like this.
  • lm-evaluation-harness (EleutherAI) — evaluation methodology.

Want more deterministic results?