Gemma 4 12B Agentic Fable5 Composer2.5 V2 3.5x Tau2 GGUF

Gemma 4 12B Agentic Fable5 Composer2.5 V2 3.5x Tau2 GGUF by yuxinlu1, a text-generation model with multimodal capabilities. Understand and compare multimodal features, benchmarks, and capabilities.

Comparison

Feature	Gemma 4 12B Agentic Fable5 Composer2.5 V2 3.5x Tau2 GGUF	Interfaze
Input Modalities	text, image, audio, video	image, text, audio, video, document
Native OCR	No	Yes
Long Document Processing	No	Yes
Language Support	140 partial	162+
Native Speech-to-Text	No	Yes
Native Object Detection	No	Yes
Guardrail Controls	No	Yes
Context Input Size	262.1K	1M
Tool Calling	Yes	Tool calling supported + built in browser, code execution and web search

Scaling

Feature	Gemma 4 12B Agentic Fable5 Composer2.5 V2 3.5x Tau2 GGUF	Interfaze
Scaling	Self-hosted/Provider-hosted with quantization	Unlimited

View model card on Hugging Face

🐣 Tiny footprint, big brain — a local coding & tool-using agent for everyone

No matter your GPU. No matter your RAM. With ~4.5 GB of VRAM or unified memory free, you can run your own private, offline coding agent right now. 🚀 v2 is the big agentic upgrade — it reads, reasons, uses tools, and works through multi-step technical tasks before it acts. 🧠🛠️ All local, all yours, no API, no cloud.

📊 The headline — it works as an agent (tau2-bench)

v2 is built for coding + agentic work — writing code, running commands, using tools, debugging, multi-step technical tasks. The clearest signal is tau2-bench telecom, an agentic tool-use benchmark whose diagnose → fix → verify loop mirrors real terminal/debugging work:

tau2-bench telecom · 20 tasks · local, same harness, all Q8_0	score
official `gemma-4-12B-it` (base)	~15%
🟢 Gemma4-12B v2 (this model)	~55%

→ Roughly 3.5× higher than the base model on technical-agentic tasks. 🎯 Want the full story — why telecom, how the two models fail differently, the honest caveats, and the trade-offs (including general knowledge)? It's all broken down further below. 👇

🚀 Announcements

📌 Hitting a problem? Please check my pinned discussion first. ~99% of issues are a client/sampler config, not the weights — and they have a quick fix there. For example: garbled or repeating 0000… output almost always means no repetition penalty (set rep_pen 1.1, temp 1.0); and leaked <|tool_call> / <|channel> tokens mean your front-end isn't parsing Gemma 4's native tool format (use llama.cpp --jinja). If your question isn't covered, don't hesitate to open a discussion — I read them and reply as fast as I can. 💬

📦 No Q2_K this release. I finished a Q2_K (imatrix) build, but it didn't hold up under real stress-testing, so I'm holding it back — I only ship a quant once I'm confident it's genuinely good. Smallest reliable option is Q3_K_M; Q4_K_M is the recommended sweet spot. 🙏

🔮 v3 is already on the way. Honestly? Even I didn't expect the post-training jump to be this large — so I'm pushing further. v3 keeps the coding + agentic focus and aims higher still. Stay tuned! 🎉

🐘 And a bigger sibling is coming — Qwen3.6-27B. I've also started fine-tuning Qwen3.6-27B with the same coding + agentic recipe, for those of you who do have the headroom and want more raw capability. But I haven't forgotten what this project is about: a 27B may be too heavy for some of your GPUs / RAM. So this is not a replacement — I'm pushing v3 (this 12B line) in parallel, at the same time, and it will only get stronger. 💪 No matter your hardware, you'll have a model that fits. 💚

💚 A personal note — thank you, and a few honest words (please read)

First, a huge thank-you for all the data and help you've shared. 🙏 The bittersweet part: none of us saw it coming that Fable 5 would be retired — and only my own dataset holds Fable 5's genuine, self-authored chain-of-thought. So for every dataset the community contributed, I rebuilt the missing reasoning from scratch with Opus 4.8 (xhigh). It may diverge from the original Fable 5 traces, but it was the only workable path — and the improvement turned out really, really huge (it nearly launched me out of my chair 😄). The benchmark numbers are right above. 👆

Second — I've tried to reply to every community comment, and I've openly owned v1's training problems. Truly, thank you: your feedback is what lets me improve. 💚

Because v1 hit #1 trending, it also attracted some bad words / trolling. I'll say this gently but firmly: real criticism is always welcome here — pure insults are not. This is a local model that lets anyone run a capable AI on tiny RAM/VRAM, at zero API cost and fully private; I even open-sourced the full safetensors master to study and build on. If something's off, open a discussion about the actual problem — I genuinely want to hear it and I'll act on it. But comments that are only insults help no one, and I'll remove them without hesitation. 🙏

Please remember: I'm one person — not a lab shipping an "open" model for marketing or to monetize later. I don't advertise. I build this for you on my own time and my own money: synthesizing data, reviewing and cleaning it by hand, splitting and re-segmenting it (this round I even built a dynamic context-window pass to keep the agent's read-before-act steps intact), reading the latest papers, then training → evaluating → training → evaluating. It burned through an entire Claude Max 20× plan (I keep a separate Pro for my own work), and v2 alone cost 40+ hours — even with Opus 4.8, the data threw constant curveballs I had to verify myself. Thank you, truly. 🐾

🔬 The benchmarks, in detail (tau2-bench)

I evaluated v2 on tau2-bench (an agentic tool-use benchmark). I did not run the whole suite — it's very time-consuming — so I focused on the single domain that best matches what v2 is for.

Why tau2-bench telecom? Telecom troubleshooting makes the agent diagnose with read/inspect tools → pinpoint the issue → apply a fix → verify it — structurally the same loop as real terminal/debugging work (check state → diagnose → fix → confirm). That's exactly what this model is meant to be good at, which makes it the right yardstick for v2 (much more so than a shopping/customer-service domain).

tau2-bench telecom · 20 tasks · local, same harness, all Q8_0	score
official `gemma-4-12B-it` (base)	~15%
🟢 Gemma4-12B v2 (this model)	~55%

→ Roughly 3.5× higher than the base model on technical-agentic tasks. 🎯

Grounded, not made-up. Independently, a coding/terminal fabrication probe (tasks that deliberately tempt the model to invent file paths / function signatures / values) found v2 grounds before it acts just like the base — it grep/read/ls first, and doesn't make things up (0% fabrication, on par with the base model).

The interesting part — how they fail. The base model gives up early: on this run it bailed to a human agent 10 times (transfer_to_human) instead of finishing the fix. v2 keeps going — it stays in the loop and works the problem the way a much bigger model would, which is exactly why it solves so many more. It's not perfect yet: it still flails a little sometimes (over-trying, retrying). And some of the remaining misses are actually a bug in the benchmark's own APN tool (it throws on inputs it should handle gracefully), not the model. To be clear: I will not patch the benchmark's tools or leak its test questions just to inflate my score — I'd rather report an honest number and improve the model itself. More training is coming in v3. 🔧

About retail (customer-service shopping): on tau2-bench retail, the base model scores a bit higher than v2. This is fully expected and by design. Retail is pure customer-service (look up a user, process an order) — not what this model is for. v2 is specialized for coding / terminal / technical-agentic work, and on those (telecom) it dramatically outperforms the base. Need a customer-service bot? This isn't it. Need a local coding/agentic model? It is. 💚

Let's keep it honest about scale. Today's frontier models — think mimo-v2.5-pro or Opus 4.8 — all land 90%+ on this telecom benchmark. They're also enormous. For a 12B model, my rough guess is that v3 might top out somewhere around 60–70% (emphasis on guess — I haven't even started v3 yet). So let's be clear-eyed: there's still a real gap to the frontier. But keep the scale in mind — this is a 12B model running on your own machine, and narrowing that gap as much as possible at this size is the whole point. 💪

And the trade-off — there's no free lunch. I also ran a general-knowledge benchmark (MMLU-Pro), and v2 lands a little below the base model there. That's completely normal and expected for a focused fine-tune: when you push hard on coding + agentic, you trade a sliver of broad-knowledge breadth for it. Need a generalist? Try my own general-purpose Claude Opus 4.6/4.8 distillation — or the original google/gemma-4-12B-it base. Need a local coding/agentic worker? That's what v2 is tuned for.

🔬 Methodology, honestly: these are local, same-harness, relative numbers (all models tested at Q8_0, greedy decoding, self-simulated user, 20 tasks). They are not directly comparable to published tau2-bench leaderboard figures (different user-simulator, full task sets, full precision) — local self-eval runs systematically lower than published scores. Read them as "v2 vs the base model under identical conditions", which is the comparison that actually matters here.

📚 What's new in v2 (training)

v2 continues from the v1 coder and adds a big agentic push — the piece v1 was missing:

🛠️ Agentic / terminal — real multi-step tool-use trajectories (read → reason → act → verify), in Gemma 4's native tool protocol. This is what drove the tau2-bench telecom jump, and it fixes v1's "stops after the first step" behavior.
💻 Coding — verified chain-of-thought over Python tasks (real CoT, gated on passing tests) plus the Fable-5-redo set for the hard cases.
📚 General — a curated slice of reasoning/instruction data to keep broad competence.

All reasoning is distilled CoT (see the personal note above on how the Fable 5 traces were rebuilt with Opus 4.8).

📦 Pick your size (GGUF quants)

Quant	Size	Vibe
🟡 Q3_K_M	5.7 GB	great for 8 GB VRAM
🔵 Q4_K_M	6.87 GB	the sweet spot 👌 (recommended)
🟣 Q6_K	9.11 GB	near-lossless
⚪ Q8_0	11.8 GB	basically full quality

ℹ️ No Q2_K this release — it didn't pass stress-testing yet (see Announcements). Smallest reliable quant = Q3_K_M.

🚀 How to run it

Option A — llama.cpp (recommended) 🦙

⚠️ Needs a recent llama.cpp (this is the gemma4_unified architecture — older builds won't load it).

@echo off
cd /d C:\llama.cpp
llama-server.exe ^
  -m C:\models\gemma4-v2-Q4_K_M.gguf ^
  --ctx-size 16384 ^
  --n-gpu-layers 99 ^
  --no-mmap -fa on ^
  --jinja ^
  --temp 1.0 --top-p 0.95 --top-k 64 ^
  --host 0.0.0.0 --port 18080
pause

🛠️ Agentic use: pass your tools via the OpenAI tools field (works with --jinja). v2 emits structured tool-calls in Gemma 4's native protocol and is happy in agent loops (read/grep/edit/run, then verify).
🖱️ One-click apps: LM Studio / Jan / Ollama — import the GGUF, pick a quant, go.

🧠 Thinking mode

v2 thinks in Gemma's native thought channel before answering (keep enable_thinking=true, the default chat template handles it). Recommended sampling: temp 1.0, top_p 0.95, top_k 64; for coding you can also go greedy (temp 0).

⚠️ Good to know

Specialized for coding / terminal / agentic. General-knowledge facts/numbers should still be double-checked.
Reduced refusals: task-focused training, not safety-aligned — add your own guardrails for production. Use responsibly. 🙏
English-centric.

📚 Base & License

License: Apache 2.0. Gemma 4 is released by Google under Apache 2.0 (unlike the older Gemma 1/2/3 terms), so this fine-tune is Apache 2.0 too — free to use, modify, and redistribute. 🎉
Base model: google/gemma-4-12B-it.
Personal/hobby project — shared as-is, no warranty. Built with time, care, and a lot of coffee. Have fun! 🐾✨

⚡ Speculative decoding (MTP draft) — verified build

The MTP/ folder ships the Gemma 4 multi-token-prediction draft (unsloth's GGUF conversion of Google's official gemma-4-12B-it-assistant) for speculative decoding. Gemma 4 MTP is in llama.cpp mainline (PR #23398) — no fork needed — but the gemma4-assistant loader is build-sensitive right now, so please use the exact build below:

✅ Verified working: llama.cpp b9553 (commit 9e3b928fd). I reproduced it with gemma4-v2-Q8_0 + the MTP-Q8_0 draft: loads cleanly and accelerates generation (~88 → ~180 tok/s on a simple deterministic prompt; expect ~1.2–1.3× on real coding/thinking). Lossless either way.
⚠️ Newer builds (e.g. b9702 / b9717) currently crash while loading the draft with invalid vector subscript. This is an upstream regression in the gemma4-assistant loader path, not a problem with these GGUFs — the same files load fine on b9553. Stick with b9553 until it's fixed upstream.

Working command on b9553 (note the older flag names — --model-draft, not --spec-draft-model):

llama-server -m gemma4-v2-Q8_0.gguf ^
  --model-draft MTP\gemma-4-12B-it-MTP-Q8_0.gguf ^
  --spec-type draft-mtp --spec-draft-n-max 4 ^
  -ngl 99 -ngld 99 -fa on --jinja

ℹ️ The Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting) line is harmless. The draft is the generic Gemma 4 assistant (not retrained for v2), so acceptance is a touch lower than a model-specific draft would give — still 100% lossless. On small-VRAM cards, Q8 main + long context + the draft can be tight; drop to Q6_K/Q4_K_M or a smaller --ctx-size if you hit OOM.