Gemma 4 12B Agentic Fable5 Composer2.5 V2 3.5x Tau2 GGUF
Gemma 4 12B Agentic Fable5 Composer2.5 V2 3.5x Tau2 GGUF by yuxinlu1, a text-generation model with multimodal capabilities. Understand and compare multimodal features, benchmarks, and capabilities.
Comparison
| Feature | Gemma 4 12B Agentic Fable5 Composer2.5 V2 3.5x Tau2 GGUF | Interfaze |
|---|---|---|
| Input Modalities | text, image, audio, video | image, text, audio, video, document |
| Native OCR | No | Yes |
| Long Document Processing | No | Yes |
| Language Support | 140 partial | 162+ |
| Native Speech-to-Text | No | Yes |
| Native Object Detection | No | Yes |
| Guardrail Controls | No | Yes |
| Context Input Size | 262.1K | 1M |
| Tool Calling | Yes | Tool calling supported + built in browser, code execution and web search |
Scaling
| Feature | Gemma 4 12B Agentic Fable5 Composer2.5 V2 3.5x Tau2 GGUF | Interfaze |
|---|---|---|
| Scaling | Self-hosted/Provider-hosted with quantization | Unlimited |
View model card on Hugging Face
๐ฃ Tiny footprint, big brain โ a local coding & tool-using agent for everyone
No matter your GPU. No matter your RAM. With ~4.5 GB of VRAM or unified memory free, you can run your own private, offline coding agent right now. ๐ v2 is the big agentic upgrade โ it reads, reasons, uses tools, and works through multi-step technical tasks before it acts. ๐ง ๐ ๏ธ All local, all yours, no API, no cloud.
๐ The headline โ it works as an agent (tau2-bench)
v2 is built for coding + agentic work โ writing code, running commands, using tools, debugging, multi-step
technical tasks. The clearest signal is tau2-bench telecom, an agentic tool-use benchmark whose
diagnose โ fix โ verify loop mirrors real terminal/debugging work:
| tau2-bench telecom ยท 20 tasks ยท local, same harness, all Q8_0 | score |
|---|---|
official gemma-4-12B-it (base) | ~15% |
| ๐ข Gemma4-12B v2 (this model) | ~55% |
โ Roughly 3.5ร higher than the base model on technical-agentic tasks. ๐ฏ Want the full story โ why telecom, how the two models fail differently, the honest caveats, and the trade-offs (including general knowledge)? It's all broken down further below. ๐
๐ Announcements
๐ Hitting a problem? Please check my pinned discussion first. ~99% of issues are a client/sampler config, not
the weights โ and they have a quick fix there. For example: garbled or repeating 0000โฆ output almost always
means no repetition penalty (set rep_pen 1.1, temp 1.0); and leaked <|tool_call> / <|channel> tokens mean
your front-end isn't parsing Gemma 4's native tool format (use llama.cpp --jinja). If your question isn't covered,
don't hesitate to open a discussion โ I read them and reply as fast as I can. ๐ฌ
๐ฆ No Q2_K this release. I finished a Q2_K (imatrix) build, but it didn't hold up under real stress-testing, so I'm holding it back โ I only ship a quant once I'm confident it's genuinely good. Smallest reliable option is Q3_K_M; Q4_K_M is the recommended sweet spot. ๐
๐ฎ v3 is already on the way. Honestly? Even I didn't expect the post-training jump to be this large โ so I'm pushing further. v3 keeps the coding + agentic focus and aims higher still. Stay tuned! ๐
๐ And a bigger sibling is coming โ Qwen3.6-27B. I've also started fine-tuning Qwen3.6-27B with the same coding + agentic recipe, for those of you who do have the headroom and want more raw capability. But I haven't forgotten what this project is about: a 27B may be too heavy for some of your GPUs / RAM. So this is not a replacement โ I'm pushing v3 (this 12B line) in parallel, at the same time, and it will only get stronger. ๐ช No matter your hardware, you'll have a model that fits. ๐
๐ A personal note โ thank you, and a few honest words (please read)
First, a huge thank-you for all the data and help you've shared. ๐ The bittersweet part: none of us saw it coming that Fable 5 would be retired โ and only my own dataset holds Fable 5's genuine, self-authored chain-of-thought. So for every dataset the community contributed, I rebuilt the missing reasoning from scratch with Opus 4.8 (xhigh). It may diverge from the original Fable 5 traces, but it was the only workable path โ and the improvement turned out really, really huge (it nearly launched me out of my chair ๐). The benchmark numbers are right above. ๐
Second โ I've tried to reply to every community comment, and I've openly owned v1's training problems. Truly, thank you: your feedback is what lets me improve. ๐
Because v1 hit #1 trending, it also attracted some bad words / trolling. I'll say this gently but firmly: real criticism is always welcome here โ pure insults are not. This is a local model that lets anyone run a capable AI on tiny RAM/VRAM, at zero API cost and fully private; I even open-sourced the full safetensors master to study and build on. If something's off, open a discussion about the actual problem โ I genuinely want to hear it and I'll act on it. But comments that are only insults help no one, and I'll remove them without hesitation. ๐
Please remember: I'm one person โ not a lab shipping an "open" model for marketing or to monetize later. I don't advertise. I build this for you on my own time and my own money: synthesizing data, reviewing and cleaning it by hand, splitting and re-segmenting it (this round I even built a dynamic context-window pass to keep the agent's read-before-act steps intact), reading the latest papers, then training โ evaluating โ training โ evaluating. It burned through an entire Claude Max 20ร plan (I keep a separate Pro for my own work), and v2 alone cost 40+ hours โ even with Opus 4.8, the data threw constant curveballs I had to verify myself. Thank you, truly. ๐พ
๐ฌ The benchmarks, in detail (tau2-bench)
I evaluated v2 on tau2-bench (an agentic tool-use benchmark). I did not run the whole suite โ it's very time-consuming โ so I focused on the single domain that best matches what v2 is for.
Why tau2-bench telecom? Telecom troubleshooting makes the agent diagnose with read/inspect tools โ pinpoint the
issue โ apply a fix โ verify it โ structurally the same loop as real terminal/debugging work
(check state โ diagnose โ fix โ confirm). That's exactly what this model is meant to be good at, which makes it the
right yardstick for v2 (much more so than a shopping/customer-service domain).
| tau2-bench telecom ยท 20 tasks ยท local, same harness, all Q8_0 | score |
|---|---|
official gemma-4-12B-it (base) | ~15% |
| ๐ข Gemma4-12B v2 (this model) | ~55% |
โ Roughly 3.5ร higher than the base model on technical-agentic tasks. ๐ฏ
Grounded, not made-up. Independently, a coding/terminal fabrication probe (tasks that deliberately tempt the
model to invent file paths / function signatures / values) found v2 grounds before it acts just like the base โ
it grep/read/ls first, and doesn't make things up (0% fabrication, on par with the base model).
The interesting part โ how they fail. The base model gives up early: on this run it bailed to a human agent
10 times (transfer_to_human) instead of finishing the fix. v2 keeps going โ it stays in the loop and works the
problem the way a much bigger model would, which is exactly why it solves so many more. It's not perfect yet: it still
flails a little sometimes (over-trying, retrying). And some of the remaining misses are actually a bug in the
benchmark's own APN tool (it throws on inputs it should handle gracefully), not the model. To be clear: I will not
patch the benchmark's tools or leak its test questions just to inflate my score โ I'd rather report an honest number
and improve the model itself. More training is coming in v3. ๐ง
About retail (customer-service shopping): on tau2-bench retail, the base model scores a bit higher than v2. This
is fully expected and by design. Retail is pure customer-service (look up a user, process an order) โ not what this
model is for. v2 is specialized for coding / terminal / technical-agentic work, and on those (telecom) it
dramatically outperforms the base. Need a customer-service bot? This isn't it. Need a local coding/agentic model?
It is. ๐
Let's keep it honest about scale. Today's frontier models โ think mimo-v2.5-pro or Opus 4.8 โ all land 90%+ on this telecom benchmark. They're also enormous. For a 12B model, my rough guess is that v3 might top out somewhere around 60โ70% (emphasis on guess โ I haven't even started v3 yet). So let's be clear-eyed: there's still a real gap to the frontier. But keep the scale in mind โ this is a 12B model running on your own machine, and narrowing that gap as much as possible at this size is the whole point. ๐ช
And the trade-off โ there's no free lunch. I also ran a general-knowledge benchmark (MMLU-Pro), and v2 lands
a little below the base model there. That's completely normal and expected for a focused fine-tune: when you
push hard on coding + agentic, you trade a sliver of broad-knowledge breadth for it. Need a generalist? Try my own
general-purpose Claude Opus 4.6/4.8 distillation
โ or the original google/gemma-4-12B-it base. Need a local coding/agentic worker? That's what v2 is tuned for.
๐ฌ Methodology, honestly: these are local, same-harness, relative numbers (all models tested at Q8_0, greedy decoding, self-simulated user, 20 tasks). They are not directly comparable to published tau2-bench leaderboard figures (different user-simulator, full task sets, full precision) โ local self-eval runs systematically lower than published scores. Read them as "v2 vs the base model under identical conditions", which is the comparison that actually matters here.
๐ What's new in v2 (training)
v2 continues from the v1 coder and adds a big agentic push โ the piece v1 was missing:
- ๐ ๏ธ Agentic / terminal โ real multi-step tool-use trajectories (read โ reason โ act โ verify), in Gemma 4's native tool protocol. This is what drove the tau2-bench telecom jump, and it fixes v1's "stops after the first step" behavior.
- ๐ป Coding โ verified chain-of-thought over Python tasks (real CoT, gated on passing tests) plus the Fable-5-redo set for the hard cases.
- ๐ General โ a curated slice of reasoning/instruction data to keep broad competence.
All reasoning is distilled CoT (see the personal note above on how the Fable 5 traces were rebuilt with Opus 4.8).
๐ฆ Pick your size (GGUF quants)
| Quant | Size | Vibe |
|---|---|---|
| ๐ก Q3_K_M | 5.7 GB | great for 8 GB VRAM |
| ๐ต Q4_K_M | 6.87 GB | the sweet spot ๐ (recommended) |
| ๐ฃ Q6_K | 9.11 GB | near-lossless |
| โช Q8_0 | 11.8 GB | basically full quality |
โน๏ธ No Q2_K this release โ it didn't pass stress-testing yet (see Announcements). Smallest reliable quant = Q3_K_M.
๐ How to run it
Option A โ llama.cpp (recommended) ๐ฆ
โ ๏ธ Needs a recent llama.cpp (this is the
gemma4_unifiedarchitecture โ older builds won't load it).
@echo off
cd /d C:\llama.cpp
llama-server.exe ^
-m C:\models\gemma4-v2-Q4_K_M.gguf ^
--ctx-size 16384 ^
--n-gpu-layers 99 ^
--no-mmap -fa on ^
--jinja ^
--temp 1.0 --top-p 0.95 --top-k 64 ^
--host 0.0.0.0 --port 18080
pause- ๐ ๏ธ Agentic use: pass your tools via the OpenAI
toolsfield (works with--jinja). v2 emits structured tool-calls in Gemma 4's native protocol and is happy in agent loops (read/grep/edit/run, then verify). - ๐ฑ๏ธ One-click apps: LM Studio / Jan / Ollama โ import the GGUF, pick a quant, go.
๐ง Thinking mode
v2 thinks in Gemma's native thought channel before answering (keep enable_thinking=true, the default chat template
handles it). Recommended sampling: temp 1.0, top_p 0.95, top_k 64; for coding you can also go greedy (temp 0).
โ ๏ธ Good to know
- Specialized for coding / terminal / agentic. General-knowledge facts/numbers should still be double-checked.
- Reduced refusals: task-focused training, not safety-aligned โ add your own guardrails for production. Use responsibly. ๐
- English-centric.
๐ Base & License
- License: Apache 2.0. Gemma 4 is released by Google under Apache 2.0 (unlike the older Gemma 1/2/3 terms), so this fine-tune is Apache 2.0 too โ free to use, modify, and redistribute. ๐
- Base model:
google/gemma-4-12B-it. - Personal/hobby project โ shared as-is, no warranty. Built with time, care, and a lot of coffee. Have fun! ๐พโจ
โก Speculative decoding (MTP draft) โ verified build
The MTP/ folder ships the Gemma 4 multi-token-prediction draft (unsloth's GGUF conversion of Google's official
gemma-4-12B-it-assistant) for speculative decoding. Gemma 4 MTP is in llama.cpp mainline (PR #23398) โ no fork
needed โ but the gemma4-assistant loader is build-sensitive right now, so please use the exact build below:
- โ
Verified working: llama.cpp
b9553(commit9e3b928fd). I reproduced it withgemma4-v2-Q8_0+ theMTP-Q8_0draft: loads cleanly and accelerates generation (~88 โ ~180 tok/s on a simple deterministic prompt; expect ~1.2โ1.3ร on real coding/thinking). Lossless either way. - โ ๏ธ Newer builds (e.g. b9702 / b9717) currently crash while loading the draft with
invalid vector subscript. This is an upstream regression in thegemma4-assistantloader path, not a problem with these GGUFs โ the same files load fine on b9553. Stick with b9553 until it's fixed upstream.
Working command on b9553 (note the older flag names โ --model-draft, not --spec-draft-model):
llama-server -m gemma4-v2-Q8_0.gguf ^
--model-draft MTP\gemma-4-12B-it-MTP-Q8_0.gguf ^
--spec-type draft-mtp --spec-draft-n-max 4 ^
-ngl 99 -ngld 99 -fa on --jinjaโน๏ธ The
Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting)line is harmless. The draft is the generic Gemma 4 assistant (not retrained for v2), so acceptance is a touch lower than a model-specific draft would give โ still 100% lossless. On small-VRAM cards, Q8 main + long context + the draft can be tight; drop to Q6_K/Q4_K_M or a smaller--ctx-sizeif you hit OOM.