Gemma 4 — The Open-Weight Local LLM Worth Committing To
Google DeepMind released Gemma 4 under Apache 2.0 — four sizes, frontier reasoning, and day-0 tooling support. Here's which workload to migrate first.
I’ve been skeptical of the “open-weight challenger” narrative every time a new model drops. Gemma 4 is different in kind, not just degree. Google DeepMind released four model sizes on April 2, 2026 — all under Apache 2.0, all with day-0 support in llama.cpp, Ollama, and MLX, and the 26B MoE variant ranking #6 on Arena AI while activating just 3.8B parameters per forward pass. The smallest variant runs in under 1.5 GB of RAM. The question isn’t whether to evaluate it. The question is which workload you’re migrating off a paid API first.
TL;DR
- What: Four Gemma 4 model variants (E2B, E4B, 26B MoE, 31B Dense) released April 2, 2026 under Apache 2.0 — the first fully permissive commercial license in Gemma family history
- Performance: 31B Dense scores 89.2% on AIME 2026 and ranks #3 among open models on Arena AI; 26B MoE ranks #6 with only 3.8B active parameters
- Hardware: E2B runs in under 1.5 GB RAM; 26B MoE fits a single 24GB GPU with Q4 quantization; even the Raspberry Pi 5 hits 7.6 decode tokens/s on CPU
- Action: Evaluate the 26B MoE for any structured output or agentic workload currently running on a paid API — the benchmark-per-parameter story is real
What Happened
Google DeepMind dropped Gemma 4 on April 2, 2026 in four sizes: Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts (MoE, with 3.8B active parameters per forward pass), and 31B Dense. All four ship under Apache 2.0 — no usage caps, no royalty requirements beyond attribution, no custom terms with surprise clauses. This is the first time the Gemma family has used a fully permissive commercial license. Previous releases used Google’s own restricted terms, which made enterprise adoption a hobby conversation. That conversation is now real.
The model weights are available on Hugging Face, Kaggle, and Ollama immediately. Framework support landed day-0 across llama.cpp, MLX, Ollama, vLLM, LM Studio, Hugging Face Transformers (v5.5.0+), and a dozen others. If you have an existing local LLM stack, Gemma 4 slots in without waiting.
Why This Matters
The Benchmark-Per-Parameter Story Is Not Marketing
The 26B MoE activates 3.8B parameters per forward pass. That is the number that matters for inference cost and hardware requirements — not the 26B total weight count. It ranks #6 on Arena AI’s text leaderboard. The 31B Dense ranks #3 among all open models on the same leaderboard, scores 85.2% on MMLU Pro, and hits 89.2% on AIME 2026. For context, Gemma 3 27B scored 20.8% on AIME 2026. That is not incremental improvement — it is a qualitative shift in what the model can reason through.
The Codeforces ELO jump tells the same story more bluntly: from 110 on Gemma 3 (barely functional at competitive programming) to 2150 on Gemma 4 31B (expert-tier). If you’ve been running coding assistance through a paid API because local models weren’t good enough, that calculus has changed.
Compare directly against the obvious alternatives: Llama 4 Scout and Qwen 3.5 27B operate in the same weight class. Llama 4 ships under Meta’s community license, which includes MAU caps that matter the moment you scale. Qwen 3.5 uses Apache 2.0 as well, and it’s a real competitor on benchmarks — but Gemma 4 has day-0 Ollama support, Android integration, and Google’s backing for long-term model availability. That ecosystem depth is a real advantage for teams who’ve been burned by models that disappear from active maintenance.
Hardware Requirements Are Finally Honest
The E2B uses Per-Layer Embeddings to carry the representational depth of a 5.1B parameter model while fitting in under 1.5 GB of memory with 2-bit and 4-bit quantization. That is not a compromised edge model — it is a genuinely capable one that runs on hardware your users already own.
On a Raspberry Pi 5 running purely on CPU, the E2B hits 133 prefill tokens/s and 7.6 decode tokens/s. With NPU acceleration on the Qualcomm Dragonwing IQ8, those numbers jump to 3,700 prefill and 31 decode tokens/s. If you’re building anything that needs to run inference locally on mobile or embedded hardware, these are the first figures I’ve seen that suggest a production path without a cloud fallback.
The 26B MoE fits on a single 24GB GPU with Q4 quantization — an RTX 3090 or 4090, hardware that a significant portion of the developer community already owns. The 31B Dense requires a single 80GB H100 unquantized, which is a different budget conversation. But the 26B MoE hitting #6 on Arena AI at 3.8B active parameters means most teams never need to provision the 31B for inference-heavy workloads.
If you’re evaluating Gemma 4 for structured output or function-calling workloads, start with the 26B MoE via Ollama. It fits on consumer hardware, has native JSON output support, and its Arena AI ranking reflects real-world instruction following — not just benchmark overfitting. Run it alongside your current paid API for two weeks and compare output quality on your actual prompts.
Multimodal and Agentic Defaults Change the Category
All four Gemma 4 sizes handle text and images natively. The E2B and E4B additionally support audio input — speech recognition and speech-to-translated-text — which is unusual at this weight class. The 26B and 31B handle 256K context tokens; the edge models handle 128K. All four support native function calling, structured JSON output, and system instructions, with support for over 140 languages.
This is not a text-only model that bolted on vision as an afterthought. Document parsing, chart recognition, and handwriting OCR are listed as explicit capabilities. For teams building document processing pipelines or agentic workflows that currently depend on GPT-4o or Claude for multimodal steps, Gemma 4 is a credible local alternative worth benchmarking on your actual inputs.
Hugging Face Transformers support requires v5.5.0 or later — this is a hard dependency, not a recommendation. If your existing stack pins an older transformers version, factor in that upgrade before you plan a rollout timeline. The quantized Ollama path has no such requirement and may be the lower-friction starting point for evaluation.
The Apache 2.0 Shift Is the Real News
Every technical capability in Gemma 4 would be interesting but limited without the licensing change. Apache 2.0 means no monthly active user caps, no acceptable-use policy enforcement by Google as a condition of use, and full freedom for sovereign or commercial deployments. You can build a product on top of Gemma 4, ship it, and scale it without renegotiating terms or worrying about a clause you missed.
This puts Gemma 4 in the same licensing tier as Qwen 3.5 and makes it meaningfully more open than Llama 4’s community license. For regulated industries where legal review of model terms is a real bottleneck — healthcare, finance, government — Apache 2.0 removes an objection that has blocked local LLM adoption at the enterprise level. The model capabilities open the door; the license is what lets legal through it.
Android developers can prototype agentic flows against Gemma 4 in the AICore Developer Preview now. Code written today will run on production Android devices powered by Gemini Nano 4 later in 2026. If you’re building anything for Android that needs on-device inference, this is the forward-compatibility path Google is signaling.
The Take
The benchmark-per-parameter story is real, the licensing is finally clean, and the tooling is there on day-0. I’ve watched every “GPT killer” cycle produce models that were good enough to demo and not good enough to replace a paid API in production. Gemma 4’s 26B MoE is the first open-weight model where the honest answer to “should I evaluate this for production?” is yes, with a concrete path to doing so on hardware most teams already own.
The specific bet I’d make: structured output and function-calling workloads are the lowest-risk migration target. The native JSON support is production-grade, the context windows are large enough for real documents, and a 24GB GPU running Q4 quantization is cheaper per month than the API cost for any significant volume. Start there. Run it in parallel for two weeks. If the output quality holds on your actual prompts, you have a clear decision.
The harder question — which I don’t have a clean answer to yet — is long-context reasoning under production load conditions. The 256K context window is real, but long-context quality at the tail end of the window has been a weak point across this model generation. If your workload depends on reliable retrieval from 200K+ token contexts, test that specifically before committing.
The license is what lets legal through the door the capabilities opened.
Related
- How to Run Your First Local LLM — if Gemma 4 is your entry point into local inference, start here
- Open Source vs Proprietary AI Models — the licensing argument in full, including when Apache 2.0 actually matters for your use case
- Local LLM API: Ollama, LM Studio, llama.cpp — practical setup for the three deployment paths Gemma 4 supports on day-0