Overview

This weeks' model spotlight is the Gemma 4 Effective models, E4B and E2B, and I will start by saying I am blown away by these two. I am grouping both variants together this week as the only key difference between them is their size. But, these models pack a punch.

Specs

Both models are titled 'Effective'. So, it's the Effective 4B and Effective 2B variants (E4B, E2B). These models are incredibly small, small enough to run on mobile devices, and yet they are surprisingly performant.

They can both be hosted at full precision on less than 10GB of memory, and at 8bit half that.

Strengths

Their size alone means they can easily run on almost consumer hardware. Not just that, they are omni models, meaning they can process audio, image, video, and text inputs. To top it all off, they are also agentic. These models are trained to perform tool calls and can fulfill lightweight agentic use tasks.

Why They're So Special

Because of Google's unique architecture, these models punch at the benchmarks of 30B models despite being less than 1/10th the size. They are incredibly capable and the fact that they are agentic is crazy. I've been impressed by their tool calling abilities.

These are omni models which is also very impressive. Combining their capabilities and performance they can be used for a variety of tasks reliably and affordably.

To top it off, they natively support a 256k token context window which is incredibly impressive for models this small. From my experience they don't do fantastic with large contexts, but it's a huge step forward that they support it at all.

Brief Architecture Overview

Because their architecture is so unique I'm dedicated a small section of this issue to cover the highlights.

PLE (Per Layer Embeddings) — This allows each layer in the model to essentially decide what information it needs per token instead of storing everything on every token. This is how an Effective 2B model gets the representational depth of ~30B models and outperforms it's cousin Gemma 3 27B which is over 13x bigger.
Shared KV Cache — Allows later layers to reuse the math that was computed on earlier layers for massive efficiency gains. Increased inference speed, lower memory overhead.
Hybrid Attention — This dynamically alternates between short-range and long-range context so the model can handle long conversations without blowing up your hardware. NOTE: I tested some large prompts and barely noticed any memory increases which is amazing to see.

Final Thoughts

These models are truly impressive and a massive opportunity for anyone who wants to host their own AIs or start experimenting. They are tiny, performant, and incredibly capable. This is one of the most significant AI releases to date. I do not say that lightly, but the results of this release are so significant. The models are tiny, and yet incredibly performant and capable, with huge memory optimizations, including TurboQuant, which we'll talk about in a later issue.

This is a release to pay attention too for sure, and I for one am excited to have performant American Models available.

Authors Note:

Google removed their custom license, making the whole Gemma 4 family Apache 2.0 which is also huge. It's a very exciting release.

Model Spotlight: Gemma 4 Effective