Home Blog Ads Try Bella Pricing Book a Demo
← Back to Blog

Google DeepMind's DiffusionGemma Generates Text 4x Faster on Local Hardware

Google DeepMind has released a new member of its open-source Gemma model family that takes a fundamentally different approach to text generation. Called DiffusionGemma, the model doesn't generate text left-to-right like most large language models. Instead, it produces entire blocks of text in parallel — borrowing techniques from image generation — making it up to four times faster than similarly sized autoregressive models.

A New Architecture for Text Generation

Most AI language models today are autoregressive: they predict the next token one at a time, building sentences sequentially from left to right. DiffusionGemma breaks this mold entirely. The model starts with a field of placeholder tokens and iteratively "denoises" the canvas over multiple passes, progressively refining its predictions until all tokens are finalized in one large block. This approach, inspired by diffusion models used in image generation (like Stable Diffusion), shifts the bottleneck from memory bandwidth to compute.

In practical terms, DiffusionGemma can generate up to 256 tokens in parallel. Testing on an NVIDIA RTX 5090 GPU yielded roughly 700 tokens per second. On a single NVIDIA H100 AI accelerator, that figure exceeds 1,000 tokens per second — about four times the output of comparably sized autoregressive Gemma models.

"DiffusionGemma has more in common with image generation models, which start with static and then denoise it to create the desired content." — Google DeepMind, via Ars Technica

Open and Accessible

DiffusionGemma joins the Gemma 4 open model family but stands apart architecturally. It is a Mixture of Experts (MoE) model with 26 billion total parameters, of which only 3.8 billion are activated during inference. This sparse activation means the model can run on consumer-grade hardware with 18 GB of RAM, making it accessible to developers and researchers without access to massive cloud clusters.

Google says the non-linear generation approach offers particular advantages for tasks like in-line text editing, molecular sequencing, and mathematical graphing — domains where the ability to continuously self-correct large token sets proves valuable. The company demonstrated the model solving Sudoku puzzles, a task notoriously difficult for autoregressive models because each digit depends on future placements.

The release underscores Google's ongoing commitment to open-weight AI models. Unlike proprietary flagship models such as Gemini, Gemma models are freely available for download, modification, and commercial use. DiffusionGemma represents a significant step forward in making fast, capable AI accessible outside of large-scale cloud deployments, potentially enabling new categories of local-first AI applications.

Source: Ars Technica — "Google's latest DiffusionGemma open AI model comes with a 4x speed boost" (June 10, 2026)

Enjoyed this article?

Get the weekly AI & crypto digest — every Monday, zero spam.

Bella

Ready to help · Ask me anything

Hi, I'm Bella! Ask me about our AI voice agent, how it works, pricing, or anything else. I'm here to help!

📬 Get the weekly AI & crypto digest