NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

Today, Google DeepMind released DiffusionGemma — an experimental open model built for exceptionally fast text generation. NVIDIA has optimized DiffusionGemma to run even faster across NVIDIA GeForce RTX GPUs, the NVIDIA RTX PRO platform and NVIDIA DGX Spark systems, from local PCs to the cloud.

Rather than generating text one word at a time, DiffusionGemma generates multiple words in parallel to output whole blocks of text, opening a new, low-latency frontier for the kind of single-user workloads that developers, researchers and AI enthusiasts run every day.

Features of the new model include:

Parallel generation: DiffusionGemma denoises up to 256 tokens per step instead of predicting one at a time.
Built on Gemma 4: DiffusionGemma is built on Gemma 4, a 26-billion-parameter mixture-of-experts model that activates just 3.8 billion parameters per step, pairing a diffusion head with Google’s Gemma 4 architecture.
Up to 4x faster performance: The boost means fast text generation, where single-user generation usually stalls — on local hardware.
Open and local: DiffusionGemma is open weights under a permissive Apache 2.0 license and runs entirely on RTX and DGX Spark — no cloud, no per-token cost — with day-zero support in Hugging Face Transformers, vLLM and Unsloth.

A Different Way to Generate Text

Almost every large language model (LLM) in wide use today is autoregressive — meaning it generates text one token at a time, with each new word depending on the one before it. That sequential process is what makes interactive AI feel like it’s typing.

DiffusionGemma takes a different path. Built on the Gemma 4 26B mixture-of-experts architecture, it generates text the way diffusion models generate images: by starting from noise and refining a whole block of text at once. Each step denoises up to 256 tokens in parallel rather than emitting a single token and waiting to compute the next.

The result is a model that thinks in blocks instead of sequentially. For latency-sensitive, single-user work — such as interactive chat, agentic loops or on-device assistants that plan and act — that parallelism translates into responses fast enough to keep pace with how developers think and iterate.

DiffusionGemma Flies on NVIDIA GPUs

Generating one token at a time is fundamentally a memory-bound problem — a traditional LLM spends most of its time waiting on memory bandwidth, not doing math, which leaves a lot of compute on the table.

Diffusion flips the equation. Pulling a full 256-token block through the transformer in parallel is a compute-bound workload — exactly what NVIDIA GPUs are built for. NVIDIA Tensor Cores accelerate the dense parallel math, and the CUDA software stack lets the model run efficiently from day one without bespoke tuning. In short, the model’s design plays directly to the GPU’‘s strengths.

That shows up in the numbers. DiffusionGemma delivers 1,000 tokens/sec on a single NVIDIA H100 Tensor Core GPU, 150 tokens/sec on NVIDIA DGX Spark and up to 2,000 tokens/sec on NVIDIA DGX Station — roughly 4x faster than an equivalent autoregressive model running in the same single-user regime.

That advantage holds across NVIDIA’s full lineup, running:

Locally on the NVIDIA DGX Spark deskside personal AI supercomputer — powered by the NVIDIA GB10 Grace Blackwell Superchip with 128GB of unified memory — with the preinstalled NVIDIA AI software stack ready for prototyping, fine-tuning and fully local agent workflows.
On NVIDIA RTX PRO 6000 workstations, providing developers, researchers and AI professionals with the headroom to run local low-latency generation and agentic loops as part of a professional workflow.
On DGX Station, delivering best-in-class, local high-speed inference with up to 2,000 tokens/sec for low-latency text generation and agentic loops with 748GB of coherent memory.
On GeForce RTX GPUs, with llama.cpp support coming soon.

Get Started Locally

The fastest way to start testing and prototyping the model is through Hugging Face Transformers, which runs DiffusionGemma on a GeForce RTX 5090 or DGX Spark out of the box. For higher-throughput inference, vLLM provides day-zero serving support.

For adapting the model to a specific task or domain, fine-tuning is available through Unsloth and NVIDIA NeMo framework, with ready-made DGX Spark playbooks to get a local environment running quickly. Check out the vLLM playbooks for DGX Spark , RTX PRO and DGX Station.

Try Diffusion Gemma on Hugging Face or test it for free using NVIDIA-hosted application programming interfaces at build.nvidia.com.

Go deeper on the architecture and local deployment by reading the NVIDIA technical blog and the Google DeepMind announcement.

#ICYMI: The Latest From RTX AI Garage

🎬 NVIDIA researchers released SANA-WM, an open source world model that turns a single image and a camera path into a minute-long, 720p video with precise 6-DoF control. At just 2.6 billion parameters, its distilled version generates a full 60-second clip in 34 seconds on a single NVIDIA GeForce RTX 5090 GPU using the NVFP4 format — delivering up to 36x higher throughput than comparable open models while running on one GPU. Read the paper.

🛠️ Building Windows agents just got a full toolset — NVIDIA and Microsoft rolled out turnkey agent sandboxing on native Windows — Microsoft eXecution Containers plus the NVIDIA OpenShell runtime — alongside up to 2x faster agentic inference and native Windows support for Hermes Agent.

🤖DGX Spark goes from unboxing to a running agent in minutes — A streamlined NVIDIA NemoClaw install gets developers to a working local agent fast, with Qwen3.6-35B running up to 2.6x faster on vLLM. And the new cluster assistant in NVIDIA Sync links up to four DGX Spark units into one 512GB pool — enough for ~400-billion-parameter models.

Plug in to RTX Spark on Facebook, Instagram, TikTok and X — and stay informed by subscribing to the RTX Spark newsletter.