Accelerating Gemma 4: faster inference with multi-token prediction drafters
Why speculative decoding?
The processor spends the majority of its time moving billions of parameters from VRAM to the compute units just to generate a single token.
The target model then verifies all of these suggested tokens in parallel.
MTP mitigates this inefficiency through speculative decoding, a technique introduced by Google researchers in Fast Inference from Transformers via Speculative Decoding.
By pairing a Gemma 4 model with its corresponding drafter, developers can achieve:
1 час назад @ blog.google
infomate
