AI Development

What Is Speculative Decoding and Why It Makes AI Faster

Q: What role does the small "drafter" model play in speculative decoding?

It guesses the next few tokens so the big model can review them in bulk

Q: Why doesn't speculative decoding lower the quality of the AI's answer?

The big model still checks and accepts every token that ships

Q: Roughly how much faster can speculative decoding make a big AI model?

About 2-3x — a meaningful speedup

A clever trick where a small AI guesses what a big AI will say — and the big AI usually agrees.

Scroll to start

01 — The Concept

Two AIs, One Answer

Speculative decoding is a speed trick for AI models. Normally, a big AI writes its answer one word at a time, slowly thinking about each word before moving on. With speculative decoding, a smaller, faster "helper" AI guesses the next few words first, and the big AI just checks whether those guesses are good. When the guesses are right, the big AI gets to write several words all at once.

Imagine a slow, careful writer working with a fast typist. The typist drafts a sentence quickly. The careful writer reads it, keeps the parts that are correct, and only fixes the parts that are wrong. The final answer is just as good as if the careful writer did every word themselves — but it took a fraction of the time.

02 — Why It Matters

AI That Doesn't Feel Slow

Big AI models are powerful, but they're also slow. When you ask one a question, it has to think hard about every single word it writes. That delay is what makes some chatbots feel laggy, what makes code tools freeze while you wait, and what makes running AI at scale expensive.

Speculative decoding can make the same big model run 2 to 3 times faster without changing the quality of a single answer. That means snappier apps, lower cloud bills, and AI tools that feel instant instead of sluggish. It's one of the most useful tricks in modern AI engineering — and you can use it without retraining anything.

💡 Key Insight

The output of speculative decoding is mathematically identical to the big model's normal output. You're not trading quality for speed — you're just getting to the same answer through a shortcut that checks work in bulk instead of one word at a time.

03 — How It Works

The Draft-and-Verify Loop

Speculative decoding uses two models working together. The small "drafter" model is fast but less accurate. The big "verifier" model is slow but very smart. The drafter writes a few possible next words, then the verifier looks at all of them in a single pass. The verifier keeps the words it agrees with and throws out the rest. Then the drafter tries again from the last good word.

Here's the typical flow:

The Speculative Decoding Loop

✍️

Drafter Guesses

Small AI drafts 4-5 next tokens

→

🔍

Verifier Checks

Big AI reviews all guesses at once

→

✅

Accept Good Ones

Keep tokens the big model agrees with

→

🔁

Draft Again

Start a new draft from the last good token

↺ repeat until done

If the drafter guesses right most of the time, the big model can leap forward by 3-4 words per round instead of just 1. That's where the speed-up comes from — the big model isn't doing less work, it's just doing the same work in fewer, larger steps.

04 — Practical Example

Speculative Decoding in Code

Here's a simple Python example showing the idea. The real libraries (like vLLM, Hugging Face Transformers, and TensorRT-LLM) handle the math, but the pattern looks like this:

speculative.py

# Normal decoding: one token at a time, slow but safe
def normal_decode(prompt, big_model):
    tokens = []
    for _ in range(100):
        next_token = big_model.predict_one(prompt, tokens)
        tokens.append(next_token)
    return tokens

# Speculative decoding: draft 5, verify all, keep the good ones
def speculative_decode(prompt, big_model, small_model):
    tokens = []
    while len(tokens) < 100:
        # 1. Small fast model drafts K tokens at once
        draft = small_model.predict_k(prompt, tokens, k=5)

        # 2. Big model checks every draft token in one pass
        accepted = big_model.verify_batch(prompt, tokens, draft)

        # 3. Keep the accepted prefix, throw out the rest
        tokens.extend(accepted)
    return tokens

The verify_batch call is the magic. Even though the big model is still doing the real thinking, it gets to review 5 words at once instead of picking just 1. As long as the drafter is right most of the time, the whole pipeline finishes much faster — and the answer is exactly the same as if the big model had worked slowly the entire time.

05 — Test Yourself

Knowledge Check

Test what you learned with this quick quiz.

Quick Quiz — 3 Questions

Question 1

What role does the small "drafter" model play in speculative decoding?

Question 2

Why doesn't speculative decoding lower the quality of the AI's answer?

Question 3

Roughly how much faster can speculative decoding make a big AI model?