AI Development

How Context Windows Actually Work

Q: What does a "context window" actually measure?

The total amount of text the AI can read and use at one time

Q: Why is a 1M token context window not always better than a 200K one?

It can be slower, more expensive, and less accurate on details in the middle

Q: In a 100K context window, if your input message uses 80K tokens, how many tokens does the AI have left to write its response?

20K — input and output share the same window

And why bigger isn't always better when it comes to AI memory.

Scroll to start

01 — The Concept

AI's Working Memory

Imagine a desk. The AI can only see and use what's on the desk — and the desk has a fixed size. That desk is the context window: the total amount of text the AI can read, remember, and reason about in a single conversation.

The AI doesn't actually store memories the way a human does. Every time you send a new message, the entire conversation so far — your question, its earlier answers, the system instructions — is re-fed to the model as one big chunk of text. The size of that chunk is the context window. There's no filing cabinet in the back. It's all on the desk, every turn.

Context is measured in tokens, which are small chunks of text. A token is roughly three or four characters, or about three-quarters of a word in English. The sentence "I love pizza" is four tokens. A full page of a novel is around 300 tokens. The context window is the maximum number of tokens the model can hold at once.

Modern models come in wildly different sizes:

4K tokens — a short email or a couple of pages
32K tokens — a long article or a short story
128K–200K tokens — a full book or a small codebase
1M+ tokens — multiple books or a large project

And here's the catch: the input and the output share the same window. If you feed the model 80,000 tokens of text, it only has 20,000 tokens left to write a reply on a 100K window.

02 — Why It Matters

Bigger Isn't Always Better

A huge context window sounds like a superpower — and in some ways it is — but it comes with three hidden costs that most people never hear about.

💡 Key Insight

A 1M token context window isn't 10× better than a 100K one. It's often slower, more expensive, and the AI can actually answer worse on details buried in the middle. The best results usually come from giving the AI just enough context to answer well — not all the context you can find.

Why does this happen? Researchers call it the "lost in the middle" problem. When you give an AI a giant pile of facts, it pays the most attention to information at the beginning and the end of the input — and the least to whatever sits in the middle. It's a bit like trying to find a specific page in a giant open book spread across a table. You notice the cover and the back, but the middle blurs together.

The three real costs of a bigger window:

Money — Cost doesn't grow in a straight line. Doubling the context can multiply the compute by four or eight, because the model has to compare every token against every other token.
Speed — A 1M context request can take many seconds longer to answer than a 50K one, even on the same hardware.
Quality — As mentioned, accuracy on details drops for stuff that's not at the edges. More text can mean a more confused model.

This is why a smart user doesn't just dump a whole book into the prompt. They pull out the most relevant pages and let the AI focus on those.

03 — How It Works

The Four Steps Behind the Curtain

Every time the AI reads your message, it goes through the same four-step process. Here's what happens, in plain language.

How a Prompt Becomes a Reply

🔪

Tokenize

Chop text into tokens and assign each a number ID

→

📍

Position

Tag each token with its spot in the conversation

→

👁️

Attention

Compare every token to every other token

→

✍️

Generate

Pick the next token, one word at a time

↺ repeat until reply is done

1. Tokenize. Your text gets sliced into tokens, and each token is turned into a number. The word "context" might become the number 6193. The model never actually sees letters — only numbers.

2. Position. Each token also gets a tag that says where it sits: first word, tenth word, halfway through. Without this, the model wouldn't know if "it" refers to the dog or the cat in the sentence before.

3. Attention. This is the magic step. The model looks at every token and decides how much it should pay attention to every other token. So when it's about to write the word "it," it knows which earlier noun is the most likely referent. This is also why long contexts get expensive: every new token has to be compared against every existing one.

4. Generate. The model picks the most likely next token, then the next, then the next — one at a time. Each new token is added to the context, and the model uses the whole growing pile to choose the next word.

04 — Practical Example

Counting Tokens in Real Text

Here's a small Python script that counts how many tokens a piece of text actually uses. It's the same kind of tool the AI companies use to track your usage and bill you.

token_counter.py

# Count tokens in a piece of text
import tiktoken

def count_tokens(text, model="gpt-4o"):
    # Get the tokenizer used by the model
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)
    return len(tokens)

# A typical email
email = "Hey, just checking in on the project. " * 5
print(f"Email uses {count_tokens(email)} tokens")
# → Email uses 75 tokens

# A full book chapter (~5,000 words)
chapter = "It was the best of times, it was the worst of times. " * 500
tokens = count_tokens(chapter)
print(f"Chapter uses {tokens} tokens")
# → Chapter uses 14,000 tokens

# A whole novel (~80,000 words)
novel = "The quick brown fox jumps over the lazy dog. " * 20000
total = count_tokens(novel)
print(f"Novel uses ~{total/1000:.0f}K tokens")
# → Novel uses ~100K tokens

# How much of a 128K window is that?
window = 128000
print(f"= {total/window*100:.0f}% of a 128K window")
# → = 78% of a 128K window

Run this and you'll see that a typical novel uses around 100,000 tokens. So even a 200K context window can only hold about two novels at once. And if you ask the AI a question about a scene in the middle of that pile, you've just hit the "lost in the middle" problem — it might miss the answer even though the text is technically "in" the window.

05 — Test Yourself

Knowledge CheckKnowledge Check

Test what you learned with this quick quiz.

Quick Quiz — 3 Questions

Question 1

What does a "context window" actually measure?

Question 2

Why is a 1M token context window not always better than a 200K one?

Question 3

In a 100K context window, if your input message uses 80K tokens, how many tokens does the AI have left to write its response?