How Context Windows Actually Work
And why bigger isn't always better when it comes to AI memory.
AI's Working Memory
Imagine a desk. The AI can only see and use what's on the desk — and the desk has a fixed size. That desk is the context window: the total amount of text the AI can read, remember, and reason about in a single conversation.
The AI doesn't actually store memories the way a human does. Every time you send a new message, the entire conversation so far — your question, its earlier answers, the system instructions — is re-fed to the model as one big chunk of text. The size of that chunk is the context window. There's no filing cabinet in the back. It's all on the desk, every turn.
Context is measured in tokens, which are small chunks of text. A token is roughly three or four characters, or about three-quarters of a word in English. The sentence "I love pizza" is four tokens. A full page of a novel is around 300 tokens. The context window is the maximum number of tokens the model can hold at once.
Modern models come in wildly different sizes:
- 4K tokens — a short email or a couple of pages
- 32K tokens — a long article or a short story
- 128K–200K tokens — a full book or a small codebase
- 1M+ tokens — multiple books or a large project
And here's the catch: the input and the output share the same window. If you feed the model 80,000 tokens of text, it only has 20,000 tokens left to write a reply on a 100K window.
Bigger Isn't Always Better
A huge context window sounds like a superpower — and in some ways it is — but it comes with three hidden costs that most people never hear about.
💡 Key Insight
A 1M token context window isn't 10× better than a 100K one. It's often slower, more expensive, and the AI can actually answer worse on details buried in the middle. The best results usually come from giving the AI just enough context to answer well — not all the context you can find.
Why does this happen? Researchers call it the "lost in the middle" problem. When you give an AI a giant pile of facts, it pays the most attention to information at the beginning and the end of the input — and the least to whatever sits in the middle. It's a bit like trying to find a specific page in a giant open book spread across a table. You notice the cover and the back, but the middle blurs together.
The three real costs of a bigger window:
- Money — Cost doesn't grow in a straight line. Doubling the context can multiply the compute by four or eight, because the model has to compare every token against every other token.
- Speed — A 1M context request can take many seconds longer to answer than a 50K one, even on the same hardware.
- Quality — As mentioned, accuracy on details drops for stuff that's not at the edges. More text can mean a more confused model.
This is why a smart user doesn't just dump a whole book into the prompt. They pull out the most relevant pages and let the AI focus on those.
The Four Steps Behind the Curtain
Every time the AI reads your message, it goes through the same four-step process. Here's what happens, in plain language.
1. Tokenize. Your text gets sliced into tokens, and each token is turned into a number. The word "context" might become the number 6193. The model never actually sees letters — only numbers.
2. Position. Each token also gets a tag that says where it sits: first word, tenth word, halfway through. Without this, the model wouldn't know if "it" refers to the dog or the cat in the sentence before.
3. Attention. This is the magic step. The model looks at every token and decides how much it should pay attention to every other token. So when it's about to write the word "it," it knows which earlier noun is the most likely referent. This is also why long contexts get expensive: every new token has to be compared against every existing one.
4. Generate. The model picks the most likely next token, then the next, then the next — one at a time. Each new token is added to the context, and the model uses the whole growing pile to choose the next word.
Counting Tokens in Real Text
Here's a small Python script that counts how many tokens a piece of text actually uses. It's the same kind of tool the AI companies use to track your usage and bill you.
# Count tokens in a piece of text import tiktoken def count_tokens(text, model="gpt-4o"): # Get the tokenizer used by the model encoding = tiktoken.encoding_for_model(model) tokens = encoding.encode(text) return len(tokens) # A typical email email = "Hey, just checking in on the project. " * 5 print(f"Email uses {count_tokens(email)} tokens") # → Email uses 75 tokens # A full book chapter (~5,000 words) chapter = "It was the best of times, it was the worst of times. " * 500 tokens = count_tokens(chapter) print(f"Chapter uses {tokens} tokens") # → Chapter uses 14,000 tokens # A whole novel (~80,000 words) novel = "The quick brown fox jumps over the lazy dog. " * 20000 total = count_tokens(novel) print(f"Novel uses ~{total/1000:.0f}K tokens") # → Novel uses ~100K tokens # How much of a 128K window is that? window = 128000 print(f"= {total/window*100:.0f}% of a 128K window") # → = 78% of a 128K window
Run this and you'll see that a typical novel uses around 100,000 tokens. So even a 200K context window can only hold about two novels at once. And if you ask the AI a question about a scene in the middle of that pile, you've just hit the "lost in the middle" problem — it might miss the answer even though the text is technically "in" the window.
Knowledge CheckKnowledge Check
Test what you learned with this quick quiz.