
Tokens and Context
AGAI 102 · How Language Models Work
Learn how language models break text into tokens, why context windows matter, and how tokenization affects cost, performance, and reliability.
Learning objectives
- Define tokens and context windows in practical terms.
- Explain why tokenization affects cost, latency, and model quality.
- Describe byte-pair encoding at a conceptual level.
- Use programmatic token counting to estimate prompt size.
- Evaluate tradeoffs between small and large context windows.
Language models do not read text exactly the way humans do. Before a model can process a prompt, the text is converted into smaller units called tokens.
A token may be a whole word, part of a word, punctuation, whitespace, or a character sequence. For example, the word computer might be one token, while an uncommon technical term might be split into several tokens. The exact result depends on the tokenizer used by the model.
Tokenization matters because language models operate over tokens rather than raw characters. Tokens affect:
- Cost — many AI APIs charge based on input and output token counts.
- Context limits — each model can only process a maximum number of tokens at once.
- Performance — longer prompts require more computation.
- Quality — if important information is outside the context window, the model cannot use it.
The context window is the maximum amount of text, measured in tokens, that a model can consider at one time. This includes the system prompt, user messages, conversation history, retrieved documents, tool results, and the model’s own generated output.
If a model has a 128,000-token context window, that does not mean every response should use 128,000 tokens. It means the model can theoretically attend to that much text in a single request, subject to cost, latency, and reliability tradeoffs.
A worked tokenization example
Consider the sentence:
The quick brown fox
For many English-language tokenizers, this sentence is likely to be represented as roughly four tokens:
["The", " quick", " brown", " fox"]
The leading spaces before quick, brown, and fox are often included as part of the token. This may seem strange at first, but it helps the tokenizer represent common word patterns efficiently. The model does not literally see four English words. It sees token IDs, such as:
[791, 4062, 14198, 39935]
These numbers are illustrative. Different tokenizers and model families use different vocabularies, so the exact IDs vary.
Now compare that with a less common phrase:
The hyper-specialized microarchitecture benchmark
This might become something like:
["The", " hyper", "-", "special", "ized", " micro", "architecture", " benchmark"]
The uncommon compound word is split into smaller pieces. This is one reason technical text can sometimes use more tokens than expected.
Byte-pair encoding, conceptually
Many modern tokenizers use a method related to byte-pair encoding, often abbreviated BPE. The full implementation details vary, but the basic idea is approachable.
Start with very small units, such as characters or bytes. Then repeatedly merge the most common adjacent pairs found in a large training corpus.
A simplified example:
Initial units:
l o w
l o w e r
n e w e r
Common pair: l + o → lo
Common pair: lo + w → low
Common pair: e + r → er
Over many merge steps, the tokenizer learns compact representations for common words, subwords, suffixes, prefixes, punctuation patterns, and whitespace patterns.
This gives tokenizers flexibility. They can represent common words efficiently while still being able to represent rare words by splitting them into smaller parts. The model does not need a vocabulary entry for every possible word, product name, code identifier, or typo.
Why some languages are less token-efficient
Token efficiency varies across languages. English is often relatively token-efficient because many tokenizers were trained heavily on English text and because English words and whitespace patterns are well represented in common token vocabularies.
Languages such as Chinese and Japanese do not use spaces between words in the same way English does. Arabic has rich morphology, where prefixes, suffixes, and word forms can encode information that English may express with multiple separate words. Depending on the tokenizer and training corpus, these languages may require more tokens for the same amount of meaning.
For example, a short English sentence might take 10 tokens, while an equivalent sentence in another language might take 15, 20, or more. This has practical consequences:
- API usage may cost more for less token-efficient languages.
- Fewer ideas may fit inside the same context window.
- Long documents in some languages may require more aggressive chunking.
- Model quality may vary if the model had less high-quality training data in that language.
This does not mean LLMs cannot work well in non-English languages. Many do. But developers building multilingual applications should test token usage, retrieval quality, and output quality across the actual languages their users will use.
Counting tokens programmatically
For production systems, estimate token counts in code rather than guessing. In Python, OpenAI-compatible token counting is often done with the tiktoken library.
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
text = "The quick brown fox"
tokens = encoding.encode(text)
print(tokens)
print(len(tokens))
A more application-oriented helper might look like this:
import tiktoken
def count_tokens(text: str, encoding_name: str = "cl100k_base") -> int:
encoding = tiktoken.get_encoding(encoding_name)
return len(encoding.encode(text))
prompt = "Summarize this document in five bullet points."
print(count_tokens(prompt))
Token counting is especially useful when building retrieval-augmented generation systems, chatbots with conversation history, or agents that insert tool results into context. You need to know how much space remains before adding another document, memory record, or tool output.
Large versus small context windows
A large context window is powerful because it lets the model consider more information at once. This is useful for long documents, codebases, legal contracts, research papers, multi-turn conversations, and agent traces.
But bigger is not always better.
Large contexts usually increase cost and latency. They may also make it harder for the model to focus on the most relevant details. This is sometimes called attention dilution: when too much information is included, the signal can be buried in noise.
For example, if a user asks a question about one paragraph in a 200-page manual, pasting the entire manual may be worse than retrieving the three most relevant sections. The model may technically have access to everything, but the prompt becomes more expensive and harder to reason over.
A good context strategy often combines:
- A clear system prompt
- The current user request
- Relevant conversation history
- Retrieved documents or snippets
- Tool results that directly support the task
- A concise summary of prior state when needed
The goal is not to maximize context usage. The goal is to provide the right information at the right time.
How context windows have evolved
Early transformer language models had much smaller context windows. GPT-2, released in 2019, used a context length of 1,024 tokens. That was enough for paragraphs or short articles, but not for long technical documents or extended conversations.
Modern models can support much larger windows, ranging from tens of thousands to hundreds of thousands of tokens, and some systems are designed for even larger document workflows. This expansion has enabled new use cases: analyzing large code files, summarizing long transcripts, comparing multiple documents, and building more capable agents.
Still, long-context capability does not eliminate the need for good information architecture. Developers must decide what to include, what to summarize, what to retrieve, and what to leave out.
Practical takeaways
Tokens are the basic unit of language model computation. Context windows define how much information the model can consider at once. Tokenization affects cost, speed, and quality.
For developers, the practical lessons are:
- Count tokens in code for serious applications.
- Test multilingual token efficiency if your users work in multiple languages.
- Do not assume a large context window automatically improves results.
- Prefer relevant context over maximum context.
- Use retrieval, summarization, and memory management to keep prompts focused.
A language model can only reason over what fits into its context. Understanding tokens and context is the first step toward building reliable LLM applications.
Sign in to track your progress.
Ask your AI guide
Ask anything about AI Fundamentals & Large Language Models — Tokens and Context, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.