From Text to Tokens
Explaining tokenization without the math. Why models don't read words, and why that matters for your prompts.
Key takeaways
- Tokenization is the process of converting text into numbers (tokens) that a model can process.
- Models do not see characters or words; they see a sequence of integers.
- A “token” is not always a word. It can be part of a word, a space, or a punctuation mark.
- This abstraction explains why models struggle with spelling (“Strawberry”) and simple math.
- “Shorter” text is not always “simpler” for a model if it uses rare tokens.
Before a Large Language Model (LLM) can “think,” it must read. But computers cannot read text; they can only process numbers. The Tokenizer is the bridge component that chops human language into atomic units called tokens and assigns each one a unique integer ID.
This seems like a trivial implementation detail, but it dictates the fundamental capabilities and limitations of the model.
Act I: The fundamentals
Why Not Just Characters?
We could map a=1, b=2, c=3. This is simple, but inefficient.
- Context window: If every letter is a token, the sentence “The quick brown fox” is 19 tokens.
- Compute cost: Attention mechanisms scale quadratically with sequence length. Longer sequences = much slower and more expensive models.
Conversely, we could map every word to a number. But English has hundreds of thousands of words, and new ones are invented daily. The vocabulary would be too large to compute efficiently.
The solution is Subword Tokenization: break common words into whole tokens ("apple") and rare words into chunks ("un", "friend", "li", "ness").
What a Token Actually Looks Like
To a human, the word “Smart” is a single concept. To a tokenizer, it might be one token (ID: 5421).
But the word “Smartification” (a made-up word) might be split into three tokens:
Smart(ID:5421)ifi(ID:892)cation(ID:1102)
The model processes these three integers in sequence. It learns that 5421 followed by 892 usually implies a transformation of the concept “Smart”.
Here is a toy tokenizer you can run locally. It is not BPE, but it shows the idea of splitting text into chunks.
import re
text = "Smartification!"
tokens = re.findall(r"[A-Za-z]+|[^A-Za-z\\s]", text)
print(tokens)
Act II: The modern paradigm
The Reality: Subword Tokenization
Real systems (like GPT-4 or Claude) use algorithms like Byte Pair Encoding (BPE). BPE starts with characters and iteratively merges the most frequent adjacent pairs until a target vocabulary size is reached.
This explains why LLMs struggle with tasks that seem easy to humans:
- Spelling: The model sees the token ID
12345for “Strawberry”. It does not inherently see the lettersr-r. It has to memorize the spelling as a separate fact. - Math: The number
9.11might be tokenized as[9, ., 11]while9.9is[9, ., 9]. To the model,11is “bigger” than9, so it might hallucinate that 9.11 is greater than 9.9.
Act III: Principles in practice
Why This Matters for Systems
When designing systems around LLMs, tokenization creates invisible constraints:
- Cost: You pay per token. Verbose prompts cost more.
- Performance: “Prompt Engineering” is often just finding words that tokenize into patterns the model recognizes more strongly.
- Security: Some “jailbreaks” work by forcing the tokenizer to split forbidden words in ways that bypass safety filters (e.g., splitting “bomb” into “b-omb”).
Understanding the tokenizer means understanding the raw material of the intelligence you are orchestrating.
For related systems context, see Systems 001: Foundations and From Prompt to Production.
What this changes in practice
Treat tokenization as a design constraint: it shapes cost, reliability, and the kinds of prompts your system can handle.