From Text to Tokens

Explaining tokenization without the math. Why models don't read words, and why that matters for your prompts.

Layout
Text being split into tokens with numeric encoding

Key takeaways

  • Tokenization is the process of converting text into numbers (tokens) that a model can process.
  • Models do not see characters or words; they see a sequence of integers.
  • A “token” is not always a word. It can be part of a word, a space, or a punctuation mark.
  • This abstraction explains why models struggle with spelling (“Strawberry”) and simple math.
  • “Shorter” text is not always “simpler” for a model if it uses rare tokens.
Text to Numbers PipelineVisualizing the transformation of the phrase ‘AI is math’ into tokens and then IDs.”AI is math”AIismath10453188921TokenizationEncoding
The machine never sees the words. It only sees the IDs.

Before a Large Language Model (LLM) can “think,” it must read. But computers cannot read text; they can only process numbers. The Tokenizer is the bridge component that chops human language into atomic units called tokens and assigns each one a unique integer ID.

This seems like a trivial implementation detail, but it dictates the fundamental capabilities and limitations of the model.

Act I: The fundamentals

Why Not Just Characters?

We could map a=1, b=2, c=3. This is simple, but inefficient.

  • Context window: If every letter is a token, the sentence “The quick brown fox” is 19 tokens.
  • Compute cost: Attention mechanisms scale quadratically with sequence length. Longer sequences = much slower and more expensive models.

Conversely, we could map every word to a number. But English has hundreds of thousands of words, and new ones are invented daily. The vocabulary would be too large to compute efficiently.

The solution is Subword Tokenization: break common words into whole tokens ("apple") and rare words into chunks ("un", "friend", "li", "ness").

What a Token Actually Looks Like

To a human, the word “Smart” is a single concept. To a tokenizer, it might be one token (ID: 5421).

But the word “Smartification” (a made-up word) might be split into three tokens:

  1. Smart (ID: 5421)
  2. ifi (ID: 892)
  3. cation (ID: 1102)

The model processes these three integers in sequence. It learns that 5421 followed by 892 usually implies a transformation of the concept “Smart”.

Here is a toy tokenizer you can run locally. It is not BPE, but it shows the idea of splitting text into chunks.

import re

text = "Smartification!"
tokens = re.findall(r"[A-Za-z]+|[^A-Za-z\\s]", text)
print(tokens)

Act II: The modern paradigm

The Reality: Subword Tokenization

Real systems (like GPT-4 or Claude) use algorithms like Byte Pair Encoding (BPE). BPE starts with characters and iteratively merges the most frequent adjacent pairs until a target vocabulary size is reached.

This explains why LLMs struggle with tasks that seem easy to humans:

  1. Spelling: The model sees the token ID 12345 for “Strawberry”. It does not inherently see the letters r-r. It has to memorize the spelling as a separate fact.
  2. Math: The number 9.11 might be tokenized as [9, ., 11] while 9.9 is [9, ., 9]. To the model, 11 is “bigger” than 9, so it might hallucinate that 9.11 is greater than 9.9.

Act III: Principles in practice

Why This Matters for Systems

When designing systems around LLMs, tokenization creates invisible constraints:

  • Cost: You pay per token. Verbose prompts cost more.
  • Performance: “Prompt Engineering” is often just finding words that tokenize into patterns the model recognizes more strongly.
  • Security: Some “jailbreaks” work by forcing the tokenizer to split forbidden words in ways that bypass safety filters (e.g., splitting “bomb” into “b-omb”).

Understanding the tokenizer means understanding the raw material of the intelligence you are orchestrating.

For related systems context, see Systems 001: Foundations and From Prompt to Production.

What this changes in practice

Treat tokenization as a design constraint: it shapes cost, reliability, and the kinds of prompts your system can handle.

Proof Block

  • Core foundational reference for tokenization mechanics
  • Referenced in context-windows-as-working-memory.mdx

FAQ

What is tokenization?

Tokenization is the process of converting text into tokens (integers) that a language model can process. The model doesn't read characters or words; it sees a sequence of numbers representing these tokens. A token can be a full word, part of a word, a space, or punctuation.

Why do models struggle with "Strawberry"?

Tokenizers split text into subword units based on frequency in training data. Common words like "straw" might be one token while uncommon endings like "berry" might be split into multiple tokens. "Strawberry" has 3 tokens (straw-berry), making counting the 'r's across token boundaries difficult.

Does shorter text always mean fewer tokens?

No. Text length in characters doesn't correlate perfectly with token count. Dense, uncommon words can use more tokens than longer, common phrases. This is why estimating costs or context usage by character count is unreliable.