What Is Tokenization?
Tokenization is the preprocessing step that converts a raw string of text into a sequence of integer IDs — called tokens — that a large language model can ingest. Before any LLM processes your prompt, a tokenizer splits the text into subword pieces, looks each piece up in a fixed vocabulary table, and passes the resulting ID sequence to the model's embedding layer.
Both input and output follow this path: the model generates token IDs, which the tokenizer then decodes back into human-readable text.
Why Tokenization Exists
LLMs are mathematical functions that operate on numbers, not characters. Tokenization is the bridge between human-readable text and the integer sequences the model processes. The design of the tokenizer has significant downstream effects on:
- Vocabulary size — typically 32,000 to 128,000 unique tokens.
- Context efficiency — how much meaningful text fits in the context window.
- Language coverage — how well non-English and multilingual text is represented.
- Arithmetic ability — numbers tokenized as individual digits vs. multi-digit chunks affect mathematical reasoning.
Byte Pair Encoding (BPE)
Byte Pair Encoding is the dominant tokenization algorithm used by GPT-series models, Llama, Mistral, and many others. The algorithm works as follows:
- Start with a vocabulary of individual bytes (256 entries).
- Count all adjacent byte-pair frequencies in the training corpus.
- Merge the most frequent pair into a new token.
- Repeat until the target vocabulary size is reached.
The result is a vocabulary where common words and subwords appear as single tokens, while rare sequences are decomposed into smaller pieces.
Training text: "low lower lowest"
Initial pairs: l-o, o-w, w-e, e-r, r-l, ...
Most frequent: "lo" → merge → new token "lo"
Next: "low" → merge → new token "low"
WordPiece and SentencePiece
Claude and Google models use variants:
- WordPiece (BERT, DistilBERT) — Similar to BPE but uses a different merge criterion based on likelihood rather than raw frequency.
- SentencePiece (T5, Gemma, many multilingual models) — Operates on raw Unicode characters without requiring pre-tokenization whitespace splitting, making it language-agnostic.
Visualizing Tokenization
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "KnowledgeSDK extracts structured data from any URL."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")
# → Token count: 10
decoded = [enc.decode([t]) for t in tokens]
print(decoded)
# → ['Knowledge', 'SDK', ' extracts', ' structured', ' data', ' from', ' any', ' URL', '.']
Tokenization Efficiency Across Languages
English is the most tokenization-efficient language in models trained primarily on English data. As a rule of thumb:
| Language | Tokens per word (approx.) |
|---|---|
| English | 1.0 – 1.3 |
| Spanish / French | 1.2 – 1.5 |
| Arabic / Hebrew | 1.5 – 2.0 |
| Chinese / Japanese | 1.5 – 2.5 |
| Code (Python/JS) | 1.5 – 3.0 |
Tokenization and Token Counting in Practice
When building pipelines that feed web content into LLMs, raw HTML is extremely token-inefficient: navigation menus, script tags, inline styles, and repeated boilerplate all consume tokens without adding useful information.
KnowledgeSDK's /v1/scrape and /v1/extract endpoints convert raw HTML into clean markdown before any LLM processing. This typically reduces token counts by 50–80% compared to passing raw HTML, which directly lowers API costs and allows more meaningful content to fit within the context window.
// Raw HTML scrape: ~8,000 tokens for a typical product page
// KnowledgeSDK clean markdown: ~1,200 tokens — same information, fraction of the cost
const { content } = await sdk.scrape("https://example.com/product");
Understanding tokenization helps you write more cost-efficient prompts and design better context management strategies for your LLM applications.