Tokenization

The process of converting raw text into a sequence of tokens that an LLM can process using a vocabulary-based algorithm like BPE.

What Is Tokenization?

Tokenization is the preprocessing step that converts a raw string of text into a sequence of integer IDs — called tokens — that a large language model can ingest. Before any LLM processes your prompt, a tokenizer splits the text into subword pieces, looks each piece up in a fixed vocabulary table, and passes the resulting ID sequence to the model's embedding layer.

Both input and output follow this path: the model generates token IDs, which the tokenizer then decodes back into human-readable text.

Why Tokenization Exists

LLMs are mathematical functions that operate on numbers, not characters. Tokenization is the bridge between human-readable text and the integer sequences the model processes. The design of the tokenizer has significant downstream effects on:

Vocabulary size — typically 32,000 to 128,000 unique tokens.
Context efficiency — how much meaningful text fits in the context window.
Language coverage — how well non-English and multilingual text is represented.
Arithmetic ability — numbers tokenized as individual digits vs. multi-digit chunks affect mathematical reasoning.

Byte Pair Encoding (BPE)

Byte Pair Encoding is the dominant tokenization algorithm used by GPT-series models, Llama, Mistral, and many others. The algorithm works as follows:

Start with a vocabulary of individual bytes (256 entries).
Count all adjacent byte-pair frequencies in the training corpus.
Merge the most frequent pair into a new token.
Repeat until the target vocabulary size is reached.

The result is a vocabulary where common words and subwords appear as single tokens, while rare sequences are decomposed into smaller pieces.

Training text: "low lower lowest"
Initial pairs: l-o, o-w, w-e, e-r, r-l, ...
Most frequent: "lo" → merge → new token "lo"
Next:          "low" → merge → new token "low"

WordPiece and SentencePiece

Claude and Google models use variants:

WordPiece (BERT, DistilBERT) — Similar to BPE but uses a different merge criterion based on likelihood rather than raw frequency.
SentencePiece (T5, Gemma, many multilingual models) — Operates on raw Unicode characters without requiring pre-tokenization whitespace splitting, making it language-agnostic.

Visualizing Tokenization

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

text = "KnowledgeSDK extracts structured data from any URL."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")
# → Token count: 10

decoded = [enc.decode([t]) for t in tokens]
print(decoded)
# → ['Knowledge', 'SDK', ' extracts', ' structured', ' data', ' from', ' any', ' URL', '.']

Tokenization Efficiency Across Languages

English is the most tokenization-efficient language in models trained primarily on English data. As a rule of thumb:

Language	Tokens per word (approx.)
English	1.0 – 1.3
Spanish / French	1.2 – 1.5
Arabic / Hebrew	1.5 – 2.0
Chinese / Japanese	1.5 – 2.5
Code (Python/JS)	1.5 – 3.0

Tokenization and Token Counting in Practice

When building pipelines that feed web content into LLMs, raw HTML is extremely token-inefficient: navigation menus, script tags, inline styles, and repeated boilerplate all consume tokens without adding useful information.

KnowledgeSDK's /v1/scrape and /v1/extract endpoints convert raw HTML into clean markdown before any LLM processing. This typically reduces token counts by 50–80% compared to passing raw HTML, which directly lowers API costs and allows more meaningful content to fit within the context window.

// Raw HTML scrape: ~8,000 tokens for a typical product page
// KnowledgeSDK clean markdown: ~1,200 tokens — same information, fraction of the cost
const { content } = await sdk.scrape("https://example.com/product");

Understanding tokenization helps you write more cost-efficient prompts and design better context management strategies for your LLM applications.

Related Terms

LLMsbeginner

Token

The basic unit of text processed by an LLM — roughly 3/4 of a word in English — that models use to read and generate language.

LLMsbeginner

Large Language Model

A neural network trained on vast text corpora that can generate, summarize, translate, and reason about language.

RAG & Retrievalbeginner

Context Window

The maximum number of tokens an LLM can process in a single inference call, including both input and output.

← Token Bucket Tool Registry →

Try it now

Build with Tokenization using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →

← Back to glossary