knowledgesdk.com/glossary/tokenization
LLMsbeginner

Also known as: text tokenization

Tokenization

The process of converting raw text into a sequence of tokens that an LLM can process using a vocabulary-based algorithm like BPE.

What Is Tokenization?

Tokenization is the preprocessing step that converts a raw string of text into a sequence of integer IDs — called tokens — that a large language model can ingest. Before any LLM processes your prompt, a tokenizer splits the text into subword pieces, looks each piece up in a fixed vocabulary table, and passes the resulting ID sequence to the model's embedding layer.

Both input and output follow this path: the model generates token IDs, which the tokenizer then decodes back into human-readable text.

Why Tokenization Exists

LLMs are mathematical functions that operate on numbers, not characters. Tokenization is the bridge between human-readable text and the integer sequences the model processes. The design of the tokenizer has significant downstream effects on:

  • Vocabulary size — typically 32,000 to 128,000 unique tokens.
  • Context efficiency — how much meaningful text fits in the context window.
  • Language coverage — how well non-English and multilingual text is represented.
  • Arithmetic ability — numbers tokenized as individual digits vs. multi-digit chunks affect mathematical reasoning.

Byte Pair Encoding (BPE)

Byte Pair Encoding is the dominant tokenization algorithm used by GPT-series models, Llama, Mistral, and many others. The algorithm works as follows:

  1. Start with a vocabulary of individual bytes (256 entries).
  2. Count all adjacent byte-pair frequencies in the training corpus.
  3. Merge the most frequent pair into a new token.
  4. Repeat until the target vocabulary size is reached.

The result is a vocabulary where common words and subwords appear as single tokens, while rare sequences are decomposed into smaller pieces.

Training text: "low lower lowest"
Initial pairs: l-o, o-w, w-e, e-r, r-l, ...
Most frequent: "lo" → merge → new token "lo"
Next:          "low" → merge → new token "low"

WordPiece and SentencePiece

Claude and Google models use variants:

  • WordPiece (BERT, DistilBERT) — Similar to BPE but uses a different merge criterion based on likelihood rather than raw frequency.
  • SentencePiece (T5, Gemma, many multilingual models) — Operates on raw Unicode characters without requiring pre-tokenization whitespace splitting, making it language-agnostic.

Visualizing Tokenization

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

text = "KnowledgeSDK extracts structured data from any URL."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")
# → Token count: 10

decoded = [enc.decode([t]) for t in tokens]
print(decoded)
# → ['Knowledge', 'SDK', ' extracts', ' structured', ' data', ' from', ' any', ' URL', '.']

Tokenization Efficiency Across Languages

English is the most tokenization-efficient language in models trained primarily on English data. As a rule of thumb:

Language Tokens per word (approx.)
English 1.0 – 1.3
Spanish / French 1.2 – 1.5
Arabic / Hebrew 1.5 – 2.0
Chinese / Japanese 1.5 – 2.5
Code (Python/JS) 1.5 – 3.0

Tokenization and Token Counting in Practice

When building pipelines that feed web content into LLMs, raw HTML is extremely token-inefficient: navigation menus, script tags, inline styles, and repeated boilerplate all consume tokens without adding useful information.

KnowledgeSDK's /v1/scrape and /v1/extract endpoints convert raw HTML into clean markdown before any LLM processing. This typically reduces token counts by 50–80% compared to passing raw HTML, which directly lowers API costs and allows more meaningful content to fit within the context window.

// Raw HTML scrape: ~8,000 tokens for a typical product page
// KnowledgeSDK clean markdown: ~1,200 tokens — same information, fraction of the cost
const { content } = await sdk.scrape("https://example.com/product");

Understanding tokenization helps you write more cost-efficient prompts and design better context management strategies for your LLM applications.

Related Terms

LLMsbeginner
Token
The basic unit of text processed by an LLM — roughly 3/4 of a word in English — that models use to read and generate language.
LLMsbeginner
Large Language Model
A neural network trained on vast text corpora that can generate, summarize, translate, and reason about language.
RAG & Retrievalbeginner
Context Window
The maximum number of tokens an LLM can process in a single inference call, including both input and output.
Token BucketTool Registry

Try it now

Build with Tokenization using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →
← Back to glossary