LLM
Tokenisation
LLM Tokenisation fundamentals and working

How Does Tokenization Work in LLMs
Tokenization is the critical first step in the processing pipeline of Large Language Models (LLMs). It transforms raw text into numerical representations that models can understand and process. While often overlooked, tokenization significantly impacts model performance, efficiency, and even cost. This deep dive explores how tokenization works in modern LLMs, with practical examples and insights for engineers.
The Fundamentals of Tokenization
At its core, tokenization is the process of breaking text into smaller units called tokens. These tokens are then converted to numerical IDs used by the model for processing. The choice of tokenization strategy directly affects:
- Model performance and capabilities
- Context window limitations
- Processing efficiency
- Handling of multilingual content
- Cost of API usage (typically billed per token)
Evolution of Tokenization Approaches
Character-Level Tokenization
The simplest approach tokenizes each character individually:
def char_tokenize(text):
return list(text)
tokens = char_tokenize("Hello")
# ['H', 'e', 'l', 'l', 'o']
Advantages:
- Simple implementation
- No out-of-vocabulary issues
- Fixed vocabulary size
Disadvantages:
- Very inefficient (long sequences)
- Loses word-level semantics
- Poor handling of spaces and formatting
Word-Level Tokenization
Word tokenization splits text at whitespace and punctuation:
def word_tokenize(text):
return text.split()
tokens = word_tokenize("Hello, world!")
# ['Hello,', 'world!']
Advantages:
- Intuitive alignment with linguistic units
- Shorter sequences than character tokenization
Disadvantages:
- Large vocabulary requirements
- Out-of-vocabulary words
- Poor handling of morphologically rich languages
Subword Tokenization (Modern Approach)
Most modern LLMs use subword tokenization, which balances efficiency and flexibility. The three dominant algorithms are:
1. Byte Pair Encoding (BPE)
Used by GPT models, BPE starts with individual characters and iteratively merges the most frequent pairs:
# Simplified BPE training process
def train_bpe(corpus, vocab_size):
# Start with character vocabulary
vocab = set(char for text in corpus for char in text)
# Initialize each word as a sequence of characters
splits = {word: list(word) for word in corpus}
# Merge most frequent pairs until vocab_size is reached
while len(vocab) < vocab_size:
# Count pair frequencies
pairs = collections.Counter()
for word_splits in splits.values():
for i in range(len(word_splits) - 1):
pair = (word_splits[i], word_splits[i + 1])
pairs[pair] += 1
# Find most frequent pair
if not pairs:
break
best_pair = max(pairs, key=pairs.get)
# Create new token from pair
new_token = best_pair[0] + best_pair[1]
vocab.add(new_token)
# Update splits to use the new token
for word, word_splits in splits.items():
i = 0
while i < len(word_splits) - 1:
if (word_splits[i], word_splits[i + 1]) == best_pair:
word_splits[i:i+2] = [new_token]
else:
i += 1
return vocab
2. WordPiece
Used by BERT-like models, WordPiece uses a likelihood-based approach that favors subwords that improve the likelihood of the training data:
# Conceptual WordPiece algorithm
def train_wordpiece(corpus, vocab_size):
# Start with base vocabulary (e.g., characters)
vocab = set(char for text in corpus for char in text)
while len(vocab) < vocab_size:
best_score = -float('inf')
best_pair = None
# Evaluate all possible pairs
for pair in generate_candidate_pairs(corpus, vocab):
# Calculate likelihood improvement if this pair is merged
score = calculate_likelihood_improvement(corpus, vocab, pair)
if score > best_score:
best_score = score
best_pair = pair
if best_pair is None:
break
# Merge the best pair
new_token = best_pair[0] + best_pair[1]
vocab.add(new_token)
return vocab
3. SentencePiece / Unigram LM
Used by T5 and many multilingual models, Unigram starts with a large vocabulary and iteratively removes tokens to maximize likelihood:
# Conceptual Unigram algorithm
def train_unigram(corpus, vocab_size, initial_vocab_multiplier=10):
# Start with a large vocabulary
initial_vocab = create_initial_vocab(corpus, vocab_size * initial_vocab_multiplier)
current_vocab = initial_vocab
while len(current_vocab) > vocab_size:
# Calculate loss contribution of each token
token_scores = {}
for token in current_vocab:
# Measure how much the likelihood would decrease if we remove this token
score = calculate_loss_if_removed(corpus, current_vocab, token)
token_scores[token] = score
# Sort tokens by their score (lower = less important)
sorted_tokens = sorted(token_scores.items(), key=lambda x: x[1])
# Remove worst performing tokens (e.g., 20% each iteration)
num_to_remove = max(1, int(len(current_vocab) * 0.2))
tokens_to_remove = [t[0] for t in sorted_tokens[:num_to_remove]]
# Update vocabulary
current_vocab = current_vocab - set(tokens_to_remove)
return current_vocab
Practical Tokenization Examples
Let's see how different modern tokenizers handle the same text:
Example Input: "The tokenization process transforms text into numerical tokens."
GPT-2/3/4 Tokenizer (BPE)
['The', 'Ġtokenization', 'Ġprocess', 'Ġtransforms', 'Ġtext', 'Ġinto', 'Ġnumerical', 'Ġtokens', '.']
# 9 tokens (Ġ represents space)
BERT Tokenizer (WordPiece)
['the', 'token', '##ization', 'process', 'transform', '##s', 'text', 'into', 'numerical', 'token', '##s', '.']
# 12 tokens (## represents subword continuation)
T5 Tokenizer (SentencePiece)
['▁The', '▁token', 'ization', '▁process', '▁transforms', '▁text', '▁into', '▁numerical', '▁tokens', '.']
# 10 tokens (▁ represents space)
Tokenization of Special Cases
1. Handling Numbers
text = "The price is $1,234.56"
# GPT-4 Tokenizer
['The', 'Ġprice', 'Ġis', 'Ġ$', '1', ',', '234', '.', '56']
# 9 tokens
# Claude Tokenizer
['The', ' price', ' is', ' $', '1', ',', '234', '.', '56']
# 9 tokens
2. Handling Emojis
text = "I love coding! 👨💻"
# GPT-4 Tokenizer
['I', 'Ġlove', 'Ġcoding', '!', 'Ġ', '<0xF0>', '<0x9F>', '<0x91>', '<0xA8>', '<0xE2>', '<0x80>', '<0x8D>', '<0xF0>', '<0x9F>', '<0x92>', '<0xBB>']
# 16 tokens (UTF-8 bytes)
# Claude Tokenizer
['I', ' love', ' coding', '!', ' ', '👨', '', '💻']
# 8 tokens
3. Handling Code
code = "def hello_world():\n print('Hello, world!')"
# GPT-4 Tokenizer
['def', 'Ġhello', '_', 'world', '():', '\n', 'Ġ', 'Ġ', 'Ġ', 'Ġprint', '(', "'", 'Hello', ',', 'Ġworld', '!', "'", ')']
# 18 tokens
# CodeLlama Tokenizer (optimized for code)
['def', ' hello_world', '():', '\n ', 'print', '(', "'", 'Hello', ',', ' world', '!', "'", ')']
# 13 tokens
Byte-Level Tokenization
Some modern models like GPT-4 use byte-level BPE, which ensures every possible string can be encoded:
# Every string can be represented as bytes
def bytes_tokenize(text):
return list(text.encode('utf-8'))
tokens = bytes_tokenize("Hello 🌍")
# [72, 101, 108, 108, 111, 32, 240, 159, 140, 141]
Implementing a Simple BPE Tokenizer
Here's a simplified but functional BPE tokenizer implementation:
class SimpleBPETokenizer:
def __init__(self):
self.vocab = {} # token -> id
self.merges = {} # pair -> new token
def train(self, texts, vocab_size=1000, min_frequency=2):
# Start with character vocabulary
word_freqs = collections.Counter(''.join(texts).split())
chars = set(''.join(word_freqs.keys()))
# Initialize vocabulary with characters
self.vocab = {c: i for i, c in enumerate(sorted(chars))}
next_id = len(self.vocab)
# Initialize splits of each word
splits = {word: list(word) for word in word_freqs.keys()}
# Perform merges
num_merges = vocab_size - len(self.vocab)
for _ in range(num_merges):
# Count pair frequencies
pair_freqs = collections.Counter()
for word, freq in word_freqs.items():
word_split = splits[word]
for i in range(len(word_split) - 1):
pair = (word_split[i], word_split[i + 1])
pair_freqs[pair] += freq
# Find most frequent pair
if not pair_freqs:
break
best_pair = max(pair_freqs, key=pair_freqs.get)
if pair_freqs[best_pair] < min_frequency:
break
# Create new token from pair
new_token = best_pair[0] + best_pair[1]
self.vocab[new_token] = next_id
next_id += 1
# Record the merge
self.merges[best_pair] = new_token
# Update splits for all words
for word in splits:
split = splits[word]
i = 0
while i < len(split) - 1:
if (split[i], split[i + 1]) == best_pair:
split[i:i+2] = [new_token]
else:
i += 1
# Create reverse vocabulary for decoding
self.id_to_token = {id: token for token, id in self.vocab.items()}
def encode(self, text):
words = text.split()
result = []
for word in words:
# Start with characters
subwords = list(word)
# Apply merges
while True:
pairs = [(subwords[i], subwords[i+1]) for i in range(len(subwords)-1)]
if not pairs:
break
# Find applicable merge
applicable_pairs = [p for p in pairs if p in self.merges]
if not applicable_pairs:
break
# Apply first applicable merge (greedy)
pair = applicable_pairs[0]
new_token = self.merges[pair]
# Update subwords
new_subwords = []
i = 0
while i < len(subwords):
if i < len(subwords) - 1 and (subwords[i], subwords[i+1]) == pair:
new_subwords.append(new_token)
i += 2
else:
new_subwords.append(subwords[i])
i += 1
subwords = new_subwords
# Convert subwords to token IDs
result.extend([self.vocab.get(subword, self.vocab.get("<unk>", 0)) for subword in subwords])
return result
def decode(self, token_ids):
return ''.join(self.id_to_token.get(id, "<unk>") for id in token_ids)
Tokenization Impact on Model Context Windows
Tokenization efficiency directly affects how much content fits in a model's context window:
Content Type | Characters | GPT-4 Tokens | Claude Tokens | Ratio (Chars/Tokens) |
---|---|---|---|---|
English Text | 1000 | ~200 | ~210 | ~5.0 |
Chinese Text | 1000 | ~500 | ~510 | ~2.0 |
Python Code | 1000 | ~260 | ~280 | ~3.7 |
JSON Data | 1000 | ~230 | ~240 | ~4.3 |
Base64 Data | 1000 | ~330 | ~340 | ~3.0 |
Tokenization Challenges
1. Language Bias
Most LLMs were initially trained with tokenizers optimized for English, leading to inefficiencies in other languages:
English: "Understanding" → 1 token
German: "Verständnis" → 3 tokens (Ver/ständ/nis)
Japanese: "理解" → 2 tokens (理/解)
2. Out-of-Distribution Text
Normal text: "Hello world" → 2 tokens
Repeated characters: "aaaaaaaaaaaa" → 4-6 tokens
3. Code and Technical Content
Special characters and syntax in code can result in suboptimal tokenization:
code = "result = list(map(lambda x: x**2, range(10)))"
# Tokenized inefficiently:
['result', 'Ġ=', 'Ġlist', '(', 'map', '(', 'lambda', 'Ġx', ':', 'Ġx', '**', '2', ',', 'Ġrange', '(', '10', ')', ')', ')']
# 19 tokens
Best Practices for Token Efficiency
1. Batch Processing
Group related requests to minimize repetitive system prompts:
# Inefficient: 2 separate calls
response1 = llm("Translate 'hello' to French") # ~10 tokens
response2 = llm("Translate 'goodbye' to French") # ~10 tokens
# Efficient: 1 batch call
response = llm("Translate these words to French:\n1. hello\n2. goodbye") # ~15 tokens
2. Prompt Engineering for Token Efficiency
# Inefficient prompt (~44 tokens)
inefficient = """Please analyze the sentiment of the following customer review.
The sentiment should be classified as positive, negative, or neutral.
Here is the review: 'Great product, fast shipping!'"""
# Efficient prompt (~21 tokens)
efficient = """Sentiment (positive/negative/neutral):
Review: 'Great product, fast shipping!'"""
3. Content Preprocessing
def preprocess_for_token_efficiency(text):
# Remove redundant whitespace
text = re.sub(r'\s+', ' ', text).strip()
# Replace verbose phrases with concise alternatives
replacements = {
"in order to": "to",
"a large number of": "many",
"due to the fact that": "because",
"at this point in time": "now"
}
for verbose, concise in replacements.items():
text = re.sub(r'\b' + verbose + r'\b', concise, text, flags=re.IGNORECASE)
return text
Comparing Tokenizer Implementations
Most LLM providers expose their tokenizers as libraries:
# OpenAI (GPT-4)
from tiktoken import get_encoding
tokenizer = get_encoding("cl100k_base") # GPT-4 tokenizer
tokens = tokenizer.encode("Hello, world!")
print(len(tokens)) # 4
# Hugging Face Transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer("Hello, world!")["input_ids"]
print(len(tokens)) # 5
# SentencePiece
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("t5.model")
tokens = sp.encode("Hello, world!", out_type=str)
print(len(tokens)) # 5
Building Token-Aware Applications
Token Estimation Function
def estimate_tokens(text, model="gpt-4"):
if model.startswith(("gpt-3", "gpt-4")):
# OpenAI models: ~4 chars per token for English text
return len(text) // 4 + 1
elif model.startswith("claude"):
# Claude models: ~3.5 chars per token for English
return len(text) // 3.5 + 1
elif model.startswith(("llama", "mistral")):
# Open source models: ~4.2 chars per token
return len(text) // 4.2 + 1
else:
# Conservative default
return len(text) // 3 + 1
Chunking Long Documents for Processing
def chunk_text_by_tokens(text, max_tokens=1000, overlap=100):
# Estimate token positions (simplified)
avg_chars_per_token = 4
approx_chars_per_chunk = max_tokens * avg_chars_per_token
overlap_chars = overlap * avg_chars_per_token
# Split into sentences first for cleaner breaks
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current_chunk = []
current_length = 0
for sentence in sentences:
sentence_tokens = len(sentence) // avg_chars_per_token + 1
# If adding this sentence would exceed max tokens
if current_length + sentence_tokens > max_tokens and current_chunk:
# Save current chunk
chunks.append(' '.join(current_chunk))
# Start new chunk with overlap
overlap_sent = current_chunk[-(overlap//10):] # Approximate sentence overlap
current_chunk = overlap_sent + [sentence]
current_length = sum(len(s) // avg_chars_per_token + 1 for s in current_chunk)
else:
# Add sentence to current chunk
current_chunk.append(sentence)
current_length += sentence_tokens
# Add the last chunk if not empty
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
Impact on API Costs
Different tokenization approaches directly impact API usage costs:
Provider | Input Price | Output Price | Tokenization Approach |
---|---|---|---|
OpenAI (GPT-4) | $30/1M tokens | $60/1M tokens | BPE (cl100k_base) |
Anthropic (Claude) | $8/1M tokens | $24/1M tokens | Proprietary (BPE-based) |
Cohere | $1/1M tokens | $2/1M tokens | Proprietary (BPE-based) |
AI21 Labs | $10/1M tokens | $15/1M tokens | Proprietary (BPE-based) |
Learn more on how Matter AI helps improve code quality across multiple languages in Pull Requests: https://docs.matterai.dev/product/code-quality
Are you looking for a way to improve your code review process? Learn more on how Matter AI helps team to solve code review challenges with AI: https://matterai.so
Share this Article:
More Articles

Understanding Attention: Coherency in LLMs
How LLMs generate coherent text across long contexts

LLM Quantization: Making models faster and smaller
What is LLM Quantization and how it enables to make models faster and smaller

Understanding LLM Context Window and Working
What is LLM Context Window and how it works

LLM Prompt Caching
What is LLM Prompt Caching and how it can help reduce LLM cost

How Matter AI brings Velocity, Cost Optimization and Governance to Engineering Teams
Dive into what and how MatterAI offers to engineering teams
Continue Reading

Understanding Attention: Coherency in LLMs
How LLMs generate coherent text across long contexts

LLM Quantization: Making models faster and smaller
What is LLM Quantization and how it enables to make models faster and smaller

Understanding LLM Context Window and Working
What is LLM Context Window and how it works