LLM

Tokenisation

LLM Tokenisation fundamentals and working

Vatsal Bajpai
Vatsal Bajpai
10 min read·
Cover Image for LLM Tokenisation fundamentals and working

How Does Tokenization Work in LLMs

Tokenization is the critical first step in the processing pipeline of Large Language Models (LLMs). It transforms raw text into numerical representations that models can understand and process. While often overlooked, tokenization significantly impacts model performance, efficiency, and even cost. This deep dive explores how tokenization works in modern LLMs, with practical examples and insights for engineers.

The Fundamentals of Tokenization

At its core, tokenization is the process of breaking text into smaller units called tokens. These tokens are then converted to numerical IDs used by the model for processing. The choice of tokenization strategy directly affects:

  • Model performance and capabilities
  • Context window limitations
  • Processing efficiency
  • Handling of multilingual content
  • Cost of API usage (typically billed per token)

Evolution of Tokenization Approaches

Character-Level Tokenization

The simplest approach tokenizes each character individually:

def char_tokenize(text):
    return list(text)

tokens = char_tokenize("Hello")
# ['H', 'e', 'l', 'l', 'o']

Advantages:

  • Simple implementation
  • No out-of-vocabulary issues
  • Fixed vocabulary size

Disadvantages:

  • Very inefficient (long sequences)
  • Loses word-level semantics
  • Poor handling of spaces and formatting

Word-Level Tokenization

Word tokenization splits text at whitespace and punctuation:

def word_tokenize(text):
    return text.split()

tokens = word_tokenize("Hello, world!")
# ['Hello,', 'world!']

Advantages:

  • Intuitive alignment with linguistic units
  • Shorter sequences than character tokenization

Disadvantages:

  • Large vocabulary requirements
  • Out-of-vocabulary words
  • Poor handling of morphologically rich languages

Subword Tokenization (Modern Approach)

Most modern LLMs use subword tokenization, which balances efficiency and flexibility. The three dominant algorithms are:

1. Byte Pair Encoding (BPE)

Used by GPT models, BPE starts with individual characters and iteratively merges the most frequent pairs:

# Simplified BPE training process
def train_bpe(corpus, vocab_size):
    # Start with character vocabulary
    vocab = set(char for text in corpus for char in text)
    
    # Initialize each word as a sequence of characters
    splits = {word: list(word) for word in corpus}
    
    # Merge most frequent pairs until vocab_size is reached
    while len(vocab) < vocab_size:
        # Count pair frequencies
        pairs = collections.Counter()
        for word_splits in splits.values():
            for i in range(len(word_splits) - 1):
                pair = (word_splits[i], word_splits[i + 1])
                pairs[pair] += 1
        
        # Find most frequent pair
        if not pairs:
            break
        best_pair = max(pairs, key=pairs.get)
        
        # Create new token from pair
        new_token = best_pair[0] + best_pair[1]
        vocab.add(new_token)
        
        # Update splits to use the new token
        for word, word_splits in splits.items():
            i = 0
            while i < len(word_splits) - 1:
                if (word_splits[i], word_splits[i + 1]) == best_pair:
                    word_splits[i:i+2] = [new_token]
                else:
                    i += 1
    
    return vocab

2. WordPiece

Used by BERT-like models, WordPiece uses a likelihood-based approach that favors subwords that improve the likelihood of the training data:

# Conceptual WordPiece algorithm
def train_wordpiece(corpus, vocab_size):
    # Start with base vocabulary (e.g., characters)
    vocab = set(char for text in corpus for char in text)
    
    while len(vocab) < vocab_size:
        best_score = -float('inf')
        best_pair = None
        
        # Evaluate all possible pairs
        for pair in generate_candidate_pairs(corpus, vocab):
            # Calculate likelihood improvement if this pair is merged
            score = calculate_likelihood_improvement(corpus, vocab, pair)
            
            if score > best_score:
                best_score = score
                best_pair = pair
        
        if best_pair is None:
            break
            
        # Merge the best pair
        new_token = best_pair[0] + best_pair[1]
        vocab.add(new_token)
    
    return vocab

3. SentencePiece / Unigram LM

Used by T5 and many multilingual models, Unigram starts with a large vocabulary and iteratively removes tokens to maximize likelihood:

# Conceptual Unigram algorithm
def train_unigram(corpus, vocab_size, initial_vocab_multiplier=10):
    # Start with a large vocabulary
    initial_vocab = create_initial_vocab(corpus, vocab_size * initial_vocab_multiplier)
    current_vocab = initial_vocab
    
    while len(current_vocab) > vocab_size:
        # Calculate loss contribution of each token
        token_scores = {}
        for token in current_vocab:
            # Measure how much the likelihood would decrease if we remove this token
            score = calculate_loss_if_removed(corpus, current_vocab, token)
            token_scores[token] = score
        
        # Sort tokens by their score (lower = less important)
        sorted_tokens = sorted(token_scores.items(), key=lambda x: x[1])
        
        # Remove worst performing tokens (e.g., 20% each iteration)
        num_to_remove = max(1, int(len(current_vocab) * 0.2))
        tokens_to_remove = [t[0] for t in sorted_tokens[:num_to_remove]]
        
        # Update vocabulary
        current_vocab = current_vocab - set(tokens_to_remove)
    
    return current_vocab

Practical Tokenization Examples

Let's see how different modern tokenizers handle the same text:

Example Input: "The tokenization process transforms text into numerical tokens."

GPT-2/3/4 Tokenizer (BPE)

['The', 'Ġtokenization', 'Ġprocess', 'Ġtransforms', 'Ġtext', 'Ġinto', 'Ġnumerical', 'Ġtokens', '.']
# 9 tokens (Ġ represents space)

BERT Tokenizer (WordPiece)

['the', 'token', '##ization', 'process', 'transform', '##s', 'text', 'into', 'numerical', 'token', '##s', '.']
# 12 tokens (## represents subword continuation)

T5 Tokenizer (SentencePiece)

['▁The', '▁token', 'ization', '▁process', '▁transforms', '▁text', '▁into', '▁numerical', '▁tokens', '.']
# 10 tokens (▁ represents space)

Tokenization of Special Cases

1. Handling Numbers

text = "The price is $1,234.56"

# GPT-4 Tokenizer
['The', 'Ġprice', 'Ġis', 'Ġ$', '1', ',', '234', '.', '56']
# 9 tokens

# Claude Tokenizer
['The', ' price', ' is', ' $', '1', ',', '234', '.', '56']
# 9 tokens

2. Handling Emojis

text = "I love coding! 👨‍💻"

# GPT-4 Tokenizer
['I', 'Ġlove', 'Ġcoding', '!', 'Ġ', '<0xF0>', '<0x9F>', '<0x91>', '<0xA8>', '<0xE2>', '<0x80>', '<0x8D>', '<0xF0>', '<0x9F>', '<0x92>', '<0xBB>']
# 16 tokens (UTF-8 bytes)

# Claude Tokenizer
['I', ' love', ' coding', '!', ' ', '👨', '‍', '💻']
# 8 tokens

3. Handling Code

code = "def hello_world():\n    print('Hello, world!')"

# GPT-4 Tokenizer
['def', 'Ġhello', '_', 'world', '():', '\n', 'Ġ', 'Ġ', 'Ġ', 'Ġprint', '(', "'", 'Hello', ',', 'Ġworld', '!', "'", ')']
# 18 tokens

# CodeLlama Tokenizer (optimized for code)
['def', ' hello_world', '():', '\n    ', 'print', '(', "'", 'Hello', ',', ' world', '!', "'", ')']
# 13 tokens

Byte-Level Tokenization

Some modern models like GPT-4 use byte-level BPE, which ensures every possible string can be encoded:

# Every string can be represented as bytes
def bytes_tokenize(text):
    return list(text.encode('utf-8'))

tokens = bytes_tokenize("Hello 🌍")
# [72, 101, 108, 108, 111, 32, 240, 159, 140, 141]

Implementing a Simple BPE Tokenizer

Here's a simplified but functional BPE tokenizer implementation:

class SimpleBPETokenizer:
    def __init__(self):
        self.vocab = {}  # token -> id
        self.merges = {}  # pair -> new token
        
    def train(self, texts, vocab_size=1000, min_frequency=2):
        # Start with character vocabulary
        word_freqs = collections.Counter(''.join(texts).split())
        chars = set(''.join(word_freqs.keys()))
        
        # Initialize vocabulary with characters
        self.vocab = {c: i for i, c in enumerate(sorted(chars))}
        next_id = len(self.vocab)
        
        # Initialize splits of each word
        splits = {word: list(word) for word in word_freqs.keys()}
        
        # Perform merges
        num_merges = vocab_size - len(self.vocab)
        for _ in range(num_merges):
            # Count pair frequencies
            pair_freqs = collections.Counter()
            for word, freq in word_freqs.items():
                word_split = splits[word]
                for i in range(len(word_split) - 1):
                    pair = (word_split[i], word_split[i + 1])
                    pair_freqs[pair] += freq
            
            # Find most frequent pair
            if not pair_freqs:
                break
                
            best_pair = max(pair_freqs, key=pair_freqs.get)
            if pair_freqs[best_pair] < min_frequency:
                break
                
            # Create new token from pair
            new_token = best_pair[0] + best_pair[1]
            self.vocab[new_token] = next_id
            next_id += 1
            
            # Record the merge
            self.merges[best_pair] = new_token
            
            # Update splits for all words
            for word in splits:
                split = splits[word]
                i = 0
                while i < len(split) - 1:
                    if (split[i], split[i + 1]) == best_pair:
                        split[i:i+2] = [new_token]
                    else:
                        i += 1
        
        # Create reverse vocabulary for decoding
        self.id_to_token = {id: token for token, id in self.vocab.items()}
        
    def encode(self, text):
        words = text.split()
        result = []
        
        for word in words:
            # Start with characters
            subwords = list(word)
            
            # Apply merges
            while True:
                pairs = [(subwords[i], subwords[i+1]) for i in range(len(subwords)-1)]
                if not pairs:
                    break
                
                # Find applicable merge
                applicable_pairs = [p for p in pairs if p in self.merges]
                if not applicable_pairs:
                    break
                    
                # Apply first applicable merge (greedy)
                pair = applicable_pairs[0]
                new_token = self.merges[pair]
                
                # Update subwords
                new_subwords = []
                i = 0
                while i < len(subwords):
                    if i < len(subwords) - 1 and (subwords[i], subwords[i+1]) == pair:
                        new_subwords.append(new_token)
                        i += 2
                    else:
                        new_subwords.append(subwords[i])
                        i += 1
                subwords = new_subwords
            
            # Convert subwords to token IDs
            result.extend([self.vocab.get(subword, self.vocab.get("<unk>", 0)) for subword in subwords])
        
        return result
    
    def decode(self, token_ids):
        return ''.join(self.id_to_token.get(id, "<unk>") for id in token_ids)

Tokenization Impact on Model Context Windows

Tokenization efficiency directly affects how much content fits in a model's context window:

Content Type Characters GPT-4 Tokens Claude Tokens Ratio (Chars/Tokens)
English Text 1000 ~200 ~210 ~5.0
Chinese Text 1000 ~500 ~510 ~2.0
Python Code 1000 ~260 ~280 ~3.7
JSON Data 1000 ~230 ~240 ~4.3
Base64 Data 1000 ~330 ~340 ~3.0

Tokenization Challenges

1. Language Bias

Most LLMs were initially trained with tokenizers optimized for English, leading to inefficiencies in other languages:

English: "Understanding" → 1 token
German: "Verständnis" → 3 tokens (Ver/ständ/nis)
Japanese: "理解" → 2 tokens (理/解)

2. Out-of-Distribution Text

Normal text: "Hello world" → 2 tokens
Repeated characters: "aaaaaaaaaaaa" → 4-6 tokens

3. Code and Technical Content

Special characters and syntax in code can result in suboptimal tokenization:

code = "result = list(map(lambda x: x**2, range(10)))"

# Tokenized inefficiently:
['result', 'Ġ=', 'Ġlist', '(', 'map', '(', 'lambda', 'Ġx', ':', 'Ġx', '**', '2', ',', 'Ġrange', '(', '10', ')', ')', ')']
# 19 tokens

Best Practices for Token Efficiency

1. Batch Processing

Group related requests to minimize repetitive system prompts:

# Inefficient: 2 separate calls
response1 = llm("Translate 'hello' to French")  # ~10 tokens
response2 = llm("Translate 'goodbye' to French")  # ~10 tokens

# Efficient: 1 batch call
response = llm("Translate these words to French:\n1. hello\n2. goodbye")  # ~15 tokens

2. Prompt Engineering for Token Efficiency

# Inefficient prompt (~44 tokens)
inefficient = """Please analyze the sentiment of the following customer review. 
The sentiment should be classified as positive, negative, or neutral. 
Here is the review: 'Great product, fast shipping!'"""

# Efficient prompt (~21 tokens)
efficient = """Sentiment (positive/negative/neutral):
Review: 'Great product, fast shipping!'"""

3. Content Preprocessing

def preprocess_for_token_efficiency(text):
    # Remove redundant whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Replace verbose phrases with concise alternatives
    replacements = {
        "in order to": "to",
        "a large number of": "many",
        "due to the fact that": "because",
        "at this point in time": "now"
    }
    
    for verbose, concise in replacements.items():
        text = re.sub(r'\b' + verbose + r'\b', concise, text, flags=re.IGNORECASE)
    
    return text

Comparing Tokenizer Implementations

Most LLM providers expose their tokenizers as libraries:

# OpenAI (GPT-4)
from tiktoken import get_encoding
tokenizer = get_encoding("cl100k_base")  # GPT-4 tokenizer
tokens = tokenizer.encode("Hello, world!")
print(len(tokens))  # 4

# Hugging Face Transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer("Hello, world!")["input_ids"]
print(len(tokens))  # 5

# SentencePiece
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("t5.model")
tokens = sp.encode("Hello, world!", out_type=str)
print(len(tokens))  # 5

Building Token-Aware Applications

Token Estimation Function

def estimate_tokens(text, model="gpt-4"):
    if model.startswith(("gpt-3", "gpt-4")):
        # OpenAI models: ~4 chars per token for English text
        return len(text) // 4 + 1
    elif model.startswith("claude"):
        # Claude models: ~3.5 chars per token for English
        return len(text) // 3.5 + 1
    elif model.startswith(("llama", "mistral")):
        # Open source models: ~4.2 chars per token
        return len(text) // 4.2 + 1
    else:
        # Conservative default
        return len(text) // 3 + 1

Chunking Long Documents for Processing

def chunk_text_by_tokens(text, max_tokens=1000, overlap=100):
    # Estimate token positions (simplified)
    avg_chars_per_token = 4
    approx_chars_per_chunk = max_tokens * avg_chars_per_token
    overlap_chars = overlap * avg_chars_per_token
    
    # Split into sentences first for cleaner breaks
    sentences = re.split(r'(?<=[.!?])\s+', text)
    
    chunks = []
    current_chunk = []
    current_length = 0
    
    for sentence in sentences:
        sentence_tokens = len(sentence) // avg_chars_per_token + 1
        
        # If adding this sentence would exceed max tokens
        if current_length + sentence_tokens > max_tokens and current_chunk:
            # Save current chunk
            chunks.append(' '.join(current_chunk))
            
            # Start new chunk with overlap
            overlap_sent = current_chunk[-(overlap//10):]  # Approximate sentence overlap
            current_chunk = overlap_sent + [sentence]
            current_length = sum(len(s) // avg_chars_per_token + 1 for s in current_chunk)
        else:
            # Add sentence to current chunk
            current_chunk.append(sentence)
            current_length += sentence_tokens
    
    # Add the last chunk if not empty
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

Impact on API Costs

Different tokenization approaches directly impact API usage costs:

Provider Input Price Output Price Tokenization Approach
OpenAI (GPT-4) $30/1M tokens $60/1M tokens BPE (cl100k_base)
Anthropic (Claude) $8/1M tokens $24/1M tokens Proprietary (BPE-based)
Cohere $1/1M tokens $2/1M tokens Proprietary (BPE-based)
AI21 Labs $10/1M tokens $15/1M tokens Proprietary (BPE-based)

Learn more on how Matter AI helps improve code quality across multiple languages in Pull Requests: https://docs.matterai.dev/product/code-quality

Are you looking for a way to improve your code review process? Learn more on how Matter AI helps team to solve code review challenges with AI: https://matterai.so

Share this Article: