The Role of Tokenizers in LLMs

The Hidden Layer: What LLMs Actually “See”

When you send this to an LLM:

1
"Hello, world! How are you?"

The model never sees those characters. It sees this:

1
[1, 9906, 11, 1917, 0, 2650, 527, 499, 7673]

Tokenizers are the translators. They convert human text into numbers that neural networks can process.

1
Human Text → Tokenizer → Numbers → LLM → Numbers → Tokenizer → Text

Without tokenizers, LLMs couldn’t work. Without understanding tokenizers, you can’t truly understand how LLMs work.

This post is about what happens in that first arrow.

Part 1: Why Tokenizers Exist

The Fundamental Problem

LLMs have a finite vocabulary. GPT-4 has ~100,000 tokens. Claude has ~100,000 tokens. These numbers are fixed at training time.

The number of possible texts in the world is infinite.

How do you represent infinite text with finite vocabulary?

Why Not Character-Level Tokenization?

Naive approach: Tokenize every character.

1
"hello" → [h, e, l, l, o] → 5 tokens

Problem: This is inefficient.

Explosion of tokens: A 100-page book becomes millions of tokens. More tokens = longer sequences = slower training and inference
Loss of meaning: The model sees characters, not words or subwords. It has to learn that “h-e-l-l-o” means greeting
Performance degradation: LLMs process tokens sequentially (transformers use attention). More tokens = quadratic computational cost. Doubling tokens = 4x more computation

Real comparison:

1
Character-level:  "Hello, world!" = 13 tokens
2
Word-level:       "Hello, world!" = 3 tokens
3
Subword-level:    "Hello, world!" = 4 tokens

Character-level loses. Every space, every punctuation becomes a token. Wasteful.

Why Not Word-Level Tokenization?

Better approach: One token per word.

1
"hello world" → ["hello", "world"] → 2 tokens

Problem: Vocabulary explosion.

English has ~170,000 words. Add slang, names, typos, abbreviations, numbers, special characters, and you need 500,000+ tokens. Training becomes expensive. Storage becomes huge.

Also, what about “unhappily”? Is it one token or should it be split into [“un”, “happy”, “ly”] to capture structure?

The Goldilocks Solution: Subword Tokenization

Balance between character and word level.

1
"unhappily" → ["un", "happy", "##ly"] → 3 tokens
2
"running" → ["run", "##ning"] → 2 tokens
3
"hello" → ["hello"] → 1 token (common word)

Subwords:

Keep vocabulary manageable (~50,000 tokens)
Reduce sequence length
Preserve linguistic structure
Handle unknown words

This is where Byte Pair Encoding (BPE) comes in.

Note

The Trade-off: BPE doesn’t magically solve the problem. It’s a compromise. Different languages, domains, and use cases need different tokenizers. GPT uses BPE. BERT uses WordPiece (similar idea). Others use SentencePiece.

Part 2: Byte Pair Encoding (BPE) Explained

The Core Idea

BPE is simple: Iteratively find the most frequent pair of tokens and merge them.

Example:

Initial text: “low w low w w” (broken into bytes/characters)

1
Iteration 1:
2
"l o w w l o w w w w"
3
Most frequent pair: "w w" (appears 3 times)
4
Replace with new token "X":
5
"l o X l o X X"
6

7
Iteration 2:
8
"l o X l o X X"
9
Most frequent pair: "o X" (appears 2 times)
10
Replace with new token "Y":
11
"l Y l Y Y"
12

13
Iteration 3:
14
"l Y l Y Y"
15
Most frequent pair: "l Y" (appears 2 times)
16
Replace with new token "Z":
17
"Z Z Y"

You repeat this process, building a merge dictionary:

1
Merge 1: "w w" → token 256
2
Merge 2: "o (token 256)" → token 257
3
Merge 3: "l (token 257)" → token 258

After N merges, you have a vocabulary of 256 (base bytes) + N (merges). If N = 50,000, vocabulary = 50,256.

Why BPE Works

Frequency-based: Common sequences get their own tokens. “the”, “ing”, “ed” become single tokens
Rare sequences stay split: Uncommon combinations remain as character sequences
Handles unknown words: Any word, no matter how rare, can be tokenized by its subwords
Data-driven: The merge dictionary is learned from training data, capturing language-specific patterns

Part 3: Implementing a Tokenizer

Step 1: Start With UTF-8 Bytes

1
const str = "Hello, world!";
2
const bytes = [...Buffer.from(str, 'utf-8')];
3
// bytes = [72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 33]
4
// (H,  e,   l,   l,   o,   ,   ,   w,   o,   r,   l,   d,   !)

Every character is now a number (its UTF-8 byte value).

Step 2: Find Most Frequent Pair (`getPairStats`)

1
function getPairStats(data: number[]) {
2
    const stats: Record<string, number | undefined> = {}
3

4
    // Count all adjacent pairs
5
    for (let i = 0; i < data.length - 1; i++) {
6
        const pair = `${data[i]}-${data[i + 1]}`;
7
        stats[pair] = (stats[pair] ?? 0) + 1;
8
    }
9

10
    // Return sorted by frequency (highest first)
11
    return [...Object.entries(stats)]
12
        .map(([pair, count]) => [count, pair.split('-').map(Number)])
13
        .sort((a, b) => b[0] - a[0]);
14
}

This counts adjacent byte pairs and ranks them by frequency.

Example:

1
bytes = [72, 101, 108, 108, 111, ...]
2
Pairs:
3
  72-101: 1 time (H-e)
4
  101-108: 1 time (e-l)
5
  108-108: 1 time (l-l)
6
  108-111: 1 time (l-o)
7
  ...

If a pair appears multiple times, it gets a higher count.

Step 3: Merge the Most Frequent Pair (`performTokenSwapping`)

1
function performTokenSwapping({
2
    tokens,
3
    mergePair,
4
    newTokenId
5
}) {
6
    let result = [...tokens];
7

8
    // Find and replace all instances of mergePair
9
    for (let i = 0; i < result.length - 1; i++) {
10
        if (result[i] === mergePair[0] && result[i + 1] === mergePair[1]) {
11
            result[i] = newTokenId;  // Replace with new ID
12
            result[i + 1] = null;    // Mark for deletion
13
        }
14
    }
15

16
    // Remove nulls
17
    return result.filter(t => t != null);
18
}

If the most frequent pair is (108, 108) — “ll” — you replace it with token ID 256.

1
Before: [108, 108, 111]  (l, l, o)
2
After:  [256, 111]       (ll, o)

Step 4: Repeat Until Target Vocabulary Size

1
const sizeOfVocab = 300;  // Target vocabulary size
2
const iterationsRequired = sizeOfVocab - 256;  // 44 iterations
3

4
// Start with 256 base tokens (0-255 are UTF-8 bytes)
5
// Perform 44 merges to reach 300 tokens total
6
for (let i = 0; i < iterationsRequired; i++) {
7
    const mostFrequentPair = getPairStats(tokensToOperateOn)[0][1];
8
    const newTokenId = 256 + i;
9

10
    tokensToOperateOn = performTokenSwapping({
11
        tokens: tokensToOperateOn,
12
        mergePair: mostFrequentPair,
13
        newTokenId
14
    });
15

16
    mergeDictOrdered.push([`${mostFrequentPair[0]}-${mostFrequentPair[1]}`, newTokenId]);
17
}

After 44 iterations, you have 300 unique tokens and a merge dictionary.

Step 5: Encoding (Text → Tokens)

1
function encode(str: string) {
2
    let bytes = [...Buffer.from(str, 'utf-8')];
3

4
    // Apply merges in order
5
    for (const [pair, newTokenId] of mergeDictOrdered) {
6
        const [b1, b2] = pair.split('-').map(Number);
7

8
        // Replace all occurrences of this pair
9
        for (let i = 0; i < bytes.length - 1; i++) {
10
            if (bytes[i] === b1 && bytes[i + 1] === b2) {
11
                bytes[i] = newTokenId;
12
                bytes[i + 1] = null;
13
                i++;  // Skip next byte (now null)
14
            }
15
        }
16
    }
17

18
    return bytes.filter(t => t != null);
19
}

When you encode new text, you apply the merge dictionary in order of priority.

Example:

1
Text: "hello"
2
Bytes: [104, 101, 108, 108, 111]
3

4
If merge 1 was (108, 108) → 256:
5
After merge 1: [104, 101, 256, 111]
6

7
If merge 2 was (256, 111) → 257:
8
After merge 2: [104, 101, 257]
9

10
If merge 3 was (101, 257) → 258:
11
After merge 3: [104, 258]
12

13
Final tokens: [104, 258]

Step 6: Decoding (Tokens → Text)

1
function decode(tokens: number[]) {
2
    const bytes = [...tokens];
3

4
    // Build reverse dictionary
5
    const reverseDict = {};
6
    for (const [pair, newTokenId] of mergeDictOrdered) {
7
        const [n1, n2] = pair.split('-').map(Number);
8
        reverseDict[newTokenId] = { n1, n2 };
9
    }
10

11
    // Undo merges in reverse order
12
    for (let i = 0; i < bytes.length; i++) {
13
        const lookup = reverseDict[bytes[i]];
14
        if (lookup) {
15
            bytes[i] = lookup.n1;
16
            bytes.splice(i + 1, 0, lookup.n2);
17
            i--;  // Re-check this position
18
        }
19
    }
20

21
    // Convert back to UTF-8 string
22
    return Buffer.from(bytes).toString('utf-8');
23
}

To decode, reverse the merges. Split each merged token back into its components.

1
Tokens: [104, 258]
2

3
If merge 3 was (101, 257) → 258:
4
After reverse: [104, 101, 257]
5

6
If merge 2 was (256, 111) → 257:
7
After reverse: [104, 101, 256, 111]
8

9
If merge 1 was (108, 108) → 256:
10
After reverse: [104, 101, 108, 108, 111]
11

12
Convert to string: "hello"

Note

Key insight: The merge dictionary is deterministic and the same across encoding and decoding. Without it, you couldn’t decode tokens back to text. This is why tokenizers are shipped with models—each model has its own dictionary.

Part 4: Real-World Complexity

Above implementation is clean and educational. Real tokenizers add complexity:

1. Special Tokens

Real tokenizers have special tokens:

<PAD>: Padding (fills sequence to fixed length)
<UNK>: Unknown (rare tokens not in vocabulary)
<BOS>: Beginning of sequence
<EOS>: End of sequence
<MASK>: For masked language modeling (BERT)

2. Normalization

Before tokenizing, text is normalized:

Lowercasing (optional)
Accent removal: “café” → “cafe”
Whitespace handling: Multiple spaces → one space

3. Pre-tokenization

Many tokenizers split on whitespace first:

1
"Hello, world!" → ["Hello", ",", "world", "!"]

Then tokenize each part separately. This prevents merging across word boundaries (usually desired).

4. UTF-8 Encoding Issues

This code uses UTF-8 bytes directly. Real tokenizers handle edge cases:

Multi-byte UTF-8 sequences (emojis, CJK characters)
Invalid UTF-8 sequences
Different byte order marks

The test string handles this well:

1
"José" (é is 2 bytes in UTF-8)
2
"世界" (each character is 3 bytes)
3
"👋🚀" (each emoji is 4 bytes)

The tokenizer correctly processes all of this!

5. Vocabulary Size Trade-offs

1
Smaller vocabulary (256): Longer sequences, faster training, less memory
2
Larger vocabulary (100k): Shorter sequences, slower training, more memory

Real models balance this. GPT-3 uses 50,257 tokens. Optimal for English but not for CJK languages.

Note

Important: LLMs can’t “exhaustively” tokenize every possible string. If you use a token ID that wasn’t in the merge dictionary, the model doesn’t know how to interpret it. This is why OOV (out-of-vocabulary) handling matters. Real tokenizers map unknown tokens to <UNK> or break them into subword pieces.

Part 5: Why Tokenizers Matter for Performance

Token Count Explosion

The compression from your tokenizer is dramatic:

1
console.log('Original:', bytes.length);    // 500 bytes
2
console.log('Final:', tokensToOperateOn.length);  // Maybe 150 tokens

Why this matters:

1
Original bytes: 500 characters = 500 tokens (character-level)
2
BPE tokens: 150 tokens (subword-level)
3
Compression: 3.3x fewer tokens
4

5
Impact:
6
- Memory: 500 tokens requires more GPU memory to process
7
- Speed: Attention is O(n²), so 500 tokens = 250,000 attention operations
8
           150 tokens = 22,500 attention operations
9
- Inference: Fewer tokens = faster inference

This is why subword tokenization is crucial for LLM efficiency.

Context Window Trade-off

An LLM has a context window (max tokens it processes):

GPT-3: 4,096 tokens
GPT-4: 128,000 tokens
Claude 3: 200,000 tokens

With character-level tokenization, a 4K context window only fits 4,000 characters (1-2 pages). With BPE, it fits ~12,000 characters (4-5 pages).

Part 6: Comparison: BPE vs. Other Tokenization Methods

WordPiece (BERT, RoBERTa)

Similar to BPE but merges based on likelihood, not frequency.

1
BPE: Most frequent pair
2
WordPiece: Pair that maximizes likelihood

Slightly better quality, slightly slower to train.

SentencePiece (ALBERT, MT5)

Language-agnostic tokenization. Trains on raw bytes, handles any language equally.

Better for multilingual models but slightly less interpretable.

Tiktoken (OpenAI)

GPT’s tokenizer. Optimized specifically for GPT models.

Regex-based pre-tokenization
Better handling of whitespace and special characters
Proprietary but open-sourced

Part 7: Limitations and Considerations

1. Tokenizer-Model Mismatch

If you train a model with one tokenizer and deploy with another, outputs break.

1
Training tokenizer:   "hello" → [123, 456]
2
Deployment tokenizer: "hello" → [789, 101]
3
Result: Model reads wrong tokens, produces garbage

This is why tokenizers are versioned with models.

2. Tokenization Artifacts

Some text gets tokenized weirdly:

1
"123" might tokenize as [1, 2, 3] (digit-by-digit) depending on training data
2
This hurts the model's ability to understand numbers

GPT learned to handle this by training on code (which has numbers), but other models struggle.

3. Non-Reversible Tokenization

Some tokenizers lose information:

1
Original: "HELLO"
2
Lowercased: "hello"
3
Tokenized: [123]
4
Decoded: "hello"  (case lost)

Above implementation is fully reversible, but real tokenizers often aren’t (by design, for normalization).

4. Inefficient Tokenization for Certain Languages

CJK languages (Chinese, Japanese, Korean) tokenize poorly with BPE:

1
English: "hello world" = 2 tokens
2
Chinese: "你好世界" = might be 4+ tokens (each character separate)

This is why multilingual models need better tokenizers.

Part 8: How Real LLMs Use Tokenizers

Training Phase

1
Raw text → Tokenizer → Token IDs → Embedding Layer → Transformer

The tokenizer is part of the data pipeline. Every epoch, raw text is re-tokenized.

Inference Phase

1
User input → Tokenizer → Token IDs → Model → Token IDs → Detokenizer → Text

The same tokenizer used during training is used during inference. This is why tokenizer version matters.

Streaming (Like ChatGPT Web Interface)

1
User input → Tokenize → Stream through model → Stream token outputs → Detokenize to text

Tokens are decoded one-by-one as they’re generated, creating the “typing” effect.

Part 9: Conclusion: Tokenizers Are Bridges, Not Details

Tokenizers are often overlooked. They’re not “interesting” like transformer attention or reinforcement learning.

But they’re fundamental. They’re the bridge between human language and machine numbers.

Understanding tokenizers helps you understand:

Why LLM context windows are limited (tokens, not characters)
Why multilingual models struggle (unbalanced tokenization across languages)
Why certain tasks are easier (more tokens = more signal to the model)
Why different models behave differently (each has its own tokenizer)
How to debug model outputs (understanding tokenization artifacts)

This is what powers GPT, Claude, Gemini. The core algorithm is simple. The engineering is elegant.

The next time you interact with an LLM, remember: your text is being converted to numbers by a tokenizer. Understanding that process is understanding the foundation of how LLMs work.

The Role of Tokenizers in LLMs

The Hidden Layer: What LLMs Actually “See”

Part 1: Why Tokenizers Exist

The Fundamental Problem

Why Not Character-Level Tokenization?

Why Not Word-Level Tokenization?

The Goldilocks Solution: Subword Tokenization

Part 2: Byte Pair Encoding (BPE) Explained

The Core Idea

Why BPE Works

Part 3: Implementing a Tokenizer

Step 1: Start With UTF-8 Bytes

Step 2: Find Most Frequent Pair (getPairStats)

Step 3: Merge the Most Frequent Pair (performTokenSwapping)

Step 4: Repeat Until Target Vocabulary Size

Step 5: Encoding (Text → Tokens)

Step 6: Decoding (Tokens → Text)

Part 4: Real-World Complexity

1. Special Tokens

2. Normalization

3. Pre-tokenization

4. UTF-8 Encoding Issues

5. Vocabulary Size Trade-offs

Part 5: Why Tokenizers Matter for Performance

Token Count Explosion

Context Window Trade-off

Part 6: Comparison: BPE vs. Other Tokenization Methods

WordPiece (BERT, RoBERTa)

SentencePiece (ALBERT, MT5)

Tiktoken (OpenAI)

Part 7: Limitations and Considerations

1. Tokenizer-Model Mismatch

2. Tokenization Artifacts

3. Non-Reversible Tokenization

4. Inefficient Tokenization for Certain Languages

Part 8: How Real LLMs Use Tokenizers

Training Phase

Inference Phase

Streaming (Like ChatGPT Web Interface)

Part 9: Conclusion: Tokenizers Are Bridges, Not Details

References & Further Reading

Step 2: Find Most Frequent Pair (`getPairStats`)

Step 3: Merge the Most Frequent Pair (`performTokenSwapping`)