Byte Pair Encoding (BPE): the tokenizer that made GPTs practical

Introduction

Byte Pair Encoding (BPE) is a subword tokenization scheme that gives us the best of both worlds: compact vocabulary sizes (not the full wordlist), the ability to represent any unknown word (by falling back to subwords/characters), and meaningful shared pieces (roots, suffixes) that help models generalize.

GPT-2 used a BPE tokenizer with a vocabulary of ≈50,257 tokens, and OpenAI’s tiktoken is a fast Rust-backed implementation you can use today. Below I explain the why, the how (intuition + algorithm), and a short hands-on demo using tiktoken (code you provided is included unchanged).

Why we even care about tokenizers

When we feed text to an LLM we must map text → integer IDs. How we split text into tokens has big consequences:

Word-level tokenizers: each distinct word is a token. Pros: short token sequences. Cons: huge vocabularies and out-of-vocabulary (OOV) problems — any unseen word breaks things.
Character-level tokenizers: each character is a token. Pros: tiny vocab, no OOV. Cons: very long token sequences and loss of meaningful multi-character chunks (roots/suffixes).
Subword tokenizers (BPE): the hybrid. Keep frequent words intact, break rare words into subwords or characters. This lets the tokenizer:
- avoid OOVs,
- keep useful common words as single tokens, and
- represent morphological similarity (e.g., token, tokenize, tokenization share pieces).

That combination is why GPT-2/GPT-3 adopted BPE.

BPE at a glance — intuition

BPE began as a simple compression algorithm (1994): repeatedly find the most frequent adjacent pair of bytes and merge them into a new symbol. Applied to language, we:

Start with characters (plus a special end-of-word marker).
Count adjacent-symbol pairs across the corpus.
Find the most frequent pair → merge it into a new subword token.
Repeat until you’ve created the desired number of tokens (or no pair repeats enough).

This produces tokens that are not full words nor single characters, but subwords — e.g., old, -est, ing, s, etc. The algorithm automatically discovers frequent roots/suffixes (like est in finest, lowest) and preserves frequent full words (like the, and) as single tokens.

Step-by-step example

Imagine a tiny corpus: old, older, finest, lowest.

Add a word-end marker (I’ll use </w> conceptually).

Start as characters with </w> appended.

Frequent adjacent pairs like e+s, es+t, then est</w> emerge and get merged.

The algorithm progressively constructs tokens like old, est</w>, etc.

Final vocabulary includes characters, useful subwords, and a manageable number of tokens. That’s how BPE captures morphology and reduces vocabulary size versus full word vocabularies.

The practical implementation — `tiktoken` (OpenAI’s Rust-backed BPE)

Tiktoken is a performant implementation of the exact class of BPE tokenizers used for OpenAI models. You can install and use it quickly. Below I include the exact code you provided (unchanged) and its sample outputs.

# pip install tiktoken

import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))
# tiktoken version: 0.7.0

tokenizer = tiktoken.get_encoding("gpt2")
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)
# [15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]

strings = tokenizer.decode(integers)

print(strings)
# Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.

Notes on that snippet:

50256 is the token id used for the special <|endoftext|> token in the GPT-2 encoding. GPT-2’s BPE vocab is commonly reported as ≈50,257 tokens (IDs 0..50256 inclusive).
The unknown-looking word someunknownPlace gets broken into subword pieces rather than causing an error — that’s BPE in action.
tiktoken is typically much faster than a pure-Python BPE implementation because the core is implemented in Rust.

Key takeaways

BPE = subword tokenizer: preserves frequent whole words, splits rare words into subwords/characters.
Solves OOV: any input can be tokenized because it can be represented as characters/subwords.
Captures morphology: roots and suffixes are discovered automatically (e.g., est in finest/lowest).
Reasonable vocab size: GPT-2 uses ~50k tokens, much smaller than the full word list of English.
Fast, production-ready tooling: use tiktoken (Rust-backed) for encoding/decoding; it’s notably faster than a Python-only implementation.

Practical tips & gotchas

Stop criteria for learning merges: when building a vocabulary via BPE, you stop after reaching a preset vocabulary size (e.g., 30k, 50k) or when no pair occurs often enough. GPT-style models often pick ~50k tokens for a balance of sequence length vs vocab expressiveness.
Special tokens: preserve and register “end-of-text” or other special tokens so they’re treated atomically by the tokenizer (as shown in the example with allowed_special).
Token counts ≠ word counts: a single English word might be multiple tokens (e.g., antidisestablishmentarianism) or many words may be a single token (the, a, punctuation combos depending on tokenizer).
Consistency matters: use the exact encoding used for the model — mixing tokenizers will break embedding alignment and model inputs.

Resources

tiktoken — fast BPE tokenizer (Rust implementation used by OpenAI). Install with pip install tiktoken.
OpenAI GPT-2 encoder reference (original Python implementation): https://github.com/openai/gpt-2/blob/master/src/encoder.py
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb
https://www.youtube.com/watch?v=fKd8s29e-l4
https://learncodecamp.net/thinking-in-llm-models-2/

Recap

BPE is simple in idea but extremely powerful in practice. It’s the reason models like GPT-2 could be trained with manageable vocabularies while still gracefully handling arbitrary text from the wild.