Introduction
Byte Pair Encoding (BPE) is a subword tokenization scheme that gives us the best of both worlds: compact vocabulary sizes (not the full wordlist), the ability to represent any unknown word (by falling back to subwords/characters), and meaningful shared pieces (roots, suffixes) that help models generalize.
GPT-2 used a BPE tokenizer with a vocabulary of ≈50,257 tokens, and OpenAI’s tiktoken
is a fast Rust-backed implementation you can use today. Below I explain the why, the how (intuition + algorithm), and a short hands-on demo using tiktoken
(code you provided is included unchanged).
Why we even care about tokenizers
When we feed text to an LLM we must map text → integer IDs. How we split text into tokens has big consequences:
- Word-level tokenizers: each distinct word is a token. Pros: short token sequences. Cons: huge vocabularies and out-of-vocabulary (OOV) problems — any unseen word breaks things.
- Character-level tokenizers: each character is a token. Pros: tiny vocab, no OOV. Cons: very long token sequences and loss of meaningful multi-character chunks (roots/suffixes).
- Subword tokenizers (BPE): the hybrid. Keep frequent words intact, break rare words into subwords or characters. This lets the tokenizer:
- avoid OOVs,
- keep useful common words as single tokens, and
- represent morphological similarity (e.g., token, tokenize, tokenization share pieces).
That combination is why GPT-2/GPT-3 adopted BPE.
BPE at a glance — intuition
BPE began as a simple compression algorithm (1994): repeatedly find the most frequent adjacent pair of bytes and merge them into a new symbol. Applied to language, we:
- Start with characters (plus a special end-of-word marker).
- Count adjacent-symbol pairs across the corpus.
- Find the most frequent pair → merge it into a new subword token.
- Repeat until you’ve created the desired number of tokens (or no pair repeats enough).
This produces tokens that are not full words nor single characters, but subwords — e.g., old
, -est
, ing
, s
, etc. The algorithm automatically discovers frequent roots/suffixes (like est
in finest, lowest) and preserves frequent full words (like the
, and
) as single tokens.
Step-by-step example
Imagine a tiny corpus: old
, older
, finest
, lowest
.
Add a word-end marker (I’ll use </w>
conceptually).
Start as characters with </w>
appended.
Frequent adjacent pairs like e+s
, es+t
, then est</w>
emerge and get merged.
The algorithm progressively constructs tokens like old
, est</w>
, etc.
Final vocabulary includes characters, useful subwords, and a manageable number of tokens. That’s how BPE captures morphology and reduces vocabulary size versus full word vocabularies.
The practical implementation — tiktoken
(OpenAI’s Rust-backed BPE)
Tiktoken
is a performant implementation of the exact class of BPE tokenizers used for OpenAI models. You can install and use it quickly. Below I include the exact code you provided (unchanged) and its sample outputs.
# pip install tiktoken
import importlib
import tiktoken
print("tiktoken version:", importlib.metadata.version("tiktoken"))
# tiktoken version: 0.7.0
tokenizer = tiktoken.get_encoding("gpt2")
text = (
"Hello, do you like tea? <|endoftext|> In the sunlit terraces"
"of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)
# [15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]
strings = tokenizer.decode(integers)
print(strings)
# Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.
Notes on that snippet:
50256
is the token id used for the special<|endoftext|>
token in the GPT-2 encoding. GPT-2’s BPE vocab is commonly reported as ≈50,257 tokens (IDs 0..50256 inclusive).- The unknown-looking word
someunknownPlace
gets broken into subword pieces rather than causing an error — that’s BPE in action. tiktoken
is typically much faster than a pure-Python BPE implementation because the core is implemented in Rust.
Key takeaways
- BPE = subword tokenizer: preserves frequent whole words, splits rare words into subwords/characters.
- Solves OOV: any input can be tokenized because it can be represented as characters/subwords.
- Captures morphology: roots and suffixes are discovered automatically (e.g.,
est
in finest/lowest). - Reasonable vocab size: GPT-2 uses ~50k tokens, much smaller than the full word list of English.
- Fast, production-ready tooling: use
tiktoken
(Rust-backed) for encoding/decoding; it’s notably faster than a Python-only implementation.
Practical tips & gotchas
- Stop criteria for learning merges: when building a vocabulary via BPE, you stop after reaching a preset vocabulary size (e.g., 30k, 50k) or when no pair occurs often enough. GPT-style models often pick ~50k tokens for a balance of sequence length vs vocab expressiveness.
- Special tokens: preserve and register “end-of-text” or other special tokens so they’re treated atomically by the tokenizer (as shown in the example with
allowed_special
). - Token counts ≠ word counts: a single English word might be multiple tokens (e.g.,
antidisestablishmentarianism
) or many words may be a single token (the
,a
, punctuation combos depending on tokenizer). - Consistency matters: use the exact encoding used for the model — mixing tokenizers will break embedding alignment and model inputs.
Resources
tiktoken
— fast BPE tokenizer (Rust implementation used by OpenAI). Install withpip install tiktoken
.- OpenAI GPT-2 encoder reference (original Python implementation):
https://github.com/openai/gpt-2/blob/master/src/encoder.py
- https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb
- https://www.youtube.com/watch?v=fKd8s29e-l4
- https://learncodecamp.net/thinking-in-llm-models-2/
Recap
BPE is simple in idea but extremely powerful in practice. It’s the reason models like GPT-2 could be trained with manageable vocabularies while still gracefully handling arbitrary text from the wild.