Byte Pair Encoding (BPE): the tokenizer that made GPTs practical

Introduction

Byte Pair Encoding (BPE) is a subword tokenization scheme that gives us the best of both worlds: compact vocabulary sizes (not the full wordlist), the ability to represent any unknown word (by falling back to subwords/characters), and meaningful shared pieces (roots, suffixes) that help models generalize.

GPT-2 used a BPE tokenizer with a vocabulary of ≈50,257 tokens, and OpenAI’s tiktoken is a fast Rust-backed implementation you can use today. Below I explain the why, the how (intuition + algorithm), and a short hands-on demo using tiktoken (code you provided is included unchanged).


Why we even care about tokenizers

When we feed text to an LLM we must map text → integer IDs. How we split text into tokens has big consequences:

  • Word-level tokenizers: each distinct word is a token. Pros: short token sequences. Cons: huge vocabularies and out-of-vocabulary (OOV) problems — any unseen word breaks things.
  • Character-level tokenizers: each character is a token. Pros: tiny vocab, no OOV. Cons: very long token sequences and loss of meaningful multi-character chunks (roots/suffixes).
  • Subword tokenizers (BPE): the hybrid. Keep frequent words intact, break rare words into subwords or characters. This lets the tokenizer:
    • avoid OOVs,
    • keep useful common words as single tokens, and
    • represent morphological similarity (e.g., token, tokenize, tokenization share pieces).

That combination is why GPT-2/GPT-3 adopted BPE.


BPE at a glance — intuition

BPE began as a simple compression algorithm (1994): repeatedly find the most frequent adjacent pair of bytes and merge them into a new symbol. Applied to language, we:

  1. Start with characters (plus a special end-of-word marker).
  2. Count adjacent-symbol pairs across the corpus.
  3. Find the most frequent pair → merge it into a new subword token.
  4. Repeat until you’ve created the desired number of tokens (or no pair repeats enough).

This produces tokens that are not full words nor single characters, but subwords — e.g., old, -est, ing, s, etc. The algorithm automatically discovers frequent roots/suffixes (like est in finest, lowest) and preserves frequent full words (like the, and) as single tokens.


Step-by-step example

Imagine a tiny corpus: old, older, finest, lowest.

Add a word-end marker (I’ll use </w> conceptually).

Start as characters with </w> appended.

Frequent adjacent pairs like e+s, es+t, then est</w> emerge and get merged.

The algorithm progressively constructs tokens like old, est</w>, etc.

Final vocabulary includes characters, useful subwords, and a manageable number of tokens. That’s how BPE captures morphology and reduces vocabulary size versus full word vocabularies.


The practical implementation — tiktoken (OpenAI’s Rust-backed BPE)

Tiktoken is a performant implementation of the exact class of BPE tokenizers used for OpenAI models. You can install and use it quickly. Below I include the exact code you provided (unchanged) and its sample outputs.

# pip install tiktoken
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))
# tiktoken version: 0.7.0

tokenizer = tiktoken.get_encoding("gpt2")
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)
# [15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]

strings = tokenizer.decode(integers)

print(strings)
# Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.

Notes on that snippet:

  • 50256 is the token id used for the special <|endoftext|> token in the GPT-2 encoding. GPT-2’s BPE vocab is commonly reported as ≈50,257 tokens (IDs 0..50256 inclusive).
  • The unknown-looking word someunknownPlace gets broken into subword pieces rather than causing an error — that’s BPE in action.
  • tiktoken is typically much faster than a pure-Python BPE implementation because the core is implemented in Rust.

Key takeaways

  • BPE = subword tokenizer: preserves frequent whole words, splits rare words into subwords/characters.
  • Solves OOV: any input can be tokenized because it can be represented as characters/subwords.
  • Captures morphology: roots and suffixes are discovered automatically (e.g., est in finest/lowest).
  • Reasonable vocab size: GPT-2 uses ~50k tokens, much smaller than the full word list of English.
  • Fast, production-ready tooling: use tiktoken (Rust-backed) for encoding/decoding; it’s notably faster than a Python-only implementation.

Practical tips & gotchas

  • Stop criteria for learning merges: when building a vocabulary via BPE, you stop after reaching a preset vocabulary size (e.g., 30k, 50k) or when no pair occurs often enough. GPT-style models often pick ~50k tokens for a balance of sequence length vs vocab expressiveness.
  • Special tokens: preserve and register “end-of-text” or other special tokens so they’re treated atomically by the tokenizer (as shown in the example with allowed_special).
  • Token counts ≠ word counts: a single English word might be multiple tokens (e.g., antidisestablishmentarianism) or many words may be a single token (the, a, punctuation combos depending on tokenizer).
  • Consistency matters: use the exact encoding used for the model — mixing tokenizers will break embedding alignment and model inputs.

Resources


Recap

BPE is simple in idea but extremely powerful in practice. It’s the reason models like GPT-2 could be trained with manageable vocabularies while still gracefully handling arbitrary text from the wild.

Leave a comment