Tokenization in Large Language Models: A Hands-On Guide

Introduction

In this blog post, we dive deep into tokenization, the very first step in preparing data for training large language models (LLMs).

Tokenization is more than just splitting sentences into words—it’s about transforming raw text into a structured format that neural networks can process. We’ll build a tokenizer, encoder, and decoder from scratch in Python, and walk through handling unknown tokens and special context markers.

By the end, you’ll not only understand how tokenization works but also have working Python code you can adapt for your own projects.

The Three Steps of Tokenization

At its core, tokenization involves three main steps:

Splitting text into tokens (words, subwords, punctuation).
Mapping tokens to token IDs (unique integers).
Encoding token IDs into embeddings (vector representations).

In this post, we’ll focus on steps 1 and 2.

Preparing the Dataset

For demonstration, we’ll use The Verdict by Edith Wharton (1908), a short story available for free.

https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/the-verdict.txt

import re

# Load dataset
with open("verdict.txt", "r") as f:
    raw_text = f.read()

print("Total characters:", len(raw_text))
print("First 100 chars:", raw_text[:100])

Output:

Total characters: 20479
First 100 chars: I had always thought Jack Gisburn...

Step 1: Tokenizing Text

We’ll use Python’s regular expressions (re) to split text into words and punctuation.

text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

Output:

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']

👉 Notice how punctuation marks (. , !) are separate tokens.

Step 2: Building a Vocabulary and Assigning Token IDs

Now we build a vocabulary: a sorted list of unique tokens mapped to integers.

# Example tokens
tokens = ["the", "quick", "brown", "fox", "jumps", "the"]

# Vocabulary (unique + sorted)
vocab = sorted(set(tokens))
print("Vocabulary:", vocab)

# Map tokens to IDs
token2id = {tok: idx for idx, tok in enumerate(vocab)}
id2token = {idx: tok for tok, idx in token2id.items()}

print("Token → ID:", token2id)
print("ID → Token:", id2token)

Output:

Vocabulary: ['brown', 'fox', 'jumps', 'quick', 'the']
Token → ID: {'brown': 0, 'fox': 1, 'jumps': 2, 'quick': 3, 'the': 4}

Implementing a Tokenizer Class

To make this reusable, let’s implement a Tokenizer class with encode and decode methods.

class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
                                
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

Example:

tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

Handling Unknown Tokens and End-of-Text

Real-world text contains words not in the vocabulary. To handle this, we add special tokens:

<UNK> → for unknown words
<EOT> → for end of text

all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

Updated Tokenizer (V2):

class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int 
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

Now, unseen words are mapped to <UNK>.

Special Context Tokens in LLMs

Beyond <UNK> and <EOT>, LLMs often use:

<BOS> (Beginning of Sequence) – Marks the start of text.
<EOS> (End of Sequence) – Marks the end.
<PAD> (Padding) – Ensures equal sequence length in batches.

💡 GPT models (like GPT-3/4) mostly rely only on <EOT> to separate documents.

Why GPT Uses Byte Pair Encoding (BPE)

Our tokenizer treats entire words as tokens. But what if a rare word appears? GPT solves this by breaking words into subwords using Byte Pair Encoding (BPE).

Example:

"chased" → "ch", "ase", "d"

This way, even unknown words can be represented through known subword pieces. We’ll explore BPE in the next lecture.

Recap

In this post, we learned:

✅ Tokenization = breaking text → tokens → IDs
✅ How to build a vocabulary and assign token IDs
✅ Implementing an encoder/decoder tokenizer class in Python
✅ Handling unknown tokens and using <EOT> for document separation
✅ Why large-scale models prefer subword tokenization (BPE)

Refrences: https://github.com/rasbt/LLMs-from-scratch/tree/main

For a high level view on tokenization read this: https://learncodecamp.net/tokenization/

Tokenization in Large Language Models: A Hands-On Guide

Introduction

The Three Steps of Tokenization

Preparing the Dataset

Step 1: Tokenizing Text

Step 2: Building a Vocabulary and Assigning Token IDs

Implementing a Tokenizer Class

Example:

Handling Unknown Tokens and End-of-Text

Updated Tokenizer (V2):

Special Context Tokens in LLMs

Why GPT Uses Byte Pair Encoding (BPE)

Recap

1 thought on “Tokenization in Large Language Models: A Hands-On Guide”

Leave a comment Cancel reply