Introduction
Token embeddings (aka vector embeddings) turn tokens — words, subwords, or characters — into numeric vectors that encode meaning.
They’re the essential bridge between raw text and a neural network.
In this post, below we will run a small demos (Word2Vec-style analogies, similarity checks), and provide concrete PyTorch code that demonstrates how an embedding layer works, I also include a tiny toy training loop so you see embeddings updated by backprop.
Why token embeddings (intuition)
- Computers only understand numbers. Assigning random integers (token IDs) or one-hot vectors to words does not capture word relationships — e.g.
cat
andkitten
should be closer thancat
andbanana
. - Embeddings are vectors (e.g., 300-D, 768-D) where semantic relationships are geometric relationships: similar words are near each other, arithmetic often works (e.g.,
king + woman − man ≈ queen
). - Practically, an embedding layer is just a lookup table: for each token ID you fetch a row (a vector). During LLM training those rows (the embedding weight matrix) are learned (initialized randomly, then optimized via backprop).
Hands-on demos — code you can run
Below are runnable code examples. They demonstrate:
- Loading pre-trained word vectors (Word2Vec / Google News).
- Doing analogy arithmetic and similarity.
- A tiny toy training loop that updates embedding weights, and predicts words based on embedding similarity
# Requirements:
# pip install fse gensim
import gensim.downloader as api
# Load the prepackaged word2vec-google-news-300 model via fse
model = api.load("word2vec-google-news-300")
# Analogy: king + woman - man -> queen
result = model.most_similar(
positive=['king', 'woman'],
negative=['man'],
topn=5
)
print("king + woman - man ->", result[:5])
# Similarity examples (cosine similarity)
pairs = [('woman', 'man'), ('king', 'queen'), ('paper', 'water')]
for a, b in pairs:
print(f"sim({a},{b}) = {model.similarity(a, b):.4f}")
# Find nearest neighbors
print("Nearest to 'tower':", model.most_similar('tower', topn=10))
#output
king + woman - man -> [('queen', 0.7118192911148071), ('monarch', 0.6189674735069275), ('princess', 0.5902431011199951), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321243286133)]
sim(woman,man) = 0.7664
sim(king,queen) = 0.6511
sim(paper,water) = 0.1141
Nearest to 'tower': [('towers', 0.8531749844551086), ('skyscraper', 0.6417425870895386), ('Tower', 0.639177143573761), ('spire', 0.5946877598762512), ('responded_Understood_Atlasjet', 0.5931612849235535), ('storey_tower', 0.5783935189247131), ('SolarReserve_molten_salt', 0.5733036398887634), ('monopole_tower', 0.566946804523468), ('bell_tower', 0.5626808404922485), ('foot_monopole', 0.5514882802963257)]
This code is a minimal example of how word/token embeddings work in PyTorch:
- Vocabulary setup
A tiny vocabulary (['fox','house','in','is','quick','the']
) is created and mapped to integer IDs (stoi
). This lets the model convert words → numbers. - Embedding layer
nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim)
creates a learnable lookup table of shape(vocab_size, embed_dim)
. Each word ID gets mapped to a small vector (here, 3-dimensional). - Input tokens
The input sentence fragment"in is the house"
is converted into IDs[2, 3, 5, 1]
using thestoi
mapping. - Embedding lookup
embedding(input_ids)
fetches the corresponding row vectors from the embedding matrix, producing a tensor of shape(4, 3)
(one 3-D vector for each token). - Weights inspection
embedding.weight
shows the full embedding matrix for all words in the vocab (size6 × 3
here).
# Requirements: torch
# pip install torch
import torch
import torch.nn.functional as F
from torch import nn
# Small toy vocabulary & mapping
vocab = ['fox', 'house', 'in', 'is', 'quick', 'the'] # size=6
stoi = {w:i for i,w in enumerate(vocab)}
vocab_size = len(vocab)
embed_dim = 3 # tiny embedding dimension for clarity
# Make an embedding layer (vocab_size x embed_dim)
embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim)
# Example input ids (tokens "in", "is", "the", "house" mapped to their IDs)
# Suppose token IDs are [2, 3, 5, 1]
input_ids = torch.tensor([stoi['in'], stoi['is'], stoi['the'], stoi['house']]) # shape: (4,)
print(input_ids)
# Single-line lookup (batch lookup)
embeds = embedding(input_ids) # shape: (4, embed_dim)
print("Embeddings shape:", embeds.shape)
print("Embeddings:\n", embeds)
# If you want the raw embedding weight matrix:
print("Embedding weight matrix (vocab_size x embed_dim):\n", embedding.weight)
The code trains embeddings + a simple linear classifier to learn word-to-next-word transitions from the essay.
- Essay text → a ~200-word paragraph on cats is used as raw training data.
- Tokenization → text is lowercased, split into words, duplicates removed to build a vocabulary.
- Vocab & mappings →
token2id
maps words → IDs,id2token
maps IDs → words. - Corpus as IDs → essay tokens are converted into a sequence of integers.
- Training pairs → create
(current_word → next_word)
pairs from the sequence. - Model →
ToyModel
has: nn.Embedding
: turns word IDs into vectors.nn.Linear
: projects vectors into logits over vocab.- Training loop → runs for 500 epochs:
- Shuffle pairs, do forward pass, compute cross-entropy loss, backprop, optimizer step.
- Loss decreases as embeddings + weights are updated.
- Prediction function → given a word, look up its ID, get logits → probabilities, return top-k likely next words.
- Test predictions → check what the model thinks comes after words like
"cats"
or"the"
.
import random
import torch
import torch.nn.functional as F
from torch import nn, optim
# --- 1. Essay on Cats (200 words) ---
essay = """
Cats are fascinating creatures that have been companions to humans for thousands of years.
They are known for their independence, agility, and mysterious behavior. Unlike dogs,
cats often prefer quiet corners where they can observe their surroundings without being disturbed.
Their sharp eyes and quick reflexes make them excellent hunters, even in domestic environments.
Cats communicate through subtle body language—tail movements, ear positions, and gentle purring.
Each cat has a unique personality; some are playful and energetic, while others are calm and affectionate.
Despite their reputation for independence, cats often form strong bonds with their owners,
seeking warmth and comfort in their presence. They enjoy routines and can be sensitive to changes
in their environment. Cats also have a remarkable ability to adapt, whether they live in bustling cities
or peaceful countryside homes. Their grooming habits keep them clean, and their graceful movements
make them a delight to watch. For many people, the presence of a cat brings a sense of calm and companionship.
It is no wonder that cats remain one of the most beloved pets in the world, admired for both their beauty and spirit.
"""
# --- 2. Tokenization (very simple split) ---
tokens = essay.lower().replace("\n", " ").split()
vocab = sorted(set(tokens))
print("Vocabulary size:", len(vocab))
# --- 3. Build mappings ---
token2id = {tok: idx for idx, tok in enumerate(vocab)}
id2token = {idx: tok for tok, idx in token2id.items()}
# --- 4. Convert essay into IDs ---
ids = [token2id[tok] for tok in tokens]
# --- 5. Build training pairs (current → next) ---
pairs = [(ids[i], ids[i+1]) for i in range(len(ids)-1)]
# --- 6. Define model ---
class ToyModel(nn.Module):
def __init__(self, vocab_size, embed_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.linear = nn.Linear(embed_dim, vocab_size)
def forward(self, x):
e = self.embedding(x) # (batch, embed_dim)
logits = self.linear(e) # (batch, vocab_size)
return logits
vocab_size = len(vocab)
embed_dim = 16 # bigger than 3 now
model = ToyModel(vocab_size, embed_dim)
opt = optim.Adam(model.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()
# --- 7. Training loop ---
for epoch in range(500): # small for demo
random.shuffle(pairs)
losses = []
for inp, target in pairs:
inp_t = torch.tensor([inp])
target_t = torch.tensor([target])
logits = model(inp_t)
loss = loss_fn(logits, target_t)
opt.zero_grad()
loss.backward()
opt.step()
losses.append(loss.item())
if epoch % 10 == 0:
print(f"Epoch {epoch}, avg loss {sum(losses)/len(losses):.4f}")
# --- 8. Prediction function ---
def predict_next(word, topk=5):
model.eval()
with torch.no_grad():
if word not in token2id:
return f"Word '{word}' not in vocab."
inp_id = torch.tensor([token2id[word]])
logits = model(inp_id)
probs = F.softmax(logits, dim=-1)
top_probs, top_ids = torch.topk(probs, k=topk, dim=-1)
results = []
for p, i in zip(top_probs[0], top_ids[0]):
results.append((id2token[i.item()], float(p)))
return results
# --- 9. Test predictions ---
for test_word in ["cats", "the", "independence", "companions"]:
preds = predict_next(test_word)
print(f"Given '{test_word}' → next candidates: {preds}")
#output
Epoch 470, avg loss 0.7142
Epoch 480, avg loss 0.7152
Epoch 490, avg loss 0.7183
Given 'cats' → next candidates: [('often', 0.3571014702320099), ('also', 0.18115095794200897), ('remain', 0.16612175107002258), ('communicate', 0.1458531767129898), ('are', 0.13305993378162384)]
Given 'the' → next candidates: [('most', 0.296306848526001), ('world,', 0.27701255679130554), ('presence', 0.2574588656425476), ('cat', 0.019772468134760857), ('sense', 0.014828374609351158)]
Given 'independence' → next candidates: Word 'independence' not in vocab.
Given 'companions' → next candidates: [('to', 1.0), ('prefer', 2.8792931580645664e-11), ('form', 2.24102333912235e-11), ('cats', 6.964049996394106e-12), ('have', 6.138758338464223e-12)]
Embedding matrices in real LLMs
- To build the embedding matrix you need two numbers:
- vocabulary size (number of tokens). E.g., GPT-2 uses 50,257 tokens.
- embedding dimension (size of each token vector): e.g., 768 for GPT-2 small, 1,024 or larger for bigger models.
- The embedding matrix shape is
(vocab_size, embed_dim)
. It is initialized randomly and learned during pretraining. - In practice, embeddings are trained as part of the same optimization that trains the whole model (predict next token / masked token / contrastive objectives, depending on algorithm).
Summary
Instead of assigning arbitrary IDs or using one-hot vectors, embeddings provide dense representations where geometric closeness reflects semantic similarity.
They can be static (pretrained and frozen) or dynamically updated during training, as in LLM pretraining. For scale, GPT-2’s embedding matrix with a vocabulary of 50,257 and dimension of 768 already holds about 38.6 million parameters, showing how significant embeddings are in both size and importance.
Token embeddings are surprisingly powerful: a compact vector can capture syntactic and semantic relationships, support arithmetic-style analogies, and form the foundation of language model inputs.
References: https://github.com/rasbt/LLMs-from-scratch, https://www.youtube.com/watch?v=ghCSGRgVB_o&list=PLPTV0NXA_ZSgsLAr8YCgCwhPIJNNtexWu&index=10