Unveiling the Secrets Behind ChatGPT – Part 2

For part 1 refer to this: Unveiling the Secrets Behind ChatGPT – Part 1 (learncodecamp.net)

Implementing a Bigram Language Model

When diving into the world of natural language processing (NLP) and language modeling, starting with a simple baseline model is essential. It helps establish a foundation to build upon. One of the simplest and most intuitive models for language generation is the bigram language model. This blog post will walk you through the implementation of a bigram language model using PyTorch, explaining the key concepts, steps, and code snippets along the way.

Introduction to the Bigram Language Model

A bigram language model predicts the next word in a sequence based solely on the previous word. It’s a straightforward approach that captures some of the local dependencies in the text. While it’s not as powerful as more complex models, it’s a great starting point to understand the basics of language modeling.

As we are working with characters in this example, our bigram model will look at just the previous character.

Implementing the Bigram Language Model in PyTorch

This model will include an embedding layer that maps input tokens to vectors and a forward method to compute the logits for the next token prediction.

In the constructor, we create a token embedding table of size vocab_size x vocab_size using nn.Embedding. The forward method processes input indices to produce logits, which are the scores for the next character in the sequence. If targets are provided, it also computes the cross-entropy loss.

vocab_size = 65, xb and yb are input and output tensors.

import torch
import torch.nn as nn
from torch.nn import functional as F

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

Evaluating the Loss

The loss function used here is the negative log likelihood loss, implemented in PyTorch as cross-entropy loss. This function measures the quality of the logits concerning the targets, essentially evaluating how well the model predicts the next character.

Generating Text from the Model

Once we have our model trained, we want to generate text. The generate function extends the input sequence by predicting the next token iteratively.

Training the Model

To make the model useful, we need to train it on a corpus of text. Here, we use the Adam optimizer for training.

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)
batch_size = 32
for steps in range(100): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)


Evaluating the Model

After training, we evaluate the model’s performance. Although a bigram model is quite limited, the loss should decrease as training progresses, indicating the model’s improving ability to predict the next token.

Generating Improved Text

With the trained model, we can now generate text that should be more coherent than the initial random outputs.

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))

Let’s learn more about B T C

B, T, and C refer to the dimensions of the tensors that represent batches of sequences of data. Let’s break down what each of these dimensions stands for:

  1. B (Batch Size): This dimension represents the number of sequences or samples processed together in one forward/backward pass of the neural network. Using batches allows for more efficient computation and training on modern hardware. For instance, if B=32, it means that 32 sequences are being processed simultaneously.
  2. T (Time Steps or Sequence Length): This dimension corresponds to the length of each sequence in the batch. In the context of language models, this usually represents the number of tokens (e.g., characters, words) in each sequence. If T=8, it means each sequence contains 8 tokens.
  3. C (Channels or Vocabulary Size): In language models, this dimension often represents the size of the vocabulary, i.e., the number of unique tokens (characters, words, etc.) that the model can recognize. For instance, if C=65, it means the vocabulary contains 65 unique tokens.

Tensor Shapes in the Bigram Language Model

Embedding Table

The embedding table is a matrix of size vocab_size x vocab_size. Here, the vocabulary size is C. Each row of this matrix is a vector representation of a token from the vocabulary.

Inputs and Outputs

When we input a batch of sequences into the model, it has the shape [B, T], where B is the batch size and T is the sequence length. Each element in this tensor is an integer index corresponding to a token in the vocabulary.


After processing the input through the embedding layer, we get a tensor of shape [B, T, C]:

  • B (Batch Size): Number of sequences being processed simultaneously.
  • T (Time Steps): Number of tokens in each sequence.
  • C (Channels or Vocabulary Size): For each token in the sequence, the model outputs a score (logit) for each possible token in the vocabulary.

Reshaping for Loss Calculation

When calculating the loss, the shape of the logits and targets needs to be compatible with the requirements of the cross-entropy loss function in PyTorch.


The bigram language model serves as a fundamental stepping stone in language modeling. While it’s a simple approach, it forms the basis for more advanced models that consider longer contexts and dependencies. By understanding and implementing the bigram model, we lay the groundwork for exploring more sophisticated architectures like the Transformer, which can handle much more complex language tasks.

For part 1 refer to this: Unveiling the Secrets Behind ChatGPT – Part 1 (learncodecamp.net)

We will continue learning this, in the next part of the blog.

For complete code, you can check this notebook: gpt-dev.ipynb – Colab (google.com)

For a complete video you can check this: https://www.youtube.com/watch?v=kCc8FmEb1nY

Leave a comment