Unveiling the Secrets Behind ChatGPT

Introduction

Hello everyone! By now, you’ve likely heard of ChatGPT, the revolutionary AI system that has taken the world and the AI community by storm. This remarkable technology allows you to interact with an AI through text-based tasks.

The Technology Behind ChatGPT: Transformers

The neural network that powers ChatGPT is based on the Transformer architecture, introduced in the 2017 paper “Attention is All You Need.” GPT stands for “Generatively Pre-trained Transformer.” The Transformer architecture is a landmark development in AI that revolutionized the field, primarily in natural language processing (NLP). The Transformer architecture, initially designed for machine translation, became the backbone for numerous AI applications, including ChatGPT.

Building a Transformer-Based Language Model

While replicating ChatGPT’s capabilities is a daunting task, we can gain valuable insights by building a smaller Transformer-based language model. We’ll focus on a character-level language model using the “tiny Shakespeare” dataset, which contains the complete works of Shakespeare in a single file. This dataset is approximately one megabyte in size, making it manageable for educational purposes.

Input : raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Tokenization and Training

First, we need to tokenize the input text. Tokenization converts raw text into a sequence of integers based on a predefined vocabulary. In our case, we use a character-level tokenizer, meaning each character is assigned an integer. Here’s how you can implement it in Python:

# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)

#Output:
# !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
#65

There are just 65 characters in our vocabulary, characters from A-Z, a-z, space and some special symbols.

Let’s write a simple tokenizer function for the text

# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

# Output:
# [46, 47, 47, 1, 58, 46, 43, 56, 43]
# hii there

Let’s now encode the entire text dataset and store it into a torch.Tensor

import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)

# Output: 
# torch.Size([1115394]) torch.int64

Next, we split the dataset into training and validation sets to monitor overfitting. The training set comprises the first 90% of the data, while the remaining 10% serves as the validation set.

n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

Now, let’s define the block size

What is Block Size?

Block size refers to the maximum length of the sequence of text that the model processes at one time. In the context of training a Transformer-based language model, it defines the length of the text chunks (or blocks) that the model will use to make predictions. For example, if the block size is set to 8, the model will look at sequences of 8 characters at a time to predict the next character.

block_size = 8
train_data[:block_size+1]

Why the “+1”?

The “+1” comes into play because, during training, we want to create examples where the model predicts the next character in a sequence. To do this, we need two sets of data:

Inputs (X): The sequence of characters up to the current position.
Targets (Y): The next character that follows each position in the sequence.

x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

Output of above code

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58

What is Batch Size?

Batch size is the number of examples that are processed together in one iteration during the training of a model. Instead of updating the model’s weights after each individual sequence, we update them after processing a batch of sequences, which makes the training process more efficient and stable.

torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

The output of above code is

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])

Here we select 4 rows of 8 elements each. (4 = batch size, 8 = block size), in targets, we show the one-element forwarded input array, for example, when input is [24] the target: 43 when input is [24, 43] the target: 58

We will continue learning this, in the next part of the blog.

For complete code, you can check this notebook: gpt-dev.ipynb – Colab (google.com)

For a complete video you can check this: https://www.youtube.com/watch?v=kCc8FmEb1nY

Unveiling the Secrets Behind ChatGPT – Part 1