How GPT Works: A Shakespearean Text Generator

following Karpathy's Video

Have you ever wondered how a computer can write poetry like Shakespeare? By exploring a simplified GPT (Generative Pre-trained Transformer) model, we can uncover the magic behind text generation. This article guides you through the process with questions to spark curiosity and understanding, using a Python script that generates Shakespearean text as our example.

What’s the Big Idea Behind GPT?

Imagine reading “To be or not to…” and guessing the next word. You’d likely say “be,” right? GPT models predict the next character or word in a sequence based on patterns in the text they’ve seen. Our script uses Shakespeare’s works to train a model to predict the next character. Why characters? They’re simpler than words, with a small vocabulary (65 characters like letters, spaces, and punctuation). What does a model need to turn raw text into predictions?

Turning Text into Numbers

Computers don’t understand letters, so how do we make text “machine-readable”? The script:

Reads Shakespeare’s text and lists all unique characters (e.g., ‘a’, ‘b’, ‘,’).
Creates mappings: stoi (e.g., ‘a’ → 0) and itos (e.g., 0 → ‘a’).
Encodes text into numbers (e.g., “hello” → [7, 4, 11, 11, 14]) and decodes numbers back to text.

Why numbers? Neural networks use math, and numbers are their language. What if two characters had the same number?

Feeding the Model Data

The script loads a preprocessed file (train.bin) with Shakespeare’s text as numbers. Why preprocess? It’s faster than encoding text during training. The model trains on chunks of 32 characters (e.g., “To be or not to be, t”) to predict the next chunk (e.g., “o be or not to be, th”). Why shift by one character? This teaches the model to predict what comes next, like guessing the next word in a sentence.

Building the Brain: The Model’s Architecture

The SimpleGPT model, built with PyTorch, has three key parts:

Embedding Layer: Converts each character into a 128-dimensional vector, like giving it a “personality.” It also adds positional information to track where characters appear in a sequence. Why care about position? Without it, “dog bites man” and “man bites dog” would seem identical.
Transformer Layers: Three layers analyze relationships between characters using:
- Self-Attention: Focuses on relevant characters (e.g., noticing “to” often follows “be”).
- Causal Mask: Ensures the model only sees past characters, mimicking how we write. Why prevent “seeing the future”?
- Feedforward Network: Refines the attention results.
Output Layer: Produces probability scores (logits) for each of the 65 characters, predicting the next one.

How do these parts work together to understand context?

Training the Model

Training teaches the model to make better predictions. The script runs 50 steps, where:

It picks eight random 32-character chunks.
The model predicts the next character for each position.
A loss function measures errors, and an optimizer (Adam) tweaks the model to improve.

Why only 50 steps? It’s a demo—real models train much longer. What might more training achieve?

Generating Shakespearean Text

To generate text, the model:

Starts with a prompt (e.g., “HAMLET: To be or not to be”) or a single character.
Encodes it into numbers and predicts the next character’s probabilities.
Uses temperature (controls creativity) and top-k sampling (limits choices to the k most likely characters) to pick the next character.
Repeats until it generates 200 characters or hits a newline.

Why use temperature and top-k? They balance predictable and creative output. What if temperature was very high or top-k was 1?

What Makes It Shakespearean?

The model learns Shakespeare’s patterns—like “thou” or dramatic phrasing—during training. The script shows outputs with different settings:

Conservative (temperature=0.5, top_k=10): Mimics common patterns.
Balanced (temperature=0.8, top_k=20): Mixes predictability and creativity.
Creative (temperature=1.2, top_k=30): Takes risks, possibly less coherent.

Which setting would you choose for a Shakespearean play?

Key Takeaways

This simple GPT shows how larger models like ChatGPT work:

Data: Encodes text into numbers.
Architecture: Uses embeddings, attention, and masks to process context.
Training: Optimizes predictions via loss and updates.
Generation: Samples from probabilities to create text.

What are the model’s limits? With brief training and a small size, it’s basic. How could you make it better? More training, larger layers, or more data could help.

Try running the script yourself! Tinker with temperature or top-k to see how the text changes. What kind of text would you want to generate?

立委按：鉴于语言大模型GPT的重要性，特此根据AI大神Karpathy的nanoGPT讲座，编纂此科普系列，计五篇，一篇没有代码和数学公式，是最通俗的科普。其他四篇包括一篇英文，均附带可验证的Python代码，并给予不同角度的详细解说，面对有一定工程背景的对象。