Introduction to Transformer and Its Attention Mechanism

The Transformer architecture and its attention mechanism form the foundation of mainstream GPT large language models, making them extraordinarily important. Despite the abundance of explanations and popular science articles on this topic, many friends tell me they still find it bewildering or only partially understand it. Therefore, I've decided to write a couple of blogs to contribute my understanding.

As someone curious about mainstream AI, you've likely heard of the renowned Transformer framework and its "attention mechanism" that powers large language models, perhaps considering them mysterious concepts. You may have read the classic paper "Attention is All You Need," but still found it confusing or difficult to decode. Don't worry—this is completely normal, and most of us have gone through this stage! While the paper may be a bit mind-bending, its core logic isn't actually that complex.

To understand the Transformer architecture in AI large language models (LLMs), we need to break down its workflow. First, we should understand how large language models work and how they're trained. Base large language models gain knowledge from data through "self-supervised learning" using multi-layer neural networks. Self-supervised learning is a special type of machine learning that uses "masking" to generate supervision signals. While supervised learning typically uses human-annotated data with output targets, self-supervised learning requires no human annotation. Instead, it masks certain data points and trains the system to predict them (like "filling blanks" or "continuing sequences"), using the masked data as the correct answer and supervision signal. Mainstream GPT models mask the next word, training the system to predict it based solely on previous context (called "next token prediction")—this is the current paradigm for generative AI.

The Complete Process from Input to Output

1. Starting with "Dictionary Lookup": Tokenization and Embedding

To understand an entire input text for next token prediction, we first need to break it down into basic units, called tokenization, which converts text into a sequence of tokens (the smallest units of text). These tokens might be complete words (like "work") or subwords (like "un+believ+able").

Tokens are symbols, and computers struggle with direct symbol manipulation—they only work well with numbers. So we need to convert tokens into numbers.

Each token is converted into a numerical representation—a multi-dimensional vector—by looking up an embedding dictionary. Each token is transformed into a 300-1024 dimensional vector (imagine establishing feature representations for each word across many conceptual dimensions, such as: noun, singular, organization, finance, etc.). Embedding allows words to have computable semantic spatial relationships.

This multi-dimensional vector space acts like a "meaning space" where each token's vector defines its position. The distance between tokens across different dimensions represents their semantic distinctions. This aligns with our intuition: a word's meaning becomes apparent through comparison with other words.

These vectors aren't randomly generated but are numerically encoded representations trained on massive natural text corpora, providing the basic semantic information of tokens—their position in meaning space. For example, the vector for "bank" naturally sits closer to "money" and far from "trees." Similarly, the vector for "apple" might contain information about both "fruit" and "technology company."

Imagine trying to help a computer understand the sentence: "The cat sat on the mat."

Step one: Tokenization breaks this sentence into individual tokens: The+cat+sat+on+the+mat.

Step two: Dictionary lookup (Embedding) finds a numerical representation—a multi-dimensional vector—for each token.
"cat" -> [0.1, 0.5, -0.2, ...]
"sat" -> [-0.3, 0.8, 0.1, ...]
...

Simply put:
Tokenization breaks text into the smallest units (tokens) that computers can easily process and analyze.
Embedding converts these tokens into vectors that computers can easily calculate and combine.

Key point: The vectors obtained from the embedding dictionary are only the "initial meaning representations" of tokens, without considering their specific context. Decoding contextual meaning from vector representations is the task of the next steps, using the multi-layer neural networks + attention mechanism in the Transformer architecture.

The core modules of Transformer can be broken down into two parts:
1. Attention mechanism: Used to calculate the relevance between tokens and dynamically update token representations.
2. Neural network: Used to process information transformation between tokens.

The entire Transformer is stacked with multiple such blocks for transformation, and with each attention layer recalculating token representations, deepening understanding progressively.

2. Attention Takes the Stage: Updating Word Meanings Based on Context

Now we have a sequence of vectors, each representing the "initial meaning" of a token. But here's the problem: the same word can have different meanings in different contexts! For instance, "bank" can mean a financial institution or a riverbank.

The core of the Transformer architecture is the attention mechanism (self-attention), which serves to dynamically adjust the representation of each token based on context, reflecting its relationships with other tokens.

For example: In the sentence "I like to eat apples," "apple" and "eat" are highly correlated, so the model will rely more on the word "eat" to update the meaning of "apple," determining that "apple" here refers to fruit rather than a company.

How is this done?

The model calculates attention weights between each token and other tokens through QKV attention:
- Query: querying vector of the current token (e.g., "he")
- Key: key vectors of contextual tokens (e.g., "police," "witness")
- Value: The actual meaning after association

For example, through matrix operations, the model discovers that "he" is most strongly associated with "witness," so it updates the vector for "he" to carry information from "witness."

Calculating "relevance": For each token, we calculate its "relevance" with all other tokens in the sentence, assigning different "attention weights" (attention scores) to different tokens. This "relevance" can be understood as: how important are other tokens when understanding the meaning of the current token.
* For example, when understanding the word "sat," "cat" and "mat" are obviously more important than "the."

Weighted average: Based on the calculated "relevance" (i.e., token weights), we take a weighted average of the V vectors from all tokens in the context to obtain a new vector representation for this token. This new vector is the meaning representation of the current token in this specific sentence.
For instance, the new vector for "sat" will be more influenced by the vectors of "cat" and "mat," and less by the vector of "the."

Key point: The attention mechanism dynamically updates the meaning of each token by calculating the relevance between tokens. This update is context-based—the same token will have different representations in different sentences.

This way, each token's meaning is no longer fixed but changes dynamically based on the entire sentence's context. For example, in "I saw a bat," "bat" could refer to either a flying mammal or a sports implement, but the attention mechanism will combine the bigger context to infer its more appropriate meaning.

For details on how QKV works in the attention mechanism, please refer to the companion article "How to Understand QKV Division of Labor in Self-Attention Mechanism?"

3. The Transformer Backbone: Multi-layer Progressive Information Compression

The core building blocks of Transformer can be broken down into two parts:
Multi-head attention layer: Used to calculate relevance between tokens and dynamically update token representations.
Feed-forward neural network layer: Further process and transform information (compression, abstraction)

The entire Transformer consists of multiple such modules stacked together, with each layer recalculating token representations for deeper understanding. Depending on the number of blocks, the Transformer repeatedly performs this update process. Like humans pondering a text multiple times, each layer deepens the understanding of the text. Deeper layers may capture more complex semantic relationships.

Each Transformer block iteratively upgrades understanding, for example:
- Bottom layers: Capture local grammar (such as the contrasting relationship in "not...but...")
- Middle layers: Understand "who 'he' actually refers to"
- Top layers: Grasp the main theme of the entire text

The main features of Transformer
1. Parallel computation: Word order is decoupled from token processing, allowing parallel processing of all tokens (in contrast to the linear inefficiency of previous RNNs)
2. Hierarchical understanding: Progressive interpretation from literal meaning to deep intention, capturing patterns both large and small.

4. Output: The Model's Final Prediction

Transformer models can be used for various tasks. Different tasks have different forms of output.

GPT: Next Token Prediction
For mainstream GPT models, their ultimate task is to predict what comes next through "autoregressive" next token prediction (autoregression is the dynamic extension of previous context, recursively implementing word-by-word continuation). The model decides what content should logically follow based on the deeply understood context. This opened the path to general AI, as sequence learning has mastered the "code" for converting inputs to outputs for general tasks, but that's a topic for another article.

5. Summary

Tokenization and Embedding lay the foundation for computers to understand text, similar to looking up a dictionary.
Attention mechanism calculates relevance between tokens and dynamically updates token representations.
Transformer consists of neural network layers + attention layers, optimizing token representations layer by layer, covering various relationships at different levels.
The final output depends on the task. Translation models generate target language text. GPT is responsible for predicting the next token, ultimately evolving this simple prediction mechanism into a general-purpose large model capable of unlocking various tasks.

【相关】