Introduction
In the attention mechanism used by Large Language Models (LLMs) like transformers (e.g., GPT), the core idea is to allow the model to dynamically focus on relevant parts of the input sequence when generating or understanding text.
This is achieved through a process called scaled dot-product attention, where input tokens (e.g., words or subwords) are transformed into three types of vectors: Q K V, Query (Q), Key (K), and Value (V). These are not arbitrary; they’re learned projections of the input embeddings via linear transformations matrices
Vector | Definition | Role in Attention |
---|---|---|
Query (Q) |
A vector representing the “question” or current token’s need for information.
For a given token:
![]() ![]() | Acts as the “search query” to determine what information to retrieve from other tokens. It’s dotted with Keys to compute relevance scores. |
Key (K) |
A vector representing the “label” or summary of each token’s content.
![]() |
Used to match against the Query.
The similarity is measured with
![]() |
Value (V) |
A vector holding the actual “content” or features of each token.
![]() |
The output is a weighted sum of these:
![]() |
- Dimensions: Typically, Q, K, and V have the same dimension (e.g., 64 or 512 in multi-head attention), and the softmax ensures weights sum to 1.
- Multi-Head Attention: In practice, LLMs use multiple “heads” (parallel attention computations with different projections), concatenating results for richer representations.
Intuition Behind Q, K, V
Think of attention like a content-addressable memory system or a smart database lookup—it’s inspired by how humans retrieve information from memory by associating cues:
Query (Q) as “What am I looking for?”: Imagine you’re reading a sentence and processing the word “bank.” Your Query might ask, “Is this the river bank or financial bank?” It probes the sequence for clues.
Key (K) as “What do I have available?”: Each word in the input (e.g., “river” or “money”) has a Key that summarizes its essence. The dot product Q⋅KQ \cdot KQ⋅K is like a similarity score: high if the Key matches your Query (e.g., “river” scores high for river-bank context), low otherwise.
Value (V) as “What do I retrieve?”: Once matches are found (via softmax-normalized weights), you pull the full details (Values) from those matching tokens and blend them into a context-aware representation for your current word. This lets the model “attend” to distant or relevant parts of the text without rigid sequential processing.
Without it, models like RNNs struggle with vanishing gradients over sequences.
The scaling by prevents dot products from exploding in high dimensions.
In short: Q asks, K answers “how relevant?”, and V delivers the goods—turning raw sequences into contextually rich outputs!
For more detailed understanding you can check this video : https://www.youtube.com/watch?v=UjdRN80c6p8&t=1156s