Attention and Transformers

Modern large language models are built on the transformer architecture, introduced in the 2017 paper Attention Is All You Need. The key idea behind transformers is attention: a mechanism that allows the model to decide which parts of the input are most relevant when processing each token.

Before transformers, many sequence models processed text step by step. Recurrent neural networks, for example, read tokens in order and maintained a hidden state. This made long-range dependencies difficult. Information from early in a sequence could fade or become compressed before the model reached later tokens.

Attention changed this. Instead of forcing information through a single sequential pathway, attention allows each token to look across other tokens and assign different levels of importance to them.

For example, in the sentence:

The developer fixed the bug because it caused the app to crash.

The word it refers to the bug, not the developer or the app. An attention mechanism can help the model connect it with the earlier phrase that gives it meaning.

This ability to connect distant parts of text is one reason transformers are so powerful for language tasks.

Self-attention in concrete terms

Self-attention means that tokens in a sequence attend to other tokens in the same sequence. Each token asks: “Which other tokens are relevant to understanding me?”

Consider this sentence:

The cat sat on the mat because it was tired.

When the model processes the token it, it needs to infer what it refers to. Possible candidates include cat and mat. Based on patterns learned during training, the model may attend strongly to cat because animals get tired, while mats usually do not.

A simplified attention pattern might look like this:

Token being processed: it

cat      ██████████ high relevance
mat      ██         low relevance
sat      █          low relevance
tired    █████      future/semantic relation depending on architecture

In a GPT-style model, tokens can only attend to previous tokens during generation, not future tokens. When predicting the next token after it was, the model may attend back to cat to continue the sentence coherently.

Query, key, and value without heavy math

Self-attention is often explained using three terms: query, key, and value.

An intuitive analogy is a library search.

A query is what you are looking for.
A key is a label or description of what each item contains.
A value is the actual information you retrieve if the item is relevant.

For each token, the model creates a query vector, key vector, and value vector. These are learned numerical representations. When processing a token, the model compares that token’s query with the keys of other tokens. Tokens with keys that match well receive higher attention weights. The model then combines the corresponding values.

Simplified:

Current token creates a query:
"I need information relevant to this token."

Other tokens provide keys:
"Here is what I contain."

Relevant tokens provide values:
"Here is the information to use."

This lets the model dynamically route information. The same word can attend to different context depending on the sentence.

For example, the word bank may attend to different tokens in different contexts:

She deposited money at the bank.
The boat stopped near the river bank.

The token is the same English word, but the surrounding tokens change its meaning.

Multi-head attention

A transformer does not use just one attention pattern. It uses multi-head attention. Each attention head can learn to focus on different relationships.

One head might specialize in nearby syntax. Another might track subject-verb agreement. Another might connect pronouns to nouns. Another might detect code indentation or function definitions. These descriptions are simplified, and real heads are distributed and complex, but the intuition is useful.

For the sentence:

The engineer who reviewed the logs found the error.

Different heads might attend to different structures:

Head 1: engineer ↔ found
Head 2: reviewed ↔ logs
Head 3: error ↔ found
Head 4: who ↔ engineer

Multiple heads allow the model to represent several relationships at once. Language is not one-dimensional. Meaning depends on grammar, reference, topic, style, world knowledge, and task instructions simultaneously.

Stacking transformer layers

A transformer layer contains attention plus additional components, including feed-forward neural networks, normalization, and residual connections. A modern model stacks many such layers.

The first layers may learn relatively local patterns: word pieces, punctuation, short phrases, and syntax. Middle layers may represent more abstract relationships: entities, references, code structures, or factual associations. Later layers may become more task-oriented, shaping the information needed to predict the next token or follow an instruction.

A simplified stack looks like this:

Input tokens
  ↓
Embedding layer
  ↓
Transformer layer 1
  ↓
Transformer layer 2
  ↓
Transformer layer 3
  ↓
...
  ↓
Final prediction layer
  ↓
Next-token probabilities

Each layer refines the representation of every token. By the time the model predicts the next token, each token representation has been transformed by many rounds of attention and computation.

Positional encoding

Attention by itself does not automatically know token order. If you give a model the tokens dog bites man and man bites dog, the order matters. Without some representation of position, attention would have difficulty distinguishing them.

Transformers therefore add positional information to token embeddings. This can be done with fixed positional encodings, learned positional embeddings, rotary position embeddings, or other techniques.

The core idea is simple: each token representation includes information about where it appears in the sequence.

Token: The       Position: 1
Token: dog       Position: 2
Token: bites     Position: 3
Token: the       Position: 4
Token: man       Position: 5

Position matters for grammar, code, lists, reasoning steps, and conversation history. In an agent trace, for example, the order of observations and actions is essential.

Encoder-decoder versus decoder-only transformers

The original transformer architecture had an encoder and a decoder. The encoder read the input sequence and produced representations. The decoder generated an output sequence while attending to the encoded input and to previously generated output tokens. This design is natural for translation:

Input: French sentence
Encoder: represent the sentence
Decoder: generate English sentence

GPT-style models use a decoder-only transformer. They predict the next token based on previous tokens. This architecture is simple and powerful for open-ended text generation.

A decoder-only model is trained on sequences like:

Input so far: The capital of France is
Next token: Paris

During generation, it repeatedly predicts the next token, appends it to the context, and predicts again.

The → capital → of → France → is → Paris

This next-token prediction setup turns out to support many capabilities: answering questions, writing code, summarizing, translating, planning, and using tools when the surrounding system provides the right interface.

Why transformers scale so well

Transformers scale well for several reasons.

First, attention allows parallel processing during training. Unlike recurrent models that process tokens one at a time, transformers can process many token relationships efficiently on GPUs and specialized accelerators.

Second, the architecture is flexible. The same basic design can learn language, code, images, audio representations, or multimodal combinations when adapted appropriately.

Third, performance has improved predictably as models have grown larger and been trained on more data, within the limits of data quality, compute, and architecture. This made transformers attractive for industrial-scale training.

Finally, transformers are general sequence learners. They do not need a separate hand-built parser for every task. They learn patterns from data and can be adapted through prompting, fine-tuning, tool use, and alignment.

Practical takeaway

Attention is the mechanism that lets a model decide what context matters. Transformers organize attention into scalable layers. GPT-style models use decoder-only transformers to predict the next token, but the representations learned during that process support much richer behavior than autocomplete alone.

For agentic AI developers, this matters because every agent depends on the model’s ability to interpret context: instructions, tool results, memory, documents, code, and prior actions. The transformer is the engine that makes that contextual reasoning possible.

Learning objectives