Pretraining and Scaling Laws

Pretraining is the first major stage in building a large language model. During pretraining, the model learns to predict the next token across a massive corpus of text, code, and other data.

The training objective is simple to state:

Given the previous tokens, predict the next token.

For example:

Input: The capital of France is
Target: Paris

The model is not directly taught grammar rules, facts, programming concepts, or reasoning strategies one by one. Instead, it adjusts billions or trillions of internal parameters so that its predictions become better across enormous numbers of examples.

As models become larger and are trained on more data, they often develop capabilities that are not obvious from the basic next-token objective. These are sometimes called emergent capabilities. Examples include translation, summarization, code generation, few-shot learning, and multi-step reasoning.

Scaling laws study how model performance changes as we increase model size, dataset size, and compute. The Chinchilla scaling laws showed that many earlier large models were undertrained: they had many parameters but not enough training data for the amount of compute used. In plain terms, compute-optimal training requires scaling both model size and data together.

What training data looks like

Pretraining data is usually a mixture of many sources. Common categories include:

Public web pages
Common Crawl-style web snapshots
Books and long-form documents
Wikipedia and reference material
Scientific and technical papers
Public code repositories
Question-answer data
Forums and discussion sites
Documentation and manuals
Multilingual text

The exact mixture matters enormously. A model trained heavily on code will usually be better at programming tasks. A model trained on high-quality technical writing may explain concepts better. A model trained on noisy, duplicated, or low-quality data may learn undesirable patterns.

Data quantity matters, but quality matters just as much. Ten billion tokens of clean, diverse, well-filtered text may be more valuable than a much larger pile of spam, boilerplate, duplicate pages, broken markup, and low-effort content.

Pretraining teams typically perform data filtering, deduplication, language identification, quality scoring, toxicity filtering, and mixture balancing. These steps are not cosmetic. They shape the model’s capabilities, biases, and failure modes.

Next-token prediction is deceptively powerful

At first, next-token prediction can sound like autocomplete. But predicting the next token well across the internet, books, code, mathematics, and dialogue requires learning many hidden structures.

To predict code, the model must learn syntax, naming conventions, library usage, and common algorithms. To predict a legal explanation, it must learn legal vocabulary and argument structure. To predict a math solution, it must learn patterns of symbolic manipulation and explanation.

The model does not learn these in the same way a human does, but the training pressure encourages internal representations that support broad linguistic and conceptual competence.

Emergent capabilities

An emergent capability is a behavior that appears or improves sharply when a model reaches sufficient scale, data, or training quality. The term does not mean magic. It means that performance may be poor below a certain threshold and noticeably better above it.

Examples include:

Few-shot learning: The model can infer a task from a few examples in the prompt.
Instruction following: The model can follow natural language task descriptions.
Chain-of-thought-style reasoning: The model can produce intermediate reasoning patterns that improve performance on some complex tasks.
Code synthesis: The model can generate useful programs from descriptions.
Tool-use readiness: The model can produce structured arguments for external tools.

For agentic AI, few-shot learning and tool-use readiness are especially important. They allow developers to define behavior through prompts, schemas, examples, and orchestration rather than training a new model for every workflow.

Chinchilla scaling laws in plain terms

The Chinchilla result is often summarized as: many large models were too large for the amount of data they saw.

Imagine you have a fixed training budget. You could spend it on a very large model that sees a smaller amount of data, or a somewhat smaller model that sees much more data. Chinchilla-style scaling laws suggest that the best performance often comes from balancing parameters and training tokens more carefully.

A simplified intuition:

Bad balance:
Huge model + too little data → undertrained model

Better balance:
Appropriately sized model + much more data → better use of compute

This insight influenced later model training strategies. It also reminded the field that parameter count alone is not the whole story. A smaller model trained on better data for longer can outperform a larger model trained inefficiently.

Parameters, weights, and model size

A parameter is a learned numerical value inside the model. A weight is a common type of parameter used in neural network computations. In casual discussion, people often use “parameters” and “weights” almost interchangeably, though neural networks can include other learned values as well.

When someone says a model has 70 billion parameters, they mean it contains roughly 70 billion learned numbers. These numbers encode the model’s learned behavior.

Physically, parameters must be stored in memory. If each parameter is stored as a 16-bit floating point number, it uses 2 bytes. A 70-billion-parameter model would require roughly:

70,000,000,000 parameters × 2 bytes ≈ 140 GB

That is just for the raw weights, not counting additional memory needed for inference, training, optimizer states, activations, batching, or system overhead.

A trillion-parameter model is not a trillion explicit facts. It is a vast mathematical function with a trillion learned values. Those values are distributed across layers and matrices. Knowledge is not stored like rows in a database. It is encoded statistically across the network.

Why training from scratch is rarely practical

Training a frontier-scale language model from scratch requires enormous resources: data pipelines, distributed training infrastructure, specialized hardware, model architecture expertise, safety processes, evaluation systems, and ongoing maintenance.

Even training a modest model can be expensive and technically demanding. For most developers and organizations, it is far more practical to use a pretrained foundation model and adapt it through:

Prompt engineering
Retrieval-augmented generation
Tool use
Fine-tuning
Distillation
System-level constraints
Evaluation and monitoring

Foundation models are called “foundation” models because they provide a general base of language, code, reasoning patterns, and world knowledge. Developers build applications on top of them.

Practical implications for developers

You usually do not need to train a language model from scratch to build a useful AI product. Instead, your work is to connect a strong pretrained model to the right context, tools, memory, and user experience.

The key developer questions are:

Which foundation model is good enough for this task?
Does the task require fresh or private data?
Should I use retrieval instead of fine-tuning?
How much context does the model need?
What should be evaluated before deployment?
What cost and latency constraints matter?

Pretraining creates the general capability. Application design turns that capability into a useful system.

Summary

Pretraining teaches language models by exposing them to massive corpora and optimizing next-token prediction. Scaling laws help researchers understand how performance changes with model size, data, and compute. Larger models can be more capable, but parameter count alone does not determine quality.

For practitioners, the lesson is clear: build on foundation models. Training from scratch is rarely the right starting point. The real leverage comes from selecting the right model, giving it the right context, connecting it to reliable tools, and evaluating the resulting system carefully.

Key terms

Learning objectives