The Context Window Problem

Every language model operates within a context window: the maximum amount of information the model can consider in a single request. The context window includes system instructions, user messages, conversation history, retrieved documents, tool results, examples, memory snippets, and the model’s own generated output.

A common beginner mistake is to treat the context window as memory. It is not. The context window is temporary working space. Once information is outside the active context, the model cannot directly use it unless your application retrieves, summarizes, or re-inserts it.

A useful distinction is:

Context window = what the model can see right now
Memory = what the system stores and may retrieve later

This distinction matters because agents often work across multiple steps. A coding agent may inspect files, run tests, and revise code. A customer support agent may need previous tickets. A research agent may build a knowledge base over time. If the system does not manage context, important information may disappear, irrelevant information may crowd out useful details, and the agent may repeat mistakes.

What goes into context?

In an agentic system, context usually contains several layers:

System prompt
Developer instructions
User request
Conversation history
Current task state
Tool definitions
Tool results
Retrieved memory
Retrieved documents
Output format instructions

All of these compete for space. Even with a large context window, you cannot assume that more context is always better. Long prompts cost more, increase latency, and can distract the model from the most relevant information.

For example, if a user asks a question about one paragraph in a 300-page manual, inserting the entire manual may be worse than retrieving the five most relevant sections. The model may technically have access to everything, but the important evidence is buried.

Why large context is not enough

Modern models can support much larger context windows than early transformer models, but large context does not eliminate the memory problem.

Large context has tradeoffs:

Cost: more input tokens usually cost more.
Latency: more tokens take longer to process.
Attention dilution: relevant details may be surrounded by noise.
Staleness: old conversation history may no longer matter.
Conflicts: old instructions or facts may conflict with newer ones.
Privacy: unnecessary personal or sensitive data may be included.

A strong agent does not maximize context. It selects context.

Concrete failure example

Imagine a customer support agent helping with an order issue. Early in the conversation, the user says:

My order ID is ORD-7711, and I already contacted support yesterday. They said the package was marked as delayed.

Later, after several tool calls and long policy text, the user asks:

So what should I do next?

If the order ID and prior support context have fallen out of context, the agent may ask for the order ID again or give a generic answer. That feels unintelligent, but the underlying problem is architectural: the system failed to preserve task-relevant memory.

A better system maintains a compact task state:

{
  "customer_goal": "Resolve delayed order",
  "order_id": "ORD-7711",
  "known_status": "delayed",
  "prior_support_contact": true,
  "next_needed_step": "Check refund or replacement eligibility"
}

This state can be kept in context even when verbose messages and tool results are summarized or removed.

Context packing

Context packing is the process of deciding what to include in the prompt for a specific model call. It is one of the most important engineering tasks in agent design.

A typical context-packing strategy might prioritize:

System and safety instructions
Current user request
Current task state
Most recent relevant conversation turns
High-confidence retrieved memories
Relevant documents or tool outputs
Output format instructions

Less important content can be summarized or omitted.

Example pseudocode:

def build_context(system_prompt, user_message, task_state, recent_messages, retrieved_docs):
    context = []
    context.append({"role": "system", "content": system_prompt})
    context.append({"role": "system", "content": f"Task state: {task_state}"})
    context.extend(select_recent_relevant_messages(recent_messages, max_messages=6))
    context.extend(format_retrieved_docs(retrieved_docs, max_tokens=3000))
    context.append({"role": "user", "content": user_message})
    return context

This is better than blindly appending every message forever.

Context window versus agent state

Agent state is the structured representation of what the system knows about the current task. Context is what the model sees at a given moment.

State might live in your application database or workflow engine. A LangGraph application, for example, often represents state explicitly and passes it between nodes. The model sees only the parts of state that the node needs.

Example state:

{
  "goal": "Write a report on vector databases",
  "completed_steps": ["searched candidates", "fetched pricing pages"],
  "open_questions": ["verify enterprise security features"],
  "sources": ["source_1", "source_2", "source_3"]
}

The full state does not always need to be inserted into every prompt. The system can include a summary or selected fields.

Practical takeaway

The context window is scarce working memory. A strong agent architecture controls what enters that working memory. It preserves task state, retrieves relevant long-term memory, compresses old information, and avoids flooding the model with irrelevant text.

The goal is not to remember everything. The goal is to provide the model with the right information at the moment it needs to decide or respond.

Key terms

Learning objectives