Diagram showing different memory types in an AI agent system

Evaluating Retrieval Quality

AGAI 203 · Memory Management and Evaluation

Learn how to measure whether memory retrieval is returning the right information, and how to improve retrieval with metadata, reranking, hybrid search, and test sets.

Key terms

retrieval quality → answer qualityprecision = relevant retrieved / retrievedrecall = relevant retrieved / relevantreranking improves top-k quality

Learning objectives

Identify common retrieval failure modes.
Build a retrieval evaluation set with known relevant chunks.
Explain precision, recall, recall@k, and ranking quality.
Improve retrieval using metadata filters, reranking, and hybrid search.

A memory system is only useful if it retrieves the right information at the right time. Retrieval quality is one of the most important factors in RAG and long-term agent memory.

When an agent gives a bad answer, the problem may not be the language model. It may be retrieval. The correct document may not have been retrieved, the chunk may have been too small, the result may have been stale, or the model may have received too many irrelevant chunks.

Retrieval evaluation asks:

Did the memory system retrieve the information needed to answer the question?

Retrieval failure modes

Common retrieval failures include:

Missed retrieval: the relevant chunk was not returned.
Low ranking: the relevant chunk was returned too far down the list.
Noisy retrieval: irrelevant chunks crowded the context.
Stale retrieval: outdated information was returned.
Over-broad retrieval: chunks were too large and contained mixed topics.
Over-narrow retrieval: chunks were too small and lost context.
Permission failure: unauthorized content was retrieved.

Each failure points to a different fix.

Building a retrieval evaluation set

Start with a set of questions and known relevant documents or chunks.

Example:

[
  {
    "question": "When can a customer request a refund for late delivery?",
    "relevant_chunk_ids": ["refund_policy_004", "shipping_policy_009"]
  },
  {
    "question": "How do users rotate API keys?",
    "relevant_chunk_ids": ["auth_docs_012"]
  }
]

Run your retriever and check whether it returns the expected chunks.

Precision and recall

Two basic retrieval metrics are precision and recall.

Precision: Of the retrieved chunks, how many were relevant?
Recall: Of the relevant chunks, how many were retrieved?

If the system retrieves five chunks and only one is relevant, precision is low. If there are three relevant chunks and the system retrieves only one, recall is low.

For RAG, both matter. Low precision wastes context and distracts the model. Low recall means the model lacks needed evidence.

Top-k and ranking

Most vector searches return the top k results. Choosing k is a tradeoff.

Small k:

Pros: less noise, lower token use
Cons: may miss needed context

Large k:

Pros: higher chance of including relevant evidence
Cons: more noise, higher token cost

Many systems start with k = 3 to k = 10, then tune based on evaluation.

Ranking also matters. A relevant chunk at rank 1 is more useful than at rank 12. Metrics such as recall@k and mean reciprocal rank are useful for retrieval testing.

Reranking

A reranker takes initial search results and reorders them using a more precise model. The first-stage retriever is fast and broad. The reranker is slower but more accurate.

Pipeline:

Query
→ retrieve top 30 chunks with vector search
→ rerank top 30 using cross-encoder or LLM-based scoring
→ pass top 5 to generator

Reranking can significantly improve retrieval quality, especially when many chunks are semantically similar.

Metadata filtering

Metadata filters improve retrieval by narrowing the search space.

Example:

def retrieve_policy(query, department, access_level):
    return vector_db.search(
        query=query,
        top_k=5,
        filters={
            "department": department,
            "access_level": {"$lte": access_level},
            "status": "approved"
        }
    )

Filters can enforce:

Access control
Document type
Product area
Language
Date range
Version
Source quality

Do not rely only on semantic similarity when metadata can reduce ambiguity.

Hybrid search evaluation

Hybrid search combines semantic and keyword search. It is especially useful for:

Error codes
Product names
API endpoints
Legal phrases
IDs
Acronyms
Rare technical terms

Evaluate hybrid search against vector-only search. Some systems improve dramatically when exact keyword matching is added.

Answer-level versus retrieval-level evaluation

Retrieval quality and answer quality are related but distinct.

A retriever can return perfect evidence, and the model can still answer poorly. A model can sometimes answer correctly even with weak retrieval, especially if it already knows the topic.

Evaluate both:

Retrieval-level: Did we retrieve the right chunks?
Answer-level: Did the final answer use them correctly?

This separation helps debugging.

Practical improvement loop

A practical retrieval improvement loop:

1. Collect failed questions.
2. Identify whether retrieval or generation failed.
3. Inspect retrieved chunks.
4. Adjust chunking, metadata, filters, query rewriting, or reranking.
5. Re-run evaluation set.
6. Add the failure as a regression test.

Memory systems improve through measurement, not guesswork.

Practical takeaway

Retrieval quality determines whether external memory is useful. Build retrieval test sets, measure precision and recall, inspect failures, use metadata filters, and consider reranking or hybrid search.

A RAG system without retrieval evaluation is a guessing system with extra steps.

Ask your AI guide

AI Chat· Memory & Context Management — Evaluating Retrieval Quality

🤖

Ask anything about Memory & Context Management — Evaluating Retrieval Quality, or choose a suggested question below.

AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.