
Evaluating Retrieval Quality
AGAI 203 · Memory Management and Evaluation
Learn how to measure whether memory retrieval is returning the right information, and how to improve retrieval with metadata, reranking, hybrid search, and test sets.
Key terms
retrieval quality → answer qualityprecision = relevant retrieved / retrievedrecall = relevant retrieved / relevantreranking improves top-k qualityLearning objectives
- Identify common retrieval failure modes.
- Build a retrieval evaluation set with known relevant chunks.
- Explain precision, recall, recall@k, and ranking quality.
- Improve retrieval using metadata filters, reranking, and hybrid search.
A memory system is only useful if it retrieves the right information at the right time. Retrieval quality is one of the most important factors in RAG and long-term agent memory.
When an agent gives a bad answer, the problem may not be the language model. It may be retrieval. The correct document may not have been retrieved, the chunk may have been too small, the result may have been stale, or the model may have received too many irrelevant chunks.
Retrieval evaluation asks:
Did the memory system retrieve the information needed to answer the question?
Retrieval failure modes
Common retrieval failures include:
- Missed retrieval: the relevant chunk was not returned.
- Low ranking: the relevant chunk was returned too far down the list.
- Noisy retrieval: irrelevant chunks crowded the context.
- Stale retrieval: outdated information was returned.
- Over-broad retrieval: chunks were too large and contained mixed topics.
- Over-narrow retrieval: chunks were too small and lost context.
- Permission failure: unauthorized content was retrieved.
Each failure points to a different fix.
Building a retrieval evaluation set
Start with a set of questions and known relevant documents or chunks.
Example:
[
{
"question": "When can a customer request a refund for late delivery?",
"relevant_chunk_ids": ["refund_policy_004", "shipping_policy_009"]
},
{
"question": "How do users rotate API keys?",
"relevant_chunk_ids": ["auth_docs_012"]
}
]
Run your retriever and check whether it returns the expected chunks.
Precision and recall
Two basic retrieval metrics are precision and recall.
Precision: Of the retrieved chunks, how many were relevant?
Recall: Of the relevant chunks, how many were retrieved?
If the system retrieves five chunks and only one is relevant, precision is low. If there are three relevant chunks and the system retrieves only one, recall is low.
For RAG, both matter. Low precision wastes context and distracts the model. Low recall means the model lacks needed evidence.
Top-k and ranking
Most vector searches return the top k results. Choosing k is a tradeoff.
Small k:
Pros: less noise, lower token use
Cons: may miss needed context
Large k:
Pros: higher chance of including relevant evidence
Cons: more noise, higher token cost
Many systems start with k = 3 to k = 10, then tune based on evaluation.
Ranking also matters. A relevant chunk at rank 1 is more useful than at rank 12. Metrics such as recall@k and mean reciprocal rank are useful for retrieval testing.
Reranking
A reranker takes initial search results and reorders them using a more precise model. The first-stage retriever is fast and broad. The reranker is slower but more accurate.
Pipeline:
Query
→ retrieve top 30 chunks with vector search
→ rerank top 30 using cross-encoder or LLM-based scoring
→ pass top 5 to generator
Reranking can significantly improve retrieval quality, especially when many chunks are semantically similar.
Metadata filtering
Metadata filters improve retrieval by narrowing the search space.
Example:
def retrieve_policy(query, department, access_level):
return vector_db.search(
query=query,
top_k=5,
filters={
"department": department,
"access_level": {"$lte": access_level},
"status": "approved"
}
)
Filters can enforce:
- Access control
- Document type
- Product area
- Language
- Date range
- Version
- Source quality
Do not rely only on semantic similarity when metadata can reduce ambiguity.
Hybrid search evaluation
Hybrid search combines semantic and keyword search. It is especially useful for:
- Error codes
- Product names
- API endpoints
- Legal phrases
- IDs
- Acronyms
- Rare technical terms
Evaluate hybrid search against vector-only search. Some systems improve dramatically when exact keyword matching is added.
Answer-level versus retrieval-level evaluation
Retrieval quality and answer quality are related but distinct.
A retriever can return perfect evidence, and the model can still answer poorly. A model can sometimes answer correctly even with weak retrieval, especially if it already knows the topic.
Evaluate both:
Retrieval-level: Did we retrieve the right chunks?
Answer-level: Did the final answer use them correctly?
This separation helps debugging.
Practical improvement loop
A practical retrieval improvement loop:
1. Collect failed questions.
2. Identify whether retrieval or generation failed.
3. Inspect retrieved chunks.
4. Adjust chunking, metadata, filters, query rewriting, or reranking.
5. Re-run evaluation set.
6. Add the failure as a regression test.
Memory systems improve through measurement, not guesswork.
Practical takeaway
Retrieval quality determines whether external memory is useful. Build retrieval test sets, measure precision and recall, inspect failures, use metadata filters, and consider reranking or hybrid search.
A RAG system without retrieval evaluation is a guessing system with extra steps.
Sign in to track your progress.
Ask your AI guide
Ask anything about Memory & Context Management — Evaluating Retrieval Quality, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.