
Building a Basic RAG Pipeline
AGAI 203 · External Memory and RAG
Build a simple retrieval-augmented generation pipeline that chunks documents, embeds them, retrieves relevant context, and generates grounded answers.
Key terms
RAG = retrieve + augment + generateindexing prepares memoryquerying uses memorygrounded generation reduces guessingLearning objectives
- Describe the indexing and query phases of a RAG pipeline.
- Implement document chunking, embedding, storage, and retrieval.
- Generate answers using retrieved context.
- Identify common RAG failure modes and their causes.
Retrieval-augmented generation, or RAG, is a pattern where an application retrieves relevant external information and inserts it into the model context before generation.
A simple formula is:
RAG = retrieve + augment + generate
RAG is one of the most important techniques for agent memory because it allows models to answer using information that was not in their training data or not safe to rely on from memory alone.
Use RAG when the answer depends on:
- Private company documents
- Product documentation
- Current policies
- Research papers
- User-uploaded files
- Project-specific knowledge
- Large knowledge bases that do not fit in context
RAG architecture
A basic RAG pipeline has two phases.
Indexing phase:
Load documents
→ Split into chunks
→ Create embeddings
→ Store chunks and embeddings in vector database
Query phase:
User question
→ Create query embedding
→ Retrieve relevant chunks
→ Insert chunks into prompt
→ Generate answer
→ Optionally cite sources
The indexing phase prepares memory. The query phase uses memory.
Step 1: load and chunk documents
Suppose you have internal documentation as plain text files.
from pathlib import Path
def load_documents(folder: str):
docs = []
for path in Path(folder).glob("*.txt"):
docs.append({
"source": str(path),
"text": path.read_text(encoding="utf-8")
})
return docs
def chunk_document(doc, chunk_size=1000, overlap=150):
text = doc["text"]
chunks = []
start = 0
chunk_index = 0
while start < len(text):
end = start + chunk_size
chunk_text = text[start:end]
chunks.append({
"id": f"{doc['source']}::chunk_{chunk_index}",
"text": chunk_text,
"metadata": {
"source": doc["source"],
"chunk_index": chunk_index
}
})
chunk_index += 1
start = end - overlap
return chunks
In production, chunking should respect document structure when possible: headings, paragraphs, Markdown sections, or semantic boundaries.
Step 2: create embeddings
from openai import OpenAI
client = OpenAI()
def embed_text(text: str):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
You can batch embeddings for efficiency, but the concept is the same: each chunk becomes a vector.
Step 3: store in a vector database
With Chroma, a local prototype might look like:
import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="product_docs")
def index_chunks(chunks):
ids = []
documents = []
embeddings = []
metadatas = []
for chunk in chunks:
ids.append(chunk["id"])
documents.append(chunk["text"])
embeddings.append(embed_text(chunk["text"]))
metadatas.append(chunk["metadata"])
collection.add(
ids=ids,
documents=documents,
embeddings=embeddings,
metadatas=metadatas
)
This stores text, vectors, and metadata together.
Step 4: retrieve relevant chunks
def retrieve(question: str, k: int = 5):
query_embedding = embed_text(question)
results = collection.query(
query_embeddings=[query_embedding],
n_results=k
)
return results
The retrieval result contains documents, scores or distances, IDs, and metadata depending on the vector store.
Step 5: generate a grounded answer
Now insert retrieved chunks into the prompt.
def format_context(results):
docs = results["documents"][0]
metadatas = results["metadatas"][0]
sections = []
for i, doc in enumerate(docs):
source = metadatas[i].get("source", "unknown")
sections.append(f"Source: {source}\n{doc}")
return "\n\n---\n\n".join(sections)
def answer_question(question: str):
results = retrieve(question)
context = format_context(results)
messages = [
{
"role": "system",
"content": "You answer questions using the provided context. If the context does not contain the answer, say so. Do not invent facts."
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion:\n{question}"
}
]
response = client.chat.completions.create(
model="gpt-4.1-mini",
messages=messages,
temperature=0.2
)
return response.choices[0].message.content
This is a minimal RAG system. Production systems add better chunking, metadata filters, reranking, citations, access controls, caching, and evaluation.
Grounding and refusal
The system prompt should tell the model how to behave when retrieved context is insufficient.
Bad behavior:
The context is weak, so the model guesses.
Better behavior:
The provided context does not specify the refund deadline. I cannot determine the deadline from these documents.
RAG should reduce hallucination, but only if the model is instructed to ground its answer in retrieved evidence.
Common RAG failure modes
RAG systems can fail in several ways:
- The right document is not indexed.
- The document is indexed but chunked poorly.
- Retrieval returns irrelevant chunks.
- The correct chunk is ranked too low.
- The model ignores retrieved context.
- The retrieved document is stale.
- The system retrieves content the user is not authorized to see.
Each failure requires a different fix. Do not assume a bad answer is always the model’s fault. It may be a retrieval problem.
Practical takeaway
A RAG pipeline gives agents external semantic memory. It retrieves relevant knowledge, augments the prompt, and generates an answer grounded in that context.
The basic version is simple. The production version depends on careful document processing, metadata, permissions, retrieval tuning, and evaluation.
Sign in to track your progress.
Ask your AI guide
Ask anything about Memory & Context Management — Building a Basic RAG Pipeline, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.