Building a Basic RAG Pipeline

Retrieval-augmented generation, or RAG, is a pattern where an application retrieves relevant external information and inserts it into the model context before generation.

A simple formula is:

RAG = retrieve + augment + generate

RAG is one of the most important techniques for agent memory because it allows models to answer using information that was not in their training data or not safe to rely on from memory alone.

Use RAG when the answer depends on:

Private company documents
Product documentation
Current policies
Research papers
User-uploaded files
Project-specific knowledge
Large knowledge bases that do not fit in context

RAG architecture

A basic RAG pipeline has two phases.

Indexing phase:

Load documents
→ Split into chunks
→ Create embeddings
→ Store chunks and embeddings in vector database

Query phase:

User question
→ Create query embedding
→ Retrieve relevant chunks
→ Insert chunks into prompt
→ Generate answer
→ Optionally cite sources

The indexing phase prepares memory. The query phase uses memory.

Step 1: load and chunk documents

Suppose you have internal documentation as plain text files.

from pathlib import Path

def load_documents(folder: str):
    docs = []
    for path in Path(folder).glob("*.txt"):
        docs.append({
            "source": str(path),
            "text": path.read_text(encoding="utf-8")
        })
    return docs

def chunk_document(doc, chunk_size=1000, overlap=150):
    text = doc["text"]
    chunks = []
    start = 0
    chunk_index = 0

    while start < len(text):
        end = start + chunk_size
        chunk_text = text[start:end]
        chunks.append({
            "id": f"{doc['source']}::chunk_{chunk_index}",
            "text": chunk_text,
            "metadata": {
                "source": doc["source"],
                "chunk_index": chunk_index
            }
        })
        chunk_index += 1
        start = end - overlap

    return chunks

In production, chunking should respect document structure when possible: headings, paragraphs, Markdown sections, or semantic boundaries.

Step 2: create embeddings

from openai import OpenAI

client = OpenAI()

def embed_text(text: str):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

You can batch embeddings for efficiency, but the concept is the same: each chunk becomes a vector.

Step 3: store in a vector database

With Chroma, a local prototype might look like:

import chromadb

chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="product_docs")

def index_chunks(chunks):
    ids = []
    documents = []
    embeddings = []
    metadatas = []

    for chunk in chunks:
        ids.append(chunk["id"])
        documents.append(chunk["text"])
        embeddings.append(embed_text(chunk["text"]))
        metadatas.append(chunk["metadata"])

    collection.add(
        ids=ids,
        documents=documents,
        embeddings=embeddings,
        metadatas=metadatas
    )

This stores text, vectors, and metadata together.

Step 4: retrieve relevant chunks

def retrieve(question: str, k: int = 5):
    query_embedding = embed_text(question)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=k
    )
    return results

The retrieval result contains documents, scores or distances, IDs, and metadata depending on the vector store.

Step 5: generate a grounded answer

Now insert retrieved chunks into the prompt.

def format_context(results):
    docs = results["documents"][0]
    metadatas = results["metadatas"][0]

    sections = []
    for i, doc in enumerate(docs):
        source = metadatas[i].get("source", "unknown")
        sections.append(f"Source: {source}\n{doc}")
    return "\n\n---\n\n".join(sections)

def answer_question(question: str):
    results = retrieve(question)
    context = format_context(results)

    messages = [
        {
            "role": "system",
            "content": "You answer questions using the provided context. If the context does not contain the answer, say so. Do not invent facts."
        },
        {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion:\n{question}"
        }
    ]

    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=messages,
        temperature=0.2
    )
    return response.choices[0].message.content

This is a minimal RAG system. Production systems add better chunking, metadata filters, reranking, citations, access controls, caching, and evaluation.

Grounding and refusal

The system prompt should tell the model how to behave when retrieved context is insufficient.

Bad behavior:

The context is weak, so the model guesses.

Better behavior:

The provided context does not specify the refund deadline. I cannot determine the deadline from these documents.

RAG should reduce hallucination, but only if the model is instructed to ground its answer in retrieved evidence.

Common RAG failure modes

RAG systems can fail in several ways:

The right document is not indexed.
The document is indexed but chunked poorly.
Retrieval returns irrelevant chunks.
The correct chunk is ranked too low.
The model ignores retrieved context.
The retrieved document is stale.
The system retrieves content the user is not authorized to see.

Each failure requires a different fix. Do not assume a bad answer is always the model’s fault. It may be a retrieval problem.

Practical takeaway

A RAG pipeline gives agents external semantic memory. It retrieves relevant knowledge, augments the prompt, and generates an answer grounded in that context.

The basic version is simple. The production version depends on careful document processing, metadata, permissions, retrieval tuning, and evaluation.

Key terms

Learning objectives