
Embeddings and Vector Search
AGAI 203 · External Memory and RAG
Understand how embeddings represent meaning numerically and how vector search retrieves semantically similar information for agents.
Key terms
embedding = meaning as vectorsemantic search = query vector → nearest chunksmetadata improves retrieval controlchunking shapes memory qualityLearning objectives
- Explain embeddings and vector similarity in practical terms.
- Describe the indexing and querying phases of vector search.
- Create embeddings programmatically for text inputs.
- Compare common vector database options and use cases.
Embeddings are one of the core technologies behind external memory for AI agents. An embedding is a numerical representation of text, code, images, or other data. For language applications, embeddings allow a system to represent meaning as a vector.
A simple mental model is:
embedding = meaning as a list of numbers
Texts with similar meanings should have vectors that are close together in vector space. This enables semantic search: finding information by meaning rather than exact keyword match.
For example, these two sentences do not share many exact words:
How do I reset my password?
I forgot my login credentials and need account access.
A keyword search may struggle. An embedding search can recognize that both are about account recovery.
How embedding search works
A vector search workflow has two phases: indexing and querying.
Indexing:
Documents → chunks → embeddings → vector database
Querying:
User question → query embedding → nearest vectors → retrieved chunks
The retrieved chunks are then inserted into the model context so the model can generate a grounded answer.
Similarity
Vector databases compare embeddings using similarity measures such as cosine similarity, dot product, or Euclidean distance. You do not need to hand-code these for most applications because tools like Pinecone, Weaviate, Chroma, Qdrant, and FAISS handle vector indexing and similarity search.
Conceptually:
Higher similarity = more semantically related
Lower similarity = less semantically related
Example query result:
[
{
"chunk_id": "doc_12_chunk_04",
"score": 0.87,
"text": "Users can reset passwords from the Account Settings page..."
},
{
"chunk_id": "doc_08_chunk_02",
"score": 0.81,
"text": "If a user cannot access their account, they should use the recovery flow..."
}
]
The score is not a guarantee of truth. It is a similarity signal.
Creating embeddings in Python
A simplified embedding workflow might look like this:
from openai import OpenAI
client = OpenAI()
text = "Users can reset their password from the Account Settings page."
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
embedding = response.data[0].embedding
print(len(embedding))
print(embedding[:5])
In production, you would embed many chunks and store them with metadata.
Storing vectors with metadata
Metadata is essential. It allows filtering, ranking, freshness checks, and source attribution.
Example memory record:
{
"id": "support_docs_password_reset_001",
"text": "Users can reset their password from the Account Settings page.",
"embedding": [0.012, -0.084, 0.031],
"metadata": {
"source": "support_docs",
"document_title": "Account Recovery Guide",
"updated_at": "2025-10-18",
"product_area": "authentication"
}
}
Without metadata, retrieved text becomes difficult to trust.
Vector database options
Common vector storage options include:
- Pinecone: managed vector database, good for production scalability.
- Weaviate: open-source and managed options, supports hybrid search and schemas.
- Chroma: popular for local development and prototypes.
- Qdrant: open-source vector database with filtering and production-friendly features.
- FAISS: efficient vector similarity library often used locally or inside custom systems.
Choice depends on scale, hosting preferences, filtering needs, cost, and operational complexity.
For a small local prototype, Chroma may be enough. For a production SaaS application with many users and metadata filters, Pinecone, Weaviate, or Qdrant may be more appropriate.
Chunking matters
Before embedding documents, you usually split them into chunks. Chunk size affects retrieval quality.
Chunks that are too small may lose context:
"It is allowed after five days."
Allowed what? Five days after what?
Chunks that are too large may contain too many topics and reduce retrieval precision.
A common approach is to use chunks of a few hundred tokens with overlap.
Example:
def chunk_text(text, chunk_size=800, overlap=100):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks
This character-based example is simplified. Production systems often chunk by tokens, headings, paragraphs, Markdown sections, or semantic boundaries.
Hybrid search
Embedding search is powerful, but keyword search still matters. Some queries depend on exact terms, IDs, error codes, product names, or legal phrases.
Hybrid search combines vector similarity with keyword search.
Example query:
ERR_AUTH_401 token refresh failure
Exact keyword matching for ERR_AUTH_401 may be more important than semantic similarity. Weaviate and other systems support hybrid search; many teams also combine vector results with traditional search engines.
Practical takeaway
Embeddings let agents retrieve information by meaning. Vector databases make that retrieval scalable. But embeddings are not magic memory. Retrieval quality depends on chunking, metadata, embedding model choice, filtering, ranking, and evaluation.
For agent memory, embeddings are best used as one component of a broader context-management system.
Sign in to track your progress.
Ask your AI guide
Ask anything about Memory & Context Management — Embeddings and Vector Search, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.