Embeddings and Vector Search

Embeddings are one of the core technologies behind external memory for AI agents. An embedding is a numerical representation of text, code, images, or other data. For language applications, embeddings allow a system to represent meaning as a vector.

A simple mental model is:

embedding = meaning as a list of numbers

Texts with similar meanings should have vectors that are close together in vector space. This enables semantic search: finding information by meaning rather than exact keyword match.

For example, these two sentences do not share many exact words:

How do I reset my password?
I forgot my login credentials and need account access.

A keyword search may struggle. An embedding search can recognize that both are about account recovery.

How embedding search works

A vector search workflow has two phases: indexing and querying.

Indexing:

Documents → chunks → embeddings → vector database

Querying:

User question → query embedding → nearest vectors → retrieved chunks

The retrieved chunks are then inserted into the model context so the model can generate a grounded answer.

Similarity

Vector databases compare embeddings using similarity measures such as cosine similarity, dot product, or Euclidean distance. You do not need to hand-code these for most applications because tools like Pinecone, Weaviate, Chroma, Qdrant, and FAISS handle vector indexing and similarity search.

Conceptually:

Higher similarity = more semantically related
Lower similarity = less semantically related

Example query result:

[
  {
    "chunk_id": "doc_12_chunk_04",
    "score": 0.87,
    "text": "Users can reset passwords from the Account Settings page..."
  },
  {
    "chunk_id": "doc_08_chunk_02",
    "score": 0.81,
    "text": "If a user cannot access their account, they should use the recovery flow..."
  }
]

The score is not a guarantee of truth. It is a similarity signal.

Creating embeddings in Python

A simplified embedding workflow might look like this:

from openai import OpenAI

client = OpenAI()

text = "Users can reset their password from the Account Settings page."

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=text
)

embedding = response.data[0].embedding
print(len(embedding))
print(embedding[:5])

In production, you would embed many chunks and store them with metadata.

Storing vectors with metadata

Metadata is essential. It allows filtering, ranking, freshness checks, and source attribution.

Example memory record:

{
  "id": "support_docs_password_reset_001",
  "text": "Users can reset their password from the Account Settings page.",
  "embedding": [0.012, -0.084, 0.031],
  "metadata": {
    "source": "support_docs",
    "document_title": "Account Recovery Guide",
    "updated_at": "2025-10-18",
    "product_area": "authentication"
  }
}

Without metadata, retrieved text becomes difficult to trust.

Vector database options

Common vector storage options include:

Pinecone: managed vector database, good for production scalability.
Weaviate: open-source and managed options, supports hybrid search and schemas.
Chroma: popular for local development and prototypes.
Qdrant: open-source vector database with filtering and production-friendly features.
FAISS: efficient vector similarity library often used locally or inside custom systems.

Choice depends on scale, hosting preferences, filtering needs, cost, and operational complexity.

For a small local prototype, Chroma may be enough. For a production SaaS application with many users and metadata filters, Pinecone, Weaviate, or Qdrant may be more appropriate.

Chunking matters

Before embedding documents, you usually split them into chunks. Chunk size affects retrieval quality.

Chunks that are too small may lose context:

"It is allowed after five days."

Allowed what? Five days after what?

Chunks that are too large may contain too many topics and reduce retrieval precision.

A common approach is to use chunks of a few hundred tokens with overlap.

Example:

def chunk_text(text, chunk_size=800, overlap=100):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

This character-based example is simplified. Production systems often chunk by tokens, headings, paragraphs, Markdown sections, or semantic boundaries.

Hybrid search

Embedding search is powerful, but keyword search still matters. Some queries depend on exact terms, IDs, error codes, product names, or legal phrases.

Hybrid search combines vector similarity with keyword search.

Example query:

ERR_AUTH_401 token refresh failure

Exact keyword matching for ERR_AUTH_401 may be more important than semantic similarity. Weaviate and other systems support hybrid search; many teams also combine vector results with traditional search engines.

Practical takeaway

Embeddings let agents retrieve information by meaning. Vector databases make that retrieval scalable. But embeddings are not magic memory. Retrieval quality depends on chunking, metadata, embedding model choice, filtering, ranking, and evaluation.

For agent memory, embeddings are best used as one component of a broader context-management system.

Key terms

Learning objectives