What components do you need to build semantic search with Claude?

You need an embedding model to convert text chunks into vectors (Voyage AI or OpenAI embeddings work well), a vector database to store and retrieve embeddings (Pinecone, Weaviate, or pgvector in PostgreSQL), and Claude to generate the final answer from retrieved chunks using the RAG pattern.

How do you chunk documents for semantic search?

Split documents into overlapping chunks of 200-400 tokens to preserve context at chunk boundaries. Use semantic boundaries like paragraph breaks rather than fixed character counts. Smaller chunks give more precise retrieval; larger chunks give more context per result. Experiment with your specific corpus to find the optimal size.

How does RAG with Claude improve answer quality compared to Claude alone?

Claude's training knowledge has a cutoff date and no access to your private data. RAG retrieves relevant passages from your document corpus at query time and passes them as context, allowing Claude to answer accurately about proprietary, recent, or specialised information it was never trained on.

Build a Semantic Search Engine from Scratch with Python

What Is Semantic Search and How Does It Work?

Semantic search finds documents by meaning rather than keyword overlap. Text is converted into dense vector embeddings using a sentence-transformer model, then stored in a vector database. When a query arrives, it is embedded using the same model and the system returns documents whose vectors are closest in meaning - measured by cosine similarity. Queries and results do not need to share any words.

Keyword search has a fundamental flaw: it matches words, not meaning. Your users do not search for keywords - they describe what they need. A user who types "how to cancel my account" is looking for the same article as one who types "steps to close my subscription." Keyword search misses one of those. Semantic search matches both.

In this project you will build a complete semantic search engine from scratch. By the end you will have a working system that can ingest documents from text files, chunk and embed them, store embeddings in ChromaDB, and expose a clean Python search interface that returns results ranked by meaning - not by keyword overlap.

This is a standalone project. If you want to understand the underlying theory before diving in, read What is a Vector Database? and ChromaDB Tutorial first.

What You Will Build

A five-component semantic search system:

Document loader: reads text files or plain strings into a standard format
Text chunker: splits long documents into overlapping chunks suitable for embedding
Embedding pipeline: converts chunks to vector embeddings using a local model
Vector store: persists embeddings in ChromaDB with metadata
Search interface: accepts natural-language queries and returns ranked results

Prerequisites

bash

pip install chromadb sentence-transformers

Python 3.10 or later. No API keys required - everything runs locally.

Step 1: Document Loader

Start with a clean data model. Every document in the system has a source, content, and metadata.

python

# semantic_search/loader.py
from dataclasses import dataclass, field
from pathlib import Path

@dataclass
class Document:
    content: str
    source: str
    metadata: dict = field(default_factory=dict)


class DocumentLoader:
    """Load documents from text strings or .txt files."""

    @staticmethod
    def from_texts(texts: list[str], source: str = "inline") -> list[Document]:
        return [
            Document(content=text.strip(), source=source, metadata={"index": i})
            for i, text in enumerate(texts)
            if text.strip()
        ]

    @staticmethod
    def from_file(path: str | Path) -> list[Document]:
        path = Path(path)
        if not path.exists():
            raise FileNotFoundError(f"File not found: {path}")
        content = path.read_text(encoding="utf-8")
        return [Document(content=content, source=str(path))]

    @staticmethod
    def from_directory(directory: str | Path, extension: str = ".txt") -> list[Document]:
        directory = Path(directory)
        documents = []
        for file_path in sorted(directory.glob(f"**/*{extension}")):
            content = file_path.read_text(encoding="utf-8").strip()
            if content:
                documents.append(Document(
                    content=content,
                    source=str(file_path),
                    metadata={"filename": file_path.name, "stem": file_path.stem}
                ))
        return documents

Step 2: Text Chunker

Long documents can exceed embedding model context limits (typically 256-512 tokens for most sentence-transformers). Chunking splits documents into smaller, overlapping segments. The overlap ensures that sentences split across a chunk boundary are represented in at least one complete chunk.

python

# semantic_search/chunker.py
from .loader import Document

@dataclass
class Chunk:
    text: str
    source: str
    chunk_index: int
    total_chunks: int
    metadata: dict = field(default_factory=dict)


class TextChunker:
    """
    Split documents into overlapping fixed-size character chunks.

    chunk_size:    target chunk size in characters (~500 chars ≈ 100-150 tokens)
    chunk_overlap: overlap between consecutive chunks in characters
    """

    def __init__(self, chunk_size: int = 500, chunk_overlap: int = 100):
        if chunk_overlap >= chunk_size:
            raise ValueError("chunk_overlap must be smaller than chunk_size")
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

    def chunk(self, document: Document) -> list[Chunk]:
        text = document.content
        chunks = []
        start = 0
        step = self.chunk_size - self.chunk_overlap

        while start < len(text):
            end = start + self.chunk_size
            chunk_text = text[start:end].strip()
            if chunk_text:
                chunks.append(Chunk(
                    text=chunk_text,
                    source=document.source,
                    chunk_index=len(chunks),
                    total_chunks=0,   # filled in below
                    metadata={**document.metadata}
                ))
            start += step

        # Set total_chunks now that we know the count
        for chunk in chunks:
            chunk.total_chunks = len(chunks)

        return chunks

    def chunk_all(self, documents: list[Document]) -> list[Chunk]:
        all_chunks = []
        for doc in documents:
            all_chunks.extend(self.chunk(doc))
        return all_chunks

Choosing Chunk Size

For support articles or documentation (dense, factual content): 400-600 characters. For narrative text or long-form articles: 800-1000 characters. The overlap (100-150 characters) ensures context at boundaries is not lost. Smaller chunks improve retrieval precision; larger chunks give the LLM more context per result.

Step 3: Vector Store Wrapper

Wrap ChromaDB with a clean interface that knows nothing about the chunker or loader:

python

# semantic_search/store.py
import hashlib
import chromadb
from chromadb.utils import embedding_functions
from .chunker import Chunk


class VectorStore:
    """
    ChromaDB-backed vector store for document chunks.
    Handles ID generation, upserts, and structured queries.
    """

    def __init__(
        self,
        persist_path: str = "./search_index",
        collection_name: str = "semantic_search",
        model_name: str = "all-MiniLM-L6-v2",
    ):
        self.client = chromadb.PersistentClient(path=persist_path)
        self.ef = embedding_functions.SentenceTransformerEmbeddingFunction(
            model_name=model_name
        )
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            embedding_function=self.ef,
            metadata={"hnsw:space": "cosine"},
        )

    @staticmethod
    def _make_chunk_id(chunk: Chunk) -> str:
        """Stable ID: hash of (source + chunk_index)."""
        key = f"{chunk.source}::{chunk.chunk_index}"
        return hashlib.sha256(key.encode()).hexdigest()[:20]

    def add_chunks(self, chunks: list[Chunk]) -> int:
        """Add or update chunks in the vector store. Returns number added."""
        if not chunks:
            return 0

        ids = [self._make_chunk_id(c) for c in chunks]
        texts = [c.text for c in chunks]
        metadatas = [
            {
                "source": c.source,
                "chunk_index": c.chunk_index,
                "total_chunks": c.total_chunks,
                **c.metadata,
            }
            for c in chunks
        ]

        self.collection.upsert(ids=ids, documents=texts, metadatas=metadatas)
        return len(chunks)

    def count(self) -> int:
        return self.collection.count()

    def search(
        self,
        query: str,
        n_results: int = 5,
        filters: dict | None = None,
        min_similarity: float = 0.0,
    ) -> list[dict]:
        """
        Run semantic search.
        Returns results as dicts with keys: text, similarity, source, metadata.
        Optionally filter by minimum similarity threshold.
        """
        actual_n = min(n_results * 2, self.collection.count())
        if actual_n == 0:
            return []

        kwargs = {
            "query_texts": [query],
            "n_results": actual_n,
            "include": ["documents", "distances", "metadatas"],
        }
        if filters:
            kwargs["where"] = filters

        raw = self.collection.query(**kwargs)

        results = []
        for text, dist, meta in zip(
            raw["documents"][0],
            raw["distances"][0],
            raw["metadatas"][0],
        ):
            similarity = round(1 - dist, 4)
            if similarity >= min_similarity:
                results.append({
                    "text": text,
                    "similarity": similarity,
                    "source": meta.get("source", ""),
                    "chunk_index": meta.get("chunk_index", 0),
                    "metadata": meta,
                })

        # Sort by similarity and trim to requested n
        results.sort(key=lambda r: r["similarity"], reverse=True)
        return results[:n_results]

    def delete_source(self, source: str) -> None:
        """Remove all chunks from a specific source document."""
        self.collection.delete(where={"source": source})

Step 4: Search Engine - Putting It All Together

python

# semantic_search/engine.py
from .loader import Document, DocumentLoader
from .chunker import TextChunker
from .store import VectorStore


class SemanticSearchEngine:
    """
    Complete semantic search engine: ingest -> chunk -> embed -> search.
    """

    def __init__(
        self,
        persist_path: str = "./search_index",
        collection_name: str = "semantic_search",
        chunk_size: int = 500,
        chunk_overlap: int = 100,
        model_name: str = "all-MiniLM-L6-v2",
    ):
        self.loader = DocumentLoader()
        self.chunker = TextChunker(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        self.store = VectorStore(
            persist_path=persist_path,
            collection_name=collection_name,
            model_name=model_name,
        )

    def ingest_texts(self, texts: list[str], source: str = "inline") -> dict:
        docs = DocumentLoader.from_texts(texts, source=source)
        return self._ingest(docs)

    def ingest_file(self, path: str) -> dict:
        docs = DocumentLoader.from_file(path)
        return self._ingest(docs)

    def ingest_directory(self, directory: str, extension: str = ".txt") -> dict:
        docs = DocumentLoader.from_directory(directory, extension=extension)
        return self._ingest(docs)

    def _ingest(self, documents: list[Document]) -> dict:
        if not documents:
            return {"documents": 0, "chunks": 0}
        chunks = self.chunker.chunk_all(documents)
        added = self.store.add_chunks(chunks)
        return {"documents": len(documents), "chunks": added}

    def search(
        self,
        query: str,
        n_results: int = 5,
        filters: dict | None = None,
        min_similarity: float = 0.3,
    ) -> list[dict]:
        return self.store.search(
            query=query,
            n_results=n_results,
            filters=filters,
            min_similarity=min_similarity,
        )

    def stats(self) -> dict:
        return {"total_chunks": self.store.count()}

Step 5: Run It - Full Working Demo

python

# demo.py
from semantic_search.engine import SemanticSearchEngine

engine = SemanticSearchEngine(persist_path="./search_index")

# -- Ingest documents ----------------------------------------------
result = engine.ingest_texts(
    texts=[
        """Password Reset Guide
        If you are locked out of your account, go to the login page and click
        'Forgot Password'. Enter your email address and you will receive a reset
        link within 5 minutes. The link expires after 24 hours. If you do not
        receive the email, check your spam folder or contact support.""",

        """Device Boot Failure After Update
        Some users have reported that their device fails to boot following a
        firmware update. To resolve this: hold the power button for 10 seconds
        to force shutdown, then restart normally. If the device still fails to
        boot, enter recovery mode by holding Power + Volume Down for 15 seconds.""",

        """Subscription Cancellation Policy
        You may cancel your subscription at any time from the Account Settings
        page. Cancellation takes effect at the end of the current billing period.
        You will not be charged for the next period. Refunds are not available
        for partial billing periods.""",

        """Invoice and Billing History
        Your full billing history is available under Account > Billing > Invoices.
        Each invoice can be downloaded as a PDF. For team and enterprise accounts,
        invoices are automatically emailed to the billing contact on record.""",

        """Battery Life Optimisation
        If your battery drains faster than expected after a software update,
        try the following: disable background app refresh, reduce screen
        brightness, and toggle aeroplane mode briefly to reset radio connections.
        A calibration cycle - full discharge followed by a full charge - often
        resolves post-update battery drain.""",
    ],
    source="support_kb"
)
print(f"Ingested: {result}")

# -- Search --------------------------------------------------------
queries = [
    "my laptop won't start after the update",
    "how do I get my password back",
    "will I get charged if I stop subscribing",
    "battery stops working quickly",
]

for query in queries:
    print(f"\nQuery: '{query}'")
    results = engine.search(query, n_results=2, min_similarity=0.3)
    for r in results:
        print(f"  [{r['similarity']:.3f}] {r['text'][:90]}...")

Expected output:

text

Ingested: {'documents': 5, 'chunks': 5}

Query: 'my laptop won't start after the update'
  [0.8123] Device Boot Failure After Update Some users have reported that their device fails...
  [0.5891] Battery Life Optimisation If your battery drains faster than expected after a so...

Query: 'how do I get my password back'
  [0.8741] Password Reset Guide If you are locked out of your account, go to the login page...
  [0.4102] Subscription Cancellation Policy You may cancel your subscription at any time fr...

Query: 'will I get charged if I stop subscribing'
  [0.8934] Subscription Cancellation Policy You may cancel your subscription at any time fr...
  [0.4211] Invoice and Billing History Your full billing history is available under Account...

Query: 'battery stops working quickly'
  [0.8567] Battery Life Optimisation If your battery drains faster than expected after a so...
  [0.5213] Device Boot Failure After Update Some users have reported that their device fail...

Every result is found through meaning, not keywords. The query "will I get charged if I stop subscribing" contains none of the words in the matching document - it finds it through semantic similarity alone.

Step 6: Chunking Long Documents

To see chunking in action, try a longer document:

python

long_article = """
The Python programming language was created by Guido van Rossum and first
released in 1991. Python's design philosophy emphasises code readability,
and its syntax allows programmers to express concepts in fewer lines of code
than would be possible in languages such as C++ or Java.

Python supports multiple programming paradigms, including structured,
object-oriented, and functional programming. It features a dynamic type
system and automatic memory management.

Python is widely used in web development, data science, artificial intelligence,
scientific computing, and automation. It is consistently ranked as one of the
most popular programming languages in the world.

The Python Package Index (PyPI) hosts thousands of third-party modules for
Python. Both the Python standard library and the community-contributed modules
allow for endless possibilities for developers.
"""

engine.ingest_texts([long_article], source="python_intro")
results = engine.search("who made Python and when", n_results=2)
for r in results:
    print(f"[{r['similarity']:.3f}] [{r['source']}] (chunk {r['chunk_index']}) {r['text'][:100]}...")

The chunker splits this 800-character document into two overlapping chunks. Both are indexed. The query "who made Python and when" retrieves the chunk containing the creation details with high similarity.

Step 7: Re-indexing and Document Updates

The engine handles content updates cleanly because it uses content-derived IDs:

python

# Update a document - re-ingest with the same source
engine.ingest_texts(
    texts=[
        """Password Reset Guide (Updated April 2026)
        To reset your password: visit login page, click 'Forgot Password',
        enter your registered email. Reset link valid for 48 hours (extended from 24).
        SMS verification required for accounts with 2FA enabled."""
    ],
    source="support_kb_v2"
)

# The old source still exists - delete it explicitly if needed
engine.store.delete_source("support_kb")

In a production ingestion pipeline, track a document's source identifier and call delete_source() before re-ingesting when the source content changes.

Adding a Simple REST API with FastAPI

To expose your search engine as an HTTP endpoint:

bash

pip install fastapi uvicorn

python

# api.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from semantic_search.engine import SemanticSearchEngine

app = FastAPI(title="Semantic Search API")
engine = SemanticSearchEngine(persist_path="./search_index")


class SearchRequest(BaseModel):
    query: str
    n_results: int = 5
    min_similarity: float = 0.3


class IngestRequest(BaseModel):
    texts: list[str]
    source: str = "api"


@app.post("/search")
def search(request: SearchRequest):
    if not request.query.strip():
        raise HTTPException(status_code=400, detail="Query cannot be empty")
    results = engine.search(
        request.query,
        n_results=request.n_results,
        min_similarity=request.min_similarity,
    )
    return {"query": request.query, "results": results}


@app.post("/ingest")
def ingest(request: IngestRequest):
    if not request.texts:
        raise HTTPException(status_code=400, detail="texts list cannot be empty")
    result = engine.ingest_texts(request.texts, source=request.source)
    return result


@app.get("/stats")
def stats():
    return engine.stats()

bash

uvicorn api:app --reload

Test it:

bash

curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "reset my password", "n_results": 3}'

Project File Structure

text

semantic_search/
+-- __init__.py
+-- loader.py       ← Document loading
+-- chunker.py      ← Text splitting
+-- store.py        ← ChromaDB wrapper
+--- engine.py       ← Main orchestrator

demo.py             ← Console demo
api.py              ← FastAPI REST endpoint
search_index/       ← ChromaDB persistent storage (auto-created)

Key Takeaways

Semantic search finds documents by meaning, not keywords - queries and documents do not need to share words
Chunking is essential for long documents - split with overlap to avoid losing context at boundaries
Content-derived IDs (hashed source + index) make your ingestion pipeline safely idempotent
Setting a min_similarity threshold (0.3-0.4) filters out irrelevant low-confidence results
The same engine can be extended to support PDF ingestion (add pypdf2), web scraping (add BeautifulSoup), or multiple languages (swap to a multilingual sentence-transformers model)
Wrapping the engine with FastAPI creates a production-ready semantic search microservice in under 30 lines

What's Next in the Vector Database Series

Next post: Vector Database Optimisation for Production - Chunking Strategies, Index Tuning, and Scaling

This post is part of the Vector Database Series. Previous post: ChromaDB vs Pinecone vs pgvector: Which Vector Database Should You Use?.

To add a Claude-powered Q&A layer on top of this search engine, see Claude RAG: Retrieval Augmented Generation. For a comparison of vector database options for scaling this project, see What Is a Vector Database? and ChromaDB Tutorial for Beginners.

Scale MongoDB for Production Workloads

If your semantic search system uses MongoDB as the document store alongside the vector index:

MongoDB Indexes Guide — create compound, text, and sparse indexes to keep queries fast at scale
MongoDB Security Guide — authentication, role-based access control, and encryption at rest
MongoDB in Production — replication, monitoring, connection pooling, and operational patterns

Expand Your Database Skills

Vector search is one data storage pattern. For a broader understanding of how document databases work alongside vector stores:

What Is NoSQL? Database Models Compared — understand when to reach for a document DB vs a relational DB vs a vector store
MongoDB Querying Guide — deep dive into document query patterns that complement vector similarity search

External Resources

Sentence-Transformers documentation - the official site for the embedding models used in this project, with model comparison and benchmarks.
ChromaDB documentation - the official guide for the vector store used for persistence and similarity search.
all-MiniLM-L6-v2 on Hugging Face - the specific embedding model used in this project; fast, accurate, and free for commercial use.