Build a Semantic Search Engine from Scratch with Python (2026)

Keyword search has a fundamental flaw: it matches words, not meaning. Your users do not search for keywords — they describe what they need. A user who types "how to cancel my account" is looking for the same article as one who types "steps to close my subscription." Keyword search misses one of those. Semantic search matches both.
In this project you will build a complete semantic search engine from scratch. By the end you will have a working system that can ingest documents from text files, chunk and embed them, store embeddings in ChromaDB, and expose a clean Python search interface that returns results ranked by meaning — not by keyword overlap.
This is a standalone project. If you want to understand the underlying theory before diving in, read What is a Vector Database? and ChromaDB Tutorial first.
What You Will Build
A five-component semantic search system:
- Document loader: reads text files or plain strings into a standard format
- Text chunker: splits long documents into overlapping chunks suitable for embedding
- Embedding pipeline: converts chunks to vector embeddings using a local model
- Vector store: persists embeddings in ChromaDB with metadata
- Search interface: accepts natural-language queries and returns ranked results
Prerequisites
1pip install chromadb sentence-transformersPython 3.10 or later. No API keys required — everything runs locally.
Step 1: Document Loader
Start with a clean data model. Every document in the system has a source, content, and metadata.
1# semantic_search/loader.py
2from dataclasses import dataclass, field
3from pathlib import Path
4
5@dataclass
6class Document:
7 content: str
8 source: str
9 metadata: dict = field(default_factory=dict)
10
11
12class DocumentLoader:
13 """Load documents from text strings or .txt files."""
14
15 @staticmethod
16 def from_texts(texts: list[str], source: str = "inline") -> list[Document]:
17 return [
18 Document(content=text.strip(), source=source, metadata={"index": i})
19 for i, text in enumerate(texts)
20 if text.strip()
21 ]
22
23 @staticmethod
24 def from_file(path: str | Path) -> list[Document]:
25 path = Path(path)
26 if not path.exists():
27 raise FileNotFoundError(f"File not found: {path}")
28 content = path.read_text(encoding="utf-8")
29 return [Document(content=content, source=str(path))]
30
31 @staticmethod
32 def from_directory(directory: str | Path, extension: str = ".txt") -> list[Document]:
33 directory = Path(directory)
34 documents = []
35 for file_path in sorted(directory.glob(f"**/*{extension}")):
36 content = file_path.read_text(encoding="utf-8").strip()
37 if content:
38 documents.append(Document(
39 content=content,
40 source=str(file_path),
41 metadata={"filename": file_path.name, "stem": file_path.stem}
42 ))
43 return documentsStep 2: Text Chunker
Long documents can exceed embedding model context limits (typically 256–512 tokens for most sentence-transformers). Chunking splits documents into smaller, overlapping segments. The overlap ensures that sentences split across a chunk boundary are represented in at least one complete chunk.
1# semantic_search/chunker.py
2from .loader import Document
3
4@dataclass
5class Chunk:
6 text: str
7 source: str
8 chunk_index: int
9 total_chunks: int
10 metadata: dict = field(default_factory=dict)
11
12
13class TextChunker:
14 """
15 Split documents into overlapping fixed-size character chunks.
16
17 chunk_size: target chunk size in characters (~500 chars ≈ 100–150 tokens)
18 chunk_overlap: overlap between consecutive chunks in characters
19 """
20
21 def __init__(self, chunk_size: int = 500, chunk_overlap: int = 100):
22 if chunk_overlap >= chunk_size:
23 raise ValueError("chunk_overlap must be smaller than chunk_size")
24 self.chunk_size = chunk_size
25 self.chunk_overlap = chunk_overlap
26
27 def chunk(self, document: Document) -> list[Chunk]:
28 text = document.content
29 chunks = []
30 start = 0
31 step = self.chunk_size - self.chunk_overlap
32
33 while start < len(text):
34 end = start + self.chunk_size
35 chunk_text = text[start:end].strip()
36 if chunk_text:
37 chunks.append(Chunk(
38 text=chunk_text,
39 source=document.source,
40 chunk_index=len(chunks),
41 total_chunks=0, # filled in below
42 metadata={**document.metadata}
43 ))
44 start += step
45
46 # Set total_chunks now that we know the count
47 for chunk in chunks:
48 chunk.total_chunks = len(chunks)
49
50 return chunks
51
52 def chunk_all(self, documents: list[Document]) -> list[Chunk]:
53 all_chunks = []
54 for doc in documents:
55 all_chunks.extend(self.chunk(doc))
56 return all_chunksChoosing Chunk Size
For support articles or documentation (dense, factual content): 400–600 characters. For narrative text or long-form articles: 800–1000 characters. The overlap (100–150 characters) ensures context at boundaries is not lost. Smaller chunks improve retrieval precision; larger chunks give the LLM more context per result.
Step 3: Vector Store Wrapper
Wrap ChromaDB with a clean interface that knows nothing about the chunker or loader:
1# semantic_search/store.py
2import hashlib
3import chromadb
4from chromadb.utils import embedding_functions
5from .chunker import Chunk
6
7
8class VectorStore:
9 """
10 ChromaDB-backed vector store for document chunks.
11 Handles ID generation, upserts, and structured queries.
12 """
13
14 def __init__(
15 self,
16 persist_path: str = "./search_index",
17 collection_name: str = "semantic_search",
18 model_name: str = "all-MiniLM-L6-v2",
19 ):
20 self.client = chromadb.PersistentClient(path=persist_path)
21 self.ef = embedding_functions.SentenceTransformerEmbeddingFunction(
22 model_name=model_name
23 )
24 self.collection = self.client.get_or_create_collection(
25 name=collection_name,
26 embedding_function=self.ef,
27 metadata={"hnsw:space": "cosine"},
28 )
29
30 @staticmethod
31 def _make_chunk_id(chunk: Chunk) -> str:
32 """Stable ID: hash of (source + chunk_index)."""
33 key = f"{chunk.source}::{chunk.chunk_index}"
34 return hashlib.sha256(key.encode()).hexdigest()[:20]
35
36 def add_chunks(self, chunks: list[Chunk]) -> int:
37 """Add or update chunks in the vector store. Returns number added."""
38 if not chunks:
39 return 0
40
41 ids = [self._make_chunk_id(c) for c in chunks]
42 texts = [c.text for c in chunks]
43 metadatas = [
44 {
45 "source": c.source,
46 "chunk_index": c.chunk_index,
47 "total_chunks": c.total_chunks,
48 **c.metadata,
49 }
50 for c in chunks
51 ]
52
53 self.collection.upsert(ids=ids, documents=texts, metadatas=metadatas)
54 return len(chunks)
55
56 def count(self) -> int:
57 return self.collection.count()
58
59 def search(
60 self,
61 query: str,
62 n_results: int = 5,
63 filters: dict | None = None,
64 min_similarity: float = 0.0,
65 ) -> list[dict]:
66 """
67 Run semantic search.
68 Returns results as dicts with keys: text, similarity, source, metadata.
69 Optionally filter by minimum similarity threshold.
70 """
71 actual_n = min(n_results * 2, self.collection.count())
72 if actual_n == 0:
73 return []
74
75 kwargs = {
76 "query_texts": [query],
77 "n_results": actual_n,
78 "include": ["documents", "distances", "metadatas"],
79 }
80 if filters:
81 kwargs["where"] = filters
82
83 raw = self.collection.query(**kwargs)
84
85 results = []
86 for text, dist, meta in zip(
87 raw["documents"][0],
88 raw["distances"][0],
89 raw["metadatas"][0],
90 ):
91 similarity = round(1 - dist, 4)
92 if similarity >= min_similarity:
93 results.append({
94 "text": text,
95 "similarity": similarity,
96 "source": meta.get("source", ""),
97 "chunk_index": meta.get("chunk_index", 0),
98 "metadata": meta,
99 })
100
101 # Sort by similarity and trim to requested n
102 results.sort(key=lambda r: r["similarity"], reverse=True)
103 return results[:n_results]
104
105 def delete_source(self, source: str) -> None:
106 """Remove all chunks from a specific source document."""
107 self.collection.delete(where={"source": source})Step 4: Search Engine — Putting It All Together
1# semantic_search/engine.py
2from .loader import Document, DocumentLoader
3from .chunker import TextChunker
4from .store import VectorStore
5
6
7class SemanticSearchEngine:
8 """
9 Complete semantic search engine: ingest → chunk → embed → search.
10 """
11
12 def __init__(
13 self,
14 persist_path: str = "./search_index",
15 collection_name: str = "semantic_search",
16 chunk_size: int = 500,
17 chunk_overlap: int = 100,
18 model_name: str = "all-MiniLM-L6-v2",
19 ):
20 self.loader = DocumentLoader()
21 self.chunker = TextChunker(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
22 self.store = VectorStore(
23 persist_path=persist_path,
24 collection_name=collection_name,
25 model_name=model_name,
26 )
27
28 def ingest_texts(self, texts: list[str], source: str = "inline") -> dict:
29 docs = DocumentLoader.from_texts(texts, source=source)
30 return self._ingest(docs)
31
32 def ingest_file(self, path: str) -> dict:
33 docs = DocumentLoader.from_file(path)
34 return self._ingest(docs)
35
36 def ingest_directory(self, directory: str, extension: str = ".txt") -> dict:
37 docs = DocumentLoader.from_directory(directory, extension=extension)
38 return self._ingest(docs)
39
40 def _ingest(self, documents: list[Document]) -> dict:
41 if not documents:
42 return {"documents": 0, "chunks": 0}
43 chunks = self.chunker.chunk_all(documents)
44 added = self.store.add_chunks(chunks)
45 return {"documents": len(documents), "chunks": added}
46
47 def search(
48 self,
49 query: str,
50 n_results: int = 5,
51 filters: dict | None = None,
52 min_similarity: float = 0.3,
53 ) -> list[dict]:
54 return self.store.search(
55 query=query,
56 n_results=n_results,
57 filters=filters,
58 min_similarity=min_similarity,
59 )
60
61 def stats(self) -> dict:
62 return {"total_chunks": self.store.count()}Step 5: Run It — Full Working Demo
1# demo.py
2from semantic_search.engine import SemanticSearchEngine
3
4engine = SemanticSearchEngine(persist_path="./search_index")
5
6# ── Ingest documents ──────────────────────────────────────────────
7result = engine.ingest_texts(
8 texts=[
9 """Password Reset Guide
10 If you are locked out of your account, go to the login page and click
11 'Forgot Password'. Enter your email address and you will receive a reset
12 link within 5 minutes. The link expires after 24 hours. If you do not
13 receive the email, check your spam folder or contact support.""",
14
15 """Device Boot Failure After Update
16 Some users have reported that their device fails to boot following a
17 firmware update. To resolve this: hold the power button for 10 seconds
18 to force shutdown, then restart normally. If the device still fails to
19 boot, enter recovery mode by holding Power + Volume Down for 15 seconds.""",
20
21 """Subscription Cancellation Policy
22 You may cancel your subscription at any time from the Account Settings
23 page. Cancellation takes effect at the end of the current billing period.
24 You will not be charged for the next period. Refunds are not available
25 for partial billing periods.""",
26
27 """Invoice and Billing History
28 Your full billing history is available under Account > Billing > Invoices.
29 Each invoice can be downloaded as a PDF. For team and enterprise accounts,
30 invoices are automatically emailed to the billing contact on record.""",
31
32 """Battery Life Optimisation
33 If your battery drains faster than expected after a software update,
34 try the following: disable background app refresh, reduce screen
35 brightness, and toggle aeroplane mode briefly to reset radio connections.
36 A calibration cycle — full discharge followed by a full charge — often
37 resolves post-update battery drain.""",
38 ],
39 source="support_kb"
40)
41print(f"Ingested: {result}")
42
43# ── Search ────────────────────────────────────────────────────────
44queries = [
45 "my laptop won't start after the update",
46 "how do I get my password back",
47 "will I get charged if I stop subscribing",
48 "battery stops working quickly",
49]
50
51for query in queries:
52 print(f"\nQuery: '{query}'")
53 results = engine.search(query, n_results=2, min_similarity=0.3)
54 for r in results:
55 print(f" [{r['similarity']:.3f}] {r['text'][:90]}...")Expected output:
Ingested: {'documents': 5, 'chunks': 5}
Query: 'my laptop won't start after the update'
[0.8123] Device Boot Failure After Update Some users have reported that their device fails...
[0.5891] Battery Life Optimisation If your battery drains faster than expected after a so...
Query: 'how do I get my password back'
[0.8741] Password Reset Guide If you are locked out of your account, go to the login page...
[0.4102] Subscription Cancellation Policy You may cancel your subscription at any time fr...
Query: 'will I get charged if I stop subscribing'
[0.8934] Subscription Cancellation Policy You may cancel your subscription at any time fr...
[0.4211] Invoice and Billing History Your full billing history is available under Account...
Query: 'battery stops working quickly'
[0.8567] Battery Life Optimisation If your battery drains faster than expected after a so...
[0.5213] Device Boot Failure After Update Some users have reported that their device fail...
Every result is found through meaning, not keywords. The query "will I get charged if I stop subscribing" contains none of the words in the matching document — it finds it through semantic similarity alone.
Step 6: Chunking Long Documents
To see chunking in action, try a longer document:
1long_article = """
2The Python programming language was created by Guido van Rossum and first
3released in 1991. Python's design philosophy emphasises code readability,
4and its syntax allows programmers to express concepts in fewer lines of code
5than would be possible in languages such as C++ or Java.
6
7Python supports multiple programming paradigms, including structured,
8object-oriented, and functional programming. It features a dynamic type
9system and automatic memory management.
10
11Python is widely used in web development, data science, artificial intelligence,
12scientific computing, and automation. It is consistently ranked as one of the
13most popular programming languages in the world.
14
15The Python Package Index (PyPI) hosts thousands of third-party modules for
16Python. Both the Python standard library and the community-contributed modules
17allow for endless possibilities for developers.
18"""
19
20engine.ingest_texts([long_article], source="python_intro")
21results = engine.search("who made Python and when", n_results=2)
22for r in results:
23 print(f"[{r['similarity']:.3f}] [{r['source']}] (chunk {r['chunk_index']}) {r['text'][:100]}...")The chunker splits this 800-character document into two overlapping chunks. Both are indexed. The query "who made Python and when" retrieves the chunk containing the creation details with high similarity.
Step 7: Re-indexing and Document Updates
The engine handles content updates cleanly because it uses content-derived IDs:
1# Update a document — re-ingest with the same source
2engine.ingest_texts(
3 texts=[
4 """Password Reset Guide (Updated April 2026)
5 To reset your password: visit login page, click 'Forgot Password',
6 enter your registered email. Reset link valid for 48 hours (extended from 24).
7 SMS verification required for accounts with 2FA enabled."""
8 ],
9 source="support_kb_v2"
10)
11
12# The old source still exists — delete it explicitly if needed
13engine.store.delete_source("support_kb")In a production ingestion pipeline, track a document's source identifier and call delete_source() before re-ingesting when the source content changes.
Adding a Simple REST API with FastAPI
To expose your search engine as an HTTP endpoint:
1pip install fastapi uvicorn1# api.py
2from fastapi import FastAPI, HTTPException
3from pydantic import BaseModel
4from semantic_search.engine import SemanticSearchEngine
5
6app = FastAPI(title="Semantic Search API")
7engine = SemanticSearchEngine(persist_path="./search_index")
8
9
10class SearchRequest(BaseModel):
11 query: str
12 n_results: int = 5
13 min_similarity: float = 0.3
14
15
16class IngestRequest(BaseModel):
17 texts: list[str]
18 source: str = "api"
19
20
21@app.post("/search")
22def search(request: SearchRequest):
23 if not request.query.strip():
24 raise HTTPException(status_code=400, detail="Query cannot be empty")
25 results = engine.search(
26 request.query,
27 n_results=request.n_results,
28 min_similarity=request.min_similarity,
29 )
30 return {"query": request.query, "results": results}
31
32
33@app.post("/ingest")
34def ingest(request: IngestRequest):
35 if not request.texts:
36 raise HTTPException(status_code=400, detail="texts list cannot be empty")
37 result = engine.ingest_texts(request.texts, source=request.source)
38 return result
39
40
41@app.get("/stats")
42def stats():
43 return engine.stats()1uvicorn api:app --reloadTest it:
1curl -X POST http://localhost:8000/search \
2 -H "Content-Type: application/json" \
3 -d '{"query": "reset my password", "n_results": 3}'Project File Structure
semantic_search/
├── __init__.py
├── loader.py ← Document loading
├── chunker.py ← Text splitting
├── store.py ← ChromaDB wrapper
└── engine.py ← Main orchestrator
demo.py ← Console demo
api.py ← FastAPI REST endpoint
search_index/ ← ChromaDB persistent storage (auto-created)
Key Takeaways
- Semantic search finds documents by meaning, not keywords — queries and documents do not need to share words
- Chunking is essential for long documents — split with overlap to avoid losing context at boundaries
- Content-derived IDs (hashed source + index) make your ingestion pipeline safely idempotent
- Setting a min_similarity threshold (0.3–0.4) filters out irrelevant low-confidence results
- The same engine can be extended to support PDF ingestion (add pypdf2), web scraping (add BeautifulSoup), or multiple languages (swap to a multilingual sentence-transformers model)
- Wrapping the engine with FastAPI creates a production-ready semantic search microservice in under 30 lines
What's Next in the Vector Database Series
- Next post: Vector Database Optimisation for Production — Chunking Strategies, Index Tuning, and Scaling
This post is part of the Vector Database Series. Previous post: ChromaDB vs Pinecone vs pgvector: Which Vector Database Should You Use?.
