Artificial IntelligenceDatabasesProjects

Build a Semantic Search Engine from Scratch with Python (2026)

TT
TopicTrick
Build a Semantic Search Engine from Scratch with Python (2026)

Keyword search has a fundamental flaw: it matches words, not meaning. Your users do not search for keywords — they describe what they need. A user who types "how to cancel my account" is looking for the same article as one who types "steps to close my subscription." Keyword search misses one of those. Semantic search matches both.

In this project you will build a complete semantic search engine from scratch. By the end you will have a working system that can ingest documents from text files, chunk and embed them, store embeddings in ChromaDB, and expose a clean Python search interface that returns results ranked by meaning — not by keyword overlap.

This is a standalone project. If you want to understand the underlying theory before diving in, read What is a Vector Database? and ChromaDB Tutorial first.


What You Will Build

A five-component semantic search system:

  1. Document loader: reads text files or plain strings into a standard format
  2. Text chunker: splits long documents into overlapping chunks suitable for embedding
  3. Embedding pipeline: converts chunks to vector embeddings using a local model
  4. Vector store: persists embeddings in ChromaDB with metadata
  5. Search interface: accepts natural-language queries and returns ranked results

Prerequisites

bash
1pip install chromadb sentence-transformers

Python 3.10 or later. No API keys required — everything runs locally.


Step 1: Document Loader

Start with a clean data model. Every document in the system has a source, content, and metadata.

python
1# semantic_search/loader.py 2from dataclasses import dataclass, field 3from pathlib import Path 4 5@dataclass 6class Document: 7 content: str 8 source: str 9 metadata: dict = field(default_factory=dict) 10 11 12class DocumentLoader: 13 """Load documents from text strings or .txt files.""" 14 15 @staticmethod 16 def from_texts(texts: list[str], source: str = "inline") -> list[Document]: 17 return [ 18 Document(content=text.strip(), source=source, metadata={"index": i}) 19 for i, text in enumerate(texts) 20 if text.strip() 21 ] 22 23 @staticmethod 24 def from_file(path: str | Path) -> list[Document]: 25 path = Path(path) 26 if not path.exists(): 27 raise FileNotFoundError(f"File not found: {path}") 28 content = path.read_text(encoding="utf-8") 29 return [Document(content=content, source=str(path))] 30 31 @staticmethod 32 def from_directory(directory: str | Path, extension: str = ".txt") -> list[Document]: 33 directory = Path(directory) 34 documents = [] 35 for file_path in sorted(directory.glob(f"**/*{extension}")): 36 content = file_path.read_text(encoding="utf-8").strip() 37 if content: 38 documents.append(Document( 39 content=content, 40 source=str(file_path), 41 metadata={"filename": file_path.name, "stem": file_path.stem} 42 )) 43 return documents

Step 2: Text Chunker

Long documents can exceed embedding model context limits (typically 256–512 tokens for most sentence-transformers). Chunking splits documents into smaller, overlapping segments. The overlap ensures that sentences split across a chunk boundary are represented in at least one complete chunk.

python
1# semantic_search/chunker.py 2from .loader import Document 3 4@dataclass 5class Chunk: 6 text: str 7 source: str 8 chunk_index: int 9 total_chunks: int 10 metadata: dict = field(default_factory=dict) 11 12 13class TextChunker: 14 """ 15 Split documents into overlapping fixed-size character chunks. 16 17 chunk_size: target chunk size in characters (~500 chars ≈ 100–150 tokens) 18 chunk_overlap: overlap between consecutive chunks in characters 19 """ 20 21 def __init__(self, chunk_size: int = 500, chunk_overlap: int = 100): 22 if chunk_overlap >= chunk_size: 23 raise ValueError("chunk_overlap must be smaller than chunk_size") 24 self.chunk_size = chunk_size 25 self.chunk_overlap = chunk_overlap 26 27 def chunk(self, document: Document) -> list[Chunk]: 28 text = document.content 29 chunks = [] 30 start = 0 31 step = self.chunk_size - self.chunk_overlap 32 33 while start < len(text): 34 end = start + self.chunk_size 35 chunk_text = text[start:end].strip() 36 if chunk_text: 37 chunks.append(Chunk( 38 text=chunk_text, 39 source=document.source, 40 chunk_index=len(chunks), 41 total_chunks=0, # filled in below 42 metadata={**document.metadata} 43 )) 44 start += step 45 46 # Set total_chunks now that we know the count 47 for chunk in chunks: 48 chunk.total_chunks = len(chunks) 49 50 return chunks 51 52 def chunk_all(self, documents: list[Document]) -> list[Chunk]: 53 all_chunks = [] 54 for doc in documents: 55 all_chunks.extend(self.chunk(doc)) 56 return all_chunks

Choosing Chunk Size

For support articles or documentation (dense, factual content): 400–600 characters. For narrative text or long-form articles: 800–1000 characters. The overlap (100–150 characters) ensures context at boundaries is not lost. Smaller chunks improve retrieval precision; larger chunks give the LLM more context per result.


    Step 3: Vector Store Wrapper

    Wrap ChromaDB with a clean interface that knows nothing about the chunker or loader:

    python
    1# semantic_search/store.py 2import hashlib 3import chromadb 4from chromadb.utils import embedding_functions 5from .chunker import Chunk 6 7 8class VectorStore: 9 """ 10 ChromaDB-backed vector store for document chunks. 11 Handles ID generation, upserts, and structured queries. 12 """ 13 14 def __init__( 15 self, 16 persist_path: str = "./search_index", 17 collection_name: str = "semantic_search", 18 model_name: str = "all-MiniLM-L6-v2", 19 ): 20 self.client = chromadb.PersistentClient(path=persist_path) 21 self.ef = embedding_functions.SentenceTransformerEmbeddingFunction( 22 model_name=model_name 23 ) 24 self.collection = self.client.get_or_create_collection( 25 name=collection_name, 26 embedding_function=self.ef, 27 metadata={"hnsw:space": "cosine"}, 28 ) 29 30 @staticmethod 31 def _make_chunk_id(chunk: Chunk) -> str: 32 """Stable ID: hash of (source + chunk_index).""" 33 key = f"{chunk.source}::{chunk.chunk_index}" 34 return hashlib.sha256(key.encode()).hexdigest()[:20] 35 36 def add_chunks(self, chunks: list[Chunk]) -> int: 37 """Add or update chunks in the vector store. Returns number added.""" 38 if not chunks: 39 return 0 40 41 ids = [self._make_chunk_id(c) for c in chunks] 42 texts = [c.text for c in chunks] 43 metadatas = [ 44 { 45 "source": c.source, 46 "chunk_index": c.chunk_index, 47 "total_chunks": c.total_chunks, 48 **c.metadata, 49 } 50 for c in chunks 51 ] 52 53 self.collection.upsert(ids=ids, documents=texts, metadatas=metadatas) 54 return len(chunks) 55 56 def count(self) -> int: 57 return self.collection.count() 58 59 def search( 60 self, 61 query: str, 62 n_results: int = 5, 63 filters: dict | None = None, 64 min_similarity: float = 0.0, 65 ) -> list[dict]: 66 """ 67 Run semantic search. 68 Returns results as dicts with keys: text, similarity, source, metadata. 69 Optionally filter by minimum similarity threshold. 70 """ 71 actual_n = min(n_results * 2, self.collection.count()) 72 if actual_n == 0: 73 return [] 74 75 kwargs = { 76 "query_texts": [query], 77 "n_results": actual_n, 78 "include": ["documents", "distances", "metadatas"], 79 } 80 if filters: 81 kwargs["where"] = filters 82 83 raw = self.collection.query(**kwargs) 84 85 results = [] 86 for text, dist, meta in zip( 87 raw["documents"][0], 88 raw["distances"][0], 89 raw["metadatas"][0], 90 ): 91 similarity = round(1 - dist, 4) 92 if similarity >= min_similarity: 93 results.append({ 94 "text": text, 95 "similarity": similarity, 96 "source": meta.get("source", ""), 97 "chunk_index": meta.get("chunk_index", 0), 98 "metadata": meta, 99 }) 100 101 # Sort by similarity and trim to requested n 102 results.sort(key=lambda r: r["similarity"], reverse=True) 103 return results[:n_results] 104 105 def delete_source(self, source: str) -> None: 106 """Remove all chunks from a specific source document.""" 107 self.collection.delete(where={"source": source})

    Step 4: Search Engine — Putting It All Together

    python
    1# semantic_search/engine.py 2from .loader import Document, DocumentLoader 3from .chunker import TextChunker 4from .store import VectorStore 5 6 7class SemanticSearchEngine: 8 """ 9 Complete semantic search engine: ingest → chunk → embed → search. 10 """ 11 12 def __init__( 13 self, 14 persist_path: str = "./search_index", 15 collection_name: str = "semantic_search", 16 chunk_size: int = 500, 17 chunk_overlap: int = 100, 18 model_name: str = "all-MiniLM-L6-v2", 19 ): 20 self.loader = DocumentLoader() 21 self.chunker = TextChunker(chunk_size=chunk_size, chunk_overlap=chunk_overlap) 22 self.store = VectorStore( 23 persist_path=persist_path, 24 collection_name=collection_name, 25 model_name=model_name, 26 ) 27 28 def ingest_texts(self, texts: list[str], source: str = "inline") -> dict: 29 docs = DocumentLoader.from_texts(texts, source=source) 30 return self._ingest(docs) 31 32 def ingest_file(self, path: str) -> dict: 33 docs = DocumentLoader.from_file(path) 34 return self._ingest(docs) 35 36 def ingest_directory(self, directory: str, extension: str = ".txt") -> dict: 37 docs = DocumentLoader.from_directory(directory, extension=extension) 38 return self._ingest(docs) 39 40 def _ingest(self, documents: list[Document]) -> dict: 41 if not documents: 42 return {"documents": 0, "chunks": 0} 43 chunks = self.chunker.chunk_all(documents) 44 added = self.store.add_chunks(chunks) 45 return {"documents": len(documents), "chunks": added} 46 47 def search( 48 self, 49 query: str, 50 n_results: int = 5, 51 filters: dict | None = None, 52 min_similarity: float = 0.3, 53 ) -> list[dict]: 54 return self.store.search( 55 query=query, 56 n_results=n_results, 57 filters=filters, 58 min_similarity=min_similarity, 59 ) 60 61 def stats(self) -> dict: 62 return {"total_chunks": self.store.count()}

    Step 5: Run It — Full Working Demo

    python
    1# demo.py 2from semantic_search.engine import SemanticSearchEngine 3 4engine = SemanticSearchEngine(persist_path="./search_index") 5 6# ── Ingest documents ────────────────────────────────────────────── 7result = engine.ingest_texts( 8 texts=[ 9 """Password Reset Guide 10 If you are locked out of your account, go to the login page and click 11 'Forgot Password'. Enter your email address and you will receive a reset 12 link within 5 minutes. The link expires after 24 hours. If you do not 13 receive the email, check your spam folder or contact support.""", 14 15 """Device Boot Failure After Update 16 Some users have reported that their device fails to boot following a 17 firmware update. To resolve this: hold the power button for 10 seconds 18 to force shutdown, then restart normally. If the device still fails to 19 boot, enter recovery mode by holding Power + Volume Down for 15 seconds.""", 20 21 """Subscription Cancellation Policy 22 You may cancel your subscription at any time from the Account Settings 23 page. Cancellation takes effect at the end of the current billing period. 24 You will not be charged for the next period. Refunds are not available 25 for partial billing periods.""", 26 27 """Invoice and Billing History 28 Your full billing history is available under Account > Billing > Invoices. 29 Each invoice can be downloaded as a PDF. For team and enterprise accounts, 30 invoices are automatically emailed to the billing contact on record.""", 31 32 """Battery Life Optimisation 33 If your battery drains faster than expected after a software update, 34 try the following: disable background app refresh, reduce screen 35 brightness, and toggle aeroplane mode briefly to reset radio connections. 36 A calibration cycle — full discharge followed by a full charge — often 37 resolves post-update battery drain.""", 38 ], 39 source="support_kb" 40) 41print(f"Ingested: {result}") 42 43# ── Search ──────────────────────────────────────────────────────── 44queries = [ 45 "my laptop won't start after the update", 46 "how do I get my password back", 47 "will I get charged if I stop subscribing", 48 "battery stops working quickly", 49] 50 51for query in queries: 52 print(f"\nQuery: '{query}'") 53 results = engine.search(query, n_results=2, min_similarity=0.3) 54 for r in results: 55 print(f" [{r['similarity']:.3f}] {r['text'][:90]}...")

    Expected output:

    Ingested: {'documents': 5, 'chunks': 5} Query: 'my laptop won't start after the update' [0.8123] Device Boot Failure After Update Some users have reported that their device fails... [0.5891] Battery Life Optimisation If your battery drains faster than expected after a so... Query: 'how do I get my password back' [0.8741] Password Reset Guide If you are locked out of your account, go to the login page... [0.4102] Subscription Cancellation Policy You may cancel your subscription at any time fr... Query: 'will I get charged if I stop subscribing' [0.8934] Subscription Cancellation Policy You may cancel your subscription at any time fr... [0.4211] Invoice and Billing History Your full billing history is available under Account... Query: 'battery stops working quickly' [0.8567] Battery Life Optimisation If your battery drains faster than expected after a so... [0.5213] Device Boot Failure After Update Some users have reported that their device fail...

    Every result is found through meaning, not keywords. The query "will I get charged if I stop subscribing" contains none of the words in the matching document — it finds it through semantic similarity alone.


    Step 6: Chunking Long Documents

    To see chunking in action, try a longer document:

    python
    1long_article = """ 2The Python programming language was created by Guido van Rossum and first 3released in 1991. Python's design philosophy emphasises code readability, 4and its syntax allows programmers to express concepts in fewer lines of code 5than would be possible in languages such as C++ or Java. 6 7Python supports multiple programming paradigms, including structured, 8object-oriented, and functional programming. It features a dynamic type 9system and automatic memory management. 10 11Python is widely used in web development, data science, artificial intelligence, 12scientific computing, and automation. It is consistently ranked as one of the 13most popular programming languages in the world. 14 15The Python Package Index (PyPI) hosts thousands of third-party modules for 16Python. Both the Python standard library and the community-contributed modules 17allow for endless possibilities for developers. 18""" 19 20engine.ingest_texts([long_article], source="python_intro") 21results = engine.search("who made Python and when", n_results=2) 22for r in results: 23 print(f"[{r['similarity']:.3f}] [{r['source']}] (chunk {r['chunk_index']}) {r['text'][:100]}...")

    The chunker splits this 800-character document into two overlapping chunks. Both are indexed. The query "who made Python and when" retrieves the chunk containing the creation details with high similarity.


    Step 7: Re-indexing and Document Updates

    The engine handles content updates cleanly because it uses content-derived IDs:

    python
    1# Update a document — re-ingest with the same source 2engine.ingest_texts( 3 texts=[ 4 """Password Reset Guide (Updated April 2026) 5 To reset your password: visit login page, click 'Forgot Password', 6 enter your registered email. Reset link valid for 48 hours (extended from 24). 7 SMS verification required for accounts with 2FA enabled.""" 8 ], 9 source="support_kb_v2" 10) 11 12# The old source still exists — delete it explicitly if needed 13engine.store.delete_source("support_kb")

    In a production ingestion pipeline, track a document's source identifier and call delete_source() before re-ingesting when the source content changes.


    Adding a Simple REST API with FastAPI

    To expose your search engine as an HTTP endpoint:

    bash
    1pip install fastapi uvicorn
    python
    1# api.py 2from fastapi import FastAPI, HTTPException 3from pydantic import BaseModel 4from semantic_search.engine import SemanticSearchEngine 5 6app = FastAPI(title="Semantic Search API") 7engine = SemanticSearchEngine(persist_path="./search_index") 8 9 10class SearchRequest(BaseModel): 11 query: str 12 n_results: int = 5 13 min_similarity: float = 0.3 14 15 16class IngestRequest(BaseModel): 17 texts: list[str] 18 source: str = "api" 19 20 21@app.post("/search") 22def search(request: SearchRequest): 23 if not request.query.strip(): 24 raise HTTPException(status_code=400, detail="Query cannot be empty") 25 results = engine.search( 26 request.query, 27 n_results=request.n_results, 28 min_similarity=request.min_similarity, 29 ) 30 return {"query": request.query, "results": results} 31 32 33@app.post("/ingest") 34def ingest(request: IngestRequest): 35 if not request.texts: 36 raise HTTPException(status_code=400, detail="texts list cannot be empty") 37 result = engine.ingest_texts(request.texts, source=request.source) 38 return result 39 40 41@app.get("/stats") 42def stats(): 43 return engine.stats()
    bash
    1uvicorn api:app --reload

    Test it:

    bash
    1curl -X POST http://localhost:8000/search \ 2 -H "Content-Type: application/json" \ 3 -d '{"query": "reset my password", "n_results": 3}'

    Project File Structure

    semantic_search/ ├── __init__.py ├── loader.py ← Document loading ├── chunker.py ← Text splitting ├── store.py ← ChromaDB wrapper └── engine.py ← Main orchestrator demo.py ← Console demo api.py ← FastAPI REST endpoint search_index/ ← ChromaDB persistent storage (auto-created)

    Key Takeaways

    • Semantic search finds documents by meaning, not keywords — queries and documents do not need to share words
    • Chunking is essential for long documents — split with overlap to avoid losing context at boundaries
    • Content-derived IDs (hashed source + index) make your ingestion pipeline safely idempotent
    • Setting a min_similarity threshold (0.3–0.4) filters out irrelevant low-confidence results
    • The same engine can be extended to support PDF ingestion (add pypdf2), web scraping (add BeautifulSoup), or multiple languages (swap to a multilingual sentence-transformers model)
    • Wrapping the engine with FastAPI creates a production-ready semantic search microservice in under 30 lines

    What's Next in the Vector Database Series

    This post is part of the Vector Database Series. Previous post: ChromaDB vs Pinecone vs pgvector: Which Vector Database Should You Use?.