Artificial IntelligenceDatabasesMachine Learning

Vector Database Optimisation for Production: Chunking, Indexing, and Scaling

TT
TopicTrick
Vector Database Optimisation for Production: Chunking, Indexing, and Scaling

A vector database that works well in a demo often behaves completely differently in production. You added 1,000 chunks during development. In production you have 500,000 — and queries that took 20 ms now take 800 ms. Your RAG pipeline was returning great answers in testing. In production, users complain the answers feel generic. You think it is the LLM. It is actually the retrieval.

This post covers the techniques that separate a demo-grade vector search system from a production-grade one: advanced chunking strategies, embedding model selection, HNSW index tuning, hybrid search, query caching, monitoring, and scaling patterns.

This is the advanced post in the series. You should already be comfortable with the basics from the earlier posts: What is a Vector Database?, ChromaDB Tutorial, and Build a Semantic Search Engine from Scratch.


1. Chunking Strategy — The Most Impactful Decision

The single biggest factor in retrieval quality is not your vector database or your embedding model — it is how you chunk your documents. Poor chunking causes your system to retrieve irrelevant or incomplete context regardless of how good everything else is.

Fixed-Size Chunking (Baseline)

Split by character count with overlap. Simple, predictable, and good enough for many use cases.

python
1def fixed_size_chunks(text: str, size: int = 500, overlap: int = 100) -> list[str]: 2 chunks = [] 3 step = size - overlap 4 for start in range(0, len(text), step): 5 chunk = text[start:start + size].strip() 6 if chunk: 7 chunks.append(chunk) 8 return chunks

Problem: A 500-character boundary might land in the middle of a sentence or split a code example in half. The resulting chunk loses semantic coherence.

Sentence-Aware Chunking (Better)

Use NLTK or spaCy to split at sentence boundaries, then group sentences into target-sized chunks:

python
1import nltk 2nltk.download("punkt_tab", quiet=True) 3 4def sentence_chunks(text: str, max_chars: int = 600, overlap_sentences: int = 1) -> list[str]: 5 sentences = nltk.sent_tokenize(text) 6 chunks = [] 7 current = [] 8 current_len = 0 9 10 for sentence in sentences: 11 sentence_len = len(sentence) 12 if current_len + sentence_len > max_chars and current: 13 chunks.append(" ".join(current)) 14 # Overlap: keep last N sentences for next chunk 15 current = current[-overlap_sentences:] if overlap_sentences else [] 16 current_len = sum(len(s) for s in current) 17 current.append(sentence) 18 current_len += sentence_len 19 20 if current: 21 chunks.append(" ".join(current)) 22 23 return chunks

This respects sentence boundaries, dramatically improving chunk coherence.

Semantic Chunking (Best for Long Documents)

Split based on topic shifts — when the semantic similarity between consecutive sentences drops below a threshold, start a new chunk. This keeps topically coherent content together.

python
1from sentence_transformers import SentenceTransformer 2import numpy as np 3 4def semantic_chunks( 5 text: str, 6 model: SentenceTransformer, 7 similarity_threshold: float = 0.75, 8 max_chunk_chars: int = 1000, 9) -> list[str]: 10 """ 11 Split text at points where topic changes (cosine similarity < threshold). 12 Falls back to character limit if no natural break is found. 13 """ 14 sentences = nltk.sent_tokenize(text) 15 if len(sentences) <= 2: 16 return [text] 17 18 embeddings = model.encode(sentences, batch_size=32, show_progress_bar=False) 19 20 # Compute similarity between consecutive sentences 21 similarities = [ 22 float(np.dot(embeddings[i], embeddings[i + 1]) / 23 (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i + 1]))) 24 for i in range(len(embeddings) - 1) 25 ] 26 27 # Split at topic boundaries 28 chunks = [] 29 current = [sentences[0]] 30 current_len = len(sentences[0]) 31 32 for i, (sentence, sim) in enumerate(zip(sentences[1:], similarities)): 33 topic_break = sim < similarity_threshold 34 too_long = current_len + len(sentence) > max_chunk_chars 35 36 if topic_break or too_long: 37 if current: 38 chunks.append(" ".join(current)) 39 current = [sentence] 40 current_len = len(sentence) 41 else: 42 current.append(sentence) 43 current_len += len(sentence) 44 45 if current: 46 chunks.append(" ".join(current)) 47 48 return chunks

Which Chunking Strategy to Use

Fixed-size: fastest, adequate for homogeneous content. Sentence-aware: best balance of quality and speed for most RAG applications. Semantic chunking: highest quality for long mixed-topic documents (e.g., legal contracts, research papers) — but the extra embedding pass doubles ingestion cost.


    2. Embedding Model Selection

    Not all embedding models produce equally useful vectors for your task. The wrong model can degrade retrieval quality by 20–40%.

    Matching Model to Task

    TaskRecommended ModelDimensions
    General semantic searchall-MiniLM-L6-v2384
    High accuracy semantic searchall-mpnet-base-v2768
    Question → document retrievalmulti-qa-MiniLM-L6-cos-v1384
    Code searchkrlvi/sentence-msmarco-bert-base-dot-v5768
    Multilingualparaphrase-multilingual-MiniLM-L12-v2384
    Highest quality (paid)text-embedding-3-large (OpenAI)3072

    For RAG applications — where you embed queries and retrieve document chunks — use a model designed for asymmetric retrieval (short question → long document). multi-qa-MiniLM-L6-cos-v1 and multi-qa-mpnet-base-dot-v1 are specifically trained for this.

    Benchmarking Your Embedding Model

    Never assume a model will work well for your domain. Build a small evaluation set and measure retrieval quality:

    python
    1from sentence_transformers import SentenceTransformer 2import numpy as np 3 4def evaluate_retrieval(model_name: str, test_cases: list[dict]) -> dict: 5 """ 6 Evaluate retrieval quality. 7 test_cases: list of {'query': str, 'relevant_doc': str, 'all_docs': list[str]} 8 Returns hit@1, hit@3, mean_rank metrics. 9 """ 10 model = SentenceTransformer(model_name) 11 hits_at_1 = 0 12 hits_at_3 = 0 13 ranks = [] 14 15 for case in test_cases: 16 all_docs = case["all_docs"] 17 doc_embeddings = model.encode(all_docs, normalize_embeddings=True) 18 query_embedding = model.encode(case["query"], normalize_embeddings=True) 19 20 scores = np.dot(doc_embeddings, query_embedding) 21 ranked_indices = np.argsort(scores)[::-1] 22 23 relevant_idx = all_docs.index(case["relevant_doc"]) 24 rank = list(ranked_indices).index(relevant_idx) + 1 25 ranks.append(rank) 26 27 if rank == 1: 28 hits_at_1 += 1 29 if rank <= 3: 30 hits_at_3 += 1 31 32 n = len(test_cases) 33 return { 34 "model": model_name, 35 "hit@1": hits_at_1 / n, 36 "hit@3": hits_at_3 / n, 37 "mean_rank": sum(ranks) / n, 38 } 39 40 41# Compare two models on your domain 42test_cases = [ 43 { 44 "query": "how to reset password", 45 "relevant_doc": "Go to Account Settings and click Reset Password to change your credentials.", 46 "all_docs": [ 47 "Go to Account Settings and click Reset Password to change your credentials.", 48 "Invoice history is available in the billing section of your account.", 49 "Contact support if your device fails to start after the update.", 50 ] 51 }, 52 # ... add more test cases 53] 54 55for model in ["all-MiniLM-L6-v2", "multi-qa-MiniLM-L6-cos-v1"]: 56 print(evaluate_retrieval(model, test_cases))

    Building even 20–30 test cases and running this evaluation before committing to a model can save hours of debugging poor retrieval later.


    3. HNSW Index Tuning

    HNSW (Hierarchical Navigable Small World) is the index algorithm used by ChromaDB, Qdrant, and Weaviate. Three parameters control the quality/speed trade-off:

    M — number of bi-directional links per node. Higher M improves recall but increases memory and indexing time. Default: 16. For high-recall applications: 32–64.

    ef_construction — the size of the dynamic candidate list during index build. Higher values improve recall but slow down ingestion. Default: 100. For high-recall: 200–400.

    ef_search (or hnsw:search_ef in ChromaDB) — the candidate list size at query time. Higher values improve recall and slow queries. This is the most useful runtime trade-off parameter.

    python
    1# ChromaDB HNSW tuning 2collection = client.get_or_create_collection( 3 name="production_docs", 4 metadata={ 5 "hnsw:space": "cosine", 6 "hnsw:M": 32, # double the default for better recall 7 "hnsw:construction_ef": 200, # better index quality 8 "hnsw:search_ef": 100, # better query recall (default: 10 — far too low) 9 } 10)

    ChromaDB Default search_ef is 10

    ChromaDB's default hnsw:search_ef is 10, which is very conservative. For production RAG applications, set it to at least 50–100. A search_ef of 10 on a large collection can miss highly relevant results and is the most common cause of poor retrieval quality in ChromaDB deployments.

      pgvector HNSW tuning:

      sql
      1-- Set ef_search at query time (per session or globally) 2SET hnsw.ef_search = 100; 3 4-- Or set a higher default in the index 5CREATE INDEX ON documents 6USING hnsw (embedding vector_cosine_ops) 7WITH (m = 32, ef_construction = 200);

      4. Hybrid Search: Combining Vector and Keyword

      Pure vector search is excellent for semantic queries but can miss exact matches that keyword search would catch trivially (product codes, technical identifiers, proper nouns). Hybrid search combines both.

      Hybrid Search with pgvector + Full-Text Search

      Postgres has native full-text search (tsvector). Combine it with pgvector in a single query using Reciprocal Rank Fusion (RRF):

      sql
      1-- Add full-text search column 2ALTER TABLE documents ADD COLUMN fts_vector tsvector 3 GENERATED ALWAYS AS (to_tsvector('english', content)) STORED; 4CREATE INDEX ON documents USING gin(fts_vector); 5 6-- Hybrid query using RRF scoring 7WITH semantic AS ( 8 SELECT id, 1 - (embedding <=> %(query_vector)s::vector) AS score, 9 ROW_NUMBER() OVER (ORDER BY embedding <=> %(query_vector)s::vector) AS rank 10 FROM documents 11 LIMIT 20 12), 13keyword AS ( 14 SELECT id, 15 ts_rank(fts_vector, plainto_tsquery('english', %(query_text)s)) AS score, 16 ROW_NUMBER() OVER ( 17 ORDER BY ts_rank(fts_vector, plainto_tsquery('english', %(query_text)s)) DESC 18 ) AS rank 19 FROM documents 20 WHERE fts_vector @@ plainto_tsquery('english', %(query_text)s) 21 LIMIT 20 22), 23rrf AS ( 24 SELECT 25 COALESCE(s.id, k.id) AS id, 26 COALESCE(1.0 / (60 + s.rank), 0) + COALESCE(1.0 / (60 + k.rank), 0) AS rrf_score 27 FROM semantic s 28 FULL JOIN keyword k ON s.id = k.id 29) 30SELECT d.id, d.content, r.rrf_score 31FROM rrf r 32JOIN documents d ON d.id = r.id 33ORDER BY r.rrf_score DESC 34LIMIT 5;

      Reciprocal Rank Fusion (the 1.0 / (60 + rank) formula) normalises scores from different systems into a comparable range and combines them without needing to calibrate weights. The constant 60 dampens the impact of very high ranks.

      Hybrid Search with ChromaDB (Manual)

      ChromaDB does not have built-in keyword search, but you can implement hybrid search manually:

      python
      1from rank_bm25 import BM25Okapi # pip install rank-bm25 2import numpy as np 3 4class HybridSearchEngine: 5 def __init__(self, collection, documents: list[str]): 6 self.collection = collection 7 self.documents = documents 8 # Build BM25 index 9 tokenised = [doc.lower().split() for doc in documents] 10 self.bm25 = BM25Okapi(tokenised) 11 12 def search(self, query: str, n_results: int = 5, alpha: float = 0.5) -> list[dict]: 13 """ 14 alpha: weight of semantic score (1-alpha = keyword weight) 15 """ 16 # Vector search 17 vector_results = self.collection.query( 18 query_texts=[query], n_results=n_results * 2, 19 include=["documents", "distances", "metadatas"] 20 ) 21 vector_scores = { 22 doc: 1 - dist 23 for doc, dist in zip( 24 vector_results["documents"][0], 25 vector_results["distances"][0] 26 ) 27 } 28 29 # BM25 keyword search 30 bm25_scores = self.bm25.get_scores(query.lower().split()) 31 max_bm25 = max(bm25_scores) if max(bm25_scores) > 0 else 1 32 bm25_normalised = { 33 self.documents[i]: score / max_bm25 34 for i, score in enumerate(bm25_scores) 35 if score > 0 36 } 37 38 # Combine with RRF-style weighting 39 all_docs = set(vector_scores.keys()) | set(bm25_normalised.keys()) 40 combined = [] 41 for doc in all_docs: 42 v_score = vector_scores.get(doc, 0) 43 k_score = bm25_normalised.get(doc, 0) 44 combined_score = alpha * v_score + (1 - alpha) * k_score 45 combined.append({"text": doc, "score": combined_score}) 46 47 combined.sort(key=lambda x: x["score"], reverse=True) 48 return combined[:n_results]

      5. Query Caching

      Identical or near-identical queries are common in production (users asking the same FAQ questions repeatedly). Cache results to avoid redundant embedding and retrieval operations.

      python
      1import hashlib 2import json 3from functools import lru_cache 4from datetime import datetime, timedelta 5 6class CachedSearchEngine: 7 def __init__(self, engine, cache_ttl_seconds: int = 300): 8 self.engine = engine 9 self.cache: dict[str, tuple[list, datetime]] = {} 10 self.ttl = timedelta(seconds=cache_ttl_seconds) 11 12 def _cache_key(self, query: str, n_results: int, filters: dict | None) -> str: 13 payload = {"q": query.lower().strip(), "n": n_results, "f": filters or {}} 14 return hashlib.md5(json.dumps(payload, sort_keys=True).encode()).hexdigest() 15 16 def search(self, query: str, n_results: int = 5, filters: dict | None = None) -> list[dict]: 17 key = self._cache_key(query, n_results, filters) 18 now = datetime.utcnow() 19 20 if key in self.cache: 21 results, cached_at = self.cache[key] 22 if now - cached_at < self.ttl: 23 return results # cache hit 24 25 results = self.engine.search(query, n_results=n_results, filters=filters) 26 self.cache[key] = (results, now) 27 return results 28 29 def invalidate(self) -> None: 30 """Clear the entire cache (e.g., after re-ingestion).""" 31 self.cache.clear()

      For production at scale, replace the in-memory dict with Redis:

      python
      1import redis 2import json 3 4redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True) 5 6def cached_search(engine, query: str, n_results: int = 5, ttl: int = 300) -> list[dict]: 7 cache_key = f"search:{hashlib.md5(f'{query}:{n_results}'.encode()).hexdigest()}" 8 cached = redis_client.get(cache_key) 9 if cached: 10 return json.loads(cached) 11 12 results = engine.search(query, n_results=n_results) 13 redis_client.setex(cache_key, ttl, json.dumps(results)) 14 return results

      6. Re-ranking Results

      Vector search returns the top-K most similar chunks, but similarity to the query vector is not always the best measure of relevance. A re-ranker (cross-encoder model) takes the query and each candidate chunk as a pair and produces a more accurate relevance score.

      python
      1from sentence_transformers import CrossEncoder 2 3# Cross-encoders are slower but much more accurate than bi-encoders for ranking 4reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") 5 6def retrieve_and_rerank( 7 collection, 8 query: str, 9 initial_n: int = 20, 10 final_n: int = 5 11) -> list[dict]: 12 """ 13 Two-stage retrieval: 14 1. Fast ANN search to get top-20 candidates 15 2. Accurate cross-encoder re-ranking to select top-5 16 """ 17 # Stage 1: fast vector retrieval (over-retrieve) 18 results = collection.query( 19 query_texts=[query], 20 n_results=initial_n, 21 include=["documents", "metadatas"] 22 ) 23 candidates = results["documents"][0] 24 metas = results["metadatas"][0] 25 26 # Stage 2: cross-encoder re-ranking 27 pairs = [(query, doc) for doc in candidates] 28 rerank_scores = reranker.predict(pairs) 29 30 ranked = sorted( 31 zip(candidates, rerank_scores, metas), 32 key=lambda x: x[1], 33 reverse=True 34 ) 35 36 return [ 37 {"text": doc, "rerank_score": float(score), "metadata": meta} 38 for doc, score, meta in ranked[:final_n] 39 ]

      Re-ranking adds 50–200 ms of latency but can dramatically improve retrieval quality, especially for complex queries. It is the technique used by Cohere Rerank, Pinecone's rerank API, and Voyage AI.


      7. Monitoring and Observability

      You cannot optimise what you cannot measure. Track these metrics in production:

      python
      1import time 2from dataclasses import dataclass, field 3from collections import defaultdict 4 5@dataclass 6class SearchMetrics: 7 total_queries: int = 0 8 cache_hits: int = 0 9 latencies_ms: list[float] = field(default_factory=list) 10 zero_result_queries: list[str] = field(default_factory=list) 11 low_similarity_queries: list[str] = field(default_factory=list) 12 13 def record_query( 14 self, 15 query: str, 16 latency_ms: float, 17 results: list[dict], 18 cache_hit: bool, 19 low_similarity_threshold: float = 0.4 20 ) -> None: 21 self.total_queries += 1 22 self.latencies_ms.append(latency_ms) 23 if cache_hit: 24 self.cache_hits += 1 25 if not results: 26 self.zero_result_queries.append(query) 27 elif results[0]["similarity"] < low_similarity_threshold: 28 self.low_similarity_queries.append(query) 29 30 def report(self) -> dict: 31 lats = self.latencies_ms 32 return { 33 "total_queries": self.total_queries, 34 "cache_hit_rate": self.cache_hits / max(self.total_queries, 1), 35 "p50_latency_ms": sorted(lats)[len(lats) // 2] if lats else 0, 36 "p95_latency_ms": sorted(lats)[int(len(lats) * 0.95)] if lats else 0, 37 "zero_result_rate": len(self.zero_result_queries) / max(self.total_queries, 1), 38 "low_similarity_rate": len(self.low_similarity_queries) / max(self.total_queries, 1), 39 "recent_zero_result_queries": self.zero_result_queries[-10:], 40 "recent_low_similarity_queries": self.low_similarity_queries[-10:], 41 }

      Key metrics to alert on:

      • p95 query latency > 500 ms: signals HNSW tuning or hardware is needed
      • Zero-result rate > 5%: suggests gaps in your document corpus
      • Low similarity rate > 20%: suggests chunking, embedding model, or corpus coverage issues
      • Cache hit rate < 10%: high query diversity — caching may not help much, focus on index tuning

      8. Scaling Patterns

      When ChromaDB Starts Slowing Down

      ChromaDB runs on a single machine. When queries start taking more than 200 ms at your target collection size, consider:

      1. Tune HNSW first — increase hnsw:M and hnsw:construction_ef before re-indexing
      2. Add a Redis cache for repeated queries
      3. Partition by metadata — split one large collection into multiple smaller ones by category or date range, and route queries to the appropriate partition
      4. Migrate to Qdrant or pgvector when the single-machine ceiling is reached

      Migrating from ChromaDB to Qdrant

      python
      1from qdrant_client import QdrantClient 2from qdrant_client.models import Distance, VectorParams, PointStruct 3 4qdrant = QdrantClient(host="localhost", port=6333) 5 6# Create Qdrant collection 7qdrant.create_collection( 8 collection_name="production_docs", 9 vectors_config=VectorParams(size=384, distance=Distance.COSINE), 10) 11 12# Export from ChromaDB 13all_data = chroma_collection.get(include=["embeddings", "documents", "metadatas"]) 14 15# Import to Qdrant in batches 16batch_size = 1000 17points = [ 18 PointStruct( 19 id=i, 20 vector=embedding, 21 payload={"text": doc, **meta} 22 ) 23 for i, (embedding, doc, meta) in enumerate(zip( 24 all_data["embeddings"], 25 all_data["documents"], 26 all_data["metadatas"] 27 )) 28] 29 30for i in range(0, len(points), batch_size): 31 qdrant.upsert( 32 collection_name="production_docs", 33 points=points[i:i + batch_size] 34 ) 35 36print(f"Migrated {len(points)} vectors to Qdrant")

      Production Optimisation Checklist

      • ☐ Use sentence-aware or semantic chunking instead of naive fixed-size splits
      • ☐ Benchmark your embedding model on a domain-specific evaluation set before committing
      • ☐ Set hnsw:search_ef to at least 50–100 in ChromaDB (default of 10 is too low)
      • ☐ Over-retrieve (top-20) then re-rank (select top-5) for high-quality RAG applications
      • ☐ Add Redis-backed query caching for common/repeated queries
      • ☐ Implement hybrid search (vector + BM25/FTS) if your corpus contains exact-match-critical content
      • ☐ Monitor p95 query latency, zero-result rate, and low-similarity rate
      • ☐ Review low-similarity queries weekly to identify corpus gaps
      • ☐ Plan migration to Qdrant or Pinecone before you hit ChromaDB's single-machine ceiling

      Key Takeaways

      • Chunking strategy is the highest-impact optimisation — sentence-aware or semantic chunking consistently outperforms fixed-size splits
      • Embedding model selection matters — use asymmetric retrieval models for question-to-document RAG
      • ChromaDB's default search_ef = 10 is the most common source of poor recall in production — set it to 50–100
      • Two-stage retrieval (ANN + cross-encoder re-ranking) gives the best quality at acceptable latency
      • Hybrid search (vectors + keyword) handles both semantic and exact-match queries correctly
      • Monitor the metrics that matter: zero-result rate and low-similarity rate tell you about corpus quality; p95 latency tells you about infrastructure

      Vector Database Series — Complete

      You have now completed the full Vector Database Series:

      1. What is a Vector Database? The Complete Beginner's Guide
      2. ChromaDB Tutorial: The Complete Beginner's Guide
      3. ChromaDB vs Pinecone vs pgvector: Which Should You Use?
      4. Build a Semantic Search Engine from Scratch
      5. Vector Database Optimisation for Production ← you are here

      For the next natural step — connect your vector database to an LLM and build a complete RAG pipeline — see Project: Build a RAG App with Claude.