Vector Database Optimisation for Production: Chunking, Indexing, and Scaling

A vector database that works well in a demo often behaves completely differently in production. You added 1,000 chunks during development. In production you have 500,000 — and queries that took 20 ms now take 800 ms. Your RAG pipeline was returning great answers in testing. In production, users complain the answers feel generic. You think it is the LLM. It is actually the retrieval.
This post covers the techniques that separate a demo-grade vector search system from a production-grade one: advanced chunking strategies, embedding model selection, HNSW index tuning, hybrid search, query caching, monitoring, and scaling patterns.
This is the advanced post in the series. You should already be comfortable with the basics from the earlier posts: What is a Vector Database?, ChromaDB Tutorial, and Build a Semantic Search Engine from Scratch.
1. Chunking Strategy — The Most Impactful Decision
The single biggest factor in retrieval quality is not your vector database or your embedding model — it is how you chunk your documents. Poor chunking causes your system to retrieve irrelevant or incomplete context regardless of how good everything else is.
Fixed-Size Chunking (Baseline)
Split by character count with overlap. Simple, predictable, and good enough for many use cases.
1def fixed_size_chunks(text: str, size: int = 500, overlap: int = 100) -> list[str]:
2 chunks = []
3 step = size - overlap
4 for start in range(0, len(text), step):
5 chunk = text[start:start + size].strip()
6 if chunk:
7 chunks.append(chunk)
8 return chunksProblem: A 500-character boundary might land in the middle of a sentence or split a code example in half. The resulting chunk loses semantic coherence.
Sentence-Aware Chunking (Better)
Use NLTK or spaCy to split at sentence boundaries, then group sentences into target-sized chunks:
1import nltk
2nltk.download("punkt_tab", quiet=True)
3
4def sentence_chunks(text: str, max_chars: int = 600, overlap_sentences: int = 1) -> list[str]:
5 sentences = nltk.sent_tokenize(text)
6 chunks = []
7 current = []
8 current_len = 0
9
10 for sentence in sentences:
11 sentence_len = len(sentence)
12 if current_len + sentence_len > max_chars and current:
13 chunks.append(" ".join(current))
14 # Overlap: keep last N sentences for next chunk
15 current = current[-overlap_sentences:] if overlap_sentences else []
16 current_len = sum(len(s) for s in current)
17 current.append(sentence)
18 current_len += sentence_len
19
20 if current:
21 chunks.append(" ".join(current))
22
23 return chunksThis respects sentence boundaries, dramatically improving chunk coherence.
Semantic Chunking (Best for Long Documents)
Split based on topic shifts — when the semantic similarity between consecutive sentences drops below a threshold, start a new chunk. This keeps topically coherent content together.
1from sentence_transformers import SentenceTransformer
2import numpy as np
3
4def semantic_chunks(
5 text: str,
6 model: SentenceTransformer,
7 similarity_threshold: float = 0.75,
8 max_chunk_chars: int = 1000,
9) -> list[str]:
10 """
11 Split text at points where topic changes (cosine similarity < threshold).
12 Falls back to character limit if no natural break is found.
13 """
14 sentences = nltk.sent_tokenize(text)
15 if len(sentences) <= 2:
16 return [text]
17
18 embeddings = model.encode(sentences, batch_size=32, show_progress_bar=False)
19
20 # Compute similarity between consecutive sentences
21 similarities = [
22 float(np.dot(embeddings[i], embeddings[i + 1]) /
23 (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i + 1])))
24 for i in range(len(embeddings) - 1)
25 ]
26
27 # Split at topic boundaries
28 chunks = []
29 current = [sentences[0]]
30 current_len = len(sentences[0])
31
32 for i, (sentence, sim) in enumerate(zip(sentences[1:], similarities)):
33 topic_break = sim < similarity_threshold
34 too_long = current_len + len(sentence) > max_chunk_chars
35
36 if topic_break or too_long:
37 if current:
38 chunks.append(" ".join(current))
39 current = [sentence]
40 current_len = len(sentence)
41 else:
42 current.append(sentence)
43 current_len += len(sentence)
44
45 if current:
46 chunks.append(" ".join(current))
47
48 return chunksWhich Chunking Strategy to Use
Fixed-size: fastest, adequate for homogeneous content. Sentence-aware: best balance of quality and speed for most RAG applications. Semantic chunking: highest quality for long mixed-topic documents (e.g., legal contracts, research papers) — but the extra embedding pass doubles ingestion cost.
2. Embedding Model Selection
Not all embedding models produce equally useful vectors for your task. The wrong model can degrade retrieval quality by 20–40%.
Matching Model to Task
| Task | Recommended Model | Dimensions |
|---|---|---|
| General semantic search | all-MiniLM-L6-v2 | 384 |
| High accuracy semantic search | all-mpnet-base-v2 | 768 |
| Question → document retrieval | multi-qa-MiniLM-L6-cos-v1 | 384 |
| Code search | krlvi/sentence-msmarco-bert-base-dot-v5 | 768 |
| Multilingual | paraphrase-multilingual-MiniLM-L12-v2 | 384 |
| Highest quality (paid) | text-embedding-3-large (OpenAI) | 3072 |
For RAG applications — where you embed queries and retrieve document chunks — use a model designed for asymmetric retrieval (short question → long document). multi-qa-MiniLM-L6-cos-v1 and multi-qa-mpnet-base-dot-v1 are specifically trained for this.
Benchmarking Your Embedding Model
Never assume a model will work well for your domain. Build a small evaluation set and measure retrieval quality:
1from sentence_transformers import SentenceTransformer
2import numpy as np
3
4def evaluate_retrieval(model_name: str, test_cases: list[dict]) -> dict:
5 """
6 Evaluate retrieval quality.
7 test_cases: list of {'query': str, 'relevant_doc': str, 'all_docs': list[str]}
8 Returns hit@1, hit@3, mean_rank metrics.
9 """
10 model = SentenceTransformer(model_name)
11 hits_at_1 = 0
12 hits_at_3 = 0
13 ranks = []
14
15 for case in test_cases:
16 all_docs = case["all_docs"]
17 doc_embeddings = model.encode(all_docs, normalize_embeddings=True)
18 query_embedding = model.encode(case["query"], normalize_embeddings=True)
19
20 scores = np.dot(doc_embeddings, query_embedding)
21 ranked_indices = np.argsort(scores)[::-1]
22
23 relevant_idx = all_docs.index(case["relevant_doc"])
24 rank = list(ranked_indices).index(relevant_idx) + 1
25 ranks.append(rank)
26
27 if rank == 1:
28 hits_at_1 += 1
29 if rank <= 3:
30 hits_at_3 += 1
31
32 n = len(test_cases)
33 return {
34 "model": model_name,
35 "hit@1": hits_at_1 / n,
36 "hit@3": hits_at_3 / n,
37 "mean_rank": sum(ranks) / n,
38 }
39
40
41# Compare two models on your domain
42test_cases = [
43 {
44 "query": "how to reset password",
45 "relevant_doc": "Go to Account Settings and click Reset Password to change your credentials.",
46 "all_docs": [
47 "Go to Account Settings and click Reset Password to change your credentials.",
48 "Invoice history is available in the billing section of your account.",
49 "Contact support if your device fails to start after the update.",
50 ]
51 },
52 # ... add more test cases
53]
54
55for model in ["all-MiniLM-L6-v2", "multi-qa-MiniLM-L6-cos-v1"]:
56 print(evaluate_retrieval(model, test_cases))Building even 20–30 test cases and running this evaluation before committing to a model can save hours of debugging poor retrieval later.
3. HNSW Index Tuning
HNSW (Hierarchical Navigable Small World) is the index algorithm used by ChromaDB, Qdrant, and Weaviate. Three parameters control the quality/speed trade-off:
M — number of bi-directional links per node. Higher M improves recall but increases memory and indexing time. Default: 16. For high-recall applications: 32–64.
ef_construction — the size of the dynamic candidate list during index build. Higher values improve recall but slow down ingestion. Default: 100. For high-recall: 200–400.
ef_search (or hnsw:search_ef in ChromaDB) — the candidate list size at query time. Higher values improve recall and slow queries. This is the most useful runtime trade-off parameter.
1# ChromaDB HNSW tuning
2collection = client.get_or_create_collection(
3 name="production_docs",
4 metadata={
5 "hnsw:space": "cosine",
6 "hnsw:M": 32, # double the default for better recall
7 "hnsw:construction_ef": 200, # better index quality
8 "hnsw:search_ef": 100, # better query recall (default: 10 — far too low)
9 }
10)ChromaDB Default search_ef is 10
ChromaDB's default hnsw:search_ef is 10, which is very conservative. For production RAG applications, set it to at least 50–100. A search_ef of 10 on a large collection can miss highly relevant results and is the most common cause of poor retrieval quality in ChromaDB deployments.
pgvector HNSW tuning:
1-- Set ef_search at query time (per session or globally)
2SET hnsw.ef_search = 100;
3
4-- Or set a higher default in the index
5CREATE INDEX ON documents
6USING hnsw (embedding vector_cosine_ops)
7WITH (m = 32, ef_construction = 200);4. Hybrid Search: Combining Vector and Keyword
Pure vector search is excellent for semantic queries but can miss exact matches that keyword search would catch trivially (product codes, technical identifiers, proper nouns). Hybrid search combines both.
Hybrid Search with pgvector + Full-Text Search
Postgres has native full-text search (tsvector). Combine it with pgvector in a single query using Reciprocal Rank Fusion (RRF):
1-- Add full-text search column
2ALTER TABLE documents ADD COLUMN fts_vector tsvector
3 GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
4CREATE INDEX ON documents USING gin(fts_vector);
5
6-- Hybrid query using RRF scoring
7WITH semantic AS (
8 SELECT id, 1 - (embedding <=> %(query_vector)s::vector) AS score,
9 ROW_NUMBER() OVER (ORDER BY embedding <=> %(query_vector)s::vector) AS rank
10 FROM documents
11 LIMIT 20
12),
13keyword AS (
14 SELECT id,
15 ts_rank(fts_vector, plainto_tsquery('english', %(query_text)s)) AS score,
16 ROW_NUMBER() OVER (
17 ORDER BY ts_rank(fts_vector, plainto_tsquery('english', %(query_text)s)) DESC
18 ) AS rank
19 FROM documents
20 WHERE fts_vector @@ plainto_tsquery('english', %(query_text)s)
21 LIMIT 20
22),
23rrf AS (
24 SELECT
25 COALESCE(s.id, k.id) AS id,
26 COALESCE(1.0 / (60 + s.rank), 0) + COALESCE(1.0 / (60 + k.rank), 0) AS rrf_score
27 FROM semantic s
28 FULL JOIN keyword k ON s.id = k.id
29)
30SELECT d.id, d.content, r.rrf_score
31FROM rrf r
32JOIN documents d ON d.id = r.id
33ORDER BY r.rrf_score DESC
34LIMIT 5;Reciprocal Rank Fusion (the 1.0 / (60 + rank) formula) normalises scores from different systems into a comparable range and combines them without needing to calibrate weights. The constant 60 dampens the impact of very high ranks.
Hybrid Search with ChromaDB (Manual)
ChromaDB does not have built-in keyword search, but you can implement hybrid search manually:
1from rank_bm25 import BM25Okapi # pip install rank-bm25
2import numpy as np
3
4class HybridSearchEngine:
5 def __init__(self, collection, documents: list[str]):
6 self.collection = collection
7 self.documents = documents
8 # Build BM25 index
9 tokenised = [doc.lower().split() for doc in documents]
10 self.bm25 = BM25Okapi(tokenised)
11
12 def search(self, query: str, n_results: int = 5, alpha: float = 0.5) -> list[dict]:
13 """
14 alpha: weight of semantic score (1-alpha = keyword weight)
15 """
16 # Vector search
17 vector_results = self.collection.query(
18 query_texts=[query], n_results=n_results * 2,
19 include=["documents", "distances", "metadatas"]
20 )
21 vector_scores = {
22 doc: 1 - dist
23 for doc, dist in zip(
24 vector_results["documents"][0],
25 vector_results["distances"][0]
26 )
27 }
28
29 # BM25 keyword search
30 bm25_scores = self.bm25.get_scores(query.lower().split())
31 max_bm25 = max(bm25_scores) if max(bm25_scores) > 0 else 1
32 bm25_normalised = {
33 self.documents[i]: score / max_bm25
34 for i, score in enumerate(bm25_scores)
35 if score > 0
36 }
37
38 # Combine with RRF-style weighting
39 all_docs = set(vector_scores.keys()) | set(bm25_normalised.keys())
40 combined = []
41 for doc in all_docs:
42 v_score = vector_scores.get(doc, 0)
43 k_score = bm25_normalised.get(doc, 0)
44 combined_score = alpha * v_score + (1 - alpha) * k_score
45 combined.append({"text": doc, "score": combined_score})
46
47 combined.sort(key=lambda x: x["score"], reverse=True)
48 return combined[:n_results]5. Query Caching
Identical or near-identical queries are common in production (users asking the same FAQ questions repeatedly). Cache results to avoid redundant embedding and retrieval operations.
1import hashlib
2import json
3from functools import lru_cache
4from datetime import datetime, timedelta
5
6class CachedSearchEngine:
7 def __init__(self, engine, cache_ttl_seconds: int = 300):
8 self.engine = engine
9 self.cache: dict[str, tuple[list, datetime]] = {}
10 self.ttl = timedelta(seconds=cache_ttl_seconds)
11
12 def _cache_key(self, query: str, n_results: int, filters: dict | None) -> str:
13 payload = {"q": query.lower().strip(), "n": n_results, "f": filters or {}}
14 return hashlib.md5(json.dumps(payload, sort_keys=True).encode()).hexdigest()
15
16 def search(self, query: str, n_results: int = 5, filters: dict | None = None) -> list[dict]:
17 key = self._cache_key(query, n_results, filters)
18 now = datetime.utcnow()
19
20 if key in self.cache:
21 results, cached_at = self.cache[key]
22 if now - cached_at < self.ttl:
23 return results # cache hit
24
25 results = self.engine.search(query, n_results=n_results, filters=filters)
26 self.cache[key] = (results, now)
27 return results
28
29 def invalidate(self) -> None:
30 """Clear the entire cache (e.g., after re-ingestion)."""
31 self.cache.clear()For production at scale, replace the in-memory dict with Redis:
1import redis
2import json
3
4redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True)
5
6def cached_search(engine, query: str, n_results: int = 5, ttl: int = 300) -> list[dict]:
7 cache_key = f"search:{hashlib.md5(f'{query}:{n_results}'.encode()).hexdigest()}"
8 cached = redis_client.get(cache_key)
9 if cached:
10 return json.loads(cached)
11
12 results = engine.search(query, n_results=n_results)
13 redis_client.setex(cache_key, ttl, json.dumps(results))
14 return results6. Re-ranking Results
Vector search returns the top-K most similar chunks, but similarity to the query vector is not always the best measure of relevance. A re-ranker (cross-encoder model) takes the query and each candidate chunk as a pair and produces a more accurate relevance score.
1from sentence_transformers import CrossEncoder
2
3# Cross-encoders are slower but much more accurate than bi-encoders for ranking
4reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
5
6def retrieve_and_rerank(
7 collection,
8 query: str,
9 initial_n: int = 20,
10 final_n: int = 5
11) -> list[dict]:
12 """
13 Two-stage retrieval:
14 1. Fast ANN search to get top-20 candidates
15 2. Accurate cross-encoder re-ranking to select top-5
16 """
17 # Stage 1: fast vector retrieval (over-retrieve)
18 results = collection.query(
19 query_texts=[query],
20 n_results=initial_n,
21 include=["documents", "metadatas"]
22 )
23 candidates = results["documents"][0]
24 metas = results["metadatas"][0]
25
26 # Stage 2: cross-encoder re-ranking
27 pairs = [(query, doc) for doc in candidates]
28 rerank_scores = reranker.predict(pairs)
29
30 ranked = sorted(
31 zip(candidates, rerank_scores, metas),
32 key=lambda x: x[1],
33 reverse=True
34 )
35
36 return [
37 {"text": doc, "rerank_score": float(score), "metadata": meta}
38 for doc, score, meta in ranked[:final_n]
39 ]Re-ranking adds 50–200 ms of latency but can dramatically improve retrieval quality, especially for complex queries. It is the technique used by Cohere Rerank, Pinecone's rerank API, and Voyage AI.
7. Monitoring and Observability
You cannot optimise what you cannot measure. Track these metrics in production:
1import time
2from dataclasses import dataclass, field
3from collections import defaultdict
4
5@dataclass
6class SearchMetrics:
7 total_queries: int = 0
8 cache_hits: int = 0
9 latencies_ms: list[float] = field(default_factory=list)
10 zero_result_queries: list[str] = field(default_factory=list)
11 low_similarity_queries: list[str] = field(default_factory=list)
12
13 def record_query(
14 self,
15 query: str,
16 latency_ms: float,
17 results: list[dict],
18 cache_hit: bool,
19 low_similarity_threshold: float = 0.4
20 ) -> None:
21 self.total_queries += 1
22 self.latencies_ms.append(latency_ms)
23 if cache_hit:
24 self.cache_hits += 1
25 if not results:
26 self.zero_result_queries.append(query)
27 elif results[0]["similarity"] < low_similarity_threshold:
28 self.low_similarity_queries.append(query)
29
30 def report(self) -> dict:
31 lats = self.latencies_ms
32 return {
33 "total_queries": self.total_queries,
34 "cache_hit_rate": self.cache_hits / max(self.total_queries, 1),
35 "p50_latency_ms": sorted(lats)[len(lats) // 2] if lats else 0,
36 "p95_latency_ms": sorted(lats)[int(len(lats) * 0.95)] if lats else 0,
37 "zero_result_rate": len(self.zero_result_queries) / max(self.total_queries, 1),
38 "low_similarity_rate": len(self.low_similarity_queries) / max(self.total_queries, 1),
39 "recent_zero_result_queries": self.zero_result_queries[-10:],
40 "recent_low_similarity_queries": self.low_similarity_queries[-10:],
41 }Key metrics to alert on:
- p95 query latency > 500 ms: signals HNSW tuning or hardware is needed
- Zero-result rate > 5%: suggests gaps in your document corpus
- Low similarity rate > 20%: suggests chunking, embedding model, or corpus coverage issues
- Cache hit rate < 10%: high query diversity — caching may not help much, focus on index tuning
8. Scaling Patterns
When ChromaDB Starts Slowing Down
ChromaDB runs on a single machine. When queries start taking more than 200 ms at your target collection size, consider:
- Tune HNSW first — increase
hnsw:Mandhnsw:construction_efbefore re-indexing - Add a Redis cache for repeated queries
- Partition by metadata — split one large collection into multiple smaller ones by category or date range, and route queries to the appropriate partition
- Migrate to Qdrant or pgvector when the single-machine ceiling is reached
Migrating from ChromaDB to Qdrant
1from qdrant_client import QdrantClient
2from qdrant_client.models import Distance, VectorParams, PointStruct
3
4qdrant = QdrantClient(host="localhost", port=6333)
5
6# Create Qdrant collection
7qdrant.create_collection(
8 collection_name="production_docs",
9 vectors_config=VectorParams(size=384, distance=Distance.COSINE),
10)
11
12# Export from ChromaDB
13all_data = chroma_collection.get(include=["embeddings", "documents", "metadatas"])
14
15# Import to Qdrant in batches
16batch_size = 1000
17points = [
18 PointStruct(
19 id=i,
20 vector=embedding,
21 payload={"text": doc, **meta}
22 )
23 for i, (embedding, doc, meta) in enumerate(zip(
24 all_data["embeddings"],
25 all_data["documents"],
26 all_data["metadatas"]
27 ))
28]
29
30for i in range(0, len(points), batch_size):
31 qdrant.upsert(
32 collection_name="production_docs",
33 points=points[i:i + batch_size]
34 )
35
36print(f"Migrated {len(points)} vectors to Qdrant")Production Optimisation Checklist
- ☐ Use sentence-aware or semantic chunking instead of naive fixed-size splits
- ☐ Benchmark your embedding model on a domain-specific evaluation set before committing
- ☐ Set
hnsw:search_efto at least 50–100 in ChromaDB (default of 10 is too low) - ☐ Over-retrieve (top-20) then re-rank (select top-5) for high-quality RAG applications
- ☐ Add Redis-backed query caching for common/repeated queries
- ☐ Implement hybrid search (vector + BM25/FTS) if your corpus contains exact-match-critical content
- ☐ Monitor p95 query latency, zero-result rate, and low-similarity rate
- ☐ Review low-similarity queries weekly to identify corpus gaps
- ☐ Plan migration to Qdrant or Pinecone before you hit ChromaDB's single-machine ceiling
Key Takeaways
- Chunking strategy is the highest-impact optimisation — sentence-aware or semantic chunking consistently outperforms fixed-size splits
- Embedding model selection matters — use asymmetric retrieval models for question-to-document RAG
- ChromaDB's default
search_ef = 10is the most common source of poor recall in production — set it to 50–100 - Two-stage retrieval (ANN + cross-encoder re-ranking) gives the best quality at acceptable latency
- Hybrid search (vectors + keyword) handles both semantic and exact-match queries correctly
- Monitor the metrics that matter: zero-result rate and low-similarity rate tell you about corpus quality; p95 latency tells you about infrastructure
Vector Database Series — Complete
You have now completed the full Vector Database Series:
- What is a Vector Database? The Complete Beginner's Guide
- ChromaDB Tutorial: The Complete Beginner's Guide
- ChromaDB vs Pinecone vs pgvector: Which Should You Use?
- Build a Semantic Search Engine from Scratch
- Vector Database Optimisation for Production ← you are here
For the next natural step — connect your vector database to an LLM and build a complete RAG pipeline — see Project: Build a RAG App with Claude.
