Artificial IntelligenceDatabasesMachine Learning

ChromaDB Tutorial: The Complete Beginner's Guide (2026)

TT
TopicTrick
ChromaDB Tutorial: The Complete Beginner's Guide (2026)

ChromaDB is the fastest way to add vector search to a Python application. It runs entirely in-process — no Docker container, no cloud account, no server to manage. You pip install chromadb, write five lines of Python, and you have a working vector database.

That simplicity makes ChromaDB the standard starting point for learning vector databases and the go-to choice for RAG prototypes and small-to-medium production deployments. This tutorial covers everything you need: collections, embedding functions, metadata filtering, persistence, updates, deletions, and building a real document search pipeline.

If you are new to vector databases, read What is a Vector Database? first — this tutorial assumes you understand what embeddings and similarity search are.


Installing ChromaDB

bash
1pip install chromadb

For sentence-transformers (the free local embedding model used in most examples below):

bash
1pip install chromadb sentence-transformers

ChromaDB requires Python 3.8 or later. All examples in this tutorial were tested on Python 3.11.


ChromaDB Client Modes

ChromaDB has three operating modes:

Ephemeral (in-memory) — data exists only for the lifetime of your Python process. Fastest, but nothing is saved to disk. Useful for unit tests and quick experiments.

python
1import chromadb 2 3client = chromadb.EphemeralClient()

Persistent (local disk) — data is saved to a directory on disk and survives process restarts. The standard mode for development and small deployments.

python
1client = chromadb.PersistentClient(path="./chroma_data")

HTTP Client — connects to a separately running ChromaDB server (started with chroma run --path ./chroma_data). Used when multiple processes or services need to share the same vector database.

python
1client = chromadb.HttpClient(host="localhost", port=8000)

For this tutorial, use PersistentClient — it behaves identically to a production setup but requires no separate service.


Collections: ChromaDB's Core Concept

In ChromaDB, a collection is the equivalent of a table in SQL. It stores:

  • Documents: the raw text (or other content) you want to search
  • Embeddings: the vector representations of those documents
  • Metadata: structured key-value pairs associated with each document
  • IDs: a unique string identifier for each item

You can have multiple collections in a single client, each with its own embedding function and distance metric.

python
1import chromadb 2 3client = chromadb.PersistentClient(path="./chroma_data") 4 5# Create a collection 6collection = client.create_collection(name="my_documents") 7 8# Get an existing collection (raises error if it doesn't exist) 9collection = client.get_collection(name="my_documents") 10 11# Get or create — the safest option for application code 12collection = client.get_or_create_collection(name="my_documents") 13 14# List all collections 15print(client.list_collections()) 16 17# Delete a collection 18client.delete_collection(name="my_documents")

Embedding Functions

An embedding function tells ChromaDB how to convert your documents into vector embeddings. ChromaDB ships with built-in support for several providers.

Option 1: Sentence Transformers (Free, Local)

The default and most convenient option for most use cases. The model downloads once (~80 MB) and then runs entirely offline.

python
1from chromadb.utils import embedding_functions 2 3ef = embedding_functions.SentenceTransformerEmbeddingFunction( 4 model_name="all-MiniLM-L6-v2" # 384 dimensions, fast, good quality 5) 6 7collection = client.get_or_create_collection( 8 name="docs", 9 embedding_function=ef, 10 metadata={"hnsw:space": "cosine"} 11)

Popular sentence-transformer models for different use cases:

ModelDimensionsSpeedUse Case
all-MiniLM-L6-v2384Very fastGeneral text, best starting point
all-mpnet-base-v2768ModerateHigher quality general text
multi-qa-MiniLM-L6-cos-v1384Very fastQuestion/answer retrieval
paraphrase-multilingual-MiniLM-L12-v2384FastMultilingual support

Option 2: OpenAI Embeddings

Higher quality embeddings, especially for domain-specific content, at a cost per token.

python
1ef = embedding_functions.OpenAIEmbeddingFunction( 2 api_key="your-openai-api-key", 3 model_name="text-embedding-3-small" # 1536 dimensions 4)

Option 3: Custom Embedding Function

Wrap any model — Cohere, Anthropic, a local GGUF model — by subclassing EmbeddingFunction:

python
1from chromadb import EmbeddingFunction, Documents, Embeddings 2 3class MyEmbeddingFunction(EmbeddingFunction): 4 def __call__(self, input: Documents) -> Embeddings: 5 # Call your model, return list of float lists 6 return my_model.encode(input).tolist() 7 8collection = client.get_or_create_collection( 9 name="custom", 10 embedding_function=MyEmbeddingFunction() 11)

Lock Your Embedding Function

Once you add documents to a collection, you must always use the same embedding function to query it. Changing the model invalidates all existing embeddings. If you need to switch models, create a new collection and re-embed all documents.


    Distance Metrics

    Set the distance metric at collection creation time via the metadata parameter:

    python
    1# Cosine similarity (default for text — recommended) 2collection = client.get_or_create_collection( 3 name="text_docs", 4 metadata={"hnsw:space": "cosine"} 5) 6 7# L2 (Euclidean distance — common for images) 8collection = client.get_or_create_collection( 9 name="image_docs", 10 metadata={"hnsw:space": "l2"} 11) 12 13# Inner product / dot product (for normalised vectors) 14collection = client.get_or_create_collection( 15 name="ip_docs", 16 metadata={"hnsw:space": "ip"} 17)

    For text: always use cosine. For image embeddings from models like CLIP: l2 or ip depending on whether the model normalises its outputs.


    Adding Documents

    python
    1from chromadb.utils import embedding_functions 2 3client = chromadb.PersistentClient(path="./chroma_data") 4ef = embedding_functions.SentenceTransformerEmbeddingFunction( 5 model_name="all-MiniLM-L6-v2" 6) 7 8collection = client.get_or_create_collection( 9 name="support_kb", 10 embedding_function=ef, 11 metadata={"hnsw:space": "cosine"} 12) 13 14# Add documents — ChromaDB embeds them automatically 15collection.add( 16 ids=["doc_001", "doc_002", "doc_003", "doc_004", "doc_005"], 17 documents=[ 18 "How to reset your password if you are locked out", 19 "Device fails to boot following a software patch", 20 "Cancel your subscription before the renewal date", 21 "How to export your invoice history as a PDF", 22 "Battery drains faster than expected after the update", 23 ], 24 metadatas=[ 25 {"category": "account", "priority": "high"}, 26 {"category": "hardware", "priority": "high"}, 27 {"category": "billing", "priority": "medium"}, 28 {"category": "billing", "priority": "low"}, 29 {"category": "hardware", "priority": "medium"}, 30 ] 31) 32 33print(f"Added {collection.count()} documents")

    IDs must be unique strings. If you call add() with an ID that already exists, ChromaDB raises an error. Use upsert() instead when you are not sure whether a document already exists.

    Providing Pre-computed Embeddings

    If you have already computed embeddings outside ChromaDB (e.g., in a batch preprocessing job), pass them directly:

    python
    1import numpy as np 2 3# Pre-computed embeddings — shape (n_docs, embedding_dim) 4embeddings = np.random.rand(3, 384).tolist() # placeholder 5 6collection.add( 7 ids=["pre_001", "pre_002", "pre_003"], 8 embeddings=embeddings, 9 documents=["raw text 1", "raw text 2", "raw text 3"] 10)

    When you provide embeddings directly, ChromaDB does not call the collection's embedding function for those documents.


    Querying: The Core Operation

    python
    1results = collection.query( 2 query_texts=["my laptop won't start after the Windows update"], 3 n_results=3, 4 include=["documents", "distances", "metadatas"] 5) 6 7for doc, dist, meta in zip( 8 results["documents"][0], 9 results["distances"][0], 10 results["metadatas"][0] 11): 12 similarity = 1 - dist # cosine distance → similarity 13 print(f"[{similarity:.3f}] [{meta['category']}] {doc}")

    Output:

    [0.851] [hardware] Device fails to boot following a software patch [0.734] [hardware] Battery drains faster than expected after the update [0.581] [account] How to reset your password if you are locked out

    The results dictionary always has this shape. results["documents"] is a list of lists — one inner list per query. If you pass multiple queries at once:

    python
    1results = collection.query( 2 query_texts=[ 3 "can't log in to my account", 4 "charge not lasting as long as before" 5 ], 6 n_results=2 7) 8# results["documents"][0] → top 2 for first query 9# results["documents"][1] → top 2 for second query

    What can you include in results?

    The include parameter controls what ChromaDB returns:

    • "documents" — original text
    • "embeddings" — raw vector arrays
    • "distances" — similarity scores
    • "metadatas" — metadata dictionaries
    • "uris" — URIs if you stored them
    • "data" — raw data if multi-modal

    Distance vs Similarity

    ChromaDB returns distances, not similarity scores. For cosine distance: similarity = 1 - distance. A distance of 0.0 means identical; 2.0 means opposite. For L2 distance, smaller is more similar — there is no simple normalisation to a 0–1 similarity score.


      Metadata Filtering

      Metadata filters let you combine vector similarity search with structured constraints. This is one of ChromaDB's most powerful features.

      Basic Filters

      python
      1# Only search within the billing category 2results = collection.query( 3 query_texts=["will I be charged if I don't cancel?"], 4 n_results=2, 5 where={"category": "billing"} 6)

      Operators

      ChromaDB supports a rich set of filter operators via the $ prefix:

      python
      1# Equality 2where={"category": {"$eq": "billing"}} 3 4# Not equal 5where={"category": {"$ne": "hardware"}} 6 7# Greater than / less than 8where={"priority_score": {"$gt": 5}} 9where={"priority_score": {"$lte": 10}} 10 11# In a list of values 12where={"category": {"$in": ["billing", "account"]}} 13 14# Not in a list 15where={"category": {"$nin": ["hardware"]}}

      Boolean Logic: $and, $or

      python
      1# Must be billing category AND high priority 2results = collection.query( 3 query_texts=["charge for next year"], 4 n_results=3, 5 where={ 6 "$and": [ 7 {"category": {"$eq": "billing"}}, 8 {"priority": {"$eq": "high"}} 9 ] 10 } 11) 12 13# Either billing OR account 14results = collection.query( 15 query_texts=["login and payment issue"], 16 n_results=3, 17 where={ 18 "$or": [ 19 {"category": {"$eq": "billing"}}, 20 {"category": {"$eq": "account"}} 21 ] 22 } 23)

      Full-Text Filter: $contains

      python
      1# Filter on document content (keyword match, not semantic) 2results = collection.query( 3 query_texts=["account access"], 4 n_results=3, 5 where_document={"$contains": "password"} 6)

      where_document filters on the actual document text, while where filters on metadata. Both can be used together in the same query.


      CRUD Operations: Update and Delete

      Upsert (Insert or Update)

      upsert() inserts new documents or updates existing ones — the safest choice for most ingestion pipelines:

      python
      1collection.upsert( 2 ids=["doc_001"], 3 documents=["How to reset your password, PIN, or security questions"], 4 metadatas=[{"category": "account", "priority": "high", "version": 2}] 5)

      Update

      update() modifies existing documents. It raises an error if an ID does not exist:

      python
      1collection.update( 2 ids=["doc_003"], 3 documents=["Cancel your subscription at any time from the billing dashboard"], 4 metadatas=[{"category": "billing", "priority": "medium", "updated": "2026-04"}] 5)

      You can update documents, metadatas, or embeddings independently — you do not need to provide all three.

      Get by ID

      python
      1# Fetch specific documents by ID 2result = collection.get( 3 ids=["doc_001", "doc_002"], 4 include=["documents", "metadatas"] 5) 6print(result["documents"])

      Delete

      python
      1# Delete by ID 2collection.delete(ids=["doc_004"]) 3 4# Delete by metadata filter 5collection.delete(where={"category": "hardware"}) 6 7# Delete by document content 8collection.delete(where_document={"$contains": "password"})

      Deletions are Permanent

      ChromaDB does not have a recycle bin or soft-delete. Deleted items are gone immediately. For production systems where auditability matters, consider soft-deleting by updating a metadata field (e.g., status: 'deleted') and filtering it out in queries, while keeping the actual record.


        Inspecting and Managing Collections

        python
        1# Count documents 2print(collection.count()) # 4 3 4# Get all documents (use with care on large collections) 5all_docs = collection.get(include=["documents", "metadatas"]) 6for id_, doc, meta in zip( 7 all_docs["ids"], 8 all_docs["documents"], 9 all_docs["metadatas"] 10): 11 print(f"{id_}: [{meta['category']}] {doc[:60]}") 12 13# Peek at first 5 documents 14print(collection.peek(5))

        Building a Document Search Pipeline

        Here is a complete, production-style document ingestion and query pipeline — the kind you would use in a RAG application.

        python
        1import chromadb 2from chromadb.utils import embedding_functions 3from pathlib import Path 4import hashlib 5 6class DocumentSearchPipeline: 7 """ 8 A simple document ingestion and semantic search pipeline built on ChromaDB. 9 Uses SHA-256 of the document text as a stable ID to support idempotent upserts. 10 """ 11 12 def __init__(self, persist_path: str = "./search_db"): 13 self.client = chromadb.PersistentClient(path=persist_path) 14 self.ef = embedding_functions.SentenceTransformerEmbeddingFunction( 15 model_name="all-MiniLM-L6-v2" 16 ) 17 self.collection = self.client.get_or_create_collection( 18 name="documents", 19 embedding_function=self.ef, 20 metadata={"hnsw:space": "cosine"} 21 ) 22 23 def _make_id(self, text: str) -> str: 24 """Generate a stable ID from document content.""" 25 return hashlib.sha256(text.encode()).hexdigest()[:16] 26 27 def ingest(self, documents: list[dict]) -> int: 28 """ 29 Ingest a list of {'text': str, 'metadata': dict} dicts. 30 Uses upsert so re-ingestion is safe. 31 """ 32 ids = [self._make_id(doc["text"]) for doc in documents] 33 texts = [doc["text"] for doc in documents] 34 metas = [doc.get("metadata", {}) for doc in documents] 35 36 self.collection.upsert(ids=ids, documents=texts, metadatas=metas) 37 return len(documents) 38 39 def search( 40 self, 41 query: str, 42 n_results: int = 5, 43 filters: dict | None = None 44 ) -> list[dict]: 45 """ 46 Run a semantic search. 47 Returns a list of results sorted by similarity (highest first). 48 """ 49 kwargs = { 50 "query_texts": [query], 51 "n_results": min(n_results, self.collection.count()), 52 "include": ["documents", "distances", "metadatas"], 53 } 54 if filters: 55 kwargs["where"] = filters 56 57 results = self.collection.query(**kwargs) 58 59 return [ 60 { 61 "text": doc, 62 "similarity": round(1 - dist, 4), 63 "metadata": meta, 64 } 65 for doc, dist, meta in zip( 66 results["documents"][0], 67 results["distances"][0], 68 results["metadatas"][0], 69 ) 70 ] 71 72 def stats(self) -> dict: 73 return {"total_documents": self.collection.count()} 74 75 76# --- Usage --- 77 78pipeline = DocumentSearchPipeline() 79 80# Ingest documents 81pipeline.ingest([ 82 {"text": "How to reset your password if locked out", "metadata": {"source": "kb", "category": "account"}}, 83 {"text": "Device fails to boot after a firmware update", "metadata": {"source": "kb", "category": "hardware"}}, 84 {"text": "Cancel your subscription before the renewal date", "metadata": {"source": "kb", "category": "billing"}}, 85 {"text": "Export your invoice history as a CSV or PDF", "metadata": {"source": "kb", "category": "billing"}}, 86 {"text": "Battery drains faster after the latest OS update", "metadata": {"source": "kb", "category": "hardware"}}, 87 {"text": "How to transfer your licence to a new device", "metadata": {"source": "kb", "category": "account"}}, 88]) 89 90print(pipeline.stats()) 91 92# Semantic search — no metadata filter 93print("\n--- General search ---") 94for r in pipeline.search("my laptop stopped working after an update", n_results=3): 95 print(f"[{r['similarity']}] [{r['metadata']['category']}] {r['text']}") 96 97# Semantic search scoped to billing only 98print("\n--- Billing search ---") 99for r in pipeline.search("charges on my account", n_results=3, filters={"category": "billing"}): 100 print(f"[{r['similarity']}] {r['text']}")

        ChromaDB with LangChain and LlamaIndex

        ChromaDB has first-class integrations with both major RAG frameworks.

        LangChain:

        python
        1from langchain_community.vectorstores import Chroma 2from langchain_community.embeddings import HuggingFaceEmbeddings 3 4embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") 5 6vectorstore = Chroma( 7 collection_name="langchain_docs", 8 embedding_function=embeddings, 9 persist_directory="./langchain_chroma" 10) 11 12# Add documents 13vectorstore.add_texts( 14 texts=["Doc 1 content", "Doc 2 content"], 15 metadatas=[{"source": "file1.pdf"}, {"source": "file2.pdf"}] 16) 17 18# Similarity search 19docs = vectorstore.similarity_search("query text", k=3)

        LlamaIndex:

        python
        1import chromadb 2from llama_index.vector_stores.chroma import ChromaVectorStore 3from llama_index.core import VectorStoreIndex, StorageContext 4 5chroma_client = chromadb.PersistentClient(path="./llamaindex_chroma") 6chroma_collection = chroma_client.get_or_create_collection("llamaindex_docs") 7 8vector_store = ChromaVectorStore(chroma_collection=chroma_collection) 9storage_context = StorageContext.from_defaults(vector_store=vector_store) 10 11index = VectorStoreIndex.from_documents(documents, storage_context=storage_context) 12query_engine = index.as_query_engine() 13response = query_engine.query("What does the document say about X?")

        HNSW Index Tuning

        ChromaDB uses HNSW internally. For most use cases the defaults are fine. For large collections or when you need to optimise recall vs speed, tune the HNSW parameters at collection creation:

        python
        1collection = client.get_or_create_collection( 2 name="large_collection", 3 metadata={ 4 "hnsw:space": "cosine", 5 "hnsw:construction_ef": 200, # higher = better recall, slower indexing (default: 100) 6 "hnsw:search_ef": 100, # higher = better recall, slower queries (default: 10) 7 "hnsw:M": 16, # connections per node, affects memory (default: 16) 8 "hnsw:batch_size": 1000, # docs per batch during indexing 9 "hnsw:sync_threshold": 1000, # batch size before index sync to disk 10 } 11)

        When to Tune HNSW

        The defaults work well for collections under 100,000 documents. If you notice queries returning obviously poor results (low recall), increase hnsw:search_ef first — it costs query time but improves result quality. Increase hnsw:construction_ef only if you are accepting poor recalls even at query time, as it only affects indexing.


          Key Takeaways

          • ChromaDB runs entirely in-process — no server, no Docker, no cloud account needed for development
          • Use PersistentClient for development and small production deployments; HttpClient when multiple services share one database
          • The embedding function must stay consistent — adding and querying must use the same model
          • Use cosine distance for all text-based collections
          • Combine vector search with metadata filters using the where parameter for scoped queries
          • Use upsert() rather than add() in production ingestion pipelines to handle re-runs safely
          • HNSW parameters can be tuned for recall vs speed trade-offs on large collections

          What's Next in the Vector Database Series

          This post is part of the Vector Database Series. Previous post: What is a Vector Database? The Complete Beginner's Guide.