When should I use RAG vs fine-tuning Claude?

Use RAG when your knowledge changes frequently (documentation, product specs, customer data) or when you need the model to cite specific sources. Use fine-tuning when you need the model to adopt a specific tone, format, or behaviour that is hard to specify in a prompt. In most production systems, RAG delivers the most value at a fraction of the cost and complexity of fine-tuning.

What vector database should I use for RAG with Claude?

The right choice depends on scale and infrastructure. Pinecone and Weaviate are managed cloud options that require no operational overhead. pgvector adds vector search to PostgreSQL if you already use it. Chroma is a simple open-source option for development and small-scale production. For most teams starting out, pgvector or Chroma avoids introducing a new infrastructure dependency.

How do I improve RAG retrieval quality?

The most impactful improvements are: chunking strategy (smaller, semantically coherent chunks retrieve better than large page-sized chunks), hybrid search (combining dense vector similarity with sparse BM25 keyword search), and reranking (a cross-encoder reranker scores retrieved chunks against the query before sending them to Claude). Embedding model quality also matters - experiment with different models for your specific domain.

Claude RAG: Build a Retrieval-Augmented Generation App

Q: What is RAG and why is it useful with Claude?

RAG (Retrieval-Augmented Generation) is a pattern where relevant documents are retrieved from an external store and injected into the Claude prompt before generating a response. This lets Claude answer questions based on up-to-date, private, or domain-specific information that was not in its training data - such as your company knowledge base, product documentation, or recent news.

← Back to Claude API Hub

Claude's context window is large - up to 200,000 tokens - but it is not infinite. And it has a training cutoff. Your company's internal documentation, your proprietary research, your customer knowledge base - Claude does not know any of it. You could paste all of it into every prompt, but at scale that is impractical and expensive. Review the Anthropic API getting started guide to understand how to structure requests efficiently before building a RAG pipeline.

Retrieval-Augmented Generation, or RAG, is the solution. Instead of loading all your documents into Claude's context on every request, you index them in a vector database. When a question arrives, you retrieve only the most relevant chunks and include those in Claude's prompt. Claude generates an answer grounded in those retrieved passages, not in its general training knowledge.

What is RAG and How Does It Work with Claude?

RAG (Retrieval-Augmented Generation) is a three-stage architecture: split your documents into chunks and store them as vector embeddings in a database; when a question arrives, embed it and retrieve the most semantically similar chunks; pass those chunks to Claude as context and instruct Claude to answer only from the provided information. The result is accurate, grounded answers from private documents with citations - without hallucination from training knowledge.

This project builds a complete, functional RAG system: document ingestion, chunking, embedding, vector search, and grounded answer generation with Claude.

What is RAG and Why Does It Matter?

RAG architecture has three phases:

Indexing: Split documents into chunks, convert each chunk to a vector embedding, and store in a vector database
Retrieval: When a question arrives, embed the question, find the most semantically similar document chunks using vector similarity search
Generation: Pass the retrieved chunks to Claude as context and ask Claude to answer the question based on that context

The result: Claude can answer questions accurately from private, current documents without hallucinating facts it does not know. For details on how Claude processes large contexts, see the Anthropic documentation.

Prerequisites

Python 3.9 or later
pip install anthropic chromadb sentence-transformers pypdf2
An Anthropic API key set as ANTHROPIC_API_KEY

ChromaDB is an open-source vector database that runs locally with no external service required. sentence-transformers provides the local embedding model.

Complete RAG Implementation

python

import anthropic
import chromadb
from sentence_transformers import SentenceTransformer
from pathlib import Path
import pypdf
import json
import hashlib
import re
from typing import Optional

# Initialise clients
anthropic_client = anthropic.Anthropic()
chroma_client = chromadb.PersistentClient(path="./chroma_db")
embedder = SentenceTransformer("all-MiniLM-L6-v2")  # Fast, accurate local model


# --- Document Ingestion -------------------------------------------------------

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract text from a PDF file."""
    reader = pypdf.PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n\n"
    return text


def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """
    Split text into overlapping chunks.
    
    Args:
        text: Full document text
        chunk_size: Approximate words per chunk
        overlap: Words of overlap between consecutive chunks
    """
    # Clean and normalise whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    words = text.split()
    
    chunks = []
    start = 0
    
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        
        # Try to end at a sentence boundary
        if end < len(words):
            last_period = chunk.rfind('. ')
            if last_period > chunk_size * 0.6:  # Only if not too short
                chunk = chunk[:last_period + 1]
        
        chunks.append(chunk)
        start += chunk_size - overlap
    
    return chunks


def ingest_document(
    file_path: str,
    collection_name: str,
    document_metadata: dict = None
) -> int:
    """
    Ingest a document into the vector database.
    
    Returns the number of chunks added.
    """
    path = Path(file_path)
    metadata = document_metadata or {}
    
    # Extract text
    if path.suffix.lower() == ".pdf":
        text = extract_text_from_pdf(file_path)
    elif path.suffix.lower() in [".txt", ".md"]:
        text = path.read_text(encoding="utf-8")
    else:
        raise ValueError(f"Unsupported file type: {path.suffix}")
    
    # Split into chunks
    chunks = chunk_text(text)
    print(f"  {path.name}: {len(chunks)} chunks from {len(text)} characters")
    
    # Get or create collection
    collection = chroma_client.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}
    )
    
    # Generate embeddings
    embeddings = embedder.encode(chunks, batch_size=32, show_progress_bar=False)
    
    # Prepare metadata for each chunk
    doc_hash = hashlib.md5(path.name.encode()).hexdigest()[:8]
    
    chunk_ids = [f"{doc_hash}_{i}" for i in range(len(chunks))]
    chunk_metadata = [
        {
            **metadata,
            "source": path.name,
            "chunk_index": i,
            "total_chunks": len(chunks)
        }
        for i in range(len(chunks))
    ]
    
    # Store in ChromaDB
    collection.add(
        ids=chunk_ids,
        embeddings=embeddings.tolist(),
        documents=chunks,
        metadatas=chunk_metadata
    )
    
    return len(chunks)


def ingest_directory(
    directory: str,
    collection_name: str,
    extensions: list = None
) -> int:
    """Ingest all documents in a directory."""
    exts = extensions or [".pdf", ".txt", ".md"]
    total_chunks = 0
    
    for path in Path(directory).iterdir():
        if path.suffix.lower() in exts:
            print(f"Ingesting: {path.name}")
            chunks = ingest_document(str(path), collection_name)
            total_chunks += chunks
    
    return total_chunks


# --- Retrieval ----------------------------------------------------------------

def retrieve_relevant_chunks(
    query: str,
    collection_name: str,
    n_results: int = 5,
    min_relevance_score: float = 0.3
) -> list[dict]:
    """
    Find the most relevant document chunks for a query.
    
    Returns list of dicts with content, source, and relevance score.
    """
    collection = chroma_client.get_or_create_collection(collection_name)
    
    if collection.count() == 0:
        return []
    
    # Embed query
    query_embedding = embedder.encode([query])[0]
    
    # Search
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=min(n_results, collection.count()),
        include=["documents", "metadatas", "distances"]
    )
    
    # Convert distances to similarity scores (cosine distance to similarity)
    chunks = []
    for doc, metadata, distance in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    ):
        similarity = 1 - distance  # Cosine similarity
        
        if similarity >= min_relevance_score:
            chunks.append({
                "content": doc,
                "source": metadata.get("source", "Unknown"),
                "chunk_index": metadata.get("chunk_index", 0),
                "similarity": round(similarity, 3)
            })
    
    # Sort by similarity (highest first)
    chunks.sort(key=lambda x: x["similarity"], reverse=True)
    return chunks


# --- Generation (RAG Answer) -------------------------------------------------

def answer_question(
    question: str,
    collection_name: str,
    n_chunks: int = 5,
    model: str = "claude-sonnet-4-6"
) -> dict:
    """
    Answer a question using RAG.
    
    Returns the answer and the source documents used.
    """
    # Retrieve relevant chunks
    chunks = retrieve_relevant_chunks(question, collection_name, n_results=n_chunks)
    
    if not chunks:
        return {
            "answer": "I could not find relevant information in the document library to answer this question.",
            "sources": [],
            "chunks_used": 0
        }
    
    # Build context block
    context_sections = []
    for i, chunk in enumerate(chunks, 1):
        context_sections.append(
            f"[Document: {chunk['source']}, Relevance: {chunk['similarity']}]\n{chunk['content']}"
        )
    
    context = "\n\n---\n\n".join(context_sections)
    
    # Generate answer with Claude
    response = anthropic_client.messages.create(
        model=model,
        max_tokens=2048,
        system="""You are a helpful assistant that answers questions based strictly on provided document excerpts.

Rules:
1. Answer ONLY from the provided context below. Do not use knowledge from your training.
2. If the context does not contain enough information to answer the question, say so clearly.
3. Always cite which document(s) you used in your answer using [Source: filename] notation.
4. If information from multiple documents is relevant, synthesise it clearly.
""",
        messages=[
            {
                "role": "user",
                "content": f"""RETRIEVED DOCUMENT CONTEXT:
{context}

---

QUESTION: {question}

Answer the question based solely on the context above."""
            }
        ]
    )
    
    answer = response.content[0].text
    unique_sources = list(set(chunk["source"] for chunk in chunks))
    
    return {
        "answer": answer,
        "sources": unique_sources,
        "chunks_used": len(chunks),
        "retrieved_chunks": chunks
    }


# --- Interactive RAG Interface ------------------------------------------------

class RAGApp:
    """Simple interactive RAG application."""
    
    def __init__(self, collection_name: str):
        self.collection_name = collection_name
    
    def ingest(self, path: str):
        """Ingest a file or directory."""
        p = Path(path)
        if p.is_dir():
            count = ingest_directory(str(p), self.collection_name)
        else:
            count = ingest_document(str(p), self.collection_name)
        print(f"Ingested {count} chunks total.")
    
    def ask(self, question: str) -> str:
        """Ask a question and get a grounded answer."""
        result = answer_question(question, self.collection_name)
        
        print(f"\nAnswer: {result['answer']}")
        print(f"\nSources used: {', '.join(result['sources'])}")
        print(f"Chunks retrieved: {result['chunks_used']}")
        
        return result["answer"]
    
    def run_interactive(self):
        """Run an interactive Q&A session."""
        print(f"\nRAG System ready. Collection: {self.collection_name}")
        print("Type your question or 'quit' to exit.\n")
        
        while True:
            question = input("Question: ").strip()
            if question.lower() in ("quit", "exit"):
                break
            if question:
                self.ask(question)
                print()


# --- Example Usage ------------------------------------------------------------

if __name__ == "__main__":
    # Create RAG app for company documentation
    app = RAGApp(collection_name="company_docs")
    
    # Ingest documents
    app.ingest("./documents/")  # Point to your document folder
    
    # Run interactive Q&A
    app.run_interactive()

Choose Chunk Size Based on Your Content

The chunk_size parameter (words per chunk) significantly affects RAG quality. For dense technical documentation, 300-400 words per chunk with 50-word overlap works well. For narrative text or long-form reports, 600-800 words per chunk may be more appropriate to preserve context. Too small, and chunks lack sufficient context for Claude to give complete answers. Too large, and retrieval precision drops because chunks contain too many topics.

Extending to Production

Replace ChromaDB with a managed vector database like Pinecone, Weaviate, or pgvector for production scale and persistence
Replace sentence-transformers with a higher-quality embedding model or Anthropic's own embeddings API when released
Add re-ranking: After vector retrieval, use a cross-encoder model to re-rank chunks by relevance before passing to Claude - improves answer quality significantly
Implement hybrid search: Combine vector similarity search with keyword BM25 search - hybrid search consistently outperforms either approach alone
Add document versioning: Track document versions and re-ingest when documents are updated, removing old chunks and adding new ones

Summary

RAG is the most important architectural pattern for giving Claude accurate knowledge from private or current information. The three-stage pipeline - chunk, embed, generate - is straightforward to implement and scales from a personal knowledge base to enterprise document search.

Chunk with overlap to preserve context at boundaries
Use a local embedding model for cost-effective indexing - inference using Claude is not needed for embeddings
Constrain Claude via the system prompt to answer only from provided context - prevents hallucination
Cite sources in every answer - users need to know where information came from

Next IT pro project: Project: Build an AI-Powered IT Incident Report Generator.

For the ChromaDB fundamentals used in this RAG pipeline, the ChromaDB beginner tutorial covers collections, metadata filtering, and HNSW tuning in detail. When you are ready to scale beyond ChromaDB, the vector database comparison guide helps you choose between Pinecone, ChromaDB, and pgvector for production.

The ChromaDB documentation covers production deployment options including the HTTP server mode and Docker containers. For embedding model selection, the Sentence Transformers pretrained models list is the best reference for finding a model suited to your domain and language.

This post is part of the Anthropic AI Tutorial Series. Previous post: Project: Build a Multi-Language Translator App with Claude.

External references:

Structured Learning Path

If you want to go deeper on each component, the LLM Engineering & RAG course covers the full pipeline step by step:

LangChain Fundamentals: Chains, Models & Prompts — learn the framework before wiring up a RAG system
LangChain Document Loaders & Embeddings — how to load, chunk, and embed your documents
Vector Databases: Pinecone & ChromaDB — choose and configure the right vector store
Building a RAG Pipeline from Scratch — the complete end-to-end lesson

Frequently Asked Questions

Q: What is Retrieval-Augmented Generation (RAG) and why is it useful with Claude? RAG is a pattern where relevant documents or data are retrieved from an external store and injected into the Claude prompt before generating a response. This lets Claude answer questions based on up-to-date, private, or domain-specific information that was not in its training data. Instead of fine-tuning (which is expensive and static), RAG keeps the knowledge store external and always current.

Q: What are the main components of a RAG pipeline with Claude? A RAG pipeline needs: an embedding model (to convert text chunks into vectors), a vector database (to store and search those vectors - Pinecone, Weaviate, Chroma, pgvector), a retrieval step (embed the user query, find the top-k similar chunks), and a generation step (inject the retrieved chunks into a Claude prompt with instructions to answer based on the provided context). LangChain and LlamaIndex are popular frameworks that wire these components together.

Q: How do you prevent Claude from "hallucinating" in a RAG system? Instruct Claude explicitly in the system prompt: "Answer only using the provided context. If the answer is not in the context, say you don't know." Include the retrieved documents with clear delimiters so Claude knows what is context versus its prior knowledge. Add a citation requirement - ask Claude to quote the source sentence - which forces grounding in the retrieved text. Evaluate outputs regularly with a ground-truth test set to detect regression in grounding quality.

Part of the Claude AI Masterclass.