Artificial IntelligenceAnthropicProjects

Claude RAG: Build a Retrieval-Augmented Generation App

TT
TopicTrick Team
Claude RAG: Build a Retrieval-Augmented Generation App

Claude's context window is large — up to 200,000 tokens — but it is not infinite. And it has a training cutoff. Your company's internal documentation, your proprietary research, your customer knowledge base — Claude does not know any of it. You could paste all of it into every prompt, but at scale that is impractical and expensive. Review the Anthropic API getting started guide to understand how to structure requests efficiently before building a RAG pipeline.

Retrieval-Augmented Generation, or RAG, is the solution. Instead of loading all your documents into Claude's context on every request, you index them in a vector database. When a question arrives, you retrieve only the most relevant chunks and include those in Claude's prompt. Claude generates an answer grounded in those retrieved passages, not in its general training knowledge.

What is RAG and How Does It Work with Claude?

RAG (Retrieval-Augmented Generation) is a three-stage architecture: split your documents into chunks and store them as vector embeddings in a database; when a question arrives, embed it and retrieve the most semantically similar chunks; pass those chunks to Claude as context and instruct Claude to answer only from the provided information. The result is accurate, grounded answers from private documents with citations — without hallucination from training knowledge.

This project builds a complete, functional RAG system: document ingestion, chunking, embedding, vector search, and grounded answer generation with Claude.


What is RAG and Why Does It Matter?

RAG architecture has three phases:

  1. Indexing: Split documents into chunks, convert each chunk to a vector embedding, and store in a vector database
  2. Retrieval: When a question arrives, embed the question, find the most semantically similar document chunks using vector similarity search
  3. Generation: Pass the retrieved chunks to Claude as context and ask Claude to answer the question based on that context

The result: Claude can answer questions accurately from private, current documents without hallucinating facts it does not know. For details on how Claude processes large contexts, see the Anthropic documentation.


Prerequisites

  • Python 3.9 or later
  • pip install anthropic chromadb sentence-transformers pypdf2
  • An Anthropic API key set as ANTHROPIC_API_KEY

ChromaDB is an open-source vector database that runs locally with no external service required. sentence-transformers provides the local embedding model.


Complete RAG Implementation

python
import anthropic
import chromadb
from sentence_transformers import SentenceTransformer
from pathlib import Path
import pypdf
import json
import hashlib
import re
from typing import Optional

# Initialise clients
anthropic_client = anthropic.Anthropic()
chroma_client = chromadb.PersistentClient(path="./chroma_db")
embedder = SentenceTransformer("all-MiniLM-L6-v2")  # Fast, accurate local model


# ─── Document Ingestion ───────────────────────────────────────────────────────

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract text from a PDF file."""
    reader = pypdf.PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n\n"
    return text


def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """
    Split text into overlapping chunks.
    
    Args:
        text: Full document text
        chunk_size: Approximate words per chunk
        overlap: Words of overlap between consecutive chunks
    """
    # Clean and normalise whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    words = text.split()
    
    chunks = []
    start = 0
    
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        
        # Try to end at a sentence boundary
        if end < len(words):
            last_period = chunk.rfind('. ')
            if last_period > chunk_size * 0.6:  # Only if not too short
                chunk = chunk[:last_period + 1]
        
        chunks.append(chunk)
        start += chunk_size - overlap
    
    return chunks


def ingest_document(
    file_path: str,
    collection_name: str,
    document_metadata: dict = None
) -> int:
    """
    Ingest a document into the vector database.
    
    Returns the number of chunks added.
    """
    path = Path(file_path)
    metadata = document_metadata or {}
    
    # Extract text
    if path.suffix.lower() == ".pdf":
        text = extract_text_from_pdf(file_path)
    elif path.suffix.lower() in [".txt", ".md"]:
        text = path.read_text(encoding="utf-8")
    else:
        raise ValueError(f"Unsupported file type: {path.suffix}")
    
    # Split into chunks
    chunks = chunk_text(text)
    print(f"  {path.name}: {len(chunks)} chunks from {len(text)} characters")
    
    # Get or create collection
    collection = chroma_client.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}
    )
    
    # Generate embeddings
    embeddings = embedder.encode(chunks, batch_size=32, show_progress_bar=False)
    
    # Prepare metadata for each chunk
    doc_hash = hashlib.md5(path.name.encode()).hexdigest()[:8]
    
    chunk_ids = [f"{doc_hash}_{i}" for i in range(len(chunks))]
    chunk_metadata = [
        {
            **metadata,
            "source": path.name,
            "chunk_index": i,
            "total_chunks": len(chunks)
        }
        for i in range(len(chunks))
    ]
    
    # Store in ChromaDB
    collection.add(
        ids=chunk_ids,
        embeddings=embeddings.tolist(),
        documents=chunks,
        metadatas=chunk_metadata
    )
    
    return len(chunks)


def ingest_directory(
    directory: str,
    collection_name: str,
    extensions: list = None
) -> int:
    """Ingest all documents in a directory."""
    exts = extensions or [".pdf", ".txt", ".md"]
    total_chunks = 0
    
    for path in Path(directory).iterdir():
        if path.suffix.lower() in exts:
            print(f"Ingesting: {path.name}")
            chunks = ingest_document(str(path), collection_name)
            total_chunks += chunks
    
    return total_chunks


# ─── Retrieval ────────────────────────────────────────────────────────────────

def retrieve_relevant_chunks(
    query: str,
    collection_name: str,
    n_results: int = 5,
    min_relevance_score: float = 0.3
) -> list[dict]:
    """
    Find the most relevant document chunks for a query.
    
    Returns list of dicts with content, source, and relevance score.
    """
    collection = chroma_client.get_or_create_collection(collection_name)
    
    if collection.count() == 0:
        return []
    
    # Embed query
    query_embedding = embedder.encode([query])[0]
    
    # Search
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=min(n_results, collection.count()),
        include=["documents", "metadatas", "distances"]
    )
    
    # Convert distances to similarity scores (cosine distance to similarity)
    chunks = []
    for doc, metadata, distance in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    ):
        similarity = 1 - distance  # Cosine similarity
        
        if similarity >= min_relevance_score:
            chunks.append({
                "content": doc,
                "source": metadata.get("source", "Unknown"),
                "chunk_index": metadata.get("chunk_index", 0),
                "similarity": round(similarity, 3)
            })
    
    # Sort by similarity (highest first)
    chunks.sort(key=lambda x: x["similarity"], reverse=True)
    return chunks


# ─── Generation (RAG Answer) ─────────────────────────────────────────────────

def answer_question(
    question: str,
    collection_name: str,
    n_chunks: int = 5,
    model: str = "claude-sonnet-4-6"
) -> dict:
    """
    Answer a question using RAG.
    
    Returns the answer and the source documents used.
    """
    # Retrieve relevant chunks
    chunks = retrieve_relevant_chunks(question, collection_name, n_results=n_chunks)
    
    if not chunks:
        return {
            "answer": "I could not find relevant information in the document library to answer this question.",
            "sources": [],
            "chunks_used": 0
        }
    
    # Build context block
    context_sections = []
    for i, chunk in enumerate(chunks, 1):
        context_sections.append(
            f"[Document: {chunk['source']}, Relevance: {chunk['similarity']}]\n{chunk['content']}"
        )
    
    context = "\n\n---\n\n".join(context_sections)
    
    # Generate answer with Claude
    response = anthropic_client.messages.create(
        model=model,
        max_tokens=2048,
        system="""You are a helpful assistant that answers questions based strictly on provided document excerpts.

Rules:
1. Answer ONLY from the provided context below. Do not use knowledge from your training.
2. If the context does not contain enough information to answer the question, say so clearly.
3. Always cite which document(s) you used in your answer using [Source: filename] notation.
4. If information from multiple documents is relevant, synthesise it clearly.
""",
        messages=[
            {
                "role": "user",
                "content": f"""RETRIEVED DOCUMENT CONTEXT:
{context}

---

QUESTION: {question}

Answer the question based solely on the context above."""
            }
        ]
    )
    
    answer = response.content[0].text
    unique_sources = list(set(chunk["source"] for chunk in chunks))
    
    return {
        "answer": answer,
        "sources": unique_sources,
        "chunks_used": len(chunks),
        "retrieved_chunks": chunks
    }


# ─── Interactive RAG Interface ────────────────────────────────────────────────

class RAGApp:
    """Simple interactive RAG application."""
    
    def __init__(self, collection_name: str):
        self.collection_name = collection_name
    
    def ingest(self, path: str):
        """Ingest a file or directory."""
        p = Path(path)
        if p.is_dir():
            count = ingest_directory(str(p), self.collection_name)
        else:
            count = ingest_document(str(p), self.collection_name)
        print(f"Ingested {count} chunks total.")
    
    def ask(self, question: str) -> str:
        """Ask a question and get a grounded answer."""
        result = answer_question(question, self.collection_name)
        
        print(f"\nAnswer: {result['answer']}")
        print(f"\nSources used: {', '.join(result['sources'])}")
        print(f"Chunks retrieved: {result['chunks_used']}")
        
        return result["answer"]
    
    def run_interactive(self):
        """Run an interactive Q&A session."""
        print(f"\nRAG System ready. Collection: {self.collection_name}")
        print("Type your question or 'quit' to exit.\n")
        
        while True:
            question = input("Question: ").strip()
            if question.lower() in ("quit", "exit"):
                break
            if question:
                self.ask(question)
                print()


# ─── Example Usage ────────────────────────────────────────────────────────────

if __name__ == "__main__":
    # Create RAG app for company documentation
    app = RAGApp(collection_name="company_docs")
    
    # Ingest documents
    app.ingest("./documents/")  # Point to your document folder
    
    # Run interactive Q&A
    app.run_interactive()

Choose Chunk Size Based on Your Content

The chunk_size parameter (words per chunk) significantly affects RAG quality. For dense technical documentation, 300-400 words per chunk with 50-word overlap works well. For narrative text or long-form reports, 600-800 words per chunk may be more appropriate to preserve context. Too small, and chunks lack sufficient context for Claude to give complete answers. Too large, and retrieval precision drops because chunks contain too many topics.


    Extending to Production

    • Replace ChromaDB with a managed vector database like Pinecone, Weaviate, or pgvector for production scale and persistence
    • Replace sentence-transformers with a higher-quality embedding model or Anthropic's own embeddings API when released
    • Add re-ranking: After vector retrieval, use a cross-encoder model to re-rank chunks by relevance before passing to Claude — improves answer quality significantly
    • Implement hybrid search: Combine vector similarity search with keyword BM25 search — hybrid search consistently outperforms either approach alone
    • Add document versioning: Track document versions and re-ingest when documents are updated, removing old chunks and adding new ones

    Summary

    RAG is the most important architectural pattern for giving Claude accurate knowledge from private or current information. The three-stage pipeline — chunk, embed, generate — is straightforward to implement and scales from a personal knowledge base to enterprise document search.

    • Chunk with overlap to preserve context at boundaries
    • Use a local embedding model for cost-effective indexing — inference using Claude is not needed for embeddings
    • Constrain Claude via the system prompt to answer only from provided context — prevents hallucination
    • Cite sources in every answer — users need to know where information came from

    Next IT pro project: Project: Build an AI-Powered IT Incident Report Generator.

    For the ChromaDB fundamentals used in this RAG pipeline, the ChromaDB beginner tutorial covers collections, metadata filtering, and HNSW tuning in detail. When you are ready to scale beyond ChromaDB, the vector database comparison guide helps you choose between Pinecone, ChromaDB, and pgvector for production.

    The ChromaDB documentation covers production deployment options including the HTTP server mode and Docker containers. For embedding model selection, the Sentence Transformers pretrained models list is the best reference for finding a model suited to your domain and language.


    This post is part of the Anthropic AI Tutorial Series. Previous post: Project: Build a Multi-Language Translator App with Claude.

    External references: