How large should document chunks be in a RAG system?

There is no single answer. Use smaller chunks for short factual queries and larger chunks with parent-child retrieval for synthesis-heavy queries requiring broader context.

What is the difference between RAG and fine-tuning for LLMs?

Use RAG for injecting factual knowledge like recent financials that changes frequently. Use fine-tuning for teaching style like brand voice. RAG is cheaper and faster to update.

RAG Architecture Patterns: The Complete Enterprise Implementation Guide

Q: Is RAG secure for handling sensitive enterprise data?

RAG can be made secure with namespace isolation in vector databases, column-level access control, and on-premises deployment for regulatory compliance requirements.

← Back to Software Architecture Hub

RAG Architecture Patterns: The Complete Enterprise Implementation Guide

Why RAG: The LLM Knowledge Gap
The Three RAG Pipeline Phases
Phase 1: Ingestion - Chunking Strategies That Matter
Phase 1: Embedding Models - Choosing the Right One
Phase 2: Retrieval - Naive vs Hybrid Search
Advanced RAG: Query Rewriting and HyDE
Advanced RAG: Cross-Encoder Re-Ranking
Phase 3: Generation - Prompt Engineering for RAG
Multi-Tenant RAG with Namespace Isolation
Evaluating RAG Quality with RAGAS
The Production RAG Stack
Frequently Asked Questions
Key Takeaway

Why RAG: The LLM Knowledge Gap

LLMs have three limitations that RAG addresses:

Problem	Description	RAG Solution
Training cutoff	Model doesn't know about recent events	Retrieve from up-to-date knowledge base
Private knowledge	Model never saw your internal docs	Retrieve from your document store
Hallucination	Model invents plausible-sounding facts	Grounded answers from retrieved context
Context length	Cannot fit entire knowledge base in prompt	Retrieve only the relevant K chunks

RAG vs Fine-tuning rule of thumb: Use RAG for facts (what), fine-tuning for style (how). If you want the model to know your company's Q3 2026 financials, use RAG. If you want the model to write in your brand voice, use fine-tuning.

The Three RAG Pipeline Phases

Phase 1: Ingestion - Chunking Strategies That Matter

Chunking is the most underestimated component of RAG quality. Wrong chunk size = wrong context retrieved.

Strategy	Description	Best For
Fixed-size	Split every N tokens, overlap 20%	Quick prototyping only
Sentence	Split on sentence boundaries	General text, Q&A
Recursive	Split on paragraphs -> sentences -> words	Structured documents
Semantic	Group semantically related sentences	Complex documents, research papers
Document-specific	Parse PDFs preserving section structure	Technical manuals, legal docs
Parent-child	Store large chunks, retrieve small ones	Maximum precision retrieval

python

# Parent-child chunking (retrieves small chunks, returns parent context):
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Small child chunks for precision retrieval:
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)

# Larger parent chunks for context-rich generation:
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)

def ingest_with_parent_child(document: str, doc_id: str):
    parent_chunks = parent_splitter.split_text(document)
    
    for i, parent in enumerate(parent_chunks):
        parent_id = f"{doc_id}_parent_{i}"
        parent_store.set(parent_id, parent)  # Store full parent in regular DB
        
        # Create child chunks - each points to its parent:
        children = child_splitter.split_text(parent)
        for j, child in enumerate(children):
            vector_store.add(
                text=child,
                metadata={"parent_id": parent_id, "doc_id": doc_id}
            )

# On retrieval: search child chunks, return parent context:
def retrieve(query: str, k: int = 5) -> list[str]:
    child_results = vector_store.similarity_search(query, k=k*3)  # over-fetch
    parent_ids = {r.metadata["parent_id"] for r in child_results}
    # Return full parent texts (much richer context than child chunks):
    return [parent_store.get(pid) for pid in list(parent_ids)[:k]]

Phase 1: Embedding Models - Choosing the Right One

Model	Dimensions	Cost	Quality	Best For
`text-embedding-3-small` (OpenAI)	1536	$0.02/1M tokens	High	General purpose, cost-effective
`text-embedding-3-large` (OpenAI)	3072	$0.13/1M tokens	Highest	Complex queries requiring precision
`embed-english-v3.0` (Cohere)	1024	$0.10/1M tokens	High	Enterprise, multilingual variant
`nomic-embed-text` (local)	768	Free (self-hosted)	Good	Privacy-sensitive, cost-constrained
`bge-large-en` (HuggingFace)	1024	Free (self-hosted)	High	Open source, MTEB leaderboard

Critical rule: Use the same embedding model for ingestion and retrieval. Mixing models produces meaningless similarity scores.

Phase 2: Retrieval - Naive vs Hybrid Search

Naive (pure vector) search only captures semantic similarity. It misses exact terminology:

python

# Naive: misses "JWT" if user asks about "JSON Web Tokens" exact term
results = vector_store.similarity_search("JWT authentication", k=5)

# Hybrid: combines semantic + keyword (BM25) search:
from langchain.retrievers import EnsembleRetriever

vector_retriever = vector_store.as_retriever(search_kwargs={"k": 5})
bm25_retriever = BM25Retriever.from_documents(docs, k=5)

# Ensemble with Reciprocal Rank Fusion:
ensemble = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # 60% semantic, 40% keyword
)
# Now "JWT" is matched by BM25 even if embeddings don't capture acronyms

Advanced RAG: Query Rewriting and HyDE

Query Rewriting: Users rarely write retrieval-optimal queries. Rewrite before searching:

python

def rewrite_query(user_query: str) -> list[str]:
    """Generate multiple search queries to improve recall."""
    response = llm.invoke(f"""
    Generate 3 different search queries to find information relevant to:
    "{user_query}"
    
    Return only the 3 queries, one per line.
    """)
    queries = response.content.strip().split('\n')
    return queries[:3]

# Search with all rewritten queries, merge results:
all_results = []
for query in rewrite_query(user_query) + [user_query]:
    all_results.extend(vector_store.similarity_search(query, k=3))
# Deduplicate and re-rank

HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer first, embed it, use its embedding to search:

python

def hyde_retrieve(query: str) -> list[str]:
    # Generate a hypothetical ideal answer:
    hypothetical_answer = llm.invoke(
        f"Write a detailed answer to: {query}\n"
        "Base it on knowledge you have. Do not say 'I don't know'."
    ).content
    
    # Embed the hypothetical answer (not the query):
    hypothesis_embedding = embed(hypothetical_answer)
    
    # Search vector store using the hypothetical answer's embedding:
    # This finds documents similar to the "ideal answer" rather than the question
    return vector_store.similarity_search_by_vector(hypothesis_embedding, k=5)

Advanced RAG: Cross-Encoder Re-Ranking

The initial vector search retrieves good candidates but doesn't perfectly rank them. A cross-encoder re-ranks by considering query and document together:

python

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def retrieve_and_rerank(query: str, k: int = 5) -> list[str]:
    # Step 1: Retrieve 20 candidates (over-fetch):
    candidates = vector_store.similarity_search(query, k=20)
    
    # Step 2: Re-rank using cross-encoder (considers query+document jointly):
    pairs = [[query, doc.page_content] for doc in candidates]
    scores = reranker.predict(pairs)
    
    # Step 3: Return top-k after re-ranking:
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc.page_content for doc, _ in ranked[:k]]

Re-ranking typically improves answer accuracy by 10-15% with a small latency cost (~50-100ms for cross-encoder inference).

Evaluating RAG Quality with RAGAS

Never ship RAG without automated evaluation. RAGAS provides key metrics:

python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

# Build evaluation dataset:
eval_data = {
    "question": ["What is our refund policy?", "How do I reset my password?"],
    "answer": [generated_answers[0], generated_answers[1]],
    "contexts": [retrieved_contexts[0], retrieved_contexts[1]],
    "ground_truth": ["Refunds within 30 days...", "Click 'Forgot Password'..."],
}

results = evaluate(eval_data, metrics=[
    faithfulness,        # Is the answer grounded in the retrieved context?
    answer_relevancy,    # Is the answer relevant to the question?
    context_precision,   # Is the retrieved context useful?
    context_recall,      # Was the relevant context actually retrieved?
])
# Target: faithfulness > 0.9, answer_relevancy > 0.85, context_recall > 0.8

Frequently Asked Questions

How large should my chunks be? No single answer - it depends on your documents and queries. Rule of thumb: if queries are short and factual ("What is X?"), use smaller chunks (200-400 tokens). If queries require synthesising information across paragraphs, use larger chunks (800-1200 tokens) with parent-child retrieval. Always evaluate chunk size changes with your RAGAS metrics rather than intuition.

Is RAG secure for handling sensitive enterprise data? RAG can be secure with the right architecture: (1) Namespace isolation in the vector database ensures one tenant cannot retrieve another's documents. (2) Column-level access control on retrieved documents (filter by user permissions before returning). (3) On-premise or private cloud deployment of both the vector database and LLM for regulatory compliance (HIPAA, GDPR). Open-source models (Llama, Mistral) deployed privately are common for highly sensitive use cases.

Key Takeaway

Naive RAG is easy to build but disappointing in production. Advanced RAG - with parent-child chunking, hybrid search, query rewriting, and cross-encoder re-ranking - delivers the 90%+ accuracy that enterprise use cases require. The evaluation discipline (RAGAS metrics, golden test sets, regression testing on every chunking or retrieval change) is what distinguishes a reliable production RAG system from an impressive demo. Invest in the evaluation infrastructure early - it pays dividends with every improvement you make.

Part of the Software Architecture Hub - comprehensive guides from architectural foundations to advanced distributed systems patterns.

RAG Architecture Patterns: The Complete Enterprise Implementation Guide

RAG Architecture Patterns: The Complete Enterprise Implementation Guide

Table of Contents

Why RAG: The LLM Knowledge Gap

The Three RAG Pipeline Phases

Phase 1: Ingestion - Chunking Strategies That Matter

Phase 1: Embedding Models - Choosing the Right One

Phase 2: Retrieval - Naive vs Hybrid Search

Advanced RAG: Query Rewriting and HyDE

Advanced RAG: Cross-Encoder Re-Ranking

Evaluating RAG Quality with RAGAS

Frequently Asked Questions

Key Takeaway