Software ArchitectureAI Engineering

RAG Architecture Patterns: The Complete Enterprise Implementation Guide

TT
TopicTrick Team
RAG Architecture Patterns: The Complete Enterprise Implementation Guide

RAG Architecture Patterns: The Complete Enterprise Implementation Guide


Table of Contents


Why RAG: The LLM Knowledge Gap

LLMs have three limitations that RAG addresses:

ProblemDescriptionRAG Solution
Training cutoffModel doesn't know about recent eventsRetrieve from up-to-date knowledge base
Private knowledgeModel never saw your internal docsRetrieve from your document store
HallucinationModel invents plausible-sounding factsGrounded answers from retrieved context
Context lengthCannot fit entire knowledge base in promptRetrieve only the relevant K chunks

RAG vs Fine-tuning rule of thumb: Use RAG for facts (what), fine-tuning for style (how). If you want the model to know your company's Q3 2026 financials, use RAG. If you want the model to write in your brand voice, use fine-tuning.


The Three RAG Pipeline Phases


Phase 1: Ingestion — Chunking Strategies That Matter

Chunking is the most underestimated component of RAG quality. Wrong chunk size = wrong context retrieved.

StrategyDescriptionBest For
Fixed-sizeSplit every N tokens, overlap 20%Quick prototyping only
SentenceSplit on sentence boundariesGeneral text, Q&A
RecursiveSplit on paragraphs → sentences → wordsStructured documents
SemanticGroup semantically related sentencesComplex documents, research papers
Document-specificParse PDFs preserving section structureTechnical manuals, legal docs
Parent-childStore large chunks, retrieve small onesMaximum precision retrieval
python
# Parent-child chunking (retrieves small chunks, returns parent context):
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Small child chunks for precision retrieval:
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)

# Larger parent chunks for context-rich generation:
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)

def ingest_with_parent_child(document: str, doc_id: str):
    parent_chunks = parent_splitter.split_text(document)
    
    for i, parent in enumerate(parent_chunks):
        parent_id = f"{doc_id}_parent_{i}"
        parent_store.set(parent_id, parent)  # Store full parent in regular DB
        
        # Create child chunks — each points to its parent:
        children = child_splitter.split_text(parent)
        for j, child in enumerate(children):
            vector_store.add(
                text=child,
                metadata={"parent_id": parent_id, "doc_id": doc_id}
            )

# On retrieval: search child chunks, return parent context:
def retrieve(query: str, k: int = 5) -> list[str]:
    child_results = vector_store.similarity_search(query, k=k*3)  # over-fetch
    parent_ids = {r.metadata["parent_id"] for r in child_results}
    # Return full parent texts (much richer context than child chunks):
    return [parent_store.get(pid) for pid in list(parent_ids)[:k]]

Phase 1: Embedding Models — Choosing the Right One

ModelDimensionsCostQualityBest For
text-embedding-3-small (OpenAI)1536$0.02/1M tokensHighGeneral purpose, cost-effective
text-embedding-3-large (OpenAI)3072$0.13/1M tokensHighestComplex queries requiring precision
embed-english-v3.0 (Cohere)1024$0.10/1M tokensHighEnterprise, multilingual variant
nomic-embed-text (local)768Free (self-hosted)GoodPrivacy-sensitive, cost-constrained
bge-large-en (HuggingFace)1024Free (self-hosted)HighOpen source, MTEB leaderboard

Critical rule: Use the same embedding model for ingestion and retrieval. Mixing models produces meaningless similarity scores.


Phase 2: Retrieval — Naive vs Hybrid Search

Naive (pure vector) search only captures semantic similarity. It misses exact terminology:

python
# Naive: misses "JWT" if user asks about "JSON Web Tokens" exact term
results = vector_store.similarity_search("JWT authentication", k=5)

# Hybrid: combines semantic + keyword (BM25) search:
from langchain.retrievers import EnsembleRetriever

vector_retriever = vector_store.as_retriever(search_kwargs={"k": 5})
bm25_retriever = BM25Retriever.from_documents(docs, k=5)

# Ensemble with Reciprocal Rank Fusion:
ensemble = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # 60% semantic, 40% keyword
)
# Now "JWT" is matched by BM25 even if embeddings don't capture acronyms

Advanced RAG: Query Rewriting and HyDE

Query Rewriting: Users rarely write retrieval-optimal queries. Rewrite before searching:

python
def rewrite_query(user_query: str) -> list[str]:
    """Generate multiple search queries to improve recall."""
    response = llm.invoke(f"""
    Generate 3 different search queries to find information relevant to:
    "{user_query}"
    
    Return only the 3 queries, one per line.
    """)
    queries = response.content.strip().split('\n')
    return queries[:3]

# Search with all rewritten queries, merge results:
all_results = []
for query in rewrite_query(user_query) + [user_query]:
    all_results.extend(vector_store.similarity_search(query, k=3))
# Deduplicate and re-rank

HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer first, embed it, use its embedding to search:

python
def hyde_retrieve(query: str) -> list[str]:
    # Generate a hypothetical ideal answer:
    hypothetical_answer = llm.invoke(
        f"Write a detailed answer to: {query}\n"
        "Base it on knowledge you have. Do not say 'I don't know'."
    ).content
    
    # Embed the hypothetical answer (not the query):
    hypothesis_embedding = embed(hypothetical_answer)
    
    # Search vector store using the hypothetical answer's embedding:
    # This finds documents similar to the "ideal answer" rather than the question
    return vector_store.similarity_search_by_vector(hypothesis_embedding, k=5)

Advanced RAG: Cross-Encoder Re-Ranking

The initial vector search retrieves good candidates but doesn't perfectly rank them. A cross-encoder re-ranks by considering query and document together:

python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def retrieve_and_rerank(query: str, k: int = 5) -> list[str]:
    # Step 1: Retrieve 20 candidates (over-fetch):
    candidates = vector_store.similarity_search(query, k=20)
    
    # Step 2: Re-rank using cross-encoder (considers query+document jointly):
    pairs = [[query, doc.page_content] for doc in candidates]
    scores = reranker.predict(pairs)
    
    # Step 3: Return top-k after re-ranking:
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc.page_content for doc, _ in ranked[:k]]

Re-ranking typically improves answer accuracy by 10-15% with a small latency cost (~50-100ms for cross-encoder inference).


Evaluating RAG Quality with RAGAS

Never ship RAG without automated evaluation. RAGAS provides key metrics:

python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

# Build evaluation dataset:
eval_data = {
    "question": ["What is our refund policy?", "How do I reset my password?"],
    "answer": [generated_answers[0], generated_answers[1]],
    "contexts": [retrieved_contexts[0], retrieved_contexts[1]],
    "ground_truth": ["Refunds within 30 days...", "Click 'Forgot Password'..."],
}

results = evaluate(eval_data, metrics=[
    faithfulness,        # Is the answer grounded in the retrieved context?
    answer_relevancy,    # Is the answer relevant to the question?
    context_precision,   # Is the retrieved context useful?
    context_recall,      # Was the relevant context actually retrieved?
])
# Target: faithfulness > 0.9, answer_relevancy > 0.85, context_recall > 0.8

Frequently Asked Questions

How large should my chunks be? No single answer — it depends on your documents and queries. Rule of thumb: if queries are short and factual ("What is X?"), use smaller chunks (200-400 tokens). If queries require synthesising information across paragraphs, use larger chunks (800-1200 tokens) with parent-child retrieval. Always evaluate chunk size changes with your RAGAS metrics rather than intuition.

Is RAG secure for handling sensitive enterprise data? RAG can be secure with the right architecture: (1) Namespace isolation in the vector database ensures one tenant cannot retrieve another's documents. (2) Column-level access control on retrieved documents (filter by user permissions before returning). (3) On-premise or private cloud deployment of both the vector database and LLM for regulatory compliance (HIPAA, GDPR). Open-source models (Llama, Mistral) deployed privately are common for highly sensitive use cases.


Key Takeaway

Naive RAG is easy to build but disappointing in production. Advanced RAG — with parent-child chunking, hybrid search, query rewriting, and cross-encoder re-ranking — delivers the 90%+ accuracy that enterprise use cases require. The evaluation discipline (RAGAS metrics, golden test sets, regression testing on every chunking or retrieval change) is what distinguishes a reliable production RAG system from an impressive demo. Invest in the evaluation infrastructure early — it pays dividends with every improvement you make.

Read next: Agentic AI Architecture: Memory, Tools and Control Loops →


Part of the Software Architecture Hub — comprehensive guides from architectural foundations to advanced distributed systems patterns.