RAG Architecture Patterns: The Complete Enterprise Implementation Guide

RAG Architecture Patterns: The Complete Enterprise Implementation Guide
Table of Contents
- Why RAG: The LLM Knowledge Gap
- The Three RAG Pipeline Phases
- Phase 1: Ingestion — Chunking Strategies That Matter
- Phase 1: Embedding Models — Choosing the Right One
- Phase 2: Retrieval — Naive vs Hybrid Search
- Advanced RAG: Query Rewriting and HyDE
- Advanced RAG: Cross-Encoder Re-Ranking
- Phase 3: Generation — Prompt Engineering for RAG
- Multi-Tenant RAG with Namespace Isolation
- Evaluating RAG Quality with RAGAS
- The Production RAG Stack
- Frequently Asked Questions
- Key Takeaway
Why RAG: The LLM Knowledge Gap
LLMs have three limitations that RAG addresses:
| Problem | Description | RAG Solution |
|---|---|---|
| Training cutoff | Model doesn't know about recent events | Retrieve from up-to-date knowledge base |
| Private knowledge | Model never saw your internal docs | Retrieve from your document store |
| Hallucination | Model invents plausible-sounding facts | Grounded answers from retrieved context |
| Context length | Cannot fit entire knowledge base in prompt | Retrieve only the relevant K chunks |
RAG vs Fine-tuning rule of thumb: Use RAG for facts (what), fine-tuning for style (how). If you want the model to know your company's Q3 2026 financials, use RAG. If you want the model to write in your brand voice, use fine-tuning.
The Three RAG Pipeline Phases
Phase 1: Ingestion — Chunking Strategies That Matter
Chunking is the most underestimated component of RAG quality. Wrong chunk size = wrong context retrieved.
| Strategy | Description | Best For |
|---|---|---|
| Fixed-size | Split every N tokens, overlap 20% | Quick prototyping only |
| Sentence | Split on sentence boundaries | General text, Q&A |
| Recursive | Split on paragraphs → sentences → words | Structured documents |
| Semantic | Group semantically related sentences | Complex documents, research papers |
| Document-specific | Parse PDFs preserving section structure | Technical manuals, legal docs |
| Parent-child | Store large chunks, retrieve small ones | Maximum precision retrieval |
# Parent-child chunking (retrieves small chunks, returns parent context):
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Small child chunks for precision retrieval:
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
# Larger parent chunks for context-rich generation:
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)
def ingest_with_parent_child(document: str, doc_id: str):
parent_chunks = parent_splitter.split_text(document)
for i, parent in enumerate(parent_chunks):
parent_id = f"{doc_id}_parent_{i}"
parent_store.set(parent_id, parent) # Store full parent in regular DB
# Create child chunks — each points to its parent:
children = child_splitter.split_text(parent)
for j, child in enumerate(children):
vector_store.add(
text=child,
metadata={"parent_id": parent_id, "doc_id": doc_id}
)
# On retrieval: search child chunks, return parent context:
def retrieve(query: str, k: int = 5) -> list[str]:
child_results = vector_store.similarity_search(query, k=k*3) # over-fetch
parent_ids = {r.metadata["parent_id"] for r in child_results}
# Return full parent texts (much richer context than child chunks):
return [parent_store.get(pid) for pid in list(parent_ids)[:k]]Phase 1: Embedding Models — Choosing the Right One
| Model | Dimensions | Cost | Quality | Best For |
|---|---|---|---|---|
text-embedding-3-small (OpenAI) | 1536 | $0.02/1M tokens | High | General purpose, cost-effective |
text-embedding-3-large (OpenAI) | 3072 | $0.13/1M tokens | Highest | Complex queries requiring precision |
embed-english-v3.0 (Cohere) | 1024 | $0.10/1M tokens | High | Enterprise, multilingual variant |
nomic-embed-text (local) | 768 | Free (self-hosted) | Good | Privacy-sensitive, cost-constrained |
bge-large-en (HuggingFace) | 1024 | Free (self-hosted) | High | Open source, MTEB leaderboard |
Critical rule: Use the same embedding model for ingestion and retrieval. Mixing models produces meaningless similarity scores.
Phase 2: Retrieval — Naive vs Hybrid Search
Naive (pure vector) search only captures semantic similarity. It misses exact terminology:
# Naive: misses "JWT" if user asks about "JSON Web Tokens" exact term
results = vector_store.similarity_search("JWT authentication", k=5)
# Hybrid: combines semantic + keyword (BM25) search:
from langchain.retrievers import EnsembleRetriever
vector_retriever = vector_store.as_retriever(search_kwargs={"k": 5})
bm25_retriever = BM25Retriever.from_documents(docs, k=5)
# Ensemble with Reciprocal Rank Fusion:
ensemble = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.6, 0.4] # 60% semantic, 40% keyword
)
# Now "JWT" is matched by BM25 even if embeddings don't capture acronymsAdvanced RAG: Query Rewriting and HyDE
Query Rewriting: Users rarely write retrieval-optimal queries. Rewrite before searching:
def rewrite_query(user_query: str) -> list[str]:
"""Generate multiple search queries to improve recall."""
response = llm.invoke(f"""
Generate 3 different search queries to find information relevant to:
"{user_query}"
Return only the 3 queries, one per line.
""")
queries = response.content.strip().split('\n')
return queries[:3]
# Search with all rewritten queries, merge results:
all_results = []
for query in rewrite_query(user_query) + [user_query]:
all_results.extend(vector_store.similarity_search(query, k=3))
# Deduplicate and re-rankHyDE (Hypothetical Document Embeddings): Generate a hypothetical answer first, embed it, use its embedding to search:
def hyde_retrieve(query: str) -> list[str]:
# Generate a hypothetical ideal answer:
hypothetical_answer = llm.invoke(
f"Write a detailed answer to: {query}\n"
"Base it on knowledge you have. Do not say 'I don't know'."
).content
# Embed the hypothetical answer (not the query):
hypothesis_embedding = embed(hypothetical_answer)
# Search vector store using the hypothetical answer's embedding:
# This finds documents similar to the "ideal answer" rather than the question
return vector_store.similarity_search_by_vector(hypothesis_embedding, k=5)Advanced RAG: Cross-Encoder Re-Ranking
The initial vector search retrieves good candidates but doesn't perfectly rank them. A cross-encoder re-ranks by considering query and document together:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def retrieve_and_rerank(query: str, k: int = 5) -> list[str]:
# Step 1: Retrieve 20 candidates (over-fetch):
candidates = vector_store.similarity_search(query, k=20)
# Step 2: Re-rank using cross-encoder (considers query+document jointly):
pairs = [[query, doc.page_content] for doc in candidates]
scores = reranker.predict(pairs)
# Step 3: Return top-k after re-ranking:
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc.page_content for doc, _ in ranked[:k]]Re-ranking typically improves answer accuracy by 10-15% with a small latency cost (~50-100ms for cross-encoder inference).
Evaluating RAG Quality with RAGAS
Never ship RAG without automated evaluation. RAGAS provides key metrics:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
# Build evaluation dataset:
eval_data = {
"question": ["What is our refund policy?", "How do I reset my password?"],
"answer": [generated_answers[0], generated_answers[1]],
"contexts": [retrieved_contexts[0], retrieved_contexts[1]],
"ground_truth": ["Refunds within 30 days...", "Click 'Forgot Password'..."],
}
results = evaluate(eval_data, metrics=[
faithfulness, # Is the answer grounded in the retrieved context?
answer_relevancy, # Is the answer relevant to the question?
context_precision, # Is the retrieved context useful?
context_recall, # Was the relevant context actually retrieved?
])
# Target: faithfulness > 0.9, answer_relevancy > 0.85, context_recall > 0.8Frequently Asked Questions
How large should my chunks be? No single answer — it depends on your documents and queries. Rule of thumb: if queries are short and factual ("What is X?"), use smaller chunks (200-400 tokens). If queries require synthesising information across paragraphs, use larger chunks (800-1200 tokens) with parent-child retrieval. Always evaluate chunk size changes with your RAGAS metrics rather than intuition.
Is RAG secure for handling sensitive enterprise data? RAG can be secure with the right architecture: (1) Namespace isolation in the vector database ensures one tenant cannot retrieve another's documents. (2) Column-level access control on retrieved documents (filter by user permissions before returning). (3) On-premise or private cloud deployment of both the vector database and LLM for regulatory compliance (HIPAA, GDPR). Open-source models (Llama, Mistral) deployed privately are common for highly sensitive use cases.
Key Takeaway
Naive RAG is easy to build but disappointing in production. Advanced RAG — with parent-child chunking, hybrid search, query rewriting, and cross-encoder re-ranking — delivers the 90%+ accuracy that enterprise use cases require. The evaluation discipline (RAGAS metrics, golden test sets, regression testing on every chunking or retrieval change) is what distinguishes a reliable production RAG system from an impressive demo. Invest in the evaluation infrastructure early — it pays dividends with every improvement you make.
Read next: Agentic AI Architecture: Memory, Tools and Control Loops →
Part of the Software Architecture Hub — comprehensive guides from architectural foundations to advanced distributed systems patterns.
