Building Production RAG Systems: Lessons from the Trenches
After deploying RAG systems across multiple production environments, I've learned that the gap between a working prototype and a reliable production system is vast. This post shares the battle-tested patterns that actually work.
What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that enhances LLM responses by first retrieving relevant context from a knowledge base, then using that context to generate more accurate, grounded answers. Think of it as giving the LLM a reference library to consult before answering questions.
Why RAG Matters for Production AI
LLMs have a fundamental limitation: their knowledge is frozen at training time. RAG solves this by:
- Grounding responses in your actual data
- Reducing hallucinations by providing factual context
- Enabling updates without retraining the model
- Supporting citations to verify answers
Production Architecture
A production RAG system is more than just "embed and retrieve." Here's the architecture I use:
1interface RAGSystemConfig {
2 // Document Processing
3 chunking: {
4 strategy: "semantic" | "fixed" | "hybrid";
5 maxTokens: number;
6 overlap: number;
7 };
8
9 // Vector Storage
10 vectorStore: {
11 provider: "pinecone" | "weaviate" | "qdrant" | "pgvector";
12 dimensions: number;
13 metric: "cosine" | "dotProduct" | "euclidean";
14 };
15
16 // Retrieval
17 retrieval: {
18 topK: number;
19 minScore: number;
20 reranker?: {
21 model: string;
22 topN: number;
23 };
24 };
25
26 // Generation
27 generation: {
28 model: string;
29 maxTokens: number;
30 temperature: number;
31 systemPrompt: string;
32 };
33}
34
35// Example production config
36const productionConfig: RAGSystemConfig = {
37 chunking: {
38 strategy: "semantic",
39 maxTokens: 512,
40 overlap: 50,
41 },
42 vectorStore: {
43 provider: "pgvector",
44 dimensions: 1536,
45 metric: "cosine",
46 },
47 retrieval: {
48 topK: 20,
49 minScore: 0.7,
50 reranker: {
51 model: "cohere-rerank-v3",
52 topN: 5,
53 },
54 },
55 generation: {
56 model: "claude-3-5-sonnet-20241022",
57 maxTokens: 2048,
58 temperature: 0.3,
59 systemPrompt: "Answer based only on the provided context.",
60 },
61};Semantic Chunking: The Foundation
Fixed-size chunking breaks documents at arbitrary points, often splitting related content. Semantic chunking preserves meaning:
1import { OpenAIEmbeddings } from "@langchain/openai";
2import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
3
4interface SemanticChunk {
5 content: string;
6 embedding: number[];
7 metadata: {
8 source: string;
9 section: string;
10 pageNumber?: number;
11 startOffset: number;
12 endOffset: number;
13 };
14}
15
16async function semanticChunk(document: string, metadata: Record<string, unknown>): Promise<SemanticChunk[]> {
17 const embeddings = new OpenAIEmbeddings({
18 modelName: "text-embedding-3-small",
19 });
20
21 // First pass: split by semantic boundaries
22 const splitter = new RecursiveCharacterTextSplitter({
23 chunkSize: 1000,
24 chunkOverlap: 100,
25 separators: [
26 "\n## ", // H2 headers
27 "\n### ", // H3 headers
28 "\n\n", // Paragraphs
29 "\n", // Lines
30 ". ", // Sentences
31 " ", // Words
32 ],
33 });
34
35 const chunks = await splitter.splitText(document);
36
37 // Second pass: merge small chunks, split large ones
38 const optimizedChunks = await optimizeChunkSizes(chunks, {
39 minTokens: 100,
40 maxTokens: 512,
41 targetTokens: 400,
42 });
43
44 // Generate embeddings for each chunk
45 const embeddedChunks = await Promise.all(
46 optimizedChunks.map(async (content, index) => ({
47 content,
48 embedding: await embeddings.embedQuery(content),
49 metadata: {
50 ...metadata,
51 startOffset: calculateOffset(document, content),
52 endOffset: calculateOffset(document, content) + content.length,
53 },
54 }))
55 );
56
57 return embeddedChunks;
58}
59
60function calculateOffset(document: string, chunk: string): number {
61 return document.indexOf(chunk);
62}
63
64async function optimizeChunkSizes(
65 chunks: string[],
66 config: { minTokens: number; maxTokens: number; targetTokens: number }
67): Promise<string[]> {
68 const optimized: string[] = [];
69 let buffer = "";
70
71 for (const chunk of chunks) {
72 const tokenCount = estimateTokens(chunk);
73
74 if (tokenCount > config.maxTokens) {
75 // Split oversized chunks
76 if (buffer) {
77 optimized.push(buffer);
78 buffer = "";
79 }
80 const subChunks = splitLargeChunk(chunk, config.maxTokens);
81 optimized.push(...subChunks);
82 } else if (estimateTokens(buffer + chunk) <= config.targetTokens) {
83 // Merge small chunks
84 buffer = buffer ? `${buffer}\n\n${chunk}` : chunk;
85 } else {
86 if (buffer) optimized.push(buffer);
87 buffer = chunk;
88 }
89 }
90
91 if (buffer) optimized.push(buffer);
92 return optimized;
93}
94
95function estimateTokens(text: string): number {
96 // Rough estimate: ~4 characters per token for English
97 return Math.ceil(text.length / 4);
98}
99
100function splitLargeChunk(chunk: string, maxTokens: number): string[] {
101 const sentences = chunk.split(/(?<=[.!?])\s+/);
102 const result: string[] = [];
103 let current = "";
104
105 for (const sentence of sentences) {
106 if (estimateTokens(current + sentence) <= maxTokens) {
107 current = current ? `${current} ${sentence}` : sentence;
108 } else {
109 if (current) result.push(current);
110 current = sentence;
111 }
112 }
113
114 if (current) result.push(current);
115 return result;
116}Pro Tip: Context Windows
When chunking, consider your retrieval topK and the LLM's context window. If you retrieve 10 chunks of 512 tokens
each, that's 5,120 tokens just for context. Leave room for the system prompt, user query, and generated response.
Reranking: The Secret Weapon
Vector similarity retrieval is fast but imprecise. Reranking uses a cross-encoder to score query-document pairs more accurately:
1import Cohere from "cohere-ai";
2
3interface RetrievedDocument {
4 id: string;
5 content: string;
6 score: number;
7 metadata: Record<string, unknown>;
8}
9
10interface RerankResult {
11 document: RetrievedDocument;
12 relevanceScore: number;
13}
14
15async function rerankDocuments(
16 query: string,
17 documents: RetrievedDocument[],
18 topN: number = 5
19): Promise<RerankResult[]> {
20 const cohere = new Cohere.Client({
21 token: process.env.COHERE_API_KEY!,
22 });
23
24 const response = await cohere.rerank({
25 model: "rerank-english-v3.0",
26 query,
27 documents: documents.map((doc) => doc.content),
28 topN,
29 returnDocuments: false,
30 });
31
32 // Map rerank results back to original documents
33 const reranked = response.results.map((result) => ({
34 document: documents[result.index],
35 relevanceScore: result.relevanceScore,
36 }));
37
38 // Filter by minimum relevance threshold
39 const filtered = reranked.filter((result) => result.relevanceScore > 0.5);
40
41 return filtered;
42}
43
44// Complete retrieval pipeline
45async function retrieveWithReranking(
46 query: string,
47 vectorStore: VectorStore,
48 config: {
49 initialTopK: number;
50 finalTopN: number;
51 minScore: number;
52 }
53): Promise<RerankResult[]> {
54 // Step 1: Fast vector similarity search
55 const candidates = await vectorStore.similaritySearch(query, config.initialTopK);
56
57 // Step 2: Precise reranking
58 const reranked = await rerankDocuments(query, candidates, config.finalTopN);
59
60 // Step 3: Filter by relevance threshold
61 return reranked.filter((result) => result.relevanceScore >= config.minScore);
62}Reranking Latency
Reranking adds 100-300ms latency. For real-time applications, consider caching frequently asked queries or using async reranking with streaming responses.
Evaluation: Measuring What Matters
You can't improve what you don't measure. Here are the metrics I track:
| Metric | Description | Target |
|---|---|---|
| Retrieval Precision@K | % of retrieved docs that are relevant | > 80% |
| Retrieval Recall@K | % of relevant docs that are retrieved | > 90% |
| Answer Faithfulness | Is the answer supported by context? | > 95% |
| Answer Relevance | Does the answer address the query? | > 90% |
| Latency P50 | Median end-to-end response time | < 2s |
| Latency P99 | 99th percentile response time | < 5s |
| Cost per Query | Embedding + LLM + infrastructure | < $0.01 |
Key Lessons Learned
After countless production deployments, here's what I've learned:
- Start with evaluation - Build your test set before optimizing anything
- Chunk boundaries matter - Poor chunking causes more failures than any other component
- Reranking is almost always worth it - The latency cost pays for itself in accuracy
- Monitor everything - Retrieval scores, answer lengths, user feedback
- Plan for failure - Have fallbacks when retrieval returns nothing relevant
Further Reading
For a deeper dive into advanced RAG techniques, check out Anthropic's research on contextual retrieval:
Introducing Contextual Retrieval
Anthropic's research on improving RAG accuracy by 49% through contextual embeddings and BM25 hybrid search.
www.anthropic.comConclusion
Building production RAG systems requires attention to every component: chunking, embedding, retrieval, reranking, and generation. The patterns in this post have been battle-tested across multiple deployments.
Start simple, measure everything, and iterate based on data. The best RAG system is the one that solves your users' actual problems reliably.
Have questions or want to share your own RAG experiences? Reach out on LinkedIn or Twitter.