Building Production RAG Systems: Lessons from the Trenches

After deploying RAG systems across multiple production environments, I've learned that the gap between a working prototype and a reliable production system is vast. This post shares the battle-tested patterns that actually work.

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances LLM responses by first retrieving relevant context from a knowledge base, then using that context to generate more accurate, grounded answers. Think of it as giving the LLM a reference library to consult before answering questions.

Why RAG Matters for Production AI

LLMs have a fundamental limitation: their knowledge is frozen at training time. RAG solves this by:

Grounding responses in your actual data
Reducing hallucinations by providing factual context
Enabling updates without retraining the model
Supporting citations to verify answers

Production Architecture

A production RAG system is more than just "embed and retrieve." Here's the architecture I use:

TypeScript

1interface RAGSystemConfig {
2  // Document Processing
3  chunking: {
4    strategy: "semantic" | "fixed" | "hybrid";
5    maxTokens: number;
6    overlap: number;
7  };
8
9  // Vector Storage
10  vectorStore: {
11    provider: "pinecone" | "weaviate" | "qdrant" | "pgvector";
12    dimensions: number;
13    metric: "cosine" | "dotProduct" | "euclidean";
14  };
15
16  // Retrieval
17  retrieval: {
18    topK: number;
19    minScore: number;
20    reranker?: {
21      model: string;
22      topN: number;
23    };
24  };
25
26  // Generation
27  generation: {
28    model: string;
29    maxTokens: number;
30    temperature: number;
31    systemPrompt: string;
32  };
33}
34
35// Example production config
36const productionConfig: RAGSystemConfig = {
37  chunking: {
38    strategy: "semantic",
39    maxTokens: 512,
40    overlap: 50,
41  },
42  vectorStore: {
43    provider: "pgvector",
44    dimensions: 1536,
45    metric: "cosine",
46  },
47  retrieval: {
48    topK: 20,
49    minScore: 0.7,
50    reranker: {
51      model: "cohere-rerank-v3",
52      topN: 5,
53    },
54  },
55  generation: {
56    model: "claude-3-5-sonnet-20241022",
57    maxTokens: 2048,
58    temperature: 0.3,
59    systemPrompt: "Answer based only on the provided context.",
60  },
61};

Semantic Chunking: The Foundation

Fixed-size chunking breaks documents at arbitrary points, often splitting related content. Semantic chunking preserves meaning:

TypeScript

1import { OpenAIEmbeddings } from "@langchain/openai";
2import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
3
4interface SemanticChunk {
5  content: string;
6  embedding: number[];
7  metadata: {
8    source: string;
9    section: string;
10    pageNumber?: number;
11    startOffset: number;
12    endOffset: number;
13  };
14}
15
16async function semanticChunk(document: string, metadata: Record<string, unknown>): Promise<SemanticChunk[]> {
17  const embeddings = new OpenAIEmbeddings({
18    modelName: "text-embedding-3-small",
19  });
20
21  // First pass: split by semantic boundaries
22  const splitter = new RecursiveCharacterTextSplitter({
23    chunkSize: 1000,
24    chunkOverlap: 100,
25    separators: [
26      "\n## ", // H2 headers
27      "\n### ", // H3 headers
28      "\n\n", // Paragraphs
29      "\n", // Lines
30      ". ", // Sentences
31      " ", // Words
32    ],
33  });
34
35  const chunks = await splitter.splitText(document);
36
37  // Second pass: merge small chunks, split large ones
38  const optimizedChunks = await optimizeChunkSizes(chunks, {
39    minTokens: 100,
40    maxTokens: 512,
41    targetTokens: 400,
42  });
43
44  // Generate embeddings for each chunk
45  const embeddedChunks = await Promise.all(
46    optimizedChunks.map(async (content, index) => ({
47      content,
48      embedding: await embeddings.embedQuery(content),
49      metadata: {
50        ...metadata,
51        startOffset: calculateOffset(document, content),
52        endOffset: calculateOffset(document, content) + content.length,
53      },
54    }))
55  );
56
57  return embeddedChunks;
58}
59
60function calculateOffset(document: string, chunk: string): number {
61  return document.indexOf(chunk);
62}
63
64async function optimizeChunkSizes(
65  chunks: string[],
66  config: { minTokens: number; maxTokens: number; targetTokens: number }
67): Promise<string[]> {
68  const optimized: string[] = [];
69  let buffer = "";
70
71  for (const chunk of chunks) {
72    const tokenCount = estimateTokens(chunk);
73
74    if (tokenCount > config.maxTokens) {
75      // Split oversized chunks
76      if (buffer) {
77        optimized.push(buffer);
78        buffer = "";
79      }
80      const subChunks = splitLargeChunk(chunk, config.maxTokens);
81      optimized.push(...subChunks);
82    } else if (estimateTokens(buffer + chunk) <= config.targetTokens) {
83      // Merge small chunks
84      buffer = buffer ? `${buffer}\n\n${chunk}` : chunk;
85    } else {
86      if (buffer) optimized.push(buffer);
87      buffer = chunk;
88    }
89  }
90
91  if (buffer) optimized.push(buffer);
92  return optimized;
93}
94
95function estimateTokens(text: string): number {
96  // Rough estimate: ~4 characters per token for English
97  return Math.ceil(text.length / 4);
98}
99
100function splitLargeChunk(chunk: string, maxTokens: number): string[] {
101  const sentences = chunk.split(/(?<=[.!?])\s+/);
102  const result: string[] = [];
103  let current = "";
104
105  for (const sentence of sentences) {
106    if (estimateTokens(current + sentence) <= maxTokens) {
107      current = current ? `${current} ${sentence}` : sentence;
108    } else {
109      if (current) result.push(current);
110      current = sentence;
111    }
112  }
113
114  if (current) result.push(current);
115  return result;
116}

Pro Tip: Context Windows

When chunking, consider your retrieval topK and the LLM's context window. If you retrieve 10 chunks of 512 tokens each, that's 5,120 tokens just for context. Leave room for the system prompt, user query, and generated response.

Reranking: The Secret Weapon

Vector similarity retrieval is fast but imprecise. Reranking uses a cross-encoder to score query-document pairs more accurately:

TypeScript

1import Cohere from "cohere-ai";
2
3interface RetrievedDocument {
4  id: string;
5  content: string;
6  score: number;
7  metadata: Record<string, unknown>;
8}
9
10interface RerankResult {
11  document: RetrievedDocument;
12  relevanceScore: number;
13}
14
15async function rerankDocuments(
16  query: string,
17  documents: RetrievedDocument[],
18  topN: number = 5
19): Promise<RerankResult[]> {
20  const cohere = new Cohere.Client({
21    token: process.env.COHERE_API_KEY!,
22  });
23
24  const response = await cohere.rerank({
25    model: "rerank-english-v3.0",
26    query,
27    documents: documents.map((doc) => doc.content),
28    topN,
29    returnDocuments: false,
30  });
31
32  // Map rerank results back to original documents
33  const reranked = response.results.map((result) => ({
34    document: documents[result.index],
35    relevanceScore: result.relevanceScore,
36  }));
37
38  // Filter by minimum relevance threshold
39  const filtered = reranked.filter((result) => result.relevanceScore > 0.5);
40
41  return filtered;
42}
43
44// Complete retrieval pipeline
45async function retrieveWithReranking(
46  query: string,
47  vectorStore: VectorStore,
48  config: {
49    initialTopK: number;
50    finalTopN: number;
51    minScore: number;
52  }
53): Promise<RerankResult[]> {
54  // Step 1: Fast vector similarity search
55  const candidates = await vectorStore.similaritySearch(query, config.initialTopK);
56
57  // Step 2: Precise reranking
58  const reranked = await rerankDocuments(query, candidates, config.finalTopN);
59
60  // Step 3: Filter by relevance threshold
61  return reranked.filter((result) => result.relevanceScore >= config.minScore);
62}

Reranking Latency

Reranking adds 100-300ms latency. For real-time applications, consider caching frequently asked queries or using async reranking with streaming responses.

Evaluation: Measuring What Matters

You can't improve what you don't measure. Here are the metrics I track:

Metric	Description	Target
Retrieval Precision@K	% of retrieved docs that are relevant	> 80%
Retrieval Recall@K	% of relevant docs that are retrieved	> 90%
Answer Faithfulness	Is the answer supported by context?	> 95%
Answer Relevance	Does the answer address the query?	> 90%
Latency P50	Median end-to-end response time	< 2s
Latency P99	99th percentile response time	< 5s
Cost per Query	Embedding + LLM + infrastructure	< $0.01

Key Lessons Learned

After countless production deployments, here's what I've learned:

Start with evaluation - Build your test set before optimizing anything
Chunk boundaries matter - Poor chunking causes more failures than any other component
Reranking is almost always worth it - The latency cost pays for itself in accuracy
Monitor everything - Retrieval scores, answer lengths, user feedback
Plan for failure - Have fallbacks when retrieval returns nothing relevant

Conclusion

Building production RAG systems requires attention to every component: chunking, embedding, retrieval, reranking, and generation. The patterns in this post have been battle-tested across multiple deployments.

Start simple, measure everything, and iterate based on data. The best RAG system is the one that solves your users' actual problems reliably.

Have questions or want to share your own RAG experiences? Reach out on LinkedIn or Twitter.

Building Production RAG Systems: Lessons from the Trenches

Building Production RAG Systems: Lessons from the Trenches

Why RAG Matters for Production AI

Production Architecture

Semantic Chunking: The Foundation

Reranking: The Secret Weapon

Evaluation: Measuring What Matters

Key Lessons Learned

Further Reading

Introducing Contextual Retrieval

Conclusion