Skip to content
AIRAGLLMProductionTutorial

Building Production RAG Systems: Lessons from the Trenches

January 28, 2026·7 min read
Cover image for Building Production RAG Systems: Lessons from the Trenches

Building Production RAG Systems: Lessons from the Trenches

After deploying RAG systems across multiple production environments, I've learned that the gap between a working prototype and a reliable production system is vast. This post shares the battle-tested patterns that actually work.

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances LLM responses by first retrieving relevant context from a knowledge base, then using that context to generate more accurate, grounded answers. Think of it as giving the LLM a reference library to consult before answering questions.

Why RAG Matters for Production AI

LLMs have a fundamental limitation: their knowledge is frozen at training time. RAG solves this by:

  • Grounding responses in your actual data
  • Reducing hallucinations by providing factual context
  • Enabling updates without retraining the model
  • Supporting citations to verify answers

Production Architecture

A production RAG system is more than just "embed and retrieve." Here's the architecture I use:

TypeScript
1interface RAGSystemConfig {
2  // Document Processing
3  chunking: {
4    strategy: "semantic" | "fixed" | "hybrid";
5    maxTokens: number;
6    overlap: number;
7  };
8
9  // Vector Storage
10  vectorStore: {
11    provider: "pinecone" | "weaviate" | "qdrant" | "pgvector";
12    dimensions: number;
13    metric: "cosine" | "dotProduct" | "euclidean";
14  };
15
16  // Retrieval
17  retrieval: {
18    topK: number;
19    minScore: number;
20    reranker?: {
21      model: string;
22      topN: number;
23    };
24  };
25
26  // Generation
27  generation: {
28    model: string;
29    maxTokens: number;
30    temperature: number;
31    systemPrompt: string;
32  };
33}
34
35// Example production config
36const productionConfig: RAGSystemConfig = {
37  chunking: {
38    strategy: "semantic",
39    maxTokens: 512,
40    overlap: 50,
41  },
42  vectorStore: {
43    provider: "pgvector",
44    dimensions: 1536,
45    metric: "cosine",
46  },
47  retrieval: {
48    topK: 20,
49    minScore: 0.7,
50    reranker: {
51      model: "cohere-rerank-v3",
52      topN: 5,
53    },
54  },
55  generation: {
56    model: "claude-3-5-sonnet-20241022",
57    maxTokens: 2048,
58    temperature: 0.3,
59    systemPrompt: "Answer based only on the provided context.",
60  },
61};

Semantic Chunking: The Foundation

Fixed-size chunking breaks documents at arbitrary points, often splitting related content. Semantic chunking preserves meaning:

TypeScript
1import { OpenAIEmbeddings } from "@langchain/openai";
2import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
3
4interface SemanticChunk {
5  content: string;
6  embedding: number[];
7  metadata: {
8    source: string;
9    section: string;
10    pageNumber?: number;
11    startOffset: number;
12    endOffset: number;
13  };
14}
15
16async function semanticChunk(document: string, metadata: Record<string, unknown>): Promise<SemanticChunk[]> {
17  const embeddings = new OpenAIEmbeddings({
18    modelName: "text-embedding-3-small",
19  });
20
21  // First pass: split by semantic boundaries
22  const splitter = new RecursiveCharacterTextSplitter({
23    chunkSize: 1000,
24    chunkOverlap: 100,
25    separators: [
26      "\n## ", // H2 headers
27      "\n### ", // H3 headers
28      "\n\n", // Paragraphs
29      "\n", // Lines
30      ". ", // Sentences
31      " ", // Words
32    ],
33  });
34
35  const chunks = await splitter.splitText(document);
36
37  // Second pass: merge small chunks, split large ones
38  const optimizedChunks = await optimizeChunkSizes(chunks, {
39    minTokens: 100,
40    maxTokens: 512,
41    targetTokens: 400,
42  });
43
44  // Generate embeddings for each chunk
45  const embeddedChunks = await Promise.all(
46    optimizedChunks.map(async (content, index) => ({
47      content,
48      embedding: await embeddings.embedQuery(content),
49      metadata: {
50        ...metadata,
51        startOffset: calculateOffset(document, content),
52        endOffset: calculateOffset(document, content) + content.length,
53      },
54    }))
55  );
56
57  return embeddedChunks;
58}
59
60function calculateOffset(document: string, chunk: string): number {
61  return document.indexOf(chunk);
62}
63
64async function optimizeChunkSizes(
65  chunks: string[],
66  config: { minTokens: number; maxTokens: number; targetTokens: number }
67): Promise<string[]> {
68  const optimized: string[] = [];
69  let buffer = "";
70
71  for (const chunk of chunks) {
72    const tokenCount = estimateTokens(chunk);
73
74    if (tokenCount > config.maxTokens) {
75      // Split oversized chunks
76      if (buffer) {
77        optimized.push(buffer);
78        buffer = "";
79      }
80      const subChunks = splitLargeChunk(chunk, config.maxTokens);
81      optimized.push(...subChunks);
82    } else if (estimateTokens(buffer + chunk) <= config.targetTokens) {
83      // Merge small chunks
84      buffer = buffer ? `${buffer}\n\n${chunk}` : chunk;
85    } else {
86      if (buffer) optimized.push(buffer);
87      buffer = chunk;
88    }
89  }
90
91  if (buffer) optimized.push(buffer);
92  return optimized;
93}
94
95function estimateTokens(text: string): number {
96  // Rough estimate: ~4 characters per token for English
97  return Math.ceil(text.length / 4);
98}
99
100function splitLargeChunk(chunk: string, maxTokens: number): string[] {
101  const sentences = chunk.split(/(?<=[.!?])\s+/);
102  const result: string[] = [];
103  let current = "";
104
105  for (const sentence of sentences) {
106    if (estimateTokens(current + sentence) <= maxTokens) {
107      current = current ? `${current} ${sentence}` : sentence;
108    } else {
109      if (current) result.push(current);
110      current = sentence;
111    }
112  }
113
114  if (current) result.push(current);
115  return result;
116}

Pro Tip: Context Windows

When chunking, consider your retrieval topK and the LLM's context window. If you retrieve 10 chunks of 512 tokens each, that's 5,120 tokens just for context. Leave room for the system prompt, user query, and generated response.

Reranking: The Secret Weapon

Vector similarity retrieval is fast but imprecise. Reranking uses a cross-encoder to score query-document pairs more accurately:

TypeScript
1import Cohere from "cohere-ai";
2
3interface RetrievedDocument {
4  id: string;
5  content: string;
6  score: number;
7  metadata: Record<string, unknown>;
8}
9
10interface RerankResult {
11  document: RetrievedDocument;
12  relevanceScore: number;
13}
14
15async function rerankDocuments(
16  query: string,
17  documents: RetrievedDocument[],
18  topN: number = 5
19): Promise<RerankResult[]> {
20  const cohere = new Cohere.Client({
21    token: process.env.COHERE_API_KEY!,
22  });
23
24  const response = await cohere.rerank({
25    model: "rerank-english-v3.0",
26    query,
27    documents: documents.map((doc) => doc.content),
28    topN,
29    returnDocuments: false,
30  });
31
32  // Map rerank results back to original documents
33  const reranked = response.results.map((result) => ({
34    document: documents[result.index],
35    relevanceScore: result.relevanceScore,
36  }));
37
38  // Filter by minimum relevance threshold
39  const filtered = reranked.filter((result) => result.relevanceScore > 0.5);
40
41  return filtered;
42}
43
44// Complete retrieval pipeline
45async function retrieveWithReranking(
46  query: string,
47  vectorStore: VectorStore,
48  config: {
49    initialTopK: number;
50    finalTopN: number;
51    minScore: number;
52  }
53): Promise<RerankResult[]> {
54  // Step 1: Fast vector similarity search
55  const candidates = await vectorStore.similaritySearch(query, config.initialTopK);
56
57  // Step 2: Precise reranking
58  const reranked = await rerankDocuments(query, candidates, config.finalTopN);
59
60  // Step 3: Filter by relevance threshold
61  return reranked.filter((result) => result.relevanceScore >= config.minScore);
62}

Reranking Latency

Reranking adds 100-300ms latency. For real-time applications, consider caching frequently asked queries or using async reranking with streaming responses.

Evaluation: Measuring What Matters

You can't improve what you don't measure. Here are the metrics I track:

MetricDescriptionTarget
Retrieval Precision@K% of retrieved docs that are relevant> 80%
Retrieval Recall@K% of relevant docs that are retrieved> 90%
Answer FaithfulnessIs the answer supported by context?> 95%
Answer RelevanceDoes the answer address the query?> 90%
Latency P50Median end-to-end response time< 2s
Latency P9999th percentile response time< 5s
Cost per QueryEmbedding + LLM + infrastructure< $0.01

Key Lessons Learned

After countless production deployments, here's what I've learned:

  1. Start with evaluation - Build your test set before optimizing anything
  2. Chunk boundaries matter - Poor chunking causes more failures than any other component
  3. Reranking is almost always worth it - The latency cost pays for itself in accuracy
  4. Monitor everything - Retrieval scores, answer lengths, user feedback
  5. Plan for failure - Have fallbacks when retrieval returns nothing relevant

Further Reading

For a deeper dive into advanced RAG techniques, check out Anthropic's research on contextual retrieval:

Anthropic's research on improving RAG accuracy by 49% through contextual embeddings and BM25 hybrid search.

www.anthropic.com

Conclusion

Building production RAG systems requires attention to every component: chunking, embedding, retrieval, reranking, and generation. The patterns in this post have been battle-tested across multiple deployments.

Start simple, measure everything, and iterate based on data. The best RAG system is the one that solves your users' actual problems reliably.


Have questions or want to share your own RAG experiences? Reach out on LinkedIn or Twitter.