RAG Production Guide 2026 — Chunking, Hybrid Search, Reranking, Evaluation

Why RAG Still Matters in 2026

The question comes up every quarter: "With 1M-token context windows, do we still need RAG?" The answer is a clear yes, for three reasons:

Cost: Stuffing 500,000 tokens of context into every API call costs 10-50x more than retrieving the 5 relevant chunks you actually need.
Accuracy: Counterintuitively, models perform worse with more context. Research consistently shows that targeted retrieval (5-10 highly relevant chunks) produces more accurate answers than dumping an entire knowledge base into context. The model gets confused by irrelevant information.
Latency: Processing 1M tokens takes 10-30 seconds. Retrieving 5 chunks and generating an answer takes 1-3 seconds. Users won't wait.

RAG done right remains the gold standard for knowledge-grounded AI applications — customer support bots, internal knowledge bases, document Q&A, legal research, medical reference systems, and any application where the AI needs to cite sources accurately.

The Complete RAG Pipeline

User Query
    ↓
[1] Query Understanding & Rewriting
    ↓
[2] Embedding Generation (text-embedding-3-large / voyage-3)
    ↓
[3] Hybrid Search: Vector (semantic) + BM25 (keyword)
    ↓
[4] Reranking (Cohere Rerank v3 / cross-encoder)
    ↓
[5] Context Assembly & Prompt Construction
    ↓
[6] LLM Generation (Claude Sonnet / GPT-4o)
    ↓
[7] Citation Extraction & Response Formatting
    ↓
Response with Sources

Most tutorials skip steps 1, 3, 4, and 7. These are the steps that separate a demo from a production system.

Step 1: Document Ingestion & Chunking

How you split documents is the single most impactful variable in your entire RAG pipeline. Get chunking wrong and nothing downstream can compensate.

Fixed-Size Chunking (Baseline)

Split every N tokens with overlap. Simple, predictable, and often sufficient for homogeneous content:

// Fixed-size with overlap
const chunks = splitText(document, {
  chunkSize: 512,        // tokens per chunk
  chunkOverlap: 64,      // overlap between adjacent chunks
  separator: "\n\n",     // prefer splitting at paragraph boundaries
});

When to use: Uniform content like product descriptions, FAQ entries, or standardized reports.

Semantic Chunking (Recommended for Most Cases)

Split at semantic boundaries — topic shifts, section headers, paragraph breaks — rather than arbitrary token counts. This ensures each chunk is a coherent unit of meaning:

// Semantic chunking with LlamaIndex
const parser = new SemanticSplitterNodeParser({
  embedModel: new OpenAIEmbedding({ model: "text-embedding-3-small" }),
  breakpointPercentileThreshold: 85, // sensitivity to topic shifts
  bufferSize: 1, // sentences of overlap
});

When to use: Long-form content, technical documentation, research papers, legal documents.

Hierarchical Chunking (Best for Complex Documents)

Store both a summary chunk (parent) and detail chunks (children). Retrieve summaries first for broad matching, then drill into children for specific answers:

// Parent: "Section 3 covers authentication, including OAuth2, SAML, and API keys..."
// Child 1: "OAuth2 implementation requires registering a client ID..."
// Child 2: "SAML integration uses the following XML configuration..."
// Child 3: "API key rotation should happen every 90 days..."

This dramatically reduces noise — the parent catches the broad query, and children provide precise answers.

Chunking Anti-Patterns to Avoid

Splitting mid-sentence: Destroys meaning. Always split at sentence or paragraph boundaries.
Splitting tables and code blocks: A table row without its header is useless. Keep structured content together.
No metadata: Every chunk should carry its source document title, section header, page number, and any relevant metadata (date, author, category). This enables filtering.
Ignoring document structure: A PDF with headers, subheaders, and bullets has natural boundaries. Use them.

Step 2: Embedding Generation

In 2026, the top embedding models for RAG are:

OpenAI text-embedding-3-large (3072 dims) — best general-purpose, good price/performance
Voyage AI voyage-3 (1024 dims) — excellent for code and technical content
Cohere embed-v4 — strong multilingual support

Important: always embed queries differently from documents. Many models support "query" vs "document" prefixes that improve retrieval quality:

// Document embedding (at ingestion time)
const docEmbedding = await embed("search_document: " + chunkText);

// Query embedding (at search time)
const queryEmbedding = await embed("search_query: " + userQuestion);

Step 3: Hybrid Search

Pure vector search misses exact keyword matches. Pure BM25 misses semantic similarity. Always combine both for production systems:

// Pinecone hybrid search
const results = await index.query({
  vector: queryEmbedding,           // semantic similarity
  sparseVector: bm25SparseVector,   // keyword matching
  topK: 20,                         // retrieve more than you need
  alpha: 0.7,                       // 70% semantic, 30% keyword
  filter: {
    category: { $eq: userCategory }, // metadata filtering
    date: { $gte: "2025-01-01" },   // recency bias
  },
});

The alpha parameter controls the balance. For technical queries where exact terms matter (error codes, API names), shift toward keyword (alpha: 0.4). For conceptual questions, shift toward semantic (alpha: 0.8).

Step 4: Reranking — The Secret Weapon

This is the single most impactful improvement you can make to an existing RAG system. Retrieve 20 chunks with hybrid search, then rerank to the top 5 using a cross-encoder model:

import { CohereClient } from "cohere-ai";

const cohere = new CohereClient({ token: COHERE_API_KEY });

// Rerank the initial retrieval results
const reranked = await cohere.rerank({
  model: "rerank-v3.5",
  query: userQuestion,
  documents: initialResults.map(r => r.text),
  topN: 5,  // keep only the top 5 after reranking
});

// Use reranked results for context
const context = reranked.results
  .map(r => initialResults[r.index].text)
  .join("\n\n");

Why reranking works: vector similarity is a rough approximation of relevance. Cross-encoders process the query and document together, understanding the actual relationship between them. In benchmarks, reranking consistently improves answer accuracy by 15-30%.

Step 5: Context Assembly

How you present retrieved chunks to the LLM matters more than most people realize:

const systemPrompt = `You are a helpful assistant that answers questions based on the provided context.

RULES:
- Only answer based on the provided context. If the context doesn't contain the answer, say "I don't have information about that."
- Cite your sources using [Source: document_name] format.
- If multiple sources agree, synthesize them. If they conflict, acknowledge the discrepancy.
- Never make up information not present in the context.

CONTEXT:
${chunks.map((chunk, i) => `
--- Source: ${chunk.metadata.title} (Page ${chunk.metadata.page}) ---
${chunk.text}
`).join("\n")}`;

const result = await generateText({
  model: anthropic("claude-sonnet-4-6"),
  system: systemPrompt,
  prompt: userQuestion,
});

Evaluation Framework

You can't improve what you don't measure. Track these metrics using RAGAS or custom evaluation pipelines:

Metric	What It Measures	Target
Faithfulness	Does the answer stick to the retrieved context?	> 0.95
Answer Relevancy	Does the answer address the actual question?	> 0.90
Context Recall	Did retrieval find the right documents?	> 0.85
Context Precision	Are retrieved documents relevant (no noise)?	> 0.80

Common Failure Modes and Fixes

"The AI says it doesn't know, but the answer is in the docs" — your chunking broke the relevant paragraph, or your embedding model doesn't understand the query's intent. Fix: try semantic chunking and add query rewriting.
"The AI hallucinates despite having context" — not enough relevant context retrieved, or the context is ambiguous. Fix: increase retrieval count, add reranking, and strengthen the "only answer from context" instruction.
"Results are irrelevant" — missing metadata filters. A query about "Python" is matching documents about the snake because there's no category filter. Fix: add structured metadata to every chunk.
"Answers are outdated" — stale embeddings. Fix: implement a re-ingestion pipeline that updates embeddings when source documents change.
"It's too slow" — embedding + search + reranking + generation adds up. Fix: cache embeddings for repeated queries, use async reranking, and stream the LLM response.

Need a production-grade RAG system?

We build AI-powered knowledge systems for businesses — from document Q&A to customer support agents. Book a free consultation →

Insights & Articles

FaroOxIum