December 9, 2025
After running RAG in production across a dozen enterprise deployments, the pattern is clear: most teams get retrieval wrong before they ever touch the LLM. The chunking strategy, embedding model choice, and reranking layer together determine whether your p95 retrieval latency stays under 400ms or blows past 1.2 seconds — and whether the context you're stuffing into an 8k window is actually the right 8k tokens.
Fixed-size chunking at 512-token windows with 10% overlap is the default everywhere. It's also wrong for most corpora, and we're tired of watching teams ship it to production without a second thought. Technical documentation, legal contracts, and support transcripts have completely different semantic structure — treating them identically is how you end up with garbage retrieval on day one.
For code documentation, function-level chunking — splitting on class and method boundaries rather than token counts — drops retrieval precision failures by roughly 30% compared to naive fixed-size splits. For long-form narrative content, semantic chunking using sentence-transformer similarity scores to detect topic shifts produces chunks that actually correspond to coherent ideas. The LlamaIndex SemanticSplitterNodeParser implements this reasonably well out of the box.
The uglier production problem is chunk size variance. If you're using pgvector with cosine similarity, wildly different chunk sizes create ranking artifacts where short, dense chunks consistently outscore longer, informationally richer ones. A practical fix: store chunk character length as metadata and apply a logarithmic length normalization factor at query time. Not elegant. Works.
OpenAI's text-embedding-3-large at 3072 dimensions is not the right default for an enterprise deployment processing 50 million documents. At $0.00013 per 1k tokens, embedding a 10M document corpus of average 400 tokens each costs roughly $520 just to build the index — and that's before you re-embed on model upgrades, which you will do. For most retrieval tasks, a fine-tuned bge-large-en-v1.5 running on-prem via a Ray Serve endpoint matches or exceeds OpenAI embedding quality on domain-specific content at 1/15th the ongoing cost. A platform team we worked with last quarter made this switch mid-project and cut their annual embedding spend by more than their entire GPU rental bill.
Dimensions also matter for storage and ANN search latency in ways that are easy to underestimate. 1536-dimensional vectors in Pinecone with 10M records give you roughly 40-60ms p95 query latency under moderate load. Moving to 3072 dimensions doesn't double retrieval quality; it roughly doubles your index memory footprint and adds 15-25ms to query time. Profile your actual recall@10 before assuming bigger embeddings win. They usually don't.
Pure dense retrieval misses exact-match queries. A user asking for "ISO 27001 section 6.1.2" doesn't need semantic similarity — they need BM25. Pure sparse retrieval misses semantic paraphrase. The production answer is hybrid search with a tunable fusion layer, and we'd argue most teams that skip this are leaving measurable quality on the table.
Weaviate's hybrid search and Elasticsearch's reciprocal rank fusion (RRF) both implement this. The key parameter is the k constant in RRF scoring:
POST /indices/docs/_search
{
"retriever": {
"rrf": {
"retrievers": [
{ "standard": { "query": { "match": { "content": "ISO 27001 section 6.1.2" } } } },
{ "knn": { "field": "embedding", "query_vector": [...], "num_candidates": 100 } }
],
"rank_constant": 60,
"window_size": 100
}
}
}
The rank_constant of 60 is the Elasticsearch default and a reasonable starting point. For corpora where keyword precision matters more — regulatory documents, internal wikis with precise terminology — dropping it to 20-30 biases RRF toward exact-match results. Run offline evaluation against a labeled query set before changing this in production; the impact on recall@10 is non-trivial and direction isn't always obvious from intuition alone.
Retrieval gets you a candidate set of 20-100 chunks. Reranking determines which 3-5 actually go into the context window. Cross-encoder rerankers like Cohere Rerank or a locally-hosted ms-marco-MiniLM-L-6-v2 add 80-150ms to your pipeline but can lift answer quality scores by 15-25% on enterprise knowledge bases.
The math usually works out. A reranker that prevents one hallucinated answer per hundred queries is worth 120ms when you're serving regulated-industry workflows where a 0.5% hallucination baseline is unacceptable. The practical architecture is a two-stage pipeline: fast ANN retrieval over top-100 candidates in under 60ms, then a reranker over the top-20 that runs in parallel with any other query preprocessing. Don't rerank all 100 — that's where latency budgets collapse. Feed reranker scores back into your observability layer via OpenTelemetry spans so you can track score distribution drift over time, which is an early signal that your embedding model is becoming stale relative to the document corpus.
Getting the right chunks is necessary but not sufficient. Stuffing 8k tokens of context into GPT-4o when 3k would cover the relevant material wastes money and degrades generation quality — the "lost in the middle" phenomenon is real and measurable, with recall on context positions 2000-5000 tokens in dropping noticeably on multi-document synthesis tasks.
Enterprise deployments cannot treat document-level access control as an afterthought. The naive implementation — filtering post-retrieval — leaks document existence to the reranker and potentially to log systems. The correct approach is pre-filtering at the ANN query level. Pinecone metadata filters, Weaviate's where clause, and pgvector with a permissions JOIN all support this. The performance cost of a pre-filter depends heavily on selectivity: a filter passing 80% of documents has minimal ANN recall impact, but a filter passing 5% of documents can degrade recall@10 by 20-40% because the HNSW graph becomes effectively sparse for that user's view. Weaviate's ACORN algorithm partially addresses this; for pgvector, you're better off partitioning vectors by access tier than relying on a filter over a unified index.
Freshness is the other dimension teams consistently underinvest in. A Kafka-based ingestion pipeline feeding into a streaming embedding job via a Ray actor pool can keep index lag under 60 seconds for high-change document stores. For daily-refresh corpora, Airflow-orchestrated re-embedding of changed documents with a Redis bloom filter to skip unchanged chunks is sufficient and far cheaper. Instrument your index staleness as a Prometheus metric and set an alert when p95 document age in retrieved results exceeds your SLA. If you're not measuring it, you don't know what users are actually retrieving against.