Foundry AI Partners

The RAG Promise

Retrieval-Augmented Generation (RAG) promises to solve a fundamental limitation of large language models: their knowledge cutoff. By retrieving relevant documents and including them in the prompt, RAG systems can answer questions about proprietary data, recent events, and domain-specific information that wasn't in the training data.

In practice, RAG systems face significant challenges that limit their effectiveness in production environments.

The Context Window Problem

Modern LLMs have large context windows—32k, 128k, even 200k tokens. This seems like plenty of space for retrieved documents. But context windows fill up quickly:

System prompts and instructions: 500-1000 tokens
Conversation history: 1000-5000 tokens
Retrieved documents: 10,000-50,000 tokens
Output space: 2000-4000 tokens

A 32k context window becomes constrained quickly, especially in multi-turn conversations. Organizations must choose between conversation history and retrieved context—both are valuable, neither can be easily discarded.

Retrieval Quality Challenges

RAG systems depend on retrieval quality. Poor retrieval leads to poor outputs, regardless of model sophistication:

Semantic Search Limitations

Vector similarity doesn't always capture relevance. Documents that are semantically similar may not answer the user's question. Conversely, relevant documents may not be semantically similar to the query.

Ranking and Reranking

Retrieving 100 potentially relevant documents is easy. Identifying the 5 most relevant documents is hard. Simple vector similarity often fails; production systems need sophisticated reranking using cross-encoders or LLM-based relevance scoring.

Context Compression

Including full documents wastes context space. Production systems need intelligent compression: extracting relevant passages, summarizing background information, and removing redundant content.

Advanced Context Engineering

Production RAG systems go beyond simple retrieval:

Hybrid Search

Combine vector similarity with keyword search, metadata filtering, and recency weighting. Different queries benefit from different retrieval strategies.

Query Expansion

Rewrite user queries to improve retrieval: expand abbreviations, add synonyms, generate multiple query variations, and use LLMs to reformulate questions.

Hierarchical Retrieval

Retrieve at multiple granularities: sections, paragraphs, sentences. Start with coarse-grained retrieval to identify relevant documents, then retrieve fine-grained passages from those documents.

Dynamic Context Allocation

Allocate context space dynamically based on query type. Simple questions need less context; complex questions need more. Adjust retrieval depth based on available context space.

The Cost of Context

Context isn't free. Every token in the context window costs money and adds latency:

Monetary cost: Larger contexts cost more per request
Latency cost: Processing large contexts takes time
Quality cost: Irrelevant context confuses models and degrades output quality

Production systems optimize context usage: retrieving only what's needed, compressing aggressively, and caching expensive operations.

Beyond RAG: Hybrid Architectures

The most sophisticated systems combine multiple approaches:

Fine-tuning for domain-specific knowledge that doesn't change frequently
RAG for dynamic information that changes regularly
Function calling for structured data that's better accessed via APIs
Prompt engineering for task-specific instructions and formatting

No single technique solves all problems. Production systems use the right tool for each task.

Conclusion

RAG is a powerful technique, but it's not a silver bullet. Production systems require sophisticated context engineering: hybrid search, intelligent reranking, dynamic compression, and cost optimization. Organizations that invest in context engineering—treating it as a critical system component rather than an afterthought—build RAG systems that deliver reliable value at scale.

The future of enterprise AI depends not just on better models, but on better context engineering. The organizations that master this discipline will capture the most value from AI investments.

The Hidden Cost of Context: Why RAG Isn't Enough