Long Context vs RAG

Who Dives Deeper for Insights?

Context windows for large language models (LLMs) are rapidly increasing. For example, the original Llama model had a context window of 2,048 tokens, whereas the latest Llama 3.1 model can handle up to 128,000 tokens. Meanwhile, Google's Gemini 1.5 Pro offers a context window of 2 million tokens.

In Retrieval-Augmented Generation (RAG), long documents are split into smaller chunks. However, these smaller chunks sometimes lack enough context for the LLM to fully address a query.

According to benchmarks presented in this paper, 60% of queries on standard datasets produce identical results for both methods. Long Context models aren't intended to replace RAG but can enhance existing RAG workflows by providing additional context when necessary. Moreover, RAG is generally more cost-effective than using Long Context models.

You can test both Long Context and RAG pipelines side-by-side with the Gemini 1.5 Flash model, which is indexed on quarterly earnings call transcripts from a select group of companies.

Long Context

Baseline RAG