What is RAG? Retrieval Augmented Generation explained ⋮ 2point0.ai

When I started playing with LLMs and learning about the technology, I kept reading about “doing RAG” or building a “RAG app” and, honestly, for a long time I didn’t even understand what RAG was.

RAG - or Retrieval Augmented Generation - is a technique that allows large language models (LLMs) to process and generate responses from very long, extensive context by intelligently retrieving and providing only the most relevant information to the model.

That’s the TLDR, but if you’re still scratching your head then read on. This isn’t a tutorial, but I hope to be able to share what my high-level understanding on RAG is, and where, why and how you should use it.

Context length: the short term memory of an LLM

All LLMs have a “context window” - which you can think of as its short-term memory. It represents the maximum length of input context that the model can keep track of when generating a response. When the length of the input exceeds that maximum length, the model effectively just forgets parts of the context, which can lead to incoherent and incorrect responses.

All models have a context length. For GPT-4 it’s currently 128k, GPT-3.5 is 32k, and most local models are trained with between 4k and 32k context length. The number represents tokens, not words, and a common rule of thumb is that one token corresponds to approximately 0.75 words (in English). So we can say GPT-4 is able to keep track of around 96,000 words of input when generating a response.

And remember, when you chat with an LLM, the entire chat history from both sides gets replayed as new input for each new message - so both the input and the output accumulates for each subsequent message.

Context length quickly becomes a barrier for use cases where you provide the LLM lengthy research papers, code bases, or even entire books, and then ask the model specific questions about the extensive context.

When context falls out of the window

My daughter is a Harry Potter fan. Her favourite book in the series is The Order of the Phoenix, which comes in at a pretty hefty 257,000 words. In truth, she doesn’t need AI because she already knows everything that is possible to know about Harry Potter (I’m not kidding), but let’s just imagine she copy-and-pastes the entire book into GPT-4 to ask a question. Because the number of words/tokens exceeds the context window of the model, effectively much of that context goes in and straight out the other side. Because of this, the model forgets the earlier chapters. So if you ask a question, for example about the Dementors attacking Harry and Dudley, it’s likely GPT-4 will fail to answer the question well.

Without RAG, a language model loses focus when provided too much context.

Also, remember these models charge per token. Feeding the entire contents of Order of the Phoenix into GPT-4 would cost the bill payer (that’s me, BTW, my daughter is 10) around $2.50.

RAG: smaller, more targeted, more relevant context

RAG is a technique that involves pre-processing the user input - a kind of middleware that manipulates the user prompt before sending it to the LLM.

Instead of providing the entire contents of the book as context, a RAG app processes the book, finds any parts of the book that are relevant to the user’s question, and simply feeds those relevant chunks of text in as context. This way, the prompt is kept well below and the model’s context length, and everything is cheaper and faster.

The process looks like this:

1. Text splitting

When the book is uploaded, the server splits the text into chunks. There are different strategies for splitting the text, and this alone can be a complex topic, but for argument’s sake, let’s just imagine every paragraph becomes a separate chunk.

2. Calculate embeddings

Each chunk of text is passed to an ML model tasked with returning an “embedding” for that chunk. An embedding is a vector of numerical values that captures the semantic value - the meaning - of the text.

Embeddings and vectors are a foundational part of how large language models work, and understanding them very quickly takes us into the realm of linear algebra and higher level maths, and well, this isn’t the blog for that. For now, just know that you can generate embeddings through OpenAI’s API, or do so locally using Ollama and the nomic-embed-text model.

3. Store embeddings in a vector database

Now you have an embedding for every chunk of text, you can store them in a vector database. As the name suggests, this is a database optimised for storing and querying against vectors.

Vector databases have a particularly useful trick up their sleeve. Because embeddings capture the semantic value of a piece of text, it’s possible to query the database to find the closest semantic matches to some other embedding.

4. Query the vector database

When the user asks a question, you can generate an embedding for that question and then query the vector database to find the semantically closest chunks of text - essentially the most relevant paragraphs of text in relation to the question.

5. Prompt the LLM

With the relevant chunks of text, you can now engineer a prompt for the LLM that includes the relevant context and asks the user question. This gives the model all the relevant information it needs to answer the question in a much smaller, faster and cheaper prompt.

With RAG, a smaller, more focused and more relevant prompt is engineered.

Whilst there are a few moving parts here, it’s not so complicated to implement, and it’s surprisingly performant as embeddings are very quick to generate and query against. Instead of costing dollars, the entire prompt is likely sub-cent.

Larger context length

Newer models are emerging that promise even lengthier context windows. For example, Claude 3 claims a context length of up to 1 million tokens (this is for partners who know the magic handshake - us ordinary folk only get 200k, which is still pretty good actually). Google’s very new Gemini 1.5 apparently ships with a 1M context length as standard, and their special enterprise partners can apparently take advantage of an incredible 10M tokens!

This does change the picture somewhat. If you’re taking advantage of the latest state-of-the-art models from the big boys, then the technical reasons that necessitate a RAG solution can be argued against. Certainly, there are use cases where dumping a million tokens into one mega-prompt will be simpler and result in better responses than using RAG. But, the financial argument still exists and so RAG will always play a role. And if you’re running models at home or on your own infrastructure, RAG has to remain an essential part of your developer toolkit.

Conclusion

RAG, or Retrieval Augmented Generation, is a powerful technique that allows large language models to process and generate responses from extensive context, far beyond their inherent context length limitations. By intelligently retrieving and providing only the most relevant chunks of information to the model, RAG enables natural language interactions with vast datasets, long-form content, or even entire books.

In a future post, I plan to dive deeper into the topic and create a RAG tutorial with code examples. But hopefully this high-level overview allows you to understand what RAG is and the role it plays, so unlike I did, you don’t spend the next few months scratching your head every time you hear someone ragging about RAG.