Retrieval-Augmented Generation

In this article, we will explore the technique of Retrieval-Augmented Generation (RAG), an AI framework designed to enhance the quality of responses generated by large language models (LLMs). By grounding LLMs on external sources of knowledge to supplement the LLM’s internal representation of information, RAG ensures that outputs remain relevant, accurate, and useful across various contexts. We will cover the important terms related to RAG, explain the RAG process in detail, and highlight the benefits of this innovative approach.
Understanding Key concepts:
- Vector: When users ask an LLM a question, the AI model sends the query to another model that converts it into a numeric format so machines can read it. The numeric version of the query is sometimes called an embedding or a vector.
- Vector databases: The embedding model then compares these numeric values to vectors in a machine-readable index of an available knowledge base. When it finds a match or multiple matches, it retrieves the related data, converts it to human-readable words and passes it back to the LLM. Finally, the LLM combines the retrieved words and its own response to the query into a final answer it presents to the user, potentially citing sources the embedding model found. In the background, the embedding model continuously creates and updates machine-readable indices, called vector databases, for new and updated knowledge bases as they become available.
- Indexing: If RAG is used, then a series of related documents are indexed by chunking them first, generating embeddings of the chunks, and indexing them into a vector store. At inference, the query is also embedded in a similar way.
Explaining Retrieve, Augment, Generate:
- Retrieve: The user query is used to retrieve relevant context from an external knowledge source. For this, the user query is embedded with an embedding model into the same vector space as the additional context in the vector database. This allows to perform a similarity search, and the top k closest data objects from the vector database are returned.
- Augment: The user query and the retrieved additional context are stuffed into a prompt template.
- Generate: Finally, the retrieval-augmented prompt is fed to the LLM.
Working of RAG:
Create external data: The new data outside of the LLM’s original training data set is called external data. It can come from multiple data sources, such as a APIs, databases, or document repositories. The data may exist in various formats like files, database records, or long-form text. An AI technique called embedding language models, converts data into numerical representations and stores it in a vector database. This process creates a knowledge library that the generative AI models can understand.
Retrieve relevant information: The next step is to perform a relevancy search. The user query is converted to a vector representation and matched with the vector databases. For example, consider a smart chatbot that can answer human resource questions for an organization. If an employee searches, “How much annual leave do I have?” the system will retrieve annual leave policy documents alongside the individual employee’s past leave record. These specific documents will be returned because they are highly-relevant to what the employee has input. The relevancy was calculated and established using mathematical vector calculations and representations.
Augment the LLM prompt: Next, the RAG model augments the user input (or prompts) by adding the relevant retrieved data in context. This step uses prompt engineering techniques to communicate effectively with the LLM. The augmented prompt allows the large language models to generate an accurate answer to user queries.
Update external data: To maintain current information for retrieval, asynchronously update the documents and update embedding representation of the documents. This is done through automated real-time processes or periodic batch processing. This is a common challenge in data analytics—different datascience approaches to change management can be used.
Explaining working of RAG with an example: Smart chatbot designed to answer human resource questions for an organization
- Query: “How much annual leave do I have?”
- Query converted to vector representation
- Embedding model searches the vector database, which contains vectors representing the organization’s annual leave policy documents and the specific employee’s past leave records.
- Retrieves the relevant documents: the general annual leave policy and the employee’s leave record.
- User’s query is combined with the retrieved documents into an augmented prompt
- Augmented prompt is structured using prompt engineering techniques to ensure clarity and context for the LLM and fed to LLM.
- LLM generates a response that combines its general knowledge about annual leave policies with specific details from the retrieved documents.
- Final Response: “You have 10 days of annual leave remaining according to the company’s policy and your past leave records.”
Benefits of RAG:
- Enhanced Contextual Understanding
- Diverse and Relevant Outputs
- Flexibility in Knowledge Integration
The Retrieval-Augmented Generation (RAG) model significantly enhances the capabilities of large language models by converting user queries into vector representations, retrieving relevant information, and augmenting prompts with this data. It ensures that LLM-generated responses are relevant, contextually rich, and up-to-date. This approach provides flexibility and adaptability in integrating new knowledge, making it a cost-effective and powerful solution for various applications such as Chatbots, Education Tools, Language Translation and many more.
Recent Comments