How to Make a RAG: Retrieval-Augmented Generation in Python

5 min readDec 3, 2024

A robot searching for information in a library, representing how RAG retrieves specific data to generate personalized responses.

Large Language Models often provide generic responses that are not specific to a given context. This is where the Retrieval-Augmented Generation (RAG) technique becomes crucial. RAG allows us to integrate the base knowledge of an LLM with external knowledge.

Imagine a company that has a collection of internal documents, such as policies or training manuals. They want to leverage the power of LLMs to provide information to users or employees, but it is clear that the LLMs don’t have access to these documents. By using RAG, the company can build a system to retrieve relevant information from these documents and generate responses with LLMs tailored to the specific questions of employees or customers.

How it Works?

User query: RAG receives a question or task.
Looking for information: The RAG system consults information from a data source, which could be an API, database, documents, or web information.
Give content to LLM: With the retrieved information, the LLM generates responses based on what it has learned from the data.
Generate Output: This is where the magic happens, RAG generates a personalized response.

Benefits or RAG

Prevents model hallucination: The training dataset with which the model was trained is rarely updated or may be outdated, leading the model, in some cases, to generate partially false responses.
Cost saving: Retraining model parameters can be time-intensive and costly. RAG can significantly reduce the cost of running LLMs in an enterprise context.
Increased user confidence: RAG allows responses to include attribution to the source information, which increases confidence in a solution powered by generative AI.

Coding

In this tutorial, we will use:

LangChain: framework that helps developers build applications using large language models (LLMs)
Chroma: open-source vector database primarily used to store and efficiently retrieve vector embedding

Prerequisite

In this example, I use a PDF document titled “Home Gardeners Guide” from Purdue University.
Set up your OpenAI Api Key.

Text Preprocessing

In this example, I used a document titled “Home Gardeners Guide”. You can use one or more PDFs of your choice.

Define the name of the folder where PDFs are located.
Load PDFs documents.
Chunk text refers to the process of breaking down a large document into smaller pieces of information, called “chunks.” Splitting the document into smaller pieces allows the retrieval system to index, search, and return specific pieces of information rather than the entire document.

src/text_processor.py

from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Path to the directory containing the PDF files
DOCUMENTS_PATH = 'documents'
def chunk_pdfs() -> list[Document]:
    # Initialize the document loader and load the documents
    document_loader = PyPDFDirectoryLoader(DOCUMENTS_PATH)
    documents = document_loader.load()

    # Initialize the text splitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800, # Size of each chunk in characters
        chunk_overlap=100, # Overlap between chunks in characters
        length_function=len, # Function to calculate the length of the text
        add_start_index=True, # Add start index to the chunks
    )

    # Split the documents into chunks
    chunks = text_splitter.split_documents(documents)

    return chunks

Loading Data

Once the documents were broken down into smaller pieces, we saved them to the vector database “Chroma.” Python provides seamless integration with Chroma for this purpose.

When working with a vector database, we can use an embedding model. An embedding model translates text or other types of data into numerical representations that the vector database can process and understand. In this example, we will use the OpenAI embedding model for this purpose.

src/chroma_dp.py

import os
import shutil
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

CHROMA_PATH = 'chroma'
def save_to_chroma_db(chunks: list[Document], embedding_model) -> Chroma:
    # Remove the existing Chroma database
    if os.path.exists(CHROMA_PATH):
        try:
            shutil.rmtree(CHROMA_PATH)
        except Exception as e:
            print(f"Error removing Chroma database: {e}")

    # Initialize the Chroma database
    db = Chroma.from_documents(
        chunks,
        persist_directory=CHROMA_PATH,
        embedding=embedding_model
    )
    # Persist the database
    print(f"Saved chunks to {CHROMA_PATH}")
    return db

Retrieving Data

Once we have implemented the methods to preprocess our PDF documents, we can begin integrating the entire workflow into a main file:

Preprocess the text and generate document chunks to store in the vector database.
Import the OpenAI embedding model.
Save the processed documents to the vector database.
Retrieve the 5 most similar documents to a given context and merge them into a single string. The context we will use is: “What are the recommended steps for fertilizing a vegetable garden?”

main.py

import os
from src.file_processor import chunk_pdfs
from src.chroma_db import save_to_chroma_db
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI

if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = "<your-openai-api-key>"

# Process the documents
processed_documents = chunk_pdfs()
# Initialize the OpenAI Embedding Model
embedding_model = OpenAIEmbeddings(
    model="text-embedding-3-large"
)
# Save the documents to the database
db = save_to_chroma_db(processed_documents, embedding_model)

query = "What are the recommended steps for fertilizing a vegetable garden?"

docs = db.similarity_search_with_score(query, k=3)

context = "\n\n---\n\n".join([doc.page_content for doc, _score in docs])

Generate Response With the LLM

Prompt templates are a tool that allows us to effectively structure interactions with LLMs creating reusable, structured prompts. They help the model understand the context and generate relevant responses.

Define the prompt template.
Use the prompt to generate a response with the OpenAI LLM.

# Define the prompt template
PROMPT_TEMPLATE = """
You have to answer the following question based on the given context:
{context}
Answer the following question:{question}
Provide a detailed answer.
Don't include non-relevant information.
"""

# Generate the prompt
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context, question=query)

# Import OpenAI LLM model
model = ChatOpenAI()
response = model.predict(prompt)

print(response)

Generated output by the RAG System:

The recommended steps for fertilizing a vegetable garden are as follows:
1. Establish the basic fertility level by applying the right kind and amount of fertilizer to your garden soil. This can be determined through a soil test, which will indicate the specific fertilizer needed.
2. Apply fertilizer to maintain the basic fertility level each year after it has been established. This ensures that your soil remains at the optimal fertility level for growing healthy plants.
3. If the soil test recommends "no basic application" due to adequate fertility levels, then skip this step and monitor the soil for any excess elements that may need to be addressed.
4. Plow or spade the soil after applying half of the recommended fertilizer to distribute it evenly throughout the top 7 inches of soil. This helps ensure that the plants receive the necessary nutrients for growth.
By following these steps, you can effectively fertilize your vegetable garden and promote healthy plant growth throughout the growing season.

Congratulations! 🎉You have successfully created a basic RAG system that fetches pertinent documents and utilizes them as context to produce insightful responses.

You can find the full code in: GitHub.