Building an intelligent retrieval system with decision-making capabilities
Agentic Retrieval-Augmented Generation (RAG) enhances traditional RAG systems by adding decision-making capabilities. While standard RAG passively retrieves information from a predefined database, Agentic RAG can:
Figure 1: Basic Agentic RAG Architecture
This tutorial builds a basic Agentic RAG system that can retrieve information from two sources:
The agent will intelligently decide which source to query based on the nature of the information needed, making it more versatile than traditional RAG systems.
Our Agentic RAG implementation follows these key steps:
Note: The key difference from standard RAG is the agent's ability to choose between information sources based on the query and available information.
We start by loading a PDF document (Tesla's Q3 report) and preparing it for indexing:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/tesla_q3.pdf")
documents = loader.load()
# split documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
documents = text_splitter.split_documents(documents)
For efficient retrieval, we need to convert these document chunks into vector embeddings:
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5",
encode_kwargs = {"normalize_embeddings": True})
We're using the BGE-small embedding model from BAAI, which offers a good balance between performance and efficiency for document retrieval tasks.
Next, we create a vector database to store our document embeddings:
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_documents(documents, embeddings)
# create retriever
retriever = vectorstore.as_retriever()
We're using FAISS (Facebook AI Similarity Search), which is optimized for efficient similarity search in high-dimensional spaces. The retriever provides a simple interface for querying the vector store.
If needed, you can save your vector store locally:
# saving the vectorstore (commented out in the notebook)
# vectorstore.save_local("vectorstore.db")
To handle queries that can't be answered using our vector store, we integrate a web search capability:
from langchain_community.tools.tavily_search import TavilySearchResults
web_search_tool = TavilySearchResults(k=10)
The Tavily search API provides web search functionality with the parameter k=10 retrieving the top 10 results for each query.
We can test the web search functionality directly:
# Sample search (commented out in the notebook)
# web_search_tool.run("Tesla stock market summary for Q3?")
For our agent's reasoning and response generation, we need a powerful language model:
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash-exp")
We're using Google's Gemini model, specifically the "gemini-2.0-flash-exp" variant, which offers a good balance between speed and capability for agent-based systems.
Now we define the core functions our agent will use to retrieve information:
# define vector search
from langchain.chains import RetrievalQA
def vector_search(query: str):
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
return qa_chain.run(query)
# define web search
def web_search(query: str):
return web_search_tool.run(query)
These functions encapsulate two different retrieval strategies:
To make these functions accessible to our agent, we need to define them as tools:
from langchain.tools import tool
@tool
def vector_search_tool(query: str) -> str:
"""Tool for searching the vector store."""
return vector_search(query)
@tool
def web_search_tool_func(query: str) -> str:
"""Tool for performing web search."""
return web_search(query)
# define tools for the agent
from langchain.agents import Tool
tools = [
Tool(
name="VectorStoreSearch",
func=vector_search_tool,
description="Use this to search the vector store for information."
),
Tool(
name="WebSearch",
func=web_search_tool_func,
description="Use this to perform a web search for information."
),
]
The @tool decorator transforms our functions into LangChain tools. We then wrap these in Tool objects with names and descriptions that help the agent understand when to use each tool.
The agent's behavior is guided by a system prompt that defines its operational logic:
# define system prompt
system_prompt = """Respond to the human as helpfully and accurately as possible. You have access to the following tools: {tools}
Always try the \"VectorStoreSearch\" tool first. Only use \"WebSearch\" if the vector store does not contain the required information.
Use a json blob to specify a tool by providing an action key (tool name) and an action_input key (tool input).
Valid "action" values: "Final Answer" or {tool_names}
Provide only ONE action per $JSON_BLOB, as shown:"
```
{{
"action": $TOOL_NAME,
"action_input": $INPUT
}}
```
Follow this format:
Question: input question to answer
Thought: consider previous and subsequent steps
Action:
```
$JSON_BLOB
```
Observation: action result
... (repeat Thought/Action/Observation N times)
Thought: I know what to respond
Action:
```
{{
"action": "Final Answer",
"action_input": "Final response to human"
}}
Begin! Reminder to ALWAYS respond with a valid json blob of a single action.
Respond directly if appropriate. Format is Action:```$JSON_BLOB```then Observation"""
# human prompt
human_prompt = """{input}
{agent_scratchpad}
(reminder to always respond in a JSON blob)"""
This system prompt is crucial as it defines:
Key Point: The instruction "Always try the \"VectorStoreSearch\" tool first" establishes a priority order for information retrieval, directing the agent to prefer local knowledge before searching externally.
Now we assemble the complete agent chain that will process queries:
# create prompt template
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
prompt = ChatPromptTemplate.from_messages(
[
("system", system_prompt),
("human", human_prompt),
]
)
# tool render
from langchain.tools.render import render_text_description_and_args
prompt = prompt.partial(
tools=render_text_description_and_args(list(tools)),
tool_names=", ".join([t.name for t in tools]),
)
# create rag chain
from langchain.schema.runnable import RunnablePassthrough
from langchain.agents.output_parsers import JSONAgentOutputParser
from langchain.agents.format_scratchpad import format_log_to_str
chain = (
RunnablePassthrough.assign(
agent_scratchpad=lambda x: format_log_to_str(x["intermediate_steps"]),
)
| prompt
| llm
| JSONAgentOutputParser()
)
# create agent
from langchain.agents import AgentExecutor
agent_executor = AgentExecutor(
agent=chain,
tools=tools,
handle_parsing_errors=True,
verbose=True
)
This chain connects all the components:
The AgentExecutor manages the execution flow, handling the back-and-forth between thinking, tool usage, and final answer generation.
Let's test our agent with a couple of example queries:
# Query about information in the vector store (Tesla Q3 report)
agent_executor.invoke({"input": "Total automotive revenues Q3-2024"})
Since this information is likely in the Tesla Q3 report, the agent should use the VectorStoreSearch tool.
# Query about information likely not in the vector store
agent_executor.invoke({"input": "Tesla stock market summary for 2024?"})
Since this information is broader and likely not in the Q3 report, the agent may need to fall back to the WebSearch tool.
Note: The verbose=True parameter lets us see the agent's reasoning process and tool selection decisions during execution.
For production use, we can create a non-verbose version of the agent and process multiple queries:
# create agent with verbose=False for production
agent_output = AgentExecutor(
agent=chain,
tools=tools,
handle_parsing_errors=True,
verbose=False
)
# Create dataset
question = [
"What milestones did the Shanghai factory achieve in Q3 2024?",
"Tesla stock market summary for 2024?"
]
response = []
contexts = []
# Inference
for query in question:
vector_contexts = retriever.get_relevant_documents(query)
if vector_contexts:
context_texts = [doc.page_content for doc in vector_contexts]
contexts.append(context_texts)
else:
print(f"[DEBUG] No relevant information in vector store for query: {query}. Falling back to web search.")
web_results = web_search_tool.run(query)
contexts.append([web_results])
# Get the agent response
result = agent_output.invoke({"input": query})
response.append(result['output'])
# To dict
data = {
"query": question,
"response": response,
"context": contexts,
}
This batch processing approach:
The resulting data dictionary could be used for evaluation, logging, or further processing of the agent's responses.
The Agentic RAG approach offers several advantages over traditional RAG systems:
Advanced Consideration: This basic implementation can be extended with additional tools, better fallback strategies, and more sophisticated reasoning about the quality and relevance of retrieved information.
We've built a basic Agentic RAG system that intelligently decides between local and web-based information retrieval. This approach can be extended in several ways:
Agentic RAG represents an evolution in retrieval-augmented generation, providing more flexible and powerful information retrieval capabilities that combine the strengths of both local knowledge bases and external information sources.