Build your customized Chatbot with RAG and LangChain agent

CW Lin
8 min readAug 25, 2024

--

Since ChatGPT became popular, many companies have started wanting to create their own chatbots to serve customers. The job market has also seen positions like LLM engineer, chatbot engineer/trainer/manager, and others.

https://miro.medium.com/v2/resize:fit:1400/1*iGdFJTHMIG79N2HChWaooQ.gif

Using LLM to build up a chatbot and letting LLM be able to answer questions like internal knowledge of the enterprise is not just inserting information and calling LLM API but also needs lots of preprocess and post-process measures that require custom development. (your source data might include lots of images, videos, or databases which is not easy to understand by LLM)

However, aside from the complex preprocessing and postprocessing, building a customized chatbot that can update information in real-time can essentially be achieved through RAG and agent.

This article will introduce how to build a chatbot with retrieval and Google search functions through Langchain agent.

Retrieval Augmented Generation(RAG)

When we want LLM to learn additional knowledge(e.g. internal enterprise information, FAQ, …) but the content is too long to fit into the prompt (due to input token limit), we need RAG to help us retrieve documents.

The following figure elaborates RAG concept very well:

Retrieval-Augmented Generation for Large Language Models: A Survey

Let’s start by looking at the indexing section in the top right corner.
First, we prepare the documents containing the knowledge we want the LLM to have, then we break down this long text into many chunks and convert them into embeddings, which are stored in the vectorstore.

Then, while User Query sentence input, we convert the query sentence into embedding, too. Then calculate the vector similarity between query embedding and embedding in the vectorstore to find out the most similar k chunks.
Finally, we insert the original sentences of chunks into the prompt to let LLM have relative knowledge to answer the User Query.

Although the idea is pretty intuitive, there are serval ways to further improve the retrieval performance. The article Retrieval-Augmented Generation for Large Language Models: A Survey has stated serval methods. If the basic RAG can’t fit your scenario, you may refer to it.

Now, let’s see how to use Langchain to create a RAG. The document I’m going to retrieve is this paper: Retrieval-Augmented Generation for Large Language Models: A Survey.

Without further ado, let’s see the code:

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA, create_retrieval_chain
from langchain.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain


system_prompt = (
"You are an assistant for question-answering tasks. "
"Use the following pieces of retrieved context to answer "
"the question. If you don't know the answer, say that you "
"don't know. Use three sentences maximum and keep the "
"answer concise."
"\n\n"
"{context}"
)

rag_prompt = ChatPromptTemplate.from_messages(
[
("system", system_prompt),
("human", "{input}"),
]
)

def get_qa_chain(pdf_path):
# read file
loader = PyPDFLoader(pdf_path)
documents = loader.load()

# split your docs into texts chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_documents(documents)

# embed the chunks into vectorstore (FAISS)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(texts, embeddings)

# create retriever and rag chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
question_answer_chain = create_stuff_documents_chain(llm=ChatOpenAI(model_name='gpt-4o-mini', temperature=0),
prompt=rag_prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

return rag_chain

rag_chain = get_qa_chain('RAG.pdf')

Basically, I just copied and pasted from the official tutorials. You can see that Langchain has already packaged a lot of things, including loading PDFs, embedding, vector stores, vector search, and connecting to the LLM. Just thinking about coding all of this from scratch is exhausting 💦

Here, I’m using OpenAI’s GPT-4, so remember to set your token in the environment variables.

Now, let’s take a look at the conversation with the RAG chain:

It looks not bad. Although the LLM may already have relative knowledge about RAG, the answer of rag_chain is based on rag.pdf.

In addition, you can use result[‘context’] to see the most similar k chunks that RAG retrieves.

However, you’ll notice that no matter what question you ask, it will perform a search and generate a response, even if you just say “hello”.

At this point, you might think, what if it could decide on its own whether a question needs a search, and execute different functions based on different questions? That’s the concept of an agent.

Agent

Now, we have a goal that letting LLM decide whether to retrieve or not for client’s question, and according to different questions it will execute different functions.

We can build an LLM like below figure. We use two LLMs to achieve it.

The green LLM determines which tool (RAG, Google Search, or No Need) to use for the client question, then executes the tool to retrieve information.
The retrieved information is then inserted into the prompt of the blue LLM, instructing it to use this information to generate a response for the client.

We can also achieve this using Langchain’s agent. I believe there are pros and cons to each approach.

If it’s not too complicated, you can use prompt engineering to let the LLM decide which tool to use. Using the Langchain agent requires less effort but can be more complex, harder to debug, and the Langchain version is constantly updating, with commands and applications being continuously revised.

But I’m a bit too lazy to write everything out, so let’s just implement the agent using Langchain directly. 😅

Besides rag_chain described in the previous paragraph, I also wanna give the ability of Google search to agent. In other words, I will provide the agent with two tools: RAG for retrieving pdf and Google search.

We have already finish rag_chain, and we can achieve the function of Google search via searchAPI. Here, I use Serper.

You can go to Serper website and register a free account to get a token, which has 2500 free queries. Langchain also has GooleSerperAPIWrapperto help us implement it in just one line code!

from langchain_community.utilities import GoogleSerperAPIWrapper
search = GoogleSerperAPIWrapper()

Let’s do a simple test of the search functionality:

ok, now, we can use search to do google search and return search results with string.

Let’s wrap rag_chain and search into tool:

from langchain.agents import Tool, AgentExecutor, create_react_agent

tools = [
Tool(
name="RAG",
func=rag_chain.invoke,
description="Useful when you're asked Retrieval Augmented Generation(RAG) related questions"
),
Tool(
name="Google Search",
description="For answering questions that are not related to legal or when you don't know the answer, use Google search to find the answer",
func=search.run,
)
]

The description in the tools is used to let LLM know which tool should be used, hence, based on your requirements, write the description as clearly as possible.

Now, we design a prompt to restrict the way that agent thought and then output the response what we want.

character_prompt = """Answer the following questions as best you can. You have access to the following tools:
{tools}

For any questions requiring tools, you should first search the provided knowledge base. If you don't find relevant information from provided knowledge base, then use Google search to find related information.

To use a tool, you MUST use the following format:
1. Thought: Do I need to use a tool? Yes
2. Action: the action to take, should be one of [{tool_names}]
3. Action Input: the input to the action
4. Observation: the result of the action

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the following format:
1. Thought: Do I need to use a tool? No
2. Final Answer: [your response here]

It's very important to always include the 'Thought' before any 'Action' or 'Final Answer'. Ensure your output strictly follows the formats above.

Begin!

Previous conversation history:
{chat_history}

Question: {input}
Thought: {agent_scratchpad}
"""

You might think how can I write out such a long prompt. You can see prompt hub of langchain to find some example then do some adjustment.

from langchain import hub
hub.pull("hwchase17/react")

Next, use create_react_agent to package the tool and prompt into an agent, and then use AgentExecutor to run the agent and generate a response. I also took the opportunity to add the memory mechanism here.

from langchain.prompts.prompt import PromptTemplate
from langchain.chains.conversation.memory import ConversationBufferWindowMemory

chat_model = ChatOpenAI(model_name='gpt-4',
temperature=0,
streaming=True,
verbose=True,
max_tokens=1024,
)

prompt = PromptTemplate.from_template(character_prompt)
agent = create_react_agent(chat_model, tools, prompt)

memory = ConversationBufferWindowMemory(memory_key='chat_history', k=5, return_messages=True, output_key="output")
agent_chain = AgentExecutor(agent=agent,
tools=tools,
memory=memory,
max_iterations=5,
handle_parsing_errors=True,
verbose=True,
)

This agent_chain can be used for conversations. When the user inputs a question, it will determine whether to use a tool, which tool to use, and execute actions to get the tool's response.
It then checks if there is enough information. If there is, it generates a reply based on the collected information. If not, it will continue executing actions until the maximum iteration is reached.

Note that, it may suffer an issue: LLM doesn’t follow our instructions in prompt to response, which will result in LLM parse error. You’ll find that it has already collected enough information to answer the question, but it reply Sorry, I don't have enough information to answer your question...

So, here I try to use prompt engineer to alleviate that. You can see that I repeatedly emphasized you MUST use the following format、It’s very important to always include the ‘Thought’ before any ‘Action’ or ‘Final Answer’. Ensure your output strictly follows the formats above.

and I also use the powerful model gpt-4. Because I found that larger models are better at following our prompt instructions

Now, let’s see the performance of agent_chain:

It seems to response very well~ by adding some front-end development, you can interact with this chatbot on a webpage.

In the end, I organized this agent into the following flowchart.
I believe it’s a chatbot architecture that can be used in many scenarios; you just need to replace the RAG retrieval knowledge base with your own content.

--

--