Building A Custom Chatbot To Query PDF Documents With Langchain

Introduction

In today's digital age, extracting insights and answers from large volumes of text data has become crucial. Many organizations have valuable information stored in PDF documents, and interacting with this data through a chatbot can be a game-changer. In this tutorial, we will explore how to create a chatbot that can answer questions based on the content of your own PDF documents using Langchain. Langchain is a powerful tool that leverages the capabilities of OpenAI's GPT models for natural language understanding. Learn the step-by-step process of chatbot development using Langchain and harness the power of AI.

Step-by-Step Process

Step 1: Importing Required Libraries

Before you can work with Langchain and perform actions on your text data, you must import various libraries and packages. Here's a list of the libraries and their purposes:

  • Langchain: This is the main library that provides access to Langchain functionalities.
  • OpenAI: Required for integrating OpenAI's GPT model for language processing.
  • Chroma: Part of the Vector store used for storing text embeddings.
  • CharacterSplitter: A tool to split large documents into smaller, more manageable chunks.
  • VectorDatabase: A crucial component of the Question-Answer system.
  • DirectoryLoader: Used for loading documents from a specified directory.
  • OS: For interacting with the operating system, handling file paths, and directory operations.
  • nltk: The Natural Language Toolkit for natural language processing tasks.
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader
import magic
import os
import nltk

openai_api_key = os.getenv("OPENAI_API_KEY", "YourAPIKey")

Step 2: Document Loading

The first step is to load your PDF documents into Langchain. Langchain provides a tool called the "Directory Loader" that allows you to load a directory of PDF files. Alternatively, you can load individual text files if needed.

Here's how you can do it:

  • Define the directory path where your PDFs are stored.
  • Use Langchain to load your PDF documents from the specified directory.

This step ensures that your documents are accessible and ready for further processing.

1 loader = DirectoryLoader('User pdf path/', glob='**/*.txt')

1 documents = loader.load()

Step 3: Splitting Documents

PDF documents can be lengthy, making it challenging to process them effectively. To work with them more efficiently, you need to divide your PDFs into smaller, manageable chunks. These chunks are typically around 1000 characters each.

Langchain provides a "Character Splitter" for this purpose. During this step, you will:

  • Initialize the character splitter.

Split your loaded documents into smaller parts.

	# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

Splitting the documents helps Langchain handle the content more effectively, making it easier to process and query.

Step 4: Generating Embeddings

To work with text data effectively, you need to convert it into a numerical format called embeddings. Embeddings represent the content of your documents in a vector space, allowing for easy analysis and querying.

Here's how you can generate embeddings for your split documents:

  • Initialize embeddings within Langchain to create Langchain-powered chatbot.
  • Create a vector store from the text using the embedding engine.

These embeddings serve as the foundation for your chatbot's understanding of the content within your PDF documents.

1 embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
docsearch = FAISS.from_documents(texts, embeddings)
    

What are embedding?

OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:

  • Search (where results are ranked by relevance to a query string)
  • Clustering (where text strings are grouped by similarity)
  • Recommendations (where items with related text strings are recommended)
  • Anomaly detection (where outliers with little relatedness are identified)
  • Diversity measurement (where similarity distributions are analyzed)
  • Classification (where text strings are classified by their most similar label)

An embedding is a vector (list) of floating-point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

Get In Touch With Blueash

Step 5: Initializing the Model

The next step involves initializing the language model that will be used for interacting with your PDFs. In this tutorial, we use OpenAI's GPT-3 model through Langchain. Initializing the model is straightforward:

  • Initialize the model using Langchain.
  • Set the chain type to determine how the model will process the input.

This step sets up the model to generate responses based on prompts and facts, which is essential to answering questions.

llm = OpenAI(openai_api_key=openai_api_key)

Step 6: Setting Up the Question-Answer (QA) System

Langchain provides a convenient Question-Answer (QA) system that enables you to query your documents. Setting up this system is essential for interacting with your PDFs.

Here's what you do:

  • Initialize the QA system using Langchain.
  • Assign a name to the QA system.

The QA system facilitates interactions with your documents and serves as the bridge between your questions and the content in your PDFs.

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever())

Step 7: Asking Questions

Now comes the exciting part - querying your PDFs with custom questions. Using the Langchain QA system, you can ask questions related to the content in your documents.

Here's how to ask a question:

  • Construct a question relevant to the information in your PDFs.
  • Use the QA system to ask the question.

For example, you could ask about specific details or facts within your PDF documents, and the chatbot will retrieve answers based on the content it has processed.

1 query = "Question you want to ask from pdf"
2 qa.run(query)

Step 8: Attributing Sources

One remarkable feature of Langchain is the ability to attribute sources to the answers. This is particularly useful when you have multiple documents and want to know which ones were referenced for a particular answer.

Here's how you can attribute sources:

  • Ask a question using the QA system and specify the option to return source documents.
  • The system will not only provide you with the answer but also specify which documents contributed to that answer..

This step allows you to trace back and understand which parts of your PDFs were used to generate the response, enhancing the transparency and reliability of the information provided.

1 qa = RetrievalQA.from_chain_type(llm=llm,
2 chain_type="stuff",
3  retriever=docsearch.as_retriever(),
4 return_source_documents=True)
5 query = "Question you want to ask from pdf "
6 result = qa({"query": query})
result['result']
result['source_documents']

Are you ready to build custom chatbots? Bluebash delivers.

Benefits:

Enhanced Data Accessibility: Building a custom chatbot with Langchain allows you to access and interact with your PDF documents more efficiently, reducing the time and effort required to search manually and retrieve information.

Instant Insights: Langchain's AI-powered chatbots decode your PDFs, providing swift and accurate insights. You can quickly obtain the information you need without the need for extensive manual document analysis.

Tailored Solutions: With the ability to build custom chatbots, you can create solutions that are specific to your organization's needs. Ask questions, get answers, and make data-driven decisions tailored to your unique requirements.

Improved Efficiency: Langchain streamlines the process of handling and querying PDF documents. By leveraging AI, you can boost productivity and get more done in less time.

Conclusion:

Querying your PDF using Langchain and creating a chatbot for custom questions is a powerful and versatile capability that can be applied to a wide range of use cases. Whether you're a researcher, a knowledge seeker, or someone who wants to make the most of their PDF documents, Langchain provides the tools and features to create a custom chatbot and access valuable insights from your text-based content.

This tutorial provides a detailed overview of each step in the process, ensuring that you can effectively leverage Langchain to explore the potential of your PDFs with AI-powered text

References
https://www.langchain.ca/blog/custom-chatbot-to-query-pdf-documents-using-openai-and-langchain/