This project demonstrates how to use LangChain to build a question-answering system that can process and query the content of a PDF document. It leverages OpenAI's API for language processing and embedding, and uses Apache Cassandra as a vector store for efficient similarity searching.
- Python 3.7 or higher
- Required libraries:
cassiodatasetslangchainopenaitiktokenPyPDF2
First, ensure you have the required Python libraries. You can install them using the following commands:
!pip install -q cassio datasets langchain openai tiktoken
!pip install pyPDF2Ensure you have your Astra DB and OpenAI API credentials ready:
ASTRA_DB_APPLICATION_TOKENASTRA_DB_IDOPENAI_API_KEY
Replace YOUR_ASTRA_DB_TOKEN, YOUR_ASTRA_DB_ID, and YOUR_OPENAI_API_KEY with your actual credentials within the code.
-
Import necessary libraries and set up configurations:
- Importing relevant modules for processing and querying.
from langchain.vectorstores.cassandra import Cassandra from langchain.indexes.vectorstore import VectorStoreIndexWrapper from langchain.llms import OpenAI from langchain.embeddings import OpenAIEmbeddings import cassio from PyPDF2 import PdfReader
-
Read the PDF content:
- The
PdfReadermodule is used to read 'Climate-report.pdf'.
pdfreader = PdfReader('Climate-report.pdf') raw_text = '' for i, page in enumerate(pdfreader.pages): content = page.extract_text() if content: raw_text += content
- The
-
Initialize the connection to Astra DB:
- Set up the connection using your credentials.
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)
-
Setup OpenAI and Embedding configuration:
- Initialize OpenAI with API key and embeddings.
llm = OpenAI(openai_api_key=OPENAI_API_KEY) embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
-
Create and populate the vector store:
- Set up Cassandra as the vector store for the dataset.
astra_vector_store = Cassandra( embedding=embedding, table_name='pdfqueryproj', session=None, keyspace=None )
-
Process the text into chunks:
- Split text into manageable chunks for vector storage.
from langchain.text_splitter import CharacterTextSplitter text_splitter = CharacterTextSplitter( separator="\n", chunk_size=800, chunk_overlap=200, length_function=len, ) texts = text_splitter.split_text(raw_text)
-
Add text datasets into the vector store
astra_vector_store.add_texts(texts[:50]) astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)
-
Query the data interactively:
- Use an interactive loop to query the data with user input.
first_question = True while True: if first_question: query_text = input("Enter your question..(type quit to exit): ").strip() else: query_text = input("\n What is your next question.. :").strip() if query_text.lower() == "quit": break if query_text.lower() == "": continue first_question = False print("\nQuestion: \"%s\"" % query_text) answer = astra_vector_index.query(query_text, llm=llm).strip() print("\nAnswer: \"%s\"\n" % answer) print("FIRST DOCUMENTS BY RELEVANCE: ") for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4): print(" [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))
This project provides a robust solution for querying and extracting information from PDF documents by leveraging LangChain, OpenAI, and Cassandra.
- Make sure to replace placeholders for the tokens and IDs with actual values.
- Adjust the dataset and model configurations as needed for your specific use case.
- Ensure the PDF document path is correct and accessible within your environment.
