From 10abfa45f866514cd47d7968da1737cff153c8c5 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 4 Apr 2026 06:54:12 +0000 Subject: [PATCH 1/2] Initial plan From 3e6a634a5f76b5a8ee0223a7cd3170fb6aeb7fce Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 4 Apr 2026 06:56:59 +0000 Subject: [PATCH 2/2] Add professional README and Python .gitignore Agent-Logs-Url: https://github.com/sammy995/PDFQuery-VectorDB/sessions/723a1fc7-7b23-4fa6-9704-a9a3cfa41c9a Co-authored-by: sammy995 <68530417+sammy995@users.noreply.github.com> --- .gitignore | 40 ++++++++++ README.md | 214 ++++++++++++++++++++++++----------------------------- 2 files changed, 135 insertions(+), 119 deletions(-) create mode 100644 .gitignore diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..72da8b4 --- /dev/null +++ b/.gitignore @@ -0,0 +1,40 @@ +# Python +__pycache__/ +*.py[cod] +*.pyo +*.pyd +*.egg +*.egg-info/ +dist/ +build/ +.eggs/ + +# Virtual environments +venv/ +.venv/ +env/ +.env/ + +# Environment / secrets +.env +.env.* +*.env + +# Jupyter +.ipynb_checkpoints/ +*.ipynb_checkpoints + +# Uploaded PDF files (keep tracked sample PDFs explicitly) +uploads/ +*.pdf +!Climate-report.pdf + +# Vector store index files +faiss_index/ +chroma_db/ +*.pkl +*.index + +# OS +.DS_Store +Thumbs.db diff --git a/README.md b/README.md index d1919dd..138b898 100644 --- a/README.md +++ b/README.md @@ -1,139 +1,115 @@ -## PDF Query with LangChain and Cassandra Vector Store +# PDFQuery-VectorDB -### Overview -This project demonstrates how to use LangChain to build a question-answering system that can process and query the content of a PDF document. It leverages OpenAI's API for language processing and embedding, and uses Apache Cassandra as a vector store for efficient similarity searching. +![Python](https://img.shields.io/badge/Python-3.8%2B-blue?logo=python) +![LangChain](https://img.shields.io/badge/LangChain-RAG-green?logo=chainlink) +![OpenAI](https://img.shields.io/badge/OpenAI-Embeddings%20%26%20LLM-412991?logo=openai) +![Cassandra](https://img.shields.io/badge/AstraDB-Vector%20Store-1287B1?logo=apachecassandra) +![Jupyter](https://img.shields.io/badge/Jupyter-Notebook-orange?logo=jupyter) -### Requirements -- Python 3.7 or higher -- Required libraries: - - `cassio` - - `datasets` - - `langchain` - - `openai` - - `tiktoken` - - `PyPDF2` +A **Retrieval-Augmented Generation (RAG)** pipeline that lets you ask natural-language questions about any PDF document. It extracts text from a PDF, stores chunk embeddings in a Cassandra (AstraDB) vector store, and answers queries using OpenAI's LLM — returning both a generated answer and the most relevant source passages. -### Installation -First, ensure you have the required Python libraries. You can install them using the following commands: +## Architecture + +![RAG Architecture](architecture.png) + +**Pipeline overview:** + +``` +PDF Document + │ + ▼ +PyPDF2 (text extraction) + │ + ▼ +LangChain CharacterTextSplitter (chunking) + │ + ▼ +OpenAI Embeddings ──► AstraDB / Cassandra Vector Store + │ +User Question ──► OpenAI Embeddings ─┘ + │ + Similarity Search + │ + ▼ + OpenAI LLM (GPT) + │ + ▼ + Answer + Source Docs +``` + +## Tech Stack + +| Component | Library / Service | +|---|---| +| PDF Parsing | `PyPDF2` | +| Text Chunking | `langchain` (`CharacterTextSplitter`) | +| Embeddings | OpenAI `text-embedding-ada-002` | +| Vector Store | Apache Cassandra via **DataStax AstraDB** (`cassio`) | +| LLM | OpenAI GPT (via `langchain`) | +| Notebook | Jupyter | + +## Prerequisites + +- Python 3.8+ +- A [DataStax AstraDB](https://astra.datastax.com/) account (free tier works) +- An [OpenAI](https://platform.openai.com/) API key + +## Installation ```bash -!pip install -q cassio datasets langchain openai tiktoken -!pip install pyPDF2 +pip install cassio datasets langchain openai tiktoken PyPDF2 ``` -### Configuration -Ensure you have your Astra DB and OpenAI API credentials ready: -- `ASTRA_DB_APPLICATION_TOKEN` -- `ASTRA_DB_ID` -- `OPENAI_API_KEY` - -Replace `YOUR_ASTRA_DB_TOKEN`, `YOUR_ASTRA_DB_ID`, and `YOUR_OPENAI_API_KEY` with your actual credentials within the code. - -### Usage -1. **Import necessary libraries and set up configurations:** - - Importing relevant modules for processing and querying. - ```python - from langchain.vectorstores.cassandra import Cassandra - from langchain.indexes.vectorstore import VectorStoreIndexWrapper - from langchain.llms import OpenAI - from langchain.embeddings import OpenAIEmbeddings - import cassio - - from PyPDF2 import PdfReader - ``` +## Configuration -2. **Read the PDF content:** - - The `PdfReader` module is used to read 'Climate-report.pdf'. - ```python - pdfreader = PdfReader('Climate-report.pdf') - - raw_text = '' - - for i, page in enumerate(pdfreader.pages): - content = page.extract_text() - if content: - raw_text += content - ``` +Set your credentials as environment variables (recommended) or replace the placeholders directly in the notebook: -3. **Initialize the connection to Astra DB:** - - Set up the connection using your credentials. - ```python - cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID) - ``` +```bash +export ASTRA_DB_APPLICATION_TOKEN="your_astra_db_token" +export ASTRA_DB_ID="your_astra_db_id" +export OPENAI_API_KEY="your_openai_api_key" +``` -4. **Setup OpenAI and Embedding configuration:** - - Initialize OpenAI with API key and embeddings. - ```python - llm = OpenAI(openai_api_key=OPENAI_API_KEY) - embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY) - ``` +> ⚠️ **Never commit real API keys to version control.** Add a `.env` file and load it with `python-dotenv`. -5. **Create and populate the vector store:** - - Set up Cassandra as the vector store for the dataset. - ```python - astra_vector_store = Cassandra( - embedding=embedding, - table_name='pdfqueryproj', - session=None, - keyspace=None - ) - ``` +## Usage -6. **Process the text into chunks:** - - Split text into manageable chunks for vector storage. - ```python - from langchain.text_splitter import CharacterTextSplitter - - text_splitter = CharacterTextSplitter( - separator="\n", - chunk_size=800, - chunk_overlap=200, - length_function=len, - ) - - texts = text_splitter.split_text(raw_text) +1. **Open the notebook:** + ```bash + jupyter notebook PDFQuery_VectorDB.ipynb ``` -7. **Add text datasets into the vector store** - ```python - astra_vector_store.add_texts(texts[:50]) - - astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store) +2. **Place your PDF** in the project root (the notebook uses `Climate-report.pdf` by default — update the filename as needed). + +3. **Run all cells.** The notebook will: + - Extract and chunk the PDF text + - Generate embeddings and populate the AstraDB vector store + - Launch an interactive Q&A loop + +4. **Ask questions interactively:** ``` + Enter your question.. (type quit to exit): What are the main causes of climate change? + + Question: "What are the main causes of climate change?" + Answer: "The main causes include greenhouse gas emissions from fossil fuels, deforestation..." -8. **Query the data interactively:** - - Use an interactive loop to query the data with user input. - ```python - first_question = True - - while True: - if first_question: - query_text = input("Enter your question..(type quit to exit): ").strip() - else: - query_text = input("\n What is your next question.. :").strip() - - if query_text.lower() == "quit": - break - if query_text.lower() == "": - continue - first_question = False - - print("\nQuestion: \"%s\"" % query_text) - answer = astra_vector_index.query(query_text, llm=llm).strip() - print("\nAnswer: \"%s\"\n" % answer) - - print("FIRST DOCUMENTS BY RELEVANCE: ") - for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4): - print(" [%0.4f] \"%s ...\"" % (score, doc.page_content[:84])) + FIRST DOCUMENTS BY RELEVANCE: + [0.9231] "Human activities, primarily the burning of fossil fuels and deforestation ..." + [0.9105] "Rising global temperatures are driven by increased concentrations of CO2 ..." ``` -### Conclusion -This project provides a robust solution for querying and extracting information from PDF documents by leveraging LangChain, OpenAI, and Cassandra. +## File Structure + +``` +PDFQuery-VectorDB/ +├── PDFQuery_VectorDB.ipynb # Main notebook (RAG pipeline) +├── Climate-report.pdf # Sample PDF document +├── architecture.png # Pipeline architecture diagram +└── README.md +``` -### Architecture -- ![Sample Image](architecture.png) +## Notes -### Notes -- Make sure to replace placeholders for the tokens and IDs with actual values. -- Adjust the dataset and model configurations as needed for your specific use case. -- Ensure the PDF document path is correct and accessible within your environment. +- The first run loads up to 50 text chunks into the vector store; adjust `texts[:50]` in the notebook to change this. +- The `table_name` in the Cassandra store (`pdfqueryproj`) can be changed to any valid CQL identifier. +- Swap `Climate-report.pdf` for any PDF to query different documents.