sammy995 · Copilot · Apr 4, 2026 · Apr 4, 2026
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,40 @@
+# Python
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+*.egg
+*.egg-info/
+dist/
+build/
+.eggs/
+
+# Virtual environments
+venv/
+.venv/
+env/
+.env/
+
+# Environment / secrets
+.env
+.env.*
+*.env
+
+# Jupyter
+.ipynb_checkpoints/
+*.ipynb_checkpoints
+
+# Uploaded PDF files (keep tracked sample PDFs explicitly)
+uploads/
+*.pdf
+!Climate-report.pdf
+
+# Vector store index files
+faiss_index/
+chroma_db/
+*.pkl
+*.index
+
+# OS
+.DS_Store
+Thumbs.db
diff --git a/README.md b/README.md
@@ -1,139 +1,115 @@
-## PDF Query with LangChain and Cassandra Vector Store
+# PDFQuery-VectorDB
 
-### Overview
-This project demonstrates how to use LangChain to build a question-answering system that can process and query the content of a PDF document. It leverages OpenAI's API for language processing and embedding, and uses Apache Cassandra as a vector store for efficient similarity searching.
+![Python](https://img.shields.io/badge/Python-3.8%2B-blue?logo=python)
+![LangChain](https://img.shields.io/badge/LangChain-RAG-green?logo=chainlink)
+![OpenAI](https://img.shields.io/badge/OpenAI-Embeddings%20%26%20LLM-412991?logo=openai)
+![Cassandra](https://img.shields.io/badge/AstraDB-Vector%20Store-1287B1?logo=apachecassandra)
+![Jupyter](https://img.shields.io/badge/Jupyter-Notebook-orange?logo=jupyter)
 
-### Requirements
-- Python 3.7 or higher
-- Required libraries:
-    - `cassio`
-    - `datasets`
-    - `langchain`
-    - `openai`
-    - `tiktoken`
-    - `PyPDF2`
+A **Retrieval-Augmented Generation (RAG)** pipeline that lets you ask natural-language questions about any PDF document. It extracts text from a PDF, stores chunk embeddings in a Cassandra (AstraDB) vector store, and answers queries using OpenAI's LLM — returning both a generated answer and the most relevant source passages.
 
-### Installation
-First, ensure you have the required Python libraries. You can install them using the following commands:
+## Architecture
+
+![RAG Architecture](architecture.png)
+
+**Pipeline overview:**
+
+```
+PDF Document
+    │
+    ▼
+PyPDF2 (text extraction)
+    │
+    ▼
+LangChain CharacterTextSplitter (chunking)
+    │
+    ▼
+OpenAI Embeddings  ──►  AstraDB / Cassandra Vector Store
+                                    │
+User Question ──► OpenAI Embeddings ─┘
+                                    │
+                            Similarity Search
+                                    │
+                                    ▼
+                             OpenAI LLM (GPT)
+                                    │
+                                    ▼
+                            Answer + Source Docs
+```
+
+## Tech Stack
+
+| Component | Library / Service |
+|---|---|
+| PDF Parsing | `PyPDF2` |
+| Text Chunking | `langchain` (`CharacterTextSplitter`) |
+| Embeddings | OpenAI `text-embedding-ada-002` |
+| Vector Store | Apache Cassandra via **DataStax AstraDB** (`cassio`) |
+| LLM | OpenAI GPT (via `langchain`) |
+| Notebook | Jupyter |
+
+## Prerequisites
+
+- Python 3.8+
+- A [DataStax AstraDB](https://astra.datastax.com/) account (free tier works)
+- An [OpenAI](https://platform.openai.com/) API key
+
+## Installation
 
 ```bash
-!pip install -q cassio datasets langchain openai tiktoken
-!pip install pyPDF2
+pip install cassio datasets langchain openai tiktoken PyPDF2
 ```
 
-### Configuration
-Ensure you have your Astra DB and OpenAI API credentials ready:
-- `ASTRA_DB_APPLICATION_TOKEN`
-- `ASTRA_DB_ID`
-- `OPENAI_API_KEY`
-
-Replace `YOUR_ASTRA_DB_TOKEN`, `YOUR_ASTRA_DB_ID`, and `YOUR_OPENAI_API_KEY` with your actual credentials within the code.
-
-### Usage
-1. **Import necessary libraries and set up configurations:**
-   - Importing relevant modules for processing and querying.
-   ```python
-   from langchain.vectorstores.cassandra import Cassandra
-   from langchain.indexes.vectorstore import VectorStoreIndexWrapper
-   from langchain.llms import OpenAI
-   from langchain.embeddings import OpenAIEmbeddings
-   import cassio
-
-   from PyPDF2 import PdfReader
-   ```
+## Configuration
 
-2. **Read the PDF content:**
-   - The `PdfReader` module is used to read 'Climate-report.pdf'.
-   ```python
-   pdfreader = PdfReader('Climate-report.pdf')
-
-   raw_text = ''
-
-   for i, page in enumerate(pdfreader.pages):
-       content = page.extract_text()
-       if content:
-           raw_text += content
-   ```
+Set your credentials as environment variables (recommended) or replace the placeholders directly in the notebook:
 
-3. **Initialize the connection to Astra DB:**
-   - Set up the connection using your credentials.
-   ```python
-   cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)
-   ```
+```bash
+export ASTRA_DB_APPLICATION_TOKEN="your_astra_db_token"
+export ASTRA_DB_ID="your_astra_db_id"
+export OPENAI_API_KEY="your_openai_api_key"
+```
 
-4. **Setup OpenAI and Embedding configuration:**
-   - Initialize OpenAI with API key and embeddings.
-   ```python
-   llm = OpenAI(openai_api_key=OPENAI_API_KEY)
-   embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
-   ```
+> ⚠️ **Never commit real API keys to version control.** Add a `.env` file and load it with `python-dotenv`.
 
-5. **Create and populate the vector store:**
-   - Set up Cassandra as the vector store for the dataset.
-   ```python
-   astra_vector_store = Cassandra(
-       embedding=embedding,
-       table_name='pdfqueryproj',
-       session=None,
-       keyspace=None
-   )
-   ```
+## Usage
 
-6. **Process the text into chunks:**
-   - Split text into manageable chunks for vector storage.
-   ```python
-   from langchain.text_splitter import CharacterTextSplitter
-
-   text_splitter = CharacterTextSplitter(
-       separator="\n",
-       chunk_size=800,
-       chunk_overlap=200,
-       length_function=len,
-   )
-
-   texts = text_splitter.split_text(raw_text)
+1. **Open the notebook:**
+   ```bash
+   jupyter notebook PDFQuery_VectorDB.ipynb
    ```
 
-7. **Add text datasets into the vector store**
-   ```python
-   astra_vector_store.add_texts(texts[:50])
-
-   astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)
+2. **Place your PDF** in the project root (the notebook uses `Climate-report.pdf` by default — update the filename as needed).
+
+3. **Run all cells.** The notebook will:
+   - Extract and chunk the PDF text
+   - Generate embeddings and populate the AstraDB vector store
+   - Launch an interactive Q&A loop
+
+4. **Ask questions interactively:**
    ```
+   Enter your question.. (type quit to exit): What are the main causes of climate change?
+
+   Question: "What are the main causes of climate change?"
+   Answer: "The main causes include greenhouse gas emissions from fossil fuels, deforestation..."
 
-8. **Query the data interactively:**
-   - Use an interactive loop to query the data with user input.
-   ```python
-   first_question = True
-
-   while True:
-       if first_question:
-           query_text = input("Enter your question..(type quit to exit): ").strip()
-       else:
-           query_text = input("\n What is your next question.. :").strip()
-
-       if query_text.lower() == "quit":
-           break
-       if query_text.lower() == "":
-           continue
-       first_question = False
-
-       print("\nQuestion: \"%s\"" % query_text)
-       answer = astra_vector_index.query(query_text, llm=llm).strip()
-       print("\nAnswer: \"%s\"\n" % answer)
-
-       print("FIRST DOCUMENTS BY RELEVANCE: ")
-       for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
-           print("  [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))
+   FIRST DOCUMENTS BY RELEVANCE:
+     [0.9231] "Human activities, primarily the burning of fossil fuels and deforestation ..."
+     [0.9105] "Rising global temperatures are driven by increased concentrations of CO2 ..."
    ```
 
-### Conclusion
-This project provides a robust solution for querying and extracting information from PDF documents by leveraging LangChain, OpenAI, and Cassandra.
+## File Structure
+
+```
+PDFQuery-VectorDB/
+├── PDFQuery_VectorDB.ipynb   # Main notebook (RAG pipeline)
+├── Climate-report.pdf        # Sample PDF document
+├── architecture.png          # Pipeline architecture diagram
+└── README.md
+```
 
-### Architecture
-- ![Sample Image](architecture.png)
+## Notes
 
-### Notes
-- Make sure to replace placeholders for the tokens and IDs with actual values.
-- Adjust the dataset and model configurations as needed for your specific use case.
-- Ensure the PDF document path is correct and accessible within your environment.
+- The first run loads up to 50 text chunks into the vector store; adjust `texts[:50]` in the notebook to change this.
+- The `table_name` in the Cassandra store (`pdfqueryproj`) can be changed to any valid CQL identifier.
+- Swap `Climate-report.pdf` for any PDF to query different documents.