Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Python
__pycache__/
*.py[cod]
*.pyo
*.pyd
*.egg
*.egg-info/
dist/
build/
.eggs/

# Virtual environments
venv/
.venv/
env/
.env/

# Environment / secrets
.env
.env.*
*.env

# Jupyter
.ipynb_checkpoints/
*.ipynb_checkpoints

# Uploaded PDF files (keep tracked sample PDFs explicitly)
uploads/
*.pdf
!Climate-report.pdf

# Vector store index files
faiss_index/
chroma_db/
*.pkl
*.index

# OS
.DS_Store
Thumbs.db
214 changes: 95 additions & 119 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,139 +1,115 @@
## PDF Query with LangChain and Cassandra Vector Store
# PDFQuery-VectorDB

### Overview
This project demonstrates how to use LangChain to build a question-answering system that can process and query the content of a PDF document. It leverages OpenAI's API for language processing and embedding, and uses Apache Cassandra as a vector store for efficient similarity searching.
![Python](https://img.shields.io/badge/Python-3.8%2B-blue?logo=python)
![LangChain](https://img.shields.io/badge/LangChain-RAG-green?logo=chainlink)
![OpenAI](https://img.shields.io/badge/OpenAI-Embeddings%20%26%20LLM-412991?logo=openai)
![Cassandra](https://img.shields.io/badge/AstraDB-Vector%20Store-1287B1?logo=apachecassandra)
![Jupyter](https://img.shields.io/badge/Jupyter-Notebook-orange?logo=jupyter)

### Requirements
- Python 3.7 or higher
- Required libraries:
- `cassio`
- `datasets`
- `langchain`
- `openai`
- `tiktoken`
- `PyPDF2`
A **Retrieval-Augmented Generation (RAG)** pipeline that lets you ask natural-language questions about any PDF document. It extracts text from a PDF, stores chunk embeddings in a Cassandra (AstraDB) vector store, and answers queries using OpenAI's LLM — returning both a generated answer and the most relevant source passages.

### Installation
First, ensure you have the required Python libraries. You can install them using the following commands:
## Architecture

![RAG Architecture](architecture.png)

**Pipeline overview:**

```
PDF Document
PyPDF2 (text extraction)
LangChain CharacterTextSplitter (chunking)
OpenAI Embeddings ──► AstraDB / Cassandra Vector Store
User Question ──► OpenAI Embeddings ─┘
Similarity Search
OpenAI LLM (GPT)
Answer + Source Docs
```

## Tech Stack

| Component | Library / Service |
|---|---|
| PDF Parsing | `PyPDF2` |
| Text Chunking | `langchain` (`CharacterTextSplitter`) |
| Embeddings | OpenAI `text-embedding-ada-002` |
| Vector Store | Apache Cassandra via **DataStax AstraDB** (`cassio`) |
| LLM | OpenAI GPT (via `langchain`) |
| Notebook | Jupyter |

## Prerequisites

- Python 3.8+
- A [DataStax AstraDB](https://astra.datastax.com/) account (free tier works)
- An [OpenAI](https://platform.openai.com/) API key

## Installation

```bash
!pip install -q cassio datasets langchain openai tiktoken
!pip install pyPDF2
pip install cassio datasets langchain openai tiktoken PyPDF2
```

### Configuration
Ensure you have your Astra DB and OpenAI API credentials ready:
- `ASTRA_DB_APPLICATION_TOKEN`
- `ASTRA_DB_ID`
- `OPENAI_API_KEY`

Replace `YOUR_ASTRA_DB_TOKEN`, `YOUR_ASTRA_DB_ID`, and `YOUR_OPENAI_API_KEY` with your actual credentials within the code.

### Usage
1. **Import necessary libraries and set up configurations:**
- Importing relevant modules for processing and querying.
```python
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
import cassio

from PyPDF2 import PdfReader
```
## Configuration

2. **Read the PDF content:**
- The `PdfReader` module is used to read 'Climate-report.pdf'.
```python
pdfreader = PdfReader('Climate-report.pdf')

raw_text = ''

for i, page in enumerate(pdfreader.pages):
content = page.extract_text()
if content:
raw_text += content
```
Set your credentials as environment variables (recommended) or replace the placeholders directly in the notebook:

3. **Initialize the connection to Astra DB:**
- Set up the connection using your credentials.
```python
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)
```
```bash
export ASTRA_DB_APPLICATION_TOKEN="your_astra_db_token"
export ASTRA_DB_ID="your_astra_db_id"
export OPENAI_API_KEY="your_openai_api_key"
```

4. **Setup OpenAI and Embedding configuration:**
- Initialize OpenAI with API key and embeddings.
```python
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
```
> ⚠️ **Never commit real API keys to version control.** Add a `.env` file and load it with `python-dotenv`.

5. **Create and populate the vector store:**
- Set up Cassandra as the vector store for the dataset.
```python
astra_vector_store = Cassandra(
embedding=embedding,
table_name='pdfqueryproj',
session=None,
keyspace=None
)
```
## Usage

6. **Process the text into chunks:**
- Split text into manageable chunks for vector storage.
```python
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=800,
chunk_overlap=200,
length_function=len,
)

texts = text_splitter.split_text(raw_text)
1. **Open the notebook:**
```bash
jupyter notebook PDFQuery_VectorDB.ipynb
```

7. **Add text datasets into the vector store**
```python
astra_vector_store.add_texts(texts[:50])

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)
2. **Place your PDF** in the project root (the notebook uses `Climate-report.pdf` by default — update the filename as needed).

3. **Run all cells.** The notebook will:
- Extract and chunk the PDF text
- Generate embeddings and populate the AstraDB vector store
- Launch an interactive Q&A loop

4. **Ask questions interactively:**
```
Enter your question.. (type quit to exit): What are the main causes of climate change?

Question: "What are the main causes of climate change?"
Answer: "The main causes include greenhouse gas emissions from fossil fuels, deforestation..."

8. **Query the data interactively:**
- Use an interactive loop to query the data with user input.
```python
first_question = True

while True:
if first_question:
query_text = input("Enter your question..(type quit to exit): ").strip()
else:
query_text = input("\n What is your next question.. :").strip()

if query_text.lower() == "quit":
break
if query_text.lower() == "":
continue
first_question = False

print("\nQuestion: \"%s\"" % query_text)
answer = astra_vector_index.query(query_text, llm=llm).strip()
print("\nAnswer: \"%s\"\n" % answer)

print("FIRST DOCUMENTS BY RELEVANCE: ")
for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
print(" [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))
FIRST DOCUMENTS BY RELEVANCE:
[0.9231] "Human activities, primarily the burning of fossil fuels and deforestation ..."
[0.9105] "Rising global temperatures are driven by increased concentrations of CO2 ..."
```

### Conclusion
This project provides a robust solution for querying and extracting information from PDF documents by leveraging LangChain, OpenAI, and Cassandra.
## File Structure

```
PDFQuery-VectorDB/
├── PDFQuery_VectorDB.ipynb # Main notebook (RAG pipeline)
├── Climate-report.pdf # Sample PDF document
├── architecture.png # Pipeline architecture diagram
└── README.md
```

### Architecture
- ![Sample Image](architecture.png)
## Notes

### Notes
- Make sure to replace placeholders for the tokens and IDs with actual values.
- Adjust the dataset and model configurations as needed for your specific use case.
- Ensure the PDF document path is correct and accessible within your environment.
- The first run loads up to 50 text chunks into the vector store; adjust `texts[:50]` in the notebook to change this.
- The `table_name` in the Cassandra store (`pdfqueryproj`) can be changed to any valid CQL identifier.
- Swap `Climate-report.pdf` for any PDF to query different documents.