Skip to content

DebugJedi/InformationExtraction

Repository files navigation

InformationExtraction — PDF RAG API

Upload any PDF, ask questions, get answers — powered by sentence-transformers, FAISS, and DeepSeek via HuggingFace Inference API.
FastAPI · FAISS · Sentence Transformers · PyMuPDF · Google Cloud Run · Docker


What This Is

A lightweight Retrieval-Augmented Generation (RAG) API that turns any PDF into a queryable knowledge base in seconds.

Upload a PDF, ask a question in natural language, and get a context-grounded answer — no hallucinations from documents that weren't in the upload, because the LLM is constrained to only the retrieved chunks.

Deployed on Google Cloud Run via Cloud Build CI/CD.


How It Works

User uploads PDF + question
        │
        ▼
┌─────────────────────┐
│   FastAPI endpoint  │  POST /query_pdf
│   /query_pdf        │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│   PyMuPDF (fitz)    │  Extract raw text from PDF pages
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│   Text chunker      │  Sentence-aware splitting (500 char max)
│   LocalRag          │  Normalizes whitespace, preserves sentences
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Sentence           │  all-MiniLM-L6-v2
│  Transformers       │  Batch encode chunks → normalized embeddings
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│   FAISS Index       │  IndexFlatIP (inner product on normalized
│                     │  vectors = cosine similarity)
└────────┬────────────┘
         │  Top-K retrieval (k=5)
         ▼
┌─────────────────────┐
│   HF Inference API  │  DeepSeek-V3 via InferenceClient
│   (DeepSeek-V3)     │  Grounded prompt with retrieved context
└────────┬────────────┘
         │
         ▼
      {"answer": "..."}

Tech Stack

Layer Technology
API framework FastAPI
PDF extraction PyMuPDF (fitz)
Embeddings sentence-transformers · all-MiniLM-L6-v2
Vector store FAISS (IndexFlatIP — cosine via normalized IP)
LLM DeepSeek-V3 via HuggingFace Inference API
Containerization Docker
Cloud deployment Google Cloud Run
CI/CD Google Cloud Build (cloudbuild.yaml)

API Endpoints

GET /

Health check — confirms the API is running.

{"message": "RAG API is running. Use /docs to interact with it."}

POST /query_pdf

Upload a PDF and ask a question about it.

Request (multipart/form-data):

Field Type Description
file PDF file The document to query
question string Your natural language question

Response:

{
  "answer": "The contract termination clause requires 30 days written notice..."
}

Error responses:

  • 400 — Not a PDF, empty file, or no text extractable
  • 500 — Extraction or RAG pipeline failure

RAG Pipeline Design

Text chunking

Sentence-aware splitting with a 500-character hard cap per chunk. The chunker normalizes whitespace once, splits on sentence boundaries ([.!?]), and accumulates sentences until the cap is reached — avoiding mid-sentence cuts that break context.

Embeddings

all-MiniLM-L6-v2 from sentence-transformers — a fast, lightweight model that produces 384-dimensional embeddings. Embeddings are L2-normalized before indexing so inner product search is equivalent to cosine similarity.

FAISS retrieval

IndexFlatIP (flat inner product index) — exact nearest-neighbor search, no approximation. Top-5 chunks are retrieved per query.

Generation

Retrieved chunks are concatenated (capped at 2000 characters to keep inference fast) and passed to DeepSeek-V3 via the HuggingFace Inference API. The prompt explicitly instructs the model to answer only from context and say "I don't know" if the answer isn't present — minimizing hallucination.

Index persistence

The FAISS index and chunk list can be saved to disk (save_index()) and reloaded (_load_index()) to avoid rebuilding on every request for large documents.


Local Development

Prerequisites

  • Python 3.11+
  • HuggingFace API key with Inference API access

Setup

# Clone the repo
git clone https://github.com/DebugJedi/InformationExtraction.git
cd InformationExtraction

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Set up environment
echo "HF_API_KEY=your_huggingface_api_key" > .env

Run locally

uvicorn app:app --reload --port 8000

Open http://localhost:8000/docs for the interactive API UI.

Test with cURL

curl -X POST http://localhost:8000/query_pdf \
  -F "file=@your_document.pdf" \
  -F "question=What are the key findings in this report?"

Docker

# Build
docker build -t rag-api .

# Run
docker run -p 8000:8000 -e HF_API_KEY=your_key rag-api

Deploying to Google Cloud Run

This repo includes a full cloudbuild.yaml that builds, pushes, and deploys automatically.

One-time setup

# Authenticate
gcloud auth login
gcloud config set project YOUR_PROJECT_ID

# Enable required APIs
gcloud services enable run.googleapis.com cloudbuild.googleapis.com artifactregistry.googleapis.com

# Create Artifact Registry repo
gcloud artifacts repositories create ocr-rag-app-repo \
  --repository-format=docker \
  --location=us-west1

Trigger a deployment

gcloud builds submit --config cloudbuild.yaml \
  --substitutions _SERVICE=my-app,_REGION=us-west1

Or connect your GitHub repo to Cloud Build for automatic deploys on every push.

Set environment variables on Cloud Run

gcloud run services update my-app \
  --region=us-west1 \
  --set-env-vars HF_API_KEY=your_key

Project Structure

InformationExtraction/
├── app.py                  ←  FastAPI app · /query_pdf endpoint
├── src/
│   ├── config.py           ←  LocalRag class · FAISS · embeddings · generation
│   └── utils/
│       └── extract_text.py ←  PyMuPDF PDF + txt extraction
├── Dockerfile
├── cloudbuild.yaml         ←  GCP Cloud Build CI/CD pipeline
├── requirements.txt
└── .gitignore

Environment Variables

Variable Required Description
HF_API_KEY ✅ Yes HuggingFace API key for Inference API access

Roadmap

  • PDF upload and text extraction
  • Sentence-aware text chunking
  • FAISS vector index with cosine similarity
  • HuggingFace Inference API generation
  • FAISS index persistence (save/load)
  • Docker + Google Cloud Run deployment
  • Cloud Build CI/CD pipeline
  • Multi-document knowledge base (persistent store)
  • Support for .txt, .docx uploads
  • Streaming responses
  • Auth middleware for API key protection

Author

Built and maintained by Priyank Rao — Data Scientist / ML Engineer
Portfolio · GitHub


License

MIT

About

Document intelligence API · Claude Vision OCR + vector embeddings + natural language Q&A over extracted knowledge base

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors