InformationExtraction — PDF RAG API

Upload any PDF, ask questions, get answers — powered by sentence-transformers, FAISS, and DeepSeek via HuggingFace Inference API.
FastAPI · FAISS · Sentence Transformers · PyMuPDF · Google Cloud Run · Docker

What This Is

A lightweight Retrieval-Augmented Generation (RAG) API that turns any PDF into a queryable knowledge base in seconds.

Upload a PDF, ask a question in natural language, and get a context-grounded answer — no hallucinations from documents that weren't in the upload, because the LLM is constrained to only the retrieved chunks.

Deployed on Google Cloud Run via Cloud Build CI/CD.

How It Works

User uploads PDF + question
        │
        ▼
┌─────────────────────┐
│   FastAPI endpoint  │  POST /query_pdf
│   /query_pdf        │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│   PyMuPDF (fitz)    │  Extract raw text from PDF pages
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│   Text chunker      │  Sentence-aware splitting (500 char max)
│   LocalRag          │  Normalizes whitespace, preserves sentences
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Sentence           │  all-MiniLM-L6-v2
│  Transformers       │  Batch encode chunks → normalized embeddings
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│   FAISS Index       │  IndexFlatIP (inner product on normalized
│                     │  vectors = cosine similarity)
└────────┬────────────┘
         │  Top-K retrieval (k=5)
         ▼
┌─────────────────────┐
│   HF Inference API  │  DeepSeek-V3 via InferenceClient
│   (DeepSeek-V3)     │  Grounded prompt with retrieved context
└────────┬────────────┘
         │
         ▼
      {"answer": "..."}

Tech Stack

Layer	Technology
API framework	FastAPI
PDF extraction	PyMuPDF (fitz)
Embeddings	`sentence-transformers` · `all-MiniLM-L6-v2`
Vector store	FAISS (`IndexFlatIP` — cosine via normalized IP)
LLM	DeepSeek-V3 via HuggingFace Inference API
Containerization	Docker
Cloud deployment	Google Cloud Run
CI/CD	Google Cloud Build (`cloudbuild.yaml`)

API Endpoints

`GET /`

Health check — confirms the API is running.

{"message": "RAG API is running. Use /docs to interact with it."}

`POST /query_pdf`

Upload a PDF and ask a question about it.

Request (multipart/form-data):

Field	Type	Description
`file`	PDF file	The document to query
`question`	string	Your natural language question

Response:

{
  "answer": "The contract termination clause requires 30 days written notice..."
}

Error responses:

400 — Not a PDF, empty file, or no text extractable
500 — Extraction or RAG pipeline failure

RAG Pipeline Design

Text chunking

Sentence-aware splitting with a 500-character hard cap per chunk. The chunker normalizes whitespace once, splits on sentence boundaries ([.!?]), and accumulates sentences until the cap is reached — avoiding mid-sentence cuts that break context.

Embeddings

all-MiniLM-L6-v2 from sentence-transformers — a fast, lightweight model that produces 384-dimensional embeddings. Embeddings are L2-normalized before indexing so inner product search is equivalent to cosine similarity.

FAISS retrieval

IndexFlatIP (flat inner product index) — exact nearest-neighbor search, no approximation. Top-5 chunks are retrieved per query.

Generation

Retrieved chunks are concatenated (capped at 2000 characters to keep inference fast) and passed to DeepSeek-V3 via the HuggingFace Inference API. The prompt explicitly instructs the model to answer only from context and say "I don't know" if the answer isn't present — minimizing hallucination.

Index persistence

The FAISS index and chunk list can be saved to disk (save_index()) and reloaded (_load_index()) to avoid rebuilding on every request for large documents.

Local Development

Prerequisites

Python 3.11+
HuggingFace API key with Inference API access

Setup

# Clone the repo
git clone https://github.com/DebugJedi/InformationExtraction.git
cd InformationExtraction

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Set up environment
echo "HF_API_KEY=your_huggingface_api_key" > .env

Run locally

uvicorn app:app --reload --port 8000

Open http://localhost:8000/docs for the interactive API UI.

Test with cURL

curl -X POST http://localhost:8000/query_pdf \
  -F "file=@your_document.pdf" \
  -F "question=What are the key findings in this report?"

Docker

# Build
docker build -t rag-api .

# Run
docker run -p 8000:8000 -e HF_API_KEY=your_key rag-api

Deploying to Google Cloud Run

This repo includes a full cloudbuild.yaml that builds, pushes, and deploys automatically.

One-time setup

# Authenticate
gcloud auth login
gcloud config set project YOUR_PROJECT_ID

# Enable required APIs
gcloud services enable run.googleapis.com cloudbuild.googleapis.com artifactregistry.googleapis.com

# Create Artifact Registry repo
gcloud artifacts repositories create ocr-rag-app-repo \
  --repository-format=docker \
  --location=us-west1

Trigger a deployment

gcloud builds submit --config cloudbuild.yaml \
  --substitutions _SERVICE=my-app,_REGION=us-west1

Or connect your GitHub repo to Cloud Build for automatic deploys on every push.

Set environment variables on Cloud Run

gcloud run services update my-app \
  --region=us-west1 \
  --set-env-vars HF_API_KEY=your_key

Project Structure

InformationExtraction/
├── app.py                  ←  FastAPI app · /query_pdf endpoint
├── src/
│   ├── config.py           ←  LocalRag class · FAISS · embeddings · generation
│   └── utils/
│       └── extract_text.py ←  PyMuPDF PDF + txt extraction
├── Dockerfile
├── cloudbuild.yaml         ←  GCP Cloud Build CI/CD pipeline
├── requirements.txt
└── .gitignore

Environment Variables

Variable	Required	Description
`HF_API_KEY`	✅ Yes	HuggingFace API key for Inference API access

Roadmap

Author

Built and maintained by Priyank Rao — Data Scientist / ML Engineer
Portfolio · GitHub

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
src		src
.dockerignore		.dockerignore
.gcloudignore		.gcloudignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
cloudbuild.yaml		cloudbuild.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InformationExtraction — PDF RAG API

What This Is

How It Works

Tech Stack

API Endpoints

`GET /`

`POST /query_pdf`

RAG Pipeline Design

Text chunking

Embeddings

FAISS retrieval

Generation

Index persistence

Local Development

Prerequisites

Setup

Run locally

Test with cURL

Docker

Deploying to Google Cloud Run

One-time setup

Trigger a deployment

Set environment variables on Cloud Run

Project Structure

Environment Variables

Roadmap

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InformationExtraction — PDF RAG API

What This Is

How It Works

Tech Stack

API Endpoints

GET /

POST /query_pdf

RAG Pipeline Design

Text chunking

Embeddings

FAISS retrieval

Generation

Index persistence

Local Development

Prerequisites

Setup

Run locally

Test with cURL

Docker

Deploying to Google Cloud Run

One-time setup

Trigger a deployment

Set environment variables on Cloud Run

Project Structure

Environment Variables

Roadmap

Author

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /`

`POST /query_pdf`

Packages