scraPy - Intelligent Web Scraping Platform

Production-grade web scraping platform powered by AI

Features • Architecture • Quick Start • API Docs • Deployment

Overview

scraPy is a modern, production-ready web scraping platform that combines the power of AI with traditional CSS selectors to extract data from any website. Built with a microservices architecture, it offers both a beautiful web interface and a comprehensive REST API for programmatic access.

Key Highlights

🤖 AI-Powered Extraction - Natural language instructions powered by Google Gemini
🎯 Traditional Selectors - CSS selector support for precise data extraction
🚀 Async Processing - Background job queue with ARQ for non-blocking operations
🔒 Enterprise Security - SSRF protection, rate limiting, JWT authentication
📊 Real-time Monitoring - Health checks, structured logging, job tracking
🌐 Dynamic Content - Playwright integration for JavaScript-heavy sites
🔑 API Key Management - Secure programmatic access with HMAC webhooks
📱 Modern UI - Responsive Next.js interface with real-time updates

Features

🎯 Core Scraping Capabilities

Guided Mode (CSS Selectors)

{
  "url": "https://example.com",
  "mode": "guided",
  "selectors": {
    "title": "h1",
    "description": "p.description",
    "price": ".price-tag"
  }
}

Smart Mode (AI-Powered)

{
  "url": "https://example.com",
  "mode": "smart",
  "instruction": "Extract product name, price, and availability status"
}

🔐 Security Features

SSRF Prevention - Blocks private IPs, localhost, link-local addresses
Rate Limiting - Per-API-key limits stored in Redis
Input Validation - Pydantic validators for all inputs
Request Size Limits - 10MB max payload protection
Error Sanitization - No internal details exposed in production
JWT Authentication - Clerk-based user authentication

📊 Monitoring & Observability

Health Checks - /health endpoint with DB/Redis verification
Structured Logging - Consistent log format with levels
Job Tracking - Complete lifecycle logging with duration metrics
API Documentation - Auto-generated Swagger UI at /docs

🔧 Advanced Features

Dynamic Content Support - Playwright for JavaScript-rendered pages
Webhook Notifications - HMAC-signed webhooks for job completion
Job History - Persistent storage in PostgreSQL
Result Caching - Redis-based caching with 1-hour TTL
Async Workers - ARQ-based background processing

Architecture

System Design - Complete Overview

Note: For the interactive architecture-beta diagram, visit Mermaid Live Editor and paste the code from architecture_diagram.md

graph TB
    subgraph client[" CLIENT LAYER "]
        browser[🌐 Web Browser]
        api_consumer[📡 API Consumer]
    end

    subgraph frontend[" FRONTEND - VERCEL "]
        nextjs[⚡ Next.js App<br/>React 19 + App Router]
        clerk_frontend[🔐 Clerk SDK]
    end

    subgraph backend[" BACKEND - RENDER "]
        fastapi[🚀 FastAPI Server<br/>Port 8000]
        auth_middleware[🔒 Auth Middleware<br/>JWT Validation]
        rate_limiter[⏱️ Rate Limiter<br/>Redis-backed]
        ssrf_guard[🛡️ SSRF Protection<br/>IP Filtering]
    end

    subgraph processing[" PROCESSING LAYER - RENDER "]
        arq_worker[⚙️ ARQ Worker<br/>Background Jobs]
        scraper_engine[🕷️ Scraper Engine<br/>Guided + Smart Mode]
        playwright[🎭 Playwright Browser<br/>Chromium]
        llm_service[🤖 LLM Service<br/>Gemini Integration]
    end

    subgraph data[" DATA LAYER - RENDER "]
        postgres[(🗄️ PostgreSQL<br/>Job History + Users)]
        redis[(⚡ Redis<br/>Queue + Cache)]
    end

    subgraph external[" EXTERNAL SERVICES "]
        clerk_auth[🔐 Clerk Auth<br/>Authentication]
        gemini_ai[🧠 Google Gemini<br/>AI Model]
        target_sites[🌍 Target Websites<br/>Scraping Sources]
    end

    %% Client to Frontend
    browser -->|HTTPS| nextjs
    api_consumer -->|REST API| fastapi

    %% Frontend Flow
    nextjs -->|Auth Check| clerk_frontend
    clerk_frontend -->|Verify Token| clerk_auth
    nextjs -->|API Calls| fastapi

    %% Backend Security Layers
    fastapi -->|Validate JWT| auth_middleware
    auth_middleware -->|Verify| clerk_auth
    fastapi -->|Check Limits| rate_limiter
    rate_limiter -->|Counter| redis
    fastapi -->|Validate URL| ssrf_guard

    %% Backend to Data
    fastapi -->|Save Jobs| postgres
    fastapi -->|Enqueue| redis

    %% Processing Flow
    redis -->|Dequeue| arq_worker
    arq_worker -->|Update Status| postgres
    arq_worker -->|Execute| scraper_engine
    
    scraper_engine -->|JS Pages| playwright
    scraper_engine -->|AI Extract| llm_service
    llm_service -->|API Call| gemini_ai
    playwright -->|HTTP GET| target_sites

    %% Results Flow
    arq_worker -->|Cache Results| redis
    arq_worker -.->|Webhook| api_consumer

    %% Styling
    classDef clientStyle fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    classDef frontendStyle fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef backendStyle fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    classDef processingStyle fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef dataStyle fill:#fce4ec,stroke:#880e4f,stroke-width:2px
    classDef externalStyle fill:#f1f8e9,stroke:#33691e,stroke-width:2px

    class browser,api_consumer clientStyle
    class nextjs,clerk_frontend frontendStyle
    class fastapi,auth_middleware,rate_limiter,ssrf_guard backendStyle
    class arq_worker,scraper_engine,playwright,llm_service processingStyle
    class postgres,redis dataStyle
    class clerk_auth,gemini_ai,target_sites externalStyle

Detailed Architecture Breakdown

1. Client Layer

Web Browser: End-user interface for interactive scraping
API Consumer: External applications integrating via REST API

2. Frontend Layer (Vercel)

Next.js App: React-based UI with server-side rendering
- App Router for routing
- TanStack Query for data fetching
- Zustand for state management
Clerk SDK: Client-side authentication handling

3. Backend Layer (Render)

FastAPI Server: High-performance async API server
- RESTful endpoints (/scrape, /history, /api-keys, /webhooks)
- Auto-generated OpenAPI documentation
- CORS middleware for cross-origin requests
Auth Middleware: JWT token validation via Clerk
Rate Limiter: Redis-backed request throttling per API key
SSRF Protection: Blocks private IPs and localhost requests

4. Processing Layer (Render)

ARQ Worker: Async task queue processor
- Handles background scraping jobs
- Job status tracking and updates
- Webhook dispatch on completion
Scraper Engine: Core scraping logic
- Guided mode: CSS selector extraction
- Smart mode: AI-powered extraction
- HTML parsing with BeautifulSoup4
Playwright Browser: Headless browser for JavaScript rendering
- Chromium engine
- Handles dynamic content
- Screenshot capabilities
LLM Service: AI integration layer
- Natural language instruction parsing
- Intelligent data extraction
- Context-aware field mapping

5. Data Layer (Render)

PostgreSQL: Primary persistent storage
- User data and API keys
- Job history and results
- Webhook configurations
- ACID compliance
Redis: In-memory cache
- Job queue (ARQ)
- Rate limit counters
- Session storage
- Result caching (1-hour TTL)

6. External Services

Clerk Auth: Authentication and user management
Google Gemini: AI model for smart scraping
Target Websites: Scraped content sources

Request Flow - Complete Lifecycle

sequenceDiagram
    actor User
    participant Browser
    participant NextJS as Next.js<br/>(Vercel)
    participant Clerk as Clerk Auth
    participant FastAPI as FastAPI<br/>(Render)
    participant Redis
    participant ARQ as ARQ Worker<br/>(Render)
    participant Scraper
    participant LLM as Gemini AI
    participant Website as Target Site
    participant DB as PostgreSQL

    User->>Browser: Navigate to /scrape
    Browser->>NextJS: GET /scrape
    NextJS->>Clerk: Verify session
    Clerk-->>NextJS: Session valid
    NextJS-->>Browser: Render form

    User->>Browser: Submit scrape request
    Browser->>NextJS: POST form data
    NextJS->>Clerk: Get auth token
    Clerk-->>NextJS: JWT token
    
    NextJS->>FastAPI: POST /api/v1/scrape<br/>(with JWT)
    FastAPI->>Clerk: Validate JWT
    Clerk-->>FastAPI: Token valid
    
    FastAPI->>Redis: Check rate limit
    Redis-->>FastAPI: Limit OK
    
    FastAPI->>Redis: Enqueue job (ARQ)
    FastAPI->>DB: Create job record<br/>(status: pending)
    FastAPI-->>NextJS: {"job_id": "abc-123"}
    NextJS-->>Browser: Redirect to /scrape/abc-123
    
    Browser->>NextJS: GET /scrape/abc-123
    NextJS->>FastAPI: GET /api/v1/scrape/abc-123
    FastAPI->>Redis: Get job status
    Redis-->>FastAPI: Status: pending
    FastAPI-->>NextJS: Job details
    NextJS-->>Browser: Show "Processing..."
    
    Note over Redis,ARQ: Background Processing
    ARQ->>Redis: Dequeue job
    ARQ->>DB: Update status: processing
    
    alt Guided Mode
        ARQ->>Scraper: Scrape with selectors
        Scraper->>Website: HTTP GET
        Website-->>Scraper: HTML content
        Scraper->>Scraper: Extract via CSS
    else Smart Mode
        ARQ->>Scraper: Scrape with instruction
        Scraper->>Website: HTTP GET (or Playwright)
        Website-->>Scraper: HTML content
        Scraper->>LLM: Analyze content
        LLM-->>Scraper: Extracted data
    end
    
    Scraper-->>ARQ: Results
    ARQ->>DB: Save results<br/>(status: completed)
    ARQ->>Redis: Cache results (1h TTL)
    
    Note over Browser,NextJS: Polling
    Browser->>NextJS: GET /scrape/abc-123 (poll)
    NextJS->>FastAPI: GET /api/v1/scrape/abc-123
    FastAPI->>Redis: Get cached result
    Redis-->>FastAPI: Job data
    FastAPI-->>NextJS: Complete job
    NextJS-->>Browser: Show results ✅


### Tech Stack

#### Backend
- **API Framework:** FastAPI 0.115+
- **Authentication:** Clerk JWT
- **Task Queue:** ARQ (Async Redis Queue)
- **Database:** PostgreSQL with AsyncPG
- **Cache:** Redis 7.0+
- **Web Scraping:** 
  - httpx (static content)
  - Playwright (dynamic content)
  - BeautifulSoup4 (HTML parsing)
- **AI:** Google Gemini 2.0 Flash
- **Validation:** Pydantic 2.0+

#### Frontend
- **Framework:** Next.js 16 (App Router)
- **UI Library:** React 19
- **Authentication:** Clerk
- **State Management:** 
  - TanStack Query (React Query)
  - Zustand
- **HTTP Client:** Axios
- **UI Components:** Radix UI + TailwindCSS
- **Forms:** React Hook Form + Zod

#### Infrastructure
- **Deployment:** Railway (Backend) + Vercel (Frontend)
- **Databases:** Railway PostgreSQL + Redis
- **Monitoring:** Built-in health checks + structured logs

---

## Quick Start

### Prerequisites

- **Python 3.12+**
- **Node.js 18+**
- **PostgreSQL 14+**
- **Redis 7.0+**
- **Gemini API Key** ([Get one here](https://ai.google.dev/))
- **Clerk Account** ([Sign up](https://clerk.com/))

### Backend Setup

```bash
# Clone repository
git clone https://github.com/Vrohs/scraPy_final_release.git
cd scraPy_final_release/backend

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium

# Configure environment
cp .env.example .env
# Edit .env with your credentials

# Start API server
uvicorn app.main:app --reload

# Start worker (in separate terminal)
arq app.worker.WorkerSettings

Frontend Setup

cd scraPy_final_release/frontend

# Install dependencies
npm install

# Configure environment
cp .env.local.example .env.local
# Edit .env.local with your credentials

# Start development server
npm run dev

Access

Frontend: http://localhost:3000
API: http://localhost:8000
API Docs: http://localhost:8000/docs
Health Check: http://localhost:8000/health

API Documentation

Authentication

All API endpoints require authentication via:

API Key: X-API-Key header
JWT Token: Authorization: Bearer <token> header

Core Endpoints

Create Scraping Job

POST /api/v1/scrape
Content-Type: application/json
X-API-Key: sk_live_xxx

{
  "url": "https://example.com",
  "mode": "guided",
  "selectors": {
    "title": "h1",
    "description": ".description"
  },
  "options": {
    "renderJs": false
  }
}

Response:

{
  "job_id": "uuid-here",
  "status": "pending"
}

Get Job Status

GET /api/v1/scrape/{job_id}
X-API-Key: sk_live_xxx

Response:

{
  "job_id": "uuid-here",
  "status": "completed",
  "url": "https://example.com",
  "mode": "guided",
  "data": {
    "title": "Example Domain",
    "description": "This domain is for use in illustrative examples..."
  }
}

Complete API Reference

Interactive API documentation available at:

Swagger UI: /docs
ReDoc: /redoc

Configuration

Environment Variables

Backend (`backend/.env`)

# Database
DATABASE_URL=postgresql://user:password@localhost:5432/scrapy
# Or use individual components:
POSTGRES_SERVER=localhost
POSTGRES_USER=postgres
POSTGRES_PASSWORD=secret
POSTGRES_DB=scrapy

# Redis
REDIS_URL=redis://localhost:6379
# Or use individual components:
REDIS_HOST=localhost
REDIS_PORT=6379

# API Keys
GEMINI_API_KEY=your_gemini_api_key

# Authentication
CLERK_ISSUER_URL=https://your-clerk-app.clerk.accounts.dev

# CORS
FRONTEND_URL=http://localhost:3000

# Application
PROJECT_NAME=scraPy API
API_V1_STR=/api/v1

Frontend (`frontend/.env.local`)

# Backend API
NEXT_PUBLIC_API_URL=http://localhost:8000

# Clerk Authentication
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_test_xxx
CLERK_SECRET_KEY=sk_test_xxx

Deployment

Production Architecture

graph LR
    subgraph "Vercel"
        FE[Next.js App]
    end
    
    subgraph "Railway"
        API[FastAPI API]
        WORKER[ARQ Worker]
        PG[(PostgreSQL)]
        RD[(Redis)]
    end
    
    FE --> API
    API --> PG
    API --> RD
    WORKER --> PG
    WORKER --> RD

Railway Deployment

Create Railway Project

railway login
railway init

Add Database Plugins

railway add --plugin postgresql
railway add --plugin redis

Deploy API Service

cd backend
railway up

Deploy Worker Service

Create second service in Railway dashboard
Use same repo, different start command: arq app.worker.WorkerSettings

Vercel Deployment

cd frontend
vercel login
vercel --prod

Detailed deployment guide: See DEPLOYMENT.md

Security

Implemented Protections

✅ SSRF Prevention

Blocks private IP ranges (10.x.x.x, 192.168.x.x, 172.16.x.x)
Blocks localhost and loopback addresses
Blocks link-local addresses

✅ Input Validation

URL format validation
Maximum length constraints
Mode validation (guided/smart only)

✅ Rate Limiting

Per-API-key limits
Redis-backed distributed rate limiting
Configurable limits per key

✅ Request Size Limits

10MB maximum payload
Prevents memory exhaustion

✅ Error Sanitization

Generic errors in production
Detailed errors only in development

✅ Authentication

Clerk JWT verification
API key SHA-256 hashing
HMAC webhook signatures

Security Headers

# CORS configured with specific origins
# Content-Type validation
# Request size middleware
# Rate limiting per API key

Monitoring

Health Check Endpoint

curl http://localhost:8000/health

Response:

{
  "status": "healthy",
  "timestamp": "2025-12-01T16:30:00.123456",
  "services": {
    "api": "healthy",
    "database": "healthy",
    "redis": "healthy"
  }
}

Structured Logging

All events are logged with consistent format:

2025-12-01 21:45:30 - scrapy - INFO - Starting scrape job abc-123 for https://example.com in guided mode
2025-12-01 21:45:32 - scrapy - INFO - Job completed: abc-123 | Duration: 1.85s

Log Events:

Job creation, processing, completion
API key creation/revocation
Webhook dispatch
Rate limit violations
SSRF attempts
System startup/shutdown

Development

Project Structure

scraPy/
├── backend/
│   ├── app/
│   │   ├── api/
│   │   │   └── v1/
│   │   │       └── endpoints/
│   │   │           ├── scrape.py       # Scraping endpoints
│   │   │           ├── api_keys.py     # API key management
│   │   │           └── webhooks.py     # Webhook management
│   │   ├── core/
│   │   │   ├── config.py              # Settings
│   │   │   ├── database.py            # DB connection
│   │   │   ├── redis.py               # Redis connection
│   │   │   ├── logging.py             # Structured logging
│   │   │   └── ratelimit.py           # Rate limiter
│   │   ├── models/
│   │   │   ├── job.py                 # Job model
│   │   │   ├── api_key.py             # API key model
│   │   │   └── webhook.py             # Webhook model
│   │   ├── services/
│   │   │   ├── scraper.py             # Scraping logic
│   │   │   └── llm.py                 # LLM integration
│   │   ├── main.py                    # FastAPI app
│   │   └── worker.py                  # ARQ worker
│   ├── tests/
│   │   ├── test_e2e_infrastructure.py
│   │   ├── test_api_key.py
│   │   └── test_production_api.py
│   └── requirements.txt
│
└── frontend/
    ├── src/
    │   ├── app/                       # Next.js App Router
    │   │   ├── page.tsx               # Dashboard
    │   │   ├── scrape/                # Scrape pages
    │   │   ├── history/               # Job history
    │   │   └── api-keys/              # API key management
    │   ├── components/
    │   │   ├── ui/                    # Radix UI components
    │   │   ├── scrape-form.tsx        # Scrape job form
    │   │   └── results/               # Results display
    │   ├── services/
    │   │   ├── scrape.ts              # Scrape API client
    │   │   ├── api-keys.ts            # API key client
    │   │   └── webhook.ts             # Webhook client
    │   └── lib/
    │       └── api.ts                 # Axios instance
    └── package.json

Running Tests

# Backend tests
cd backend
source venv/bin/activate

# E2E infrastructure test
python tests/test_e2e_infrastructure.py

# API key tests
python tests/test_api_key.py

# Production API test
python tests/test_production_api.py

# Frontend E2E tests
cd frontend
npx playwright test

Making Changes

Backend Changes:
- API server auto-reloads with --reload flag
- Worker needs manual restart
- Run linting: ruff check .
Frontend Changes:
- Next.js hot-reloads automatically
- Check console for errors
- Run linting: npm run lint
Database Changes:
- Currently using Base.metadata.create_all
- For production: implement Alembic migrations

Performance

Benchmarks

Static Page Scraping:

Average: ~800ms per job
Includes: HTTP fetch + parsing + DB save

Dynamic Page Scraping (Playwright):

Average: ~2.5s per job
Includes: Browser launch + render + parsing + DB save

Smart Mode (AI):

Average: ~3s per job
Includes: Scraping + LLM analysis (Gemini 2.0 Flash)

Optimization Tips

Use Guided Mode for static content
Enable renderJs only when necessary
Implement result caching for frequently scraped URLs
Use API keys with appropriate rate limits
Deploy worker separately for better scaling

Roadmap

Completed ✅

In Progress 🚧

Unit test coverage expansion
Alembic database migrations
Sentry error tracking integration

Planned 📋

Browser pooling for Playwright
WebSocket support for real-time job updates
Data export formats (CSV, JSON, Excel)
Scheduled scraping (cron jobs)
Custom JavaScript injection
Proxy rotation support
Rate limit usage dashboard

Contributing

Contributions are welcome! Please follow these guidelines:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes
Run tests: pytest (backend) / npm test (frontend)
Commit with conventional commits: feat: add amazing feature
Push to your fork: git push origin feature/amazing-feature
Open a Pull Request

Commit Convention

feat: New features
fix: Bug fixes
docs: Documentation changes
style: Code style changes (formatting)
refactor: Code refactoring
test: Test additions/changes
chore: Build process or auxiliary tool changes

Troubleshooting

Common Issues

Jobs not processing:

# Check worker is running
ps aux | grep "arq app.worker"

# Check Redis connection
redis-cli ping

# View worker logs
tail -f worker.log

401 Unauthorized:

# Verify Clerk issuer URL
echo $CLERK_ISSUER_URL

# Check API key format
# Should start with sk_live_ or sk_test_

Health check fails:

# Test individually
curl http://localhost:8000/health

# Check database
psql $DATABASE_URL -c "SELECT 1"

# Check Redis
redis-cli -u $REDIS_URL ping

License

MIT License - see LICENSE file for details.

Acknowledgments

Built with:

FastAPI - Modern Python web framework
Next.js - React framework
Clerk - Authentication platform
Playwright - Browser automation
Google Gemini - AI model
Radix UI - Unstyled UI components
Railway - Deployment platform
Vercel - Frontend hosting

Support

Documentation: Full docs
Issues: GitHub Issues
Discussions: GitHub Discussions

Made with ❤️ by vrohs

⭐ Star this repo if you find it useful!

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
INFORMATION_NEEDED.md		INFORMATION_NEEDED.md
README.md		README.md
render.yaml		render.yaml

Vrohs/scraPy_final_release

Folders and files

Latest commit

History

Repository files navigation

scraPy - Intelligent Web Scraping Platform

Overview

Key Highlights

Features

🎯 Core Scraping Capabilities

Guided Mode (CSS Selectors)

Smart Mode (AI-Powered)

🔐 Security Features

📊 Monitoring & Observability

🔧 Advanced Features

Architecture

System Design - Complete Overview

Detailed Architecture Breakdown

1. Client Layer

2. Frontend Layer (Vercel)

3. Backend Layer (Render)

4. Processing Layer (Render)

5. Data Layer (Render)

6. External Services

Request Flow - Complete Lifecycle

Frontend Setup

Access

API Documentation

Authentication

Core Endpoints

Create Scraping Job

Get Job Status

Complete API Reference

Configuration

Environment Variables

Backend (backend/.env)

Frontend (frontend/.env.local)

Deployment

Production Architecture

Railway Deployment

Vercel Deployment

Security

Implemented Protections

Security Headers

Monitoring

Health Check Endpoint

Structured Logging

Development

Project Structure

Running Tests

Making Changes

Performance

Benchmarks

Optimization Tips

Roadmap

Completed ✅

In Progress 🚧

Planned 📋

Contributing

Commit Convention

Troubleshooting

Common Issues

License

Acknowledgments

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Backend (`backend/.env`)

Frontend (`frontend/.env.local`)

Packages