Skip to content

Latest commit

 

History

History
113 lines (81 loc) · 2.24 KB

File metadata and controls

113 lines (81 loc) · 2.24 KB

PDF Annotation Extractor

A Flask web app that extracts annotations (highlights, underlines, squiggly marks, comments) from uploaded PDF files and exports them as CSV and JSON.


Folder Structure

project/
├── app.py
├── templates/
│   └── index.html
├── uploads/          ← auto-created
├── outputs/          ← auto-created
└── README.md

Requirements

  • Python 3.9+
  • pip

Setup & Run

1. Create and activate a virtual environment (recommended)

python -m venv venv

# macOS / Linux
source venv/bin/activate

# Windows
venv\Scripts\activate

2. Install dependencies

pip install flask pymupdf werkzeug

3. Run the app

python app.py

4. Open in your browser

http://127.0.0.1:5000

Usage

  1. Upload any annotated PDF via the drag-and-drop zone or file browser.
  2. Click Extract Annotations.
  3. View results in the table on the page.
  4. Download as CSV or JSON using the buttons.

Output Format

CSV (outputs/annotations.csv):

page,label,text
1,Highlight,"example highlighted text"
2,Comment,"This is a sticky note"
3,Underline,"underlined sentence"

JSON (outputs/annotations.json):

[
  { "page": 1, "label": "Highlight", "text": "example highlighted text" },
  { "page": 2, "label": "Comment",   "text": "This is a sticky note" },
  { "page": 3, "label": "Underline", "text": "underlined sentence" }
]

Supported Annotation Types

Type PDF Standard Name
Highlight PDF_ANNOT_HIGHLIGHT
Underline PDF_ANNOT_UNDERLINE
Squiggly PDF_ANNOT_SQUIGGLY
Strikeout PDF_ANNOT_STRIKEOUT
Comment PDF_ANNOT_TEXT (sticky)
Free Text PDF_ANNOT_FREE_TEXT

API Endpoints

Method Route Description
GET / Render upload page
POST / Upload PDF and process annotations
GET /download/csv Download annotations.csv
GET /download/json Download annotations.json