Tools to download scanned book pages from Internet Archive (including borrow-only items), run OCR, and build an EPUB. Uses a browser + cookies for reliable downloads; validates captures to avoid saving loading spinners or blank pages.
-
Install
pip install requests ebooklib playwright Pillow playwright install chromium # System OCR: sudo apt install tesseract-ocr (or brew install tesseract) -
Get cookies
Borrow the book on Archive.org, open it in the BookReader, then export cookies in Netscape format (e.g. Get cookies.txt LOCALLY) and save ascookies.txtin this folder. -
Download pages (browser script – recommended; skip good pages with
--validate-existing)python3.10 extract_archive_pages_browser.py --id BOOK_ID --end TOTAL_PAGES -o OUTPUT_DIR --cookies cookies.txt --headless
-
Check pages (optional – find bad captures)
python check_pages.py --pages-dir OUTPUT_DIR --expected TOTAL_PAGES --id BOOK_ID
If it reports suspect pages, re-download only those (no good pages overwritten):
python3.10 extract_archive_pages_browser.py --id BOOK_ID --start N --end M -o OUTPUT_DIR --cookies cookies.txt --headless --validate-existing
-
OCR
python ocr_pages_to_text.py --pages-dir OUTPUT_DIR
-
Build EPUB
python build_epub.py --pages-dir OUTPUT_DIR -o Title.epub --title "Book Title" --author "Author Name"
| Script | Purpose |
|---|---|
extract_archive_pages_browser.py |
Download page images via browser (recommended for borrow-only books). |
extract_archive_pages.py |
Download via HTTP + cookies (often 403 on restricted items). |
check_pages.py |
Validate downloaded images (flag spinners/blank); suggest re-download command. |
ocr_pages_to_text.py |
Run Tesseract OCR on page images → one .txt per page. |
build_epub.py |
Build EPUB from OCR text + first page as cover. |
Uses Playwright + your cookies so Archive.org sees a normal browser. Validates each capture (size, brightness, variance) and retries on failure so you don’t keep spinners or blank pages.
Required: pip install playwright and playwright install chromium. Use python3.10 if your default python doesn’t have Playwright.
Examples
# Full run (skip existing files)
python3.10 extract_archive_pages_browser.py --id standonzanzibar0000unse --end 674 \
-o stand-on-zanzibar-pages --cookies cookies.txt --headless
# Re-download only missing or bad pages (good pages are not touched)
python3.10 extract_archive_pages_browser.py --id lovelimerence00tenn --end 346 \
-o love-and-limerence-pages --cookies cookies.txt --headless --validate-existing
# More workers and longer content wait
python3.10 extract_archive_pages_browser.py --id BOOK_ID --end N -o out \
--cookies cookies.txt --headless --workers 6 --content-wait 20000Options
| Option | Default | Meaning |
|---|---|---|
--id |
(Stand on Zanzibar) | Archive.org item identifier |
--start / --end |
1 / 674 | Page range (inclusive) |
-o, --output |
stand-on-zanzibar-pages |
Output directory for page_0001.png, … |
--cookies |
(required) | Path to Netscape cookies.txt |
--headless |
off | Run browser without a window |
--workers |
4 | Concurrent browser pages (1–16) |
--timeout |
45000 | Page load timeout (ms) |
--content-wait |
15000 | Ms to wait for book content before screenshot |
--retries |
2 | Retries per page on failure (3 attempts total) |
--force |
off | Re-download all pages in range (overwrite existing) |
--validate-existing |
off | Re-download only missing or invalid pages; skip good ones |
--no-validate |
off | Skip post-capture validation (keep every screenshot) |
Scans a folder of page images and flags files that look like spinners or blank (tiny size, too dark, or too flat). Prints a suggested re-download command using --validate-existing so only bad pages are re-downloaded.
Example
python check_pages.py --pages-dir love-and-limerence-pages --expected 346 --id lovelimerence00tennOptions
| Option | Default | Meaning |
|---|---|---|
--pages-dir |
love-and-limerence-pages |
Directory containing page_*.png / page_*.jpg |
--expected |
(none) | Expected page count (warn if different) |
--id |
lovelimerence00tenn |
Book ID used in the suggested re-download command |
Install Pillow for brightness/variance checks: pip install Pillow.
Runs Tesseract on each page image and writes one .txt per page (same stem as the image, e.g. page_0001.txt).
Example
python ocr_pages_to_text.py --pages-dir stand-on-zanzibar-pagesOptions
| Option | Default | Meaning |
|---|---|---|
--pages-dir |
stand-on-zanzibar-pages |
Directory with page images |
-o, --output-dir |
same as --pages-dir |
Where to write .txt files |
--lang |
eng |
Tesseract language code |
--use-cli |
off | Use tesseract CLI instead of pytesseract |
Requires Tesseract: sudo apt install tesseract-ocr (or brew install tesseract).
Builds an EPUB from OCR text files and the first page image as cover. By default the first page is cover only; body text starts from page 2.
Example
python build_epub.py --pages-dir love-and-limerence-pages -o Love_and_Limerence.epub \
--title "Love and Limerence: The Experience of Being in Love" --author "Dorothy Tennov"Options
| Option | Default | Meaning |
|---|---|---|
--pages-dir |
stand-on-zanzibar-pages |
Directory with .txt (and optional images) |
-o, --output |
Stand_on_Zanzibar.epub |
Output EPUB path |
--title / --author |
Stand on Zanzibar / John Brunner | Metadata |
--chapters-per-part |
50 | One chapter every N pages; use 0 for a single chapter |
--first-page-is-cover |
on | First page = cover image only; body from page 2 |
--no-first-page-is-cover |
off | Include first page OCR in body as well |
Requires: pip install ebooklib.
- Identifier:
lovelimerence00tenn - Pages: 346
- Archive.org: Love and limerence
# 1) Borrow + export cookies, then:
python3.10 extract_archive_pages_browser.py --id lovelimerence00tenn --end 346 \
-o love-and-limerence-pages --cookies cookies.txt --headless
# 2) Check for bad pages; re-download only those if needed
python check_pages.py --pages-dir love-and-limerence-pages --expected 346 --id lovelimerence00tenn
# If suspect pages reported:
python3.10 extract_archive_pages_browser.py --id lovelimerence00tenn --start 19 --end 346 \
-o love-and-limerence-pages --cookies cookies.txt --headless --validate-existing
# 3) OCR + EPUB
python ocr_pages_to_text.py --pages-dir love-and-limerence-pages
python build_epub.py --pages-dir love-and-limerence-pages -o Love_and_Limerence.epub \
--title "Love and Limerence: The Experience of Being in Love" --author "Dorothy Tennov"- In Chrome, install Get cookies.txt LOCALLY.
- On Archive.org, log in, Borrow the book, and open it in the BookReader.
- Export cookies in Netscape format and save as
cookies.txtin this folder. - Run the download script while the loan is still active.
- 403 on direct download: Use
extract_archive_pages_browser.pywith cookies instead ofextract_archive_pages.py. - Good pages: Use
--validate-existing(and the command suggested bycheck_pages.py) so only missing or bad pages are re-downloaded; good pages are never overwritten. - Legal: Use only for personal use in line with Archive.org’s terms and copyright.