Skip to content

DynamicDevices/book-translation

Repository files navigation

Book translation – Archive.org download, OCR, EPUB

Tools to download scanned book pages from Internet Archive (including borrow-only items), run OCR, and build an EPUB. Uses a browser + cookies for reliable downloads; validates captures to avoid saving loading spinners or blank pages.

Quick start (one book, full pipeline)

  1. Install

    pip install requests ebooklib playwright Pillow
    playwright install chromium
    # System OCR:  sudo apt install tesseract-ocr   (or brew install tesseract)
  2. Get cookies
    Borrow the book on Archive.org, open it in the BookReader, then export cookies in Netscape format (e.g. Get cookies.txt LOCALLY) and save as cookies.txt in this folder.

  3. Download pages (browser script – recommended; skip good pages with --validate-existing)

    python3.10 extract_archive_pages_browser.py --id BOOK_ID --end TOTAL_PAGES -o OUTPUT_DIR --cookies cookies.txt --headless
  4. Check pages (optional – find bad captures)

    python check_pages.py --pages-dir OUTPUT_DIR --expected TOTAL_PAGES --id BOOK_ID

    If it reports suspect pages, re-download only those (no good pages overwritten):

    python3.10 extract_archive_pages_browser.py --id BOOK_ID --start N --end M -o OUTPUT_DIR --cookies cookies.txt --headless --validate-existing
  5. OCR

    python ocr_pages_to_text.py --pages-dir OUTPUT_DIR
  6. Build EPUB

    python build_epub.py --pages-dir OUTPUT_DIR -o Title.epub --title "Book Title" --author "Author Name"

Scripts

Script Purpose
extract_archive_pages_browser.py Download page images via browser (recommended for borrow-only books).
extract_archive_pages.py Download via HTTP + cookies (often 403 on restricted items).
check_pages.py Validate downloaded images (flag spinners/blank); suggest re-download command.
ocr_pages_to_text.py Run Tesseract OCR on page images → one .txt per page.
build_epub.py Build EPUB from OCR text + first page as cover.

1. Download: extract_archive_pages_browser.py

Uses Playwright + your cookies so Archive.org sees a normal browser. Validates each capture (size, brightness, variance) and retries on failure so you don’t keep spinners or blank pages.

Required: pip install playwright and playwright install chromium. Use python3.10 if your default python doesn’t have Playwright.

Examples

# Full run (skip existing files)
python3.10 extract_archive_pages_browser.py --id standonzanzibar0000unse --end 674 \
  -o stand-on-zanzibar-pages --cookies cookies.txt --headless

# Re-download only missing or bad pages (good pages are not touched)
python3.10 extract_archive_pages_browser.py --id lovelimerence00tenn --end 346 \
  -o love-and-limerence-pages --cookies cookies.txt --headless --validate-existing

# More workers and longer content wait
python3.10 extract_archive_pages_browser.py --id BOOK_ID --end N -o out \
  --cookies cookies.txt --headless --workers 6 --content-wait 20000

Options

Option Default Meaning
--id (Stand on Zanzibar) Archive.org item identifier
--start / --end 1 / 674 Page range (inclusive)
-o, --output stand-on-zanzibar-pages Output directory for page_0001.png, …
--cookies (required) Path to Netscape cookies.txt
--headless off Run browser without a window
--workers 4 Concurrent browser pages (1–16)
--timeout 45000 Page load timeout (ms)
--content-wait 15000 Ms to wait for book content before screenshot
--retries 2 Retries per page on failure (3 attempts total)
--force off Re-download all pages in range (overwrite existing)
--validate-existing off Re-download only missing or invalid pages; skip good ones
--no-validate off Skip post-capture validation (keep every screenshot)

2. Check downloads: check_pages.py

Scans a folder of page images and flags files that look like spinners or blank (tiny size, too dark, or too flat). Prints a suggested re-download command using --validate-existing so only bad pages are re-downloaded.

Example

python check_pages.py --pages-dir love-and-limerence-pages --expected 346 --id lovelimerence00tenn

Options

Option Default Meaning
--pages-dir love-and-limerence-pages Directory containing page_*.png / page_*.jpg
--expected (none) Expected page count (warn if different)
--id lovelimerence00tenn Book ID used in the suggested re-download command

Install Pillow for brightness/variance checks: pip install Pillow.


3. OCR: ocr_pages_to_text.py

Runs Tesseract on each page image and writes one .txt per page (same stem as the image, e.g. page_0001.txt).

Example

python ocr_pages_to_text.py --pages-dir stand-on-zanzibar-pages

Options

Option Default Meaning
--pages-dir stand-on-zanzibar-pages Directory with page images
-o, --output-dir same as --pages-dir Where to write .txt files
--lang eng Tesseract language code
--use-cli off Use tesseract CLI instead of pytesseract

Requires Tesseract: sudo apt install tesseract-ocr (or brew install tesseract).


4. Build EPUB: build_epub.py

Builds an EPUB from OCR text files and the first page image as cover. By default the first page is cover only; body text starts from page 2.

Example

python build_epub.py --pages-dir love-and-limerence-pages -o Love_and_Limerence.epub \
  --title "Love and Limerence: The Experience of Being in Love" --author "Dorothy Tennov"

Options

Option Default Meaning
--pages-dir stand-on-zanzibar-pages Directory with .txt (and optional images)
-o, --output Stand_on_Zanzibar.epub Output EPUB path
--title / --author Stand on Zanzibar / John Brunner Metadata
--chapters-per-part 50 One chapter every N pages; use 0 for a single chapter
--first-page-is-cover on First page = cover image only; body from page 2
--no-first-page-is-cover off Include first page OCR in body as well

Requires: pip install ebooklib.


Example: another book (Love and Limerence)

# 1) Borrow + export cookies, then:
python3.10 extract_archive_pages_browser.py --id lovelimerence00tenn --end 346 \
  -o love-and-limerence-pages --cookies cookies.txt --headless

# 2) Check for bad pages; re-download only those if needed
python check_pages.py --pages-dir love-and-limerence-pages --expected 346 --id lovelimerence00tenn
# If suspect pages reported:
python3.10 extract_archive_pages_browser.py --id lovelimerence00tenn --start 19 --end 346 \
  -o love-and-limerence-pages --cookies cookies.txt --headless --validate-existing

# 3) OCR + EPUB
python ocr_pages_to_text.py --pages-dir love-and-limerence-pages
python build_epub.py --pages-dir love-and-limerence-pages -o Love_and_Limerence.epub \
  --title "Love and Limerence: The Experience of Being in Love" --author "Dorothy Tennov"

Exporting cookies

  1. In Chrome, install Get cookies.txt LOCALLY.
  2. On Archive.org, log in, Borrow the book, and open it in the BookReader.
  3. Export cookies in Netscape format and save as cookies.txt in this folder.
  4. Run the download script while the loan is still active.

Notes

  • 403 on direct download: Use extract_archive_pages_browser.py with cookies instead of extract_archive_pages.py.
  • Good pages: Use --validate-existing (and the command suggested by check_pages.py) so only missing or bad pages are re-downloaded; good pages are never overwritten.
  • Legal: Use only for personal use in line with Archive.org’s terms and copyright.

About

Download book pages from Internet Archive, OCR, build EPUB

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages