Book translation – Archive.org download, OCR, EPUB

Tools to download scanned book pages from Internet Archive (including borrow-only items), run OCR, and build an EPUB. Uses a browser + cookies for reliable downloads; validates captures to avoid saving loading spinners or blank pages.

Quick start (one book, full pipeline)

Install

pip install requests ebooklib playwright Pillow
playwright install chromium
# System OCR:  sudo apt install tesseract-ocr   (or brew install tesseract)

Get cookies
Borrow the book on Archive.org, open it in the BookReader, then export cookies in Netscape format (e.g. Get cookies.txt LOCALLY) and save as cookies.txt in this folder.

Download pages (browser script – recommended; skip good pages with --validate-existing)

python3.10 extract_archive_pages_browser.py --id BOOK_ID --end TOTAL_PAGES -o OUTPUT_DIR --cookies cookies.txt --headless

Check pages (optional – find bad captures)

python check_pages.py --pages-dir OUTPUT_DIR --expected TOTAL_PAGES --id BOOK_ID

If it reports suspect pages, re-download only those (no good pages overwritten):

python3.10 extract_archive_pages_browser.py --id BOOK_ID --start N --end M -o OUTPUT_DIR --cookies cookies.txt --headless --validate-existing

OCR

python ocr_pages_to_text.py --pages-dir OUTPUT_DIR

Build EPUB

python build_epub.py --pages-dir OUTPUT_DIR -o Title.epub --title "Book Title" --author "Author Name"

Scripts

Script	Purpose
`extract_archive_pages_browser.py`	Download page images via browser (recommended for borrow-only books).
`extract_archive_pages.py`	Download via HTTP + cookies (often 403 on restricted items).
`check_pages.py`	Validate downloaded images (flag spinners/blank); suggest re-download command.
`ocr_pages_to_text.py`	Run Tesseract OCR on page images → one `.txt` per page.
`build_epub.py`	Build EPUB from OCR text + first page as cover.

1. Download: `extract_archive_pages_browser.py`

Uses Playwright + your cookies so Archive.org sees a normal browser. Validates each capture (size, brightness, variance) and retries on failure so you don’t keep spinners or blank pages.

Required: pip install playwright and playwright install chromium. Use python3.10 if your default python doesn’t have Playwright.

Examples

# Full run (skip existing files)
python3.10 extract_archive_pages_browser.py --id standonzanzibar0000unse --end 674 \
  -o stand-on-zanzibar-pages --cookies cookies.txt --headless

# Re-download only missing or bad pages (good pages are not touched)
python3.10 extract_archive_pages_browser.py --id lovelimerence00tenn --end 346 \
  -o love-and-limerence-pages --cookies cookies.txt --headless --validate-existing

# More workers and longer content wait
python3.10 extract_archive_pages_browser.py --id BOOK_ID --end N -o out \
  --cookies cookies.txt --headless --workers 6 --content-wait 20000

Options

Option	Default	Meaning
`--id`	(Stand on Zanzibar)	Archive.org item identifier
`--start` / `--end`	1 / 674	Page range (inclusive)
`-o`, `--output`	`stand-on-zanzibar-pages`	Output directory for `page_0001.png`, …
`--cookies`	(required)	Path to Netscape `cookies.txt`
`--headless`	off	Run browser without a window
`--workers`	4	Concurrent browser pages (1–16)
`--timeout`	45000	Page load timeout (ms)
`--content-wait`	15000	Ms to wait for book content before screenshot
`--retries`	2	Retries per page on failure (3 attempts total)
`--force`	off	Re-download all pages in range (overwrite existing)
`--validate-existing`	off	Re-download only missing or invalid pages; skip good ones
`--no-validate`	off	Skip post-capture validation (keep every screenshot)

2. Check downloads: `check_pages.py`

Scans a folder of page images and flags files that look like spinners or blank (tiny size, too dark, or too flat). Prints a suggested re-download command using --validate-existing so only bad pages are re-downloaded.

Example

python check_pages.py --pages-dir love-and-limerence-pages --expected 346 --id lovelimerence00tenn

Options

Option	Default	Meaning
`--pages-dir`	`love-and-limerence-pages`	Directory containing `page_.png` / `page_.jpg`
`--expected`	(none)	Expected page count (warn if different)
`--id`	`lovelimerence00tenn`	Book ID used in the suggested re-download command

Install Pillow for brightness/variance checks: pip install Pillow.

3. OCR: `ocr_pages_to_text.py`

Runs Tesseract on each page image and writes one .txt per page (same stem as the image, e.g. page_0001.txt).

Example

python ocr_pages_to_text.py --pages-dir stand-on-zanzibar-pages

Options

Option	Default	Meaning
`--pages-dir`	`stand-on-zanzibar-pages`	Directory with page images
`-o`, `--output-dir`	same as `--pages-dir`	Where to write `.txt` files
`--lang`	`eng`	Tesseract language code
`--use-cli`	off	Use `tesseract` CLI instead of `pytesseract`

Requires Tesseract: sudo apt install tesseract-ocr (or brew install tesseract).

4. Build EPUB: `build_epub.py`

Builds an EPUB from OCR text files and the first page image as cover. By default the first page is cover only; body text starts from page 2.

Example

python build_epub.py --pages-dir love-and-limerence-pages -o Love_and_Limerence.epub \
  --title "Love and Limerence: The Experience of Being in Love" --author "Dorothy Tennov"

Options

Option	Default	Meaning
`--pages-dir`	`stand-on-zanzibar-pages`	Directory with `.txt` (and optional images)
`-o`, `--output`	`Stand_on_Zanzibar.epub`	Output EPUB path
`--title` / `--author`	Stand on Zanzibar / John Brunner	Metadata
`--chapters-per-part`	50	One chapter every N pages; use `0` for a single chapter
`--first-page-is-cover`	on	First page = cover image only; body from page 2
`--no-first-page-is-cover`	off	Include first page OCR in body as well

Requires: pip install ebooklib.

Example: another book (Love and Limerence)

Identifier: lovelimerence00tenn
Pages: 346
Archive.org: Love and limerence

# 1) Borrow + export cookies, then:
python3.10 extract_archive_pages_browser.py --id lovelimerence00tenn --end 346 \
  -o love-and-limerence-pages --cookies cookies.txt --headless

# 2) Check for bad pages; re-download only those if needed
python check_pages.py --pages-dir love-and-limerence-pages --expected 346 --id lovelimerence00tenn
# If suspect pages reported:
python3.10 extract_archive_pages_browser.py --id lovelimerence00tenn --start 19 --end 346 \
  -o love-and-limerence-pages --cookies cookies.txt --headless --validate-existing

# 3) OCR + EPUB
python ocr_pages_to_text.py --pages-dir love-and-limerence-pages
python build_epub.py --pages-dir love-and-limerence-pages -o Love_and_Limerence.epub \
  --title "Love and Limerence: The Experience of Being in Love" --author "Dorothy Tennov"

Exporting cookies

In Chrome, install Get cookies.txt LOCALLY.
On Archive.org, log in, Borrow the book, and open it in the BookReader.
Export cookies in Netscape format and save as cookies.txt in this folder.
Run the download script while the loan is still active.

Notes

403 on direct download: Use extract_archive_pages_browser.py with cookies instead of extract_archive_pages.py.
Good pages: Use --validate-existing (and the command suggested by check_pages.py) so only missing or bad pages are re-downloaded; good pages are never overwritten.
Legal: Use only for personal use in line with Archive.org’s terms and copyright.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Book translation – Archive.org download, OCR, EPUB

Quick start (one book, full pipeline)

Scripts

1. Download: `extract_archive_pages_browser.py`

2. Check downloads: `check_pages.py`

3. OCR: `ocr_pages_to_text.py`

4. Build EPUB: `build_epub.py`

Example: another book (Love and Limerence)

Exporting cookies

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
build_epub.py		build_epub.py
check_pages.py		check_pages.py
extract_archive_pages.py		extract_archive_pages.py
extract_archive_pages_browser.py		extract_archive_pages_browser.py
ocr_pages_to_text.py		ocr_pages_to_text.py

Folders and files

Latest commit

History

Repository files navigation

Book translation – Archive.org download, OCR, EPUB

Quick start (one book, full pipeline)

Scripts

1. Download: extract_archive_pages_browser.py

2. Check downloads: check_pages.py

3. OCR: ocr_pages_to_text.py

4. Build EPUB: build_epub.py

Example: another book (Love and Limerence)

Exporting cookies

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Download: `extract_archive_pages_browser.py`

2. Check downloads: `check_pages.py`

3. OCR: `ocr_pages_to_text.py`

4. Build EPUB: `build_epub.py`

Packages