A modular collection of professional Python scripts for extracting high-quality data from digital museum archives and cultural heritage collections.
This repository serves as an educational resource and a toolkit for Digital Humanities researchers, developers, and archivists. It demonstrates modern scraping patterns including:
- Dynamic Scraping: Using
Playwrightto handle JavaScript-heavy museum viewers. - IIIF Integration: Extracting maximum-resolution images directly from IIIF servers (bypassing web thumbnails).
- Metadata Normalization: converting messy museum HTML into structured JSONL datasets.
- Async Concurrency: Fast, non-blocking downloads using
asyncio.
Each script is a standalone tool targeting a specific digital archive architecture.
| Institution | Script | Tech Stack | Key Features |
|---|---|---|---|
| Pitt Rivers Museum | scrapers/run_pitt_rivers.py |
Playwright, AsyncIO |
• IIIF Max-Res Extraction • Bypasses "Sensitive Content" popups • Hybrid Search + Scraping |
| British Museum | scrapers/run_british_museum.py |
Pandas, Requests |
• CSV-driven extraction • Handles "Preview" quality access • Metadata mapping |
| MAA Cambridge | scrapers/run_maa_cambridge.py |
Playwright |
• Dynamic JS Navigation • Deep metadata (Context, Photographer) • Multi-view image linking |
| G.I. Jones Archive | scrapers/run_gijones.py |
BeautifulSoup |
• Static site traversing • Gallery iteration |
| Ukpuru Blog | scrapers/run_ukpuru.py |
BeautifulSoup |
• Blogspot/Blogger parsing • Unstructured text extraction |
git clone https://github.com/Nwokike/museum-scrapers-python.git
cd museum-scrapers-python
This project relies on playwright for dynamic sites and pandas for data handling.
pip install -r requirements.txt
Required for the MAA and Pitt Rivers scrapers.
playwright install chromium
Each scraper is designed to be run independently.
This script navigates the search results for a specific query (e.g., "Igbo") and extracts high-res IIIF images.
python scrapers/run_pitt_rivers.py
Output: Creates a data_pitt_rivers/ folder with images/ and data.jsonl.
Place your CSV export (british_museum.csv) in the folder before running.
python scrapers/run_british_museum.py
Please scrape responsibly.
- Respect Rate Limits: These scripts are powerful. Do not overwhelm museum servers. Use
time.sleep()intervals (included in scripts) to be a polite bot. - Copyright:
- The Code: This repository is open source (MIT License). You can use the code freely.
- The Data: The content you scrape (images, text) is subject to the copyright terms of the respective institutions (e.g., "© Trustees of the British Museum", "CC BY-NC-ND 4.0").
- Usage: This tool is for educational and research purposes. Do not use scraped data for commercial products without obtaining proper licenses from the source institutions.
We welcome contributions! If you have built a scraper for another museum (e.g., The Met, Smithsonian, Quai Branly), please submit a Pull Request.
- Fork the repo.
- Create your scraper in
scrapers/run_NEW_SOURCE.py. - Ensure it outputs structured
JSONLand separates images into an/imagesfolder.
This project is licensed under the MIT License - see the LICENSE file for details.