Skip to content

A collection of robust, asynchronous Python scripts for scraping and archiving digital museum collections. Features Playwright, IIIF handling, and rich metadata extraction for Digital Humanities research.

License

Notifications You must be signed in to change notification settings

Nwokike/museum-scrapers-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🏛️ Museum Scrapers (Python)

License: MIT Python Playwright

A modular collection of professional Python scripts for extracting high-quality data from digital museum archives and cultural heritage collections.

This repository serves as an educational resource and a toolkit for Digital Humanities researchers, developers, and archivists. It demonstrates modern scraping patterns including:

  • Dynamic Scraping: Using Playwright to handle JavaScript-heavy museum viewers.
  • IIIF Integration: Extracting maximum-resolution images directly from IIIF servers (bypassing web thumbnails).
  • Metadata Normalization: converting messy museum HTML into structured JSONL datasets.
  • Async Concurrency: Fast, non-blocking downloads using asyncio.

📂 Supported Institutions

Each script is a standalone tool targeting a specific digital archive architecture.

Institution Script Tech Stack Key Features
Pitt Rivers Museum scrapers/run_pitt_rivers.py Playwright, AsyncIO IIIF Max-Res Extraction
• Bypasses "Sensitive Content" popups
• Hybrid Search + Scraping
British Museum scrapers/run_british_museum.py Pandas, Requests CSV-driven extraction
• Handles "Preview" quality access
• Metadata mapping
MAA Cambridge scrapers/run_maa_cambridge.py Playwright Dynamic JS Navigation
• Deep metadata (Context, Photographer)
• Multi-view image linking
G.I. Jones Archive scrapers/run_gijones.py BeautifulSoup • Static site traversing
• Gallery iteration
Ukpuru Blog scrapers/run_ukpuru.py BeautifulSoup • Blogspot/Blogger parsing
• Unstructured text extraction

🚀 Installation

1. Clone the Repository

git clone https://github.com/Nwokike/museum-scrapers-python.git
cd museum-scrapers-python

2. Install Dependencies

This project relies on playwright for dynamic sites and pandas for data handling.

pip install -r requirements.txt

3. Install Browser Engines

Required for the MAA and Pitt Rivers scrapers.

playwright install chromium

📖 Usage Examples

Each scraper is designed to be run independently.

Example 1: Scraping the Pitt Rivers Museum

This script navigates the search results for a specific query (e.g., "Igbo") and extracts high-res IIIF images.

python scrapers/run_pitt_rivers.py

Output: Creates a data_pitt_rivers/ folder with images/ and data.jsonl.

Example 2: Processing British Museum Data

Place your CSV export (british_museum.csv) in the folder before running.

python scrapers/run_british_museum.py

⚖️ Ethics & Legal Disclaimer

Please scrape responsibly.

  1. Respect Rate Limits: These scripts are powerful. Do not overwhelm museum servers. Use time.sleep() intervals (included in scripts) to be a polite bot.
  2. Copyright:
  • The Code: This repository is open source (MIT License). You can use the code freely.
  • The Data: The content you scrape (images, text) is subject to the copyright terms of the respective institutions (e.g., "© Trustees of the British Museum", "CC BY-NC-ND 4.0").
  1. Usage: This tool is for educational and research purposes. Do not use scraped data for commercial products without obtaining proper licenses from the source institutions.

🤝 Contributing

We welcome contributions! If you have built a scraper for another museum (e.g., The Met, Smithsonian, Quai Branly), please submit a Pull Request.

  1. Fork the repo.
  2. Create your scraper in scrapers/run_NEW_SOURCE.py.
  3. Ensure it outputs structured JSONL and separates images into an /images folder.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A collection of robust, asynchronous Python scripts for scraping and archiving digital museum collections. Features Playwright, IIIF handling, and rich metadata extraction for Digital Humanities research.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages