🏛️ Museum Scrapers (Python)

A modular collection of professional Python scripts for extracting high-quality data from digital museum archives and cultural heritage collections.

This repository serves as an educational resource and a toolkit for Digital Humanities researchers, developers, and archivists. It demonstrates modern scraping patterns including:

Dynamic Scraping: Using Playwright to handle JavaScript-heavy museum viewers.
IIIF Integration: Extracting maximum-resolution images directly from IIIF servers (bypassing web thumbnails).
Metadata Normalization: converting messy museum HTML into structured JSONL datasets.
Async Concurrency: Fast, non-blocking downloads using asyncio.

📂 Supported Institutions

Each script is a standalone tool targeting a specific digital archive architecture.

Institution	Script	Tech Stack	Key Features
Pitt Rivers Museum	`scrapers/run_pitt_rivers.py`	`Playwright`, `AsyncIO`	• IIIF Max-Res Extraction • Bypasses "Sensitive Content" popups • Hybrid Search + Scraping
British Museum	`scrapers/run_british_museum.py`	`Pandas`, `Requests`	• CSV-driven extraction • Handles "Preview" quality access • Metadata mapping
MAA Cambridge	`scrapers/run_maa_cambridge.py`	`Playwright`	• Dynamic JS Navigation • Deep metadata (Context, Photographer) • Multi-view image linking
G.I. Jones Archive	`scrapers/run_gijones.py`	`BeautifulSoup`	• Static site traversing • Gallery iteration
Ukpuru Blog	`scrapers/run_ukpuru.py`	`BeautifulSoup`	• Blogspot/Blogger parsing • Unstructured text extraction

🚀 Installation

1. Clone the Repository

git clone https://github.com/Nwokike/museum-scrapers-python.git
cd museum-scrapers-python

2. Install Dependencies

This project relies on playwright for dynamic sites and pandas for data handling.

pip install -r requirements.txt

3. Install Browser Engines

Required for the MAA and Pitt Rivers scrapers.

playwright install chromium

📖 Usage Examples

Each scraper is designed to be run independently.

Example 1: Scraping the Pitt Rivers Museum

This script navigates the search results for a specific query (e.g., "Igbo") and extracts high-res IIIF images.

python scrapers/run_pitt_rivers.py

Output: Creates a data_pitt_rivers/ folder with images/ and data.jsonl.

Example 2: Processing British Museum Data

Place your CSV export (british_museum.csv) in the folder before running.

python scrapers/run_british_museum.py

⚖️ Ethics & Legal Disclaimer

Please scrape responsibly.

Respect Rate Limits: These scripts are powerful. Do not overwhelm museum servers. Use time.sleep() intervals (included in scripts) to be a polite bot.
Copyright:

The Code: This repository is open source (MIT License). You can use the code freely.
The Data: The content you scrape (images, text) is subject to the copyright terms of the respective institutions (e.g., "© Trustees of the British Museum", "CC BY-NC-ND 4.0").

Usage: This tool is for educational and research purposes. Do not use scraped data for commercial products without obtaining proper licenses from the source institutions.

🤝 Contributing

We welcome contributions! If you have built a scraper for another museum (e.g., The Met, Smithsonian, Quai Branly), please submit a Pull Request.

Fork the repo.
Create your scraper in scrapers/run_NEW_SOURCE.py.
Ensure it outputs structured JSONL and separates images into an /images folder.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
scrapers		scrapers
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏛️ Museum Scrapers (Python)

📂 Supported Institutions

🚀 Installation

1. Clone the Repository

2. Install Dependencies

3. Install Browser Engines

📖 Usage Examples

Example 1: Scraping the Pitt Rivers Museum

Example 2: Processing British Museum Data

⚖️ Ethics & Legal Disclaimer

🤝 Contributing

📝 License

About

Uh oh!

Releases

Packages

Languages

License

Nwokike/museum-scrapers-python

Folders and files

Latest commit

History

Repository files navigation

🏛️ Museum Scrapers (Python)

📂 Supported Institutions

🚀 Installation

1. Clone the Repository

2. Install Dependencies

3. Install Browser Engines

📖 Usage Examples

Example 1: Scraping the Pitt Rivers Museum

Example 2: Processing British Museum Data

⚖️ Ethics & Legal Disclaimer

🤝 Contributing

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages