Releases: Studnicky/PathRipper
Releases · Studnicky/PathRipper
ripperoni v2.0.0 — web ingestion engine
Complete TypeScript v2 rewrite of PathRipper (2019).
Ripperoni is a web ingestion engine. Point it at a wiki, a site, or a list of URLs. It slices through everything, one page at a time, and hands you the meat.
What's in the box
- Plugin-driven pipeline — compose scraping jobs from small
async (next, state) => voidtask functions; the engine runs them in sequence, each one narrowing raw page content into structured output - Three scrape modes — HTML (
HtmlScraper+ cheerio), MediaWiki JSON API (MediaWikiScraper), recursive link crawler (LinkLister) - MediaWiki modes — single category,
categories[]array from config, or full-wiki enumeration viaallpagesAPI - Retry + backoff — exponential backoff with ±10% jitter,
Retry-Afterheader support, configurable max attempts - Error classification — NETWORK / THROTTLED / TIMEOUT / TRANSIENT / PERMANENT / VALIDATION / RESOURCE
- ScraperCache — sharded content-addressed pointer cache with
read-write/read-only/write-only/offmodes and TTL - ConfigClamp — validates and clamps all numeric config values to valid ranges
- Resume/retry —
failures.jsontracks failed page titles;--resume-failuresre-fetches only those - AJV-validated config — full
json-schema-to-tsderived types, malformed files fail fast with field-path errors - 90 unit tests —
node:testnative runner, no jest/vitest - Matrix CI — Node 22/24 × ubuntu/macos
Install
npm install
npm run buildQuick start
ripperoni scrape --target <id> --config ripperoni.config.json
ripperoni crawl --starts "https://example.com" --domain "example\.com" --target "\?id="See the docs for full config reference and plugin authoring guide.