Skip to content

Releases: Studnicky/PathRipper

ripperoni v2.0.0 — web ingestion engine

30 Apr 21:38

Choose a tag to compare

Complete TypeScript v2 rewrite of PathRipper (2019).

Documentation


Ripperoni is a web ingestion engine. Point it at a wiki, a site, or a list of URLs. It slices through everything, one page at a time, and hands you the meat.

What's in the box

  • Plugin-driven pipeline — compose scraping jobs from small async (next, state) => void task functions; the engine runs them in sequence, each one narrowing raw page content into structured output
  • Three scrape modes — HTML (HtmlScraper + cheerio), MediaWiki JSON API (MediaWikiScraper), recursive link crawler (LinkLister)
  • MediaWiki modes — single category, categories[] array from config, or full-wiki enumeration via allpages API
  • Retry + backoff — exponential backoff with ±10% jitter, Retry-After header support, configurable max attempts
  • Error classification — NETWORK / THROTTLED / TIMEOUT / TRANSIENT / PERMANENT / VALIDATION / RESOURCE
  • ScraperCache — sharded content-addressed pointer cache with read-write / read-only / write-only / off modes and TTL
  • ConfigClamp — validates and clamps all numeric config values to valid ranges
  • Resume/retryfailures.json tracks failed page titles; --resume-failures re-fetches only those
  • AJV-validated config — full json-schema-to-ts derived types, malformed files fail fast with field-path errors
  • 90 unit testsnode:test native runner, no jest/vitest
  • Matrix CI — Node 22/24 × ubuntu/macos

Install

npm install
npm run build

Quick start

ripperoni scrape --target <id> --config ripperoni.config.json
ripperoni crawl --starts "https://example.com" --domain "example\.com" --target "\?id="

See the docs for full config reference and plugin authoring guide.