Releases · Studnicky/PathRipper

Complete TypeScript v2 rewrite of PathRipper (2019).

Ripperoni is a web ingestion engine. Point it at a wiki, a site, or a list of URLs. It slices through everything, one page at a time, and hands you the meat.

What's in the box

Plugin-driven pipeline — compose scraping jobs from small async (next, state) => void task functions; the engine runs them in sequence, each one narrowing raw page content into structured output
Three scrape modes — HTML (HtmlScraper + cheerio), MediaWiki JSON API (MediaWikiScraper), recursive link crawler (LinkLister)
MediaWiki modes — single category, categories[] array from config, or full-wiki enumeration via allpages API
Retry + backoff — exponential backoff with ±10% jitter, Retry-After header support, configurable max attempts
Error classification — NETWORK / THROTTLED / TIMEOUT / TRANSIENT / PERMANENT / VALIDATION / RESOURCE
ScraperCache — sharded content-addressed pointer cache with read-write / read-only / write-only / off modes and TTL
ConfigClamp — validates and clamps all numeric config values to valid ranges
Resume/retry — failures.json tracks failed page titles; --resume-failures re-fetches only those
AJV-validated config — full json-schema-to-ts derived types, malformed files fail fast with field-path errors
90 unit tests — node:test native runner, no jest/vitest
Matrix CI — Node 22/24 × ubuntu/macos

Install

npm install
npm run build

Quick start

ripperoni scrape --target <id> --config ripperoni.config.json
ripperoni crawl --starts "https://example.com" --domain "example\.com" --target "\?id="

See the docs for full config reference and plugin authoring guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's in the box

Install

Quick start

Uh oh!

Releases: Studnicky/PathRipper

ripperoni v2.0.0 — web ingestion engine

What's in the box

Install

Quick start

Uh oh!