From 0c4452e90739a811f13eda7bb252de292f9f8864 Mon Sep 17 00:00:00 2001 From: Ace Data Cloud Dev Date: Sun, 10 May 2026 23:51:02 -0700 Subject: [PATCH] docs(webextrator): rewrite README + 3 integration guides from scratch MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The existing WebExtrator docs were inaccurate and incomplete — they documented an API surface that does not match the deployed service and missed every feature added in WebExtrator PRs #8–#11. What was wrong - `expected_type` was listed as `markdown / article / text / links / structured`. The deployed service only accepts `product / article / general`. - An `instruction` field was documented. It does not exist. - Sample responses used `author / published_at / content / summary` — the real response shape is `byline / publishedAt / markdown / text / description / structured.{schemaOrg, jsonLd, openGraph, llm, ...}`. - The schema.org JSON-LD mapper was not mentioned at all. - The LLM-first typed extractor (article / product / discussion / recipe / video / job, Zod-validated) was not mentioned at all. - The Redis result cache (`bypass_cache`, `cache_ttl_seconds`, cached / cacheStoredAt response fields) was not mentioned at all. - callback_url, cookies, headers, block_resources, wait_for_selector, delay, mode were not documented. - Tasks API only had a stub; the discriminated union (`retrieve` vs `retrieve_batch`) was not explained. What this PR ships - README.md (197 lines, +154) - Clear "why this service" pitch - Endpoints + cost table (0.005 / 0.005 / free Credits) - "Pick the right endpoint" decision table - End-to-end Quick Start (cURL + sample response) - "How extraction works" — three-tier pipeline overview - docs/webextrator_render_api_integration_guide.md (275 lines, +175) - Full request-body table from the actual Zod schema, including `bypass_cache`, `cache_ttl_seconds`, `callback_url`, `cookies`, `headers`, `block_resources`, `wait_until`, `wait_for_selector`, `delay`, `timeout`, `mode`. - Cookie shape table - Sync + async response shapes - Error table (400 / 401 / 402 / 408 / 429 / 500) - cURL + Python (requests) + Node.js (fetch) examples - Async + callback example - Cache bypass example - Tips & gotchas - docs/webextrator_extract_api_integration_guide.md (430 lines, +397) - Three-tier pipeline explainer - Full request-body table (inherits Render + `expected_type` + `enable_llm`) - Full top-level response field table (incl. `cached`, `cacheStoredAt`, `rawSignals`) - schema.org mapper coverage table (7 types + BreadcrumbList + @graph/@type-array/AggregateOffer/prefix-variant handling) - LLM extractor schemas table (6 kinds with required + optional fields and URL heuristics) - Top-level back-fill semantics per kind - Caching rules - 5 worked examples: Wikipedia article (schema.org), BestBuy product (schema.org), AllRecipes recipe (schema.org), HN discussion (LLM-required), Amazon product (LLM-required) - Python + Node.js usage examples - Tips & gotchas - docs/webextrator_tasks_api_integration_guide.md (228 lines, +144) - Documents the discriminated-union body: `action: "retrieve"` (by id or trace_id) and `action: "retrieve_batch"` (with ids or trace_ids, plus offset / limit) - Single-task and batch response shapes - cURL + Python poll-until-done + Node.js callback-rehydrate examples - 7-day retention, free-tier note Verification All four documents cross-check directly against: - `WebExtrator/src/routes/{render,extract,platform}.ts` (Zod schemas) - `WebExtrator/src/lib/schema-org.ts` (mapper output types) - `WebExtrator/src/lib/llm-extractor.ts` (Zod schemas, prompts, URL heuristics) - `WebExtrator/src/lib/extract-cache.ts` (cache rules) - `WebExtrator/src/types/jobs.ts` (response field names) No code changes — docs-only. --- webextrator/README.md | 192 +++++++- ...bextrator_extract_api_integration_guide.md | 425 +++++++++++++++--- ...ebextrator_render_api_integration_guide.md | 264 ++++++++--- ...webextrator_tasks_api_integration_guide.md | 258 +++++++---- 4 files changed, 933 insertions(+), 206 deletions(-) diff --git a/webextrator/README.md b/webextrator/README.md index cb93def..bed2ca7 100644 --- a/webextrator/README.md +++ b/webextrator/README.md @@ -1,36 +1,68 @@ # WebExtrator API -WebExtrator web rendering and intelligent content extraction services. +WebExtrator is Ace Data Cloud's web rendering and intelligent content extraction +service. Give it a URL and get back either the fully-rendered HTML (`render`) or +a typed structured payload (`extract`) — Article / Product / Recipe / Video / +Discussion / Job — all behind a single `Authorization: Bearer` API key. -![Platform](https://img.shields.io/badge/platform-Ace%20Data%20Cloud-0f766e?style=flat-square) ![API](https://img.shields.io/badge/type-AI%20API-2563eb?style=flat-square) ![Docs](https://img.shields.io/badge/docs-online-16a34a?style=flat-square) +![Platform](https://img.shields.io/badge/platform-Ace%20Data%20Cloud-0f766e?style=flat-square) +![API](https://img.shields.io/badge/type-AI%20API-2563eb?style=flat-square) +![Docs](https://img.shields.io/badge/docs-online-16a34a?style=flat-square) -API home page: [Ace Data Cloud - WebExtrator](https://platform.acedata.cloud/service/webextrator) +API home page: [Ace Data Cloud — WebExtrator](https://platform.acedata.cloud/service/webextrator) -Keywords: webextrator-api, web-render, web-extract, content-extraction, rest-api, ai-api, developer-tools, AI API, REST API, Developer API, Ace Data Cloud +Keywords: web extract api, web scraping api, headless chromium, schema.org +mapper, structured content extraction, llm extraction, content type detection, +patchright, readability, web rendering, ai api, rest api, developer tools, +Ace Data Cloud -## Why Use WebExtrator on Ace Data Cloud +--- -- Unified developer platform with one API key, billing system, and usage tracking -- Production-ready AI API endpoints served from [https://api.acedata.cloud](https://api.acedata.cloud) -- English integration guides, API references, and service documentation -- Global-ready workflow for developers building chat, image, video, music, and search products +## Why WebExtrator on Ace Data Cloud -## Overview +- **Three-layer extraction pipeline** — deterministic schema.org JSON-LD + mapper first; LLM type-aware extractor (Article / Product / Recipe / Video / + Discussion / Job) for sites without structured data; result cache (Redis) + collapses duplicate URLs to <1 ms. +- **Real headless Chromium via Patchright** — bypasses simple bot checks out of + the box, supports custom UA / cookies / headers / wait conditions. +- **Synchronous and asynchronous modes** — get a result inline in seconds, or + fire-and-forget with a callback URL and retrieve later via the Tasks API. +- **Production-grade auth and billing** — one API key, one bill, usage tracked + per request via the standard AceDataCloud platform. +- **Free quota for new users** — try the service before you commit. -WebExtrator provides a two-layer API for working with web pages: +--- -1. **Render** (`/webextrator/render`): Headless browser rendering — returns the full rendered HTML, markdown, plain text, screenshot (base64), and extracted links for any URL. -2. **Extract** (`/webextrator/extract`): Builds on top of Render to provide structured content extraction — supports article extraction, markdown, raw text, link lists, or fully custom structured output powered by an optional LLM post-processing step. -3. **Tasks** (`/webextrator/tasks`): Free query interface to look up historical `render` / `extract` tasks (retained for 7 days). +## Endpoints -## Application Process +| Path | Purpose | Cost (Credits) | Guide | +|---|---|---|---| +| `POST /webextrator/render` | Headless Chromium render, returns raw HTML + text + title | 0.005 | [Render API guide](docs/webextrator_render_api_integration_guide.md) | +| `POST /webextrator/extract` | Render + structured extraction (schema.org + LLM types) | 0.005 | [Extract API guide](docs/webextrator_extract_api_integration_guide.md) | +| `POST /webextrator/tasks` | Look up historical render / extract tasks (7-day retention) | Free | [Tasks API guide](docs/webextrator_tasks_api_integration_guide.md) | + +Pricing as of May 2026. Service is metered in Credits via your AceDataCloud +account; cache hits are still billed at the configured rate to keep cost +predictable. + +--- + +## When to use which -To use the WebExtrator API, apply for the corresponding service on the [WebExtrator Render API](https://platform.acedata.cloud/documents/) page. After entering the page, click the "Acquire" button. +| You want… | Use | Why | +|---|---|---| +| Bypass JS rendering and read the final HTML / text | `/webextrator/render` | Single Patchright navigation, no extraction work | +| Pull article / product / video / recipe / job metadata from a real-world URL | `/webextrator/extract` | schema.org mapper covers ~60 % of the long tail with zero LLM cost; LLM fills the rest | +| Convert a page to clean markdown for downstream LLM input | `/webextrator/extract` | Returns Turndown-converted markdown + readability text in addition to structured payload | +| Look up a result you got via async / callback mode | `/webextrator/tasks` | Retrieves the full envelope by `task_id` or `trace_id` | -There is a free quota available for first-time applicants, allowing you to use this API for free. +--- ## Quick Start +### 1. Render a page + ```bash curl -X POST https://api.acedata.cloud/webextrator/render \ -H "Authorization: Bearer $API_KEY" \ @@ -42,10 +74,124 @@ curl -X POST https://api.acedata.cloud/webextrator/render \ }' ``` -## APIs and Guides +### 2. Extract typed content + +```bash +curl -X POST https://api.acedata.cloud/webextrator/extract \ + -H "Authorization: Bearer $API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "url": "https://en.wikipedia.org/wiki/Diffbot", + "expected_type": "article" + }' +``` + +Response (trimmed) — the `structured.schemaOrg.primary` block is the typed +payload, and `description / byline / publishedAt` are back-filled from it: + +```json +{ + "success": true, + "task_id": "550e8400-e29b-41d4-a716-446655440000", + "trace_id": "550e8400-e29b-41d4-a716-446655440001", + "started_at": "2026-05-02T10:30:00.123Z", + "finished_at": "2026-05-02T10:30:02.535Z", + "elapsed": 2.412, + "data": { + "kind": "extract", + "url": "https://en.wikipedia.org/wiki/Diffbot", + "finalUrl": "https://en.wikipedia.org/wiki/Diffbot", + "contentType": "article", + "title": "Diffbot", + "description": "American machine learning and knowledge management company", + "byline": "Contributors to Wikimedia projects", + "publishedAt": "2007-08-08T05:47:27Z", + "language": "en", + "images": ["https://en.wikipedia.org/static/images/icons/enwiki-25.svg"], + "links": ["https://en.wikipedia.org/wiki/Machine_learning"], + "markdown": "# Diffbot\n\nDiffbot is a developer of machine learning ...", + "text": "Diffbot is a developer of machine learning algorithms ...", + "structured": { + "schemaOrg": { + "primary": { + "kind": "article", + "subtype": "Article", + "headline": "American machine learning and knowledge management company", + "datePublished": "2007-08-08T05:47:27Z", + "dateModified": "2025-07-10T20:42:45Z", + "author": { "name": "Contributors to Wikimedia projects", "type": "Organization" }, + "publisher": { "name": "Wikimedia Foundation, Inc." } + }, + "breadcrumbs": [], + "all": [ /* ... */ ] + }, + "openGraph": { /* ... */ }, + "jsonLd": [ /* raw passthrough */ ] + }, + "elapsedMs": 2412 + } +} +``` + +--- + +## How extraction works + +The Extract API is a **three-tier pipeline**: + +1. **schema.org JSON-LD mapper** *(deterministic, zero LLM cost)*. + If the page ships `