From 0c4452e90739a811f13eda7bb252de292f9f8864 Mon Sep 17 00:00:00 2001
From: Ace Data Cloud Dev <dev@acedata.cloud>
Date: Sun, 10 May 2026 23:51:02 -0700
Subject: [PATCH] docs(webextrator): rewrite README + 3 integration guides from
 scratch
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The existing WebExtrator docs were inaccurate and incomplete — they
documented an API surface that does not match the deployed service
and missed every feature added in WebExtrator PRs #8–#11.

What was wrong

- `expected_type` was listed as `markdown / article / text / links /
  structured`. The deployed service only accepts `product / article /
  general`.
- An `instruction` field was documented. It does not exist.
- Sample responses used `author / published_at / content / summary` —
  the real response shape is `byline / publishedAt / markdown / text /
  description / structured.{schemaOrg, jsonLd, openGraph, llm, ...}`.
- The schema.org JSON-LD mapper was not mentioned at all.
- The LLM-first typed extractor (article / product / discussion /
  recipe / video / job, Zod-validated) was not mentioned at all.
- The Redis result cache (`bypass_cache`, `cache_ttl_seconds`, cached /
  cacheStoredAt response fields) was not mentioned at all.
- callback_url, cookies, headers, block_resources, wait_for_selector,
  delay, mode were not documented.
- Tasks API only had a stub; the discriminated union (`retrieve` vs
  `retrieve_batch`) was not explained.

What this PR ships

- README.md (197 lines, +154)
  - Clear "why this service" pitch
  - Endpoints + cost table (0.005 / 0.005 / free Credits)
  - "Pick the right endpoint" decision table
  - End-to-end Quick Start (cURL + sample response)
  - "How extraction works" — three-tier pipeline overview

- docs/webextrator_render_api_integration_guide.md (275 lines, +175)
  - Full request-body table from the actual Zod schema, including
    `bypass_cache`, `cache_ttl_seconds`, `callback_url`, `cookies`,
    `headers`, `block_resources`, `wait_until`, `wait_for_selector`,
    `delay`, `timeout`, `mode`.
  - Cookie shape table
  - Sync + async response shapes
  - Error table (400 / 401 / 402 / 408 / 429 / 500)
  - cURL + Python (requests) + Node.js (fetch) examples
  - Async + callback example
  - Cache bypass example
  - Tips & gotchas

- docs/webextrator_extract_api_integration_guide.md (430 lines, +397)
  - Three-tier pipeline explainer
  - Full request-body table (inherits Render + `expected_type` +
    `enable_llm`)
  - Full top-level response field table (incl. `cached`, `cacheStoredAt`,
    `rawSignals`)
  - schema.org mapper coverage table (7 types + BreadcrumbList +
    @graph/@type-array/AggregateOffer/prefix-variant handling)
  - LLM extractor schemas table (6 kinds with required + optional
    fields and URL heuristics)
  - Top-level back-fill semantics per kind
  - Caching rules
  - 5 worked examples: Wikipedia article (schema.org), BestBuy product
    (schema.org), AllRecipes recipe (schema.org), HN discussion
    (LLM-required), Amazon product (LLM-required)
  - Python + Node.js usage examples
  - Tips & gotchas

- docs/webextrator_tasks_api_integration_guide.md (228 lines, +144)
  - Documents the discriminated-union body: `action: "retrieve"` (by
    id or trace_id) and `action: "retrieve_batch"` (with ids or
    trace_ids, plus offset / limit)
  - Single-task and batch response shapes
  - cURL + Python poll-until-done + Node.js callback-rehydrate examples
  - 7-day retention, free-tier note

Verification

All four documents cross-check directly against:
- `WebExtrator/src/routes/{render,extract,platform}.ts` (Zod schemas)
- `WebExtrator/src/lib/schema-org.ts` (mapper output types)
- `WebExtrator/src/lib/llm-extractor.ts` (Zod schemas, prompts,
  URL heuristics)
- `WebExtrator/src/lib/extract-cache.ts` (cache rules)
- `WebExtrator/src/types/jobs.ts` (response field names)

No code changes — docs-only.
---
 webextrator/README.md                         | 192 +++++++-
 ...bextrator_extract_api_integration_guide.md | 425 +++++++++++++++---
 ...ebextrator_render_api_integration_guide.md | 264 ++++++++---
 ...webextrator_tasks_api_integration_guide.md | 258 +++++++----
 4 files changed, 933 insertions(+), 206 deletions(-)

diff --git a/webextrator/README.md b/webextrator/README.md
index cb93def..bed2ca7 100644
--- a/webextrator/README.md
+++ b/webextrator/README.md
@@ -1,36 +1,68 @@
 # WebExtrator API
 
-WebExtrator web rendering and intelligent content extraction services.
+WebExtrator is Ace Data Cloud's web rendering and intelligent content extraction
+service. Give it a URL and get back either the fully-rendered HTML (`render`) or
+a typed structured payload (`extract`) — Article / Product / Recipe / Video /
+Discussion / Job — all behind a single `Authorization: Bearer` API key.
 
-![Platform](https://img.shields.io/badge/platform-Ace%20Data%20Cloud-0f766e?style=flat-square) ![API](https://img.shields.io/badge/type-AI%20API-2563eb?style=flat-square) ![Docs](https://img.shields.io/badge/docs-online-16a34a?style=flat-square)
+![Platform](https://img.shields.io/badge/platform-Ace%20Data%20Cloud-0f766e?style=flat-square)
+![API](https://img.shields.io/badge/type-AI%20API-2563eb?style=flat-square)
+![Docs](https://img.shields.io/badge/docs-online-16a34a?style=flat-square)
 
-API home page: [Ace Data Cloud - WebExtrator](https://platform.acedata.cloud/service/webextrator)
+API home page: [Ace Data Cloud — WebExtrator](https://platform.acedata.cloud/service/webextrator)
 
-Keywords: webextrator-api, web-render, web-extract, content-extraction, rest-api, ai-api, developer-tools, AI API, REST API, Developer API, Ace Data Cloud
+Keywords: web extract api, web scraping api, headless chromium, schema.org
+mapper, structured content extraction, llm extraction, content type detection,
+patchright, readability, web rendering, ai api, rest api, developer tools,
+Ace Data Cloud
 
-## Why Use WebExtrator on Ace Data Cloud
+---
 
-- Unified developer platform with one API key, billing system, and usage tracking
-- Production-ready AI API endpoints served from [https://api.acedata.cloud](https://api.acedata.cloud)
-- English integration guides, API references, and service documentation
-- Global-ready workflow for developers building chat, image, video, music, and search products
+## Why WebExtrator on Ace Data Cloud
 
-## Overview
+- **Three-layer extraction pipeline** — deterministic schema.org JSON-LD
+  mapper first; LLM type-aware extractor (Article / Product / Recipe / Video /
+  Discussion / Job) for sites without structured data; result cache (Redis)
+  collapses duplicate URLs to <1 ms.
+- **Real headless Chromium via Patchright** — bypasses simple bot checks out of
+  the box, supports custom UA / cookies / headers / wait conditions.
+- **Synchronous and asynchronous modes** — get a result inline in seconds, or
+  fire-and-forget with a callback URL and retrieve later via the Tasks API.
+- **Production-grade auth and billing** — one API key, one bill, usage tracked
+  per request via the standard AceDataCloud platform.
+- **Free quota for new users** — try the service before you commit.
 
-WebExtrator provides a two-layer API for working with web pages:
+---
 
-1. **Render** (`/webextrator/render`): Headless browser rendering — returns the full rendered HTML, markdown, plain text, screenshot (base64), and extracted links for any URL.
-2. **Extract** (`/webextrator/extract`): Builds on top of Render to provide structured content extraction — supports article extraction, markdown, raw text, link lists, or fully custom structured output powered by an optional LLM post-processing step.
-3. **Tasks** (`/webextrator/tasks`): Free query interface to look up historical `render` / `extract` tasks (retained for 7 days).
+## Endpoints
 
-## Application Process
+| Path | Purpose | Cost (Credits) | Guide |
+|---|---|---|---|
+| `POST /webextrator/render` | Headless Chromium render, returns raw HTML + text + title | 0.005 | [Render API guide](docs/webextrator_render_api_integration_guide.md) |
+| `POST /webextrator/extract` | Render + structured extraction (schema.org + LLM types) | 0.005 | [Extract API guide](docs/webextrator_extract_api_integration_guide.md) |
+| `POST /webextrator/tasks` | Look up historical render / extract tasks (7-day retention) | Free | [Tasks API guide](docs/webextrator_tasks_api_integration_guide.md) |
+
+Pricing as of May 2026. Service is metered in Credits via your AceDataCloud
+account; cache hits are still billed at the configured rate to keep cost
+predictable.
+
+---
+
+## When to use which
 
-To use the WebExtrator API, apply for the corresponding service on the [WebExtrator Render API](https://platform.acedata.cloud/documents/) page. After entering the page, click the "Acquire" button.
+| You want… | Use | Why |
+|---|---|---|
+| Bypass JS rendering and read the final HTML / text | `/webextrator/render` | Single Patchright navigation, no extraction work |
+| Pull article / product / video / recipe / job metadata from a real-world URL | `/webextrator/extract` | schema.org mapper covers ~60 % of the long tail with zero LLM cost; LLM fills the rest |
+| Convert a page to clean markdown for downstream LLM input | `/webextrator/extract` | Returns Turndown-converted markdown + readability text in addition to structured payload |
+| Look up a result you got via async / callback mode | `/webextrator/tasks` | Retrieves the full envelope by `task_id` or `trace_id` |
 
-There is a free quota available for first-time applicants, allowing you to use this API for free.
+---
 
 ## Quick Start
 
+### 1. Render a page
+
 ```bash
 curl -X POST https://api.acedata.cloud/webextrator/render \
   -H "Authorization: Bearer $API_KEY" \
@@ -42,10 +74,124 @@ curl -X POST https://api.acedata.cloud/webextrator/render \
   }'
 ```
 
-## APIs and Guides
+### 2. Extract typed content
+
+```bash
+curl -X POST https://api.acedata.cloud/webextrator/extract \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://en.wikipedia.org/wiki/Diffbot",
+    "expected_type": "article"
+  }'
+```
+
+Response (trimmed) — the `structured.schemaOrg.primary` block is the typed
+payload, and `description / byline / publishedAt` are back-filled from it:
+
+```json
+{
+  "success": true,
+  "task_id": "550e8400-e29b-41d4-a716-446655440000",
+  "trace_id": "550e8400-e29b-41d4-a716-446655440001",
+  "started_at": "2026-05-02T10:30:00.123Z",
+  "finished_at": "2026-05-02T10:30:02.535Z",
+  "elapsed": 2.412,
+  "data": {
+    "kind": "extract",
+    "url": "https://en.wikipedia.org/wiki/Diffbot",
+    "finalUrl": "https://en.wikipedia.org/wiki/Diffbot",
+    "contentType": "article",
+    "title": "Diffbot",
+    "description": "American machine learning and knowledge management company",
+    "byline": "Contributors to Wikimedia projects",
+    "publishedAt": "2007-08-08T05:47:27Z",
+    "language": "en",
+    "images": ["https://en.wikipedia.org/static/images/icons/enwiki-25.svg"],
+    "links": ["https://en.wikipedia.org/wiki/Machine_learning"],
+    "markdown": "# Diffbot\n\nDiffbot is a developer of machine learning ...",
+    "text": "Diffbot is a developer of machine learning algorithms ...",
+    "structured": {
+      "schemaOrg": {
+        "primary": {
+          "kind": "article",
+          "subtype": "Article",
+          "headline": "American machine learning and knowledge management company",
+          "datePublished": "2007-08-08T05:47:27Z",
+          "dateModified": "2025-07-10T20:42:45Z",
+          "author": { "name": "Contributors to Wikimedia projects", "type": "Organization" },
+          "publisher": { "name": "Wikimedia Foundation, Inc." }
+        },
+        "breadcrumbs": [],
+        "all": [ /* ... */ ]
+      },
+      "openGraph": { /* ... */ },
+      "jsonLd": [ /* raw passthrough */ ]
+    },
+    "elapsedMs": 2412
+  }
+}
+```
+
+---
+
+## How extraction works
+
+The Extract API is a **three-tier pipeline**:
+
+1. **schema.org JSON-LD mapper** *(deterministic, zero LLM cost)*.
+   If the page ships `<script type="application/ld+json">` blocks (Wikipedia,
+   BestBuy, AllRecipes, YouTube, most news sites, most product pages), the
+   mapper walks `@graph` containers and `@type` arrays and emits a typed entity
+   for `Article` / `Product` / `Recipe` / `VideoObject` / `Event` / `JobPosting`
+   / `FAQPage`, plus a `BreadcrumbList`.
+
+2. **LLM-first typed extractor** *(only when schema.org returned nothing)*.
+   Triggered by `enable_llm: true`. Detects the page kind from URL heuristics
+   (or your `expected_type` hint) and asks the model for a strict JSON payload
+   validated against a Zod schema. Schemas:
+   `article` / `product` / `discussion` / `recipe` / `video` / `job`. Failures
+   surface as `structured.llmError` and never crash the request.
+
+3. **Readability + markdown fallback** *(always runs)*.
+   Mozilla Readability for clean text, Turndown for markdown, OG / `<meta>`
+   tags for title / description / image / site_name. These populate the
+   top-level fields whenever schema.org and the LLM didn't.
+
+URL repetition? **Step 0 is the Redis result cache** — identical requests
+return in <1 ms regardless of the pipeline behind them. See the Extract guide
+for `bypass_cache` and `cache_ttl_seconds`.
+
+---
+
+## Application Process
+
+To use the WebExtrator API, apply for the service on the
+[WebExtrator service page](https://platform.acedata.cloud/service/webextrator).
+After landing on the page, click the **Acquire** button to obtain credentials.
+
+If you are not logged in or registered, you will be automatically redirected to
+the login page inviting you to register and log in.
+
+A free quota is provided to first-time applicants — try the API before
+committing to paid usage.
+
+---
+
+## SDKs and Tooling
+
+- **HTTP / cURL** — examples in every guide.
+- **Python** — `requests` examples in every guide.
+- **Node.js** — `fetch` examples in every guide.
+- **Webhooks** — `callback_url` is supported on all three endpoints for async
+  job completion notifications.
+
+---
+
+## API Reference
 
-| API | Path | Integration Guidance |
-| ---- | ---- | ------------ |
-| WebExtrator Render API | `/webextrator/render` | [WebExtrator Render API Integration Guide](docs/webextrator_render_api_integration_guide.md) |
-| WebExtrator Extract API | `/webextrator/extract` | [WebExtrator Extract API Integration Guide](docs/webextrator_extract_api_integration_guide.md) |
-| WebExtrator Tasks API | `/webextrator/tasks` | [WebExtrator Tasks API Integration Guide](docs/webextrator_tasks_api_integration_guide.md) |
+| API | Path | Integration guide |
+|---|---|---|
+| WebExtrator Render API | `/webextrator/render` | [Render API integration guide](docs/webextrator_render_api_integration_guide.md) |
+| WebExtrator Extract API | `/webextrator/extract` | [Extract API integration guide](docs/webextrator_extract_api_integration_guide.md) |
+| WebExtrator Tasks API | `/webextrator/tasks` | [Tasks API integration guide](docs/webextrator_tasks_api_integration_guide.md) |
diff --git a/webextrator/docs/webextrator_extract_api_integration_guide.md b/webextrator/docs/webextrator_extract_api_integration_guide.md
index a072a7a..d08a4e1 100644
--- a/webextrator/docs/webextrator_extract_api_integration_guide.md
+++ b/webextrator/docs/webextrator_extract_api_integration_guide.md
@@ -2,104 +2,429 @@
 
 `POST https://api.acedata.cloud/webextrator/extract`
 
-This document introduces the WebExtrator Extract API. This API builds on top of the Render API to provide intelligent content extraction. It supports extracting articles, markdown, plain text, links, or custom structured data — with an optional LLM post-processing step for advanced use cases.
+The WebExtrator Extract API turns a URL into a typed, structured payload —
+Article, Product, Recipe, Video, Discussion, Job — with markdown and clean text
+on the side. It is the right endpoint to call when you want **clean structured
+data** rather than raw HTML.
+
+Under the hood Extract runs a three-tier pipeline:
+
+1. **schema.org JSON-LD mapper** — deterministic, zero LLM cost. Covers
+   Wikipedia / BestBuy / AllRecipes / YouTube / most news / most product pages.
+2. **LLM-first typed extractor** — for pages without JSON-LD (Amazon, Hacker
+   News, Greenhouse, blogs). Zod-validated typed payload per kind.
+3. **Readability + markdown fallback** — always runs; populates the basic
+   top-level fields when neither layer above filled them.
+
+Repeat URL requests hit a Redis result cache and return in <1 ms.
+
+---
 
 ## Application Process
 
-To use the WebExtrator Extract API, apply for the corresponding service on the WebExtrator service page. After entering the page, click the "Acquire" button to obtain the credentials needed for the request.
+To use the WebExtrator Extract API, apply for the service on the
+[WebExtrator service page](https://platform.acedata.cloud/service/webextrator).
+Click **Acquire** to obtain the credentials needed for the request. A free
+quota is provided to first-time applicants.
 
-There is a free quota available for first-time applicants, allowing you to use this API for free.
+---
 
-## Request Parameters
+## Authentication
+
+```
+Authorization: Bearer YOUR_API_KEY
+Content-Type:  application/json
+```
 
-The Extract API accepts all parameters from the [Render API](webextrator_render_api_integration_guide.md), plus the following additional fields:
+---
+
+## Request Body
+
+Extract accepts **everything** Render accepts (see
+[Render API guide](webextrator_render_api_integration_guide.md#request-body) —
+`url`, `user_agent`, `timeout`, `wait_until`, `delay`, `wait_for_selector`,
+`block_resources`, `headers`, `cookies`, `callback_url`, `bypass_cache`,
+`cache_ttl_seconds`, `mode`) plus the two Extract-specific fields:
 
 | Field | Type | Required | Default | Description |
-|------|------|:----:|------|------|
-| `expected_type` | string | ❌ | `markdown` | Desired extraction output: `markdown` / `article` / `text` / `links` / `structured` |
-| `enable_llm` | boolean | ❌ | false | Enable LLM post-processing (recommended for `article` / `structured`) |
-| `instruction` | string | ❌ | - | LLM extraction instruction, e.g. "Extract product title, price, and specifications" |
+|---|---|:---:|---|---|
+| `expected_type` | enum | ❌ | auto | Hint at the page kind: `product` \| `article` \| `general`. Skips the URL/text heuristic and dispatches directly. |
+| `enable_llm` | boolean | ❌ | `false` | Allow the LLM-first extractor to run when schema.org returned nothing. Required for the typed payload on Amazon / HN / Greenhouse-style pages. |
+
+> When the page already ships schema.org JSON-LD, `enable_llm` has no effect —
+> the deterministic mapper wins and we never spend the LLM call. You always
+> get the typed payload for free.
 
-## Synchronous Response
+---
+
+## Response (Sync)
 
 ```json
 {
   "success": true,
-  "task_id": "550e8400-...",
-  "trace_id": "550e8400-...",
+  "task_id": "550e8400-e29b-41d4-a716-446655440000",
+  "trace_id": "550e8400-e29b-41d4-a716-446655440001",
   "started_at": "2026-05-02T10:30:00.123Z",
-  "finished_at": "2026-05-02T10:30:08.789Z",
-  "elapsed": 8.666,
+  "finished_at": "2026-05-02T10:30:02.535Z",
+  "elapsed": 2.412,
   "data": {
     "kind": "extract",
-    "expected_type": "article",
-    "url": "https://example.com/post/1",
-    "title": "Sample Article",
-    "author": "John Doe",
-    "published_at": "2026-05-01",
-    "content": "# Sample Article\n\nBody text ...",
-    "summary": "This article introduces ..."
+    "url": "https://en.wikipedia.org/wiki/Diffbot",
+    "finalUrl": "https://en.wikipedia.org/wiki/Diffbot",
+    "contentType": "article",
+    "title": "Diffbot",
+    "description": "American machine learning and knowledge management company",
+    "byline": "Contributors to Wikimedia projects",
+    "language": "en",
+    "siteName": "Wikipedia",
+    "publishedAt": "2007-08-08T05:47:27Z",
+    "images": ["https://en.wikipedia.org/static/images/icons/enwiki-25.svg"],
+    "links": ["https://en.wikipedia.org/wiki/Machine_learning"],
+    "markdown": "# Diffbot\n\nDiffbot is a developer of machine learning ...",
+    "text":     "Diffbot is a developer of machine learning algorithms ...",
+    "structured": {
+      "schemaOrg": { "primary": { /* typed entity */ }, "breadcrumbs": [], "all": [] },
+      "openGraph": { "title": "...", "description": "...", "image": "...", "type": "..." },
+      "jsonLd":   [ /* raw passthrough */ ]
+    },
+    "rawSignals": {
+      "hasJsonLd": true,
+      "title": "Diffbot - Wikipedia",
+      "metaDescription": null,
+      "pageStatus": 200,
+      "textLength": 11473
+    },
+    "elapsedMs": 2412
   }
 }
 ```
 
-The async mode, error codes, and billing rules are identical to those of the `/webextrator/render` API.
+### Top-level data fields
+
+| Field | Type | Description |
+|---|---|---|
+| `kind` | string | Always `"extract"`. |
+| `url` | string | The URL you supplied. |
+| `finalUrl` | string | URL after redirects. |
+| `contentType` | enum | `product` \| `article` \| `general`. Derived from `expected_type` if given, else from schema.org primary, else heuristic. |
+| `title` | string | Readability `<title>` or render `document.title`. |
+| `description` | string? | First non-empty of: `<meta name="description">` → `og:description` → schema.org / LLM payload → trimmed first paragraph. |
+| `byline` | string? | Author / channel / company name. Sourced from `<meta name="author">`, then schema.org / LLM payload. |
+| `language` | string? | `<html lang>` value. |
+| `siteName` | string? | `og:site_name`. |
+| `publishedAt` | string? | ISO 8601. `article:published_time` meta → `<time datetime>` → schema.org / LLM payload. |
+| `images` | string[] | Up to 50 `<img src>` values, resolved to absolute URLs, deduped, `data:` URIs dropped. |
+| `links` | string[] | Up to 100 outbound link URLs, fragment-only / `javascript:` / `mailto:` / `tel:` filtered. |
+| `markdown` | string | Turndown-rendered body markdown. |
+| `text` | string | Mozilla Readability `textContent`. |
+| `structured` | object | Full structured payload — see below. |
+| `rawSignals` | object | Diagnostic counts for debugging. |
+| `cached` | boolean? | `true` if served from cache. |
+| `cacheStoredAt` | number? | Unix-ms when the cached entry was first stored. |
+
+### `data.structured`
+
+| Sub-field | When present | Description |
+|---|---|---|
+| `schemaOrg` | always | `{ primary, breadcrumbs, all }`. `primary` is the highest-priority typed entity (see below); `null` if none found. |
+| `openGraph` | always | `{ title, description, image, type }` from `<meta property="og:*">`. |
+| `jsonLd` | always | Raw passthrough of every `<script type="application/ld+json">` block parsed into JSON. |
+| `llm` | when LLM ran & succeeded | `{ kind, data, model, promptCharCount }`. Typed Zod-validated payload — see [LLM extractor schemas](#llm-extractor-schemas). |
+| `llmError` | when LLM ran & failed | `{ kind, error, model }`. Errors never crash the request; the heuristic payload still ships. |
+| `amazon` | when URL is `amazon.*` | Legacy pre-LLM amazon scraper output. Will be deprecated. |
+
+---
+
+## schema.org mapper coverage
+
+The mapper recognises these types (priority order — first match wins as
+`structured.schemaOrg.primary`):
+
+| schema.org type | mapped kind | Surfaced fields |
+|---|---|---|
+| `Product` | product | `name, sku, gtin, model, color, brand, url, images, offer.{price,currency,availability,condition,seller}, rating.{value,count}, reviews[], properties[]` |
+| `Recipe` | recipe | `name, description, image, datePublished, author, cookTime, prepTime, totalTime, recipeYield, ingredients[], instructions[], nutrition, rating, keywords, recipeCategory, recipeCuisine` |
+| `VideoObject` | video | `name, description, thumbnailUrl, uploadDate, duration, embedUrl, contentUrl, channel, interactionCount` |
+| `JobPosting` | job | `title, description, datePosted, validThrough, hiringOrganization, jobLocation, baseSalary, employmentType` |
+| `Event` (and `*Event`) | event | `name, description, startDate, endDate, location.{name,address}, organizer, offer.{url,price,currency}` |
+| `Article` / `NewsArticle` / `BlogPosting` / `ScholarlyArticle` / `TechArticle` / `Report` / `*NewsArticle` | article | `subtype, headline, description, datePublished, dateModified, author, publisher, image[], url, sameAs[]` |
+| `FAQPage` | faq | `questions[{question, answer}]` |
+| `BreadcrumbList` | (sibling) | Always surfaced in `structured.schemaOrg.breadcrumbs[]`, never as primary. |
+
+The mapper handles:
+
+- `@graph` containers (recursively flattened).
+- `@type` arrays (e.g. `["Recipe", "NewsArticle"]` — both are recognised, the
+  higher-priority kind wins).
+- The `http://schema.org/` prefix variant.
+- Nested `Offer` and `AggregateOffer` (reads `lowPrice` on the latter).
+- Relative image URLs (resolved against `finalUrl`).
+
+---
+
+## LLM extractor schemas
+
+When `enable_llm: true` **and** schema.org returned no primary entity, the
+extractor picks one of these typed schemas based on URL heuristics (or your
+`expected_type` hint) and validates the model's JSON output against it:
+
+| Kind | URL heuristic | Required field | Optional fields |
+|---|---|---|---|
+| `article` | text ≥ 400 chars and no other match | `headline` | `description, byline, publishedAt, language, topics[], sections[{heading,summary}]` |
+| `product` | `amazon.* / ebay.* / aliexpress.* / temu.* / walmart.* / bestbuy.*` | `name` | `description, brand, sku, price, currency, availability, rating.{value,count}, bullets[], specifications[{name,value}]` |
+| `discussion` | `news.ycombinator.com / reddit.com / lobste.rs` | `title` | `author, postedAt, points, commentCount, body, url` |
+| `recipe` | `allrecipes / foodnetwork / seriouseats / epicurious / bonappetit / simplyrecipes` | `name` | `description, author, cookTime, prepTime, totalTime, recipeYield, ingredients[], instructions[], nutrition, rating, keywords[]` |
+| `video` | `youtube.com/watch / youtu.be / vimeo.com/<id> / tiktok.com/@/video` | `name` | `description, channel, uploadDate, duration, viewCount, likeCount, thumbnailUrl, transcript` |
+| `job` | `greenhouse.io / lever.co / jobs.* / careers.* / workable.com / bamboohr` | `title` | `description, company, location, remote, employmentType, datePosted, validThrough, salaryMin, salaryMax, salaryCurrency, salaryPeriod, responsibilities[], qualifications[]` |
+
+When the LLM call succeeds you also get the typed payload back-filling the
+top-level fields:
 
-## Example: Extract Article (with LLM enabled)
+- `article` → `description`, `byline`, `publishedAt`, `language`
+- `product` → `description`
+- `discussion` → `description` (= body, ≤ 280 chars), `byline` (= author), `publishedAt` (= postedAt)
+- `recipe` → `description`, `byline` (= author)
+- `video` → `description`, `byline` (= channel), `publishedAt` (= uploadDate)
+- `job` → `description`, `byline` (= company), `publishedAt` (= datePosted)
+
+Back-fills only fire if the deterministic source didn't already populate that
+field — the LLM is always a last resort.
+
+---
+
+## Caching
+
+Identical requests hash to the same Redis key:
+`webextrator:cache:extract:<sha256(canonical-json)>`. Cache keys ignore `mode`,
+`bypass_cache`, and `cache_ttl_seconds` (those are operational toggles, not
+part of the response). Cookies / headers DO partition the cache.
+
+| Field | Effect |
+|---|---|
+| `bypass_cache: true` | Skip the read; still write the fresh result back so subsequent identical calls hit. |
+| `cache_ttl_seconds: 0` | Don't cache this response at all. |
+| `cache_ttl_seconds: N` | Override the 3600 s default for this entry. |
+
+Cached responses set `data.cached: true` and `data.cacheStoredAt: <unix-ms>`.
+
+---
+
+## Async mode and callbacks
+
+Set `mode: "async"` to fire-and-forget. The platform returns
+`{ "jobId": "...", "status": "queued" }` (HTTP 202) immediately and posts the
+final envelope to your `callback_url` (if provided) once the job finishes. Use
+[`/webextrator/tasks`](webextrator_tasks_api_integration_guide.md) to look up
+results by `task_id` or `trace_id` later.
+
+---
+
+## Examples
+
+### 1. Article from Wikipedia (schema.org wins; no LLM needed)
 
 ```bash
 curl -X POST https://api.acedata.cloud/webextrator/extract \
   -H "Authorization: Bearer $API_KEY" \
   -H "Content-Type: application/json" \
   -d '{
-    "url": "https://example.com/news/1",
-    "expected_type": "article",
-    "enable_llm": true
+    "url": "https://en.wikipedia.org/wiki/Diffbot",
+    "expected_type": "article"
   }'
 ```
 
-Python example:
+Key fields in `data.structured.schemaOrg.primary`:
 
-```python
-import requests
+```json
+{
+  "kind": "article",
+  "subtype": "Article",
+  "headline": "American machine learning and knowledge management company",
+  "datePublished": "2007-08-08T05:47:27Z",
+  "dateModified": "2025-07-10T20:42:45Z",
+  "author": { "name": "Contributors to Wikimedia projects", "type": "Organization" },
+  "publisher": { "name": "Wikimedia Foundation, Inc." }
+}
+```
+
+### 2. Product page (BestBuy ships JSON-LD)
+
+```bash
+curl -X POST https://api.acedata.cloud/webextrator/extract \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://www.bestbuy.com/product/apple-airpods-pro-2nd-generation-white/JJ8ZH6TPSW",
+    "expected_type": "product"
+  }'
+```
 
-url = "https://api.acedata.cloud/webextrator/extract"
+Surfaced from schema.org:
 
-headers = {
-    "accept": "application/json",
-    "authorization": "Bearer {token}",
-    "content-type": "application/json"
+```json
+{
+  "kind": "product",
+  "name": "Apple - Refurbished Excellent - AirPods Pro (2nd generation) - White",
+  "sku": "10845412",
+  "model": "MQD83AM/A",
+  "color": "White",
+  "brand": "Apple",
+  "offer": { "price": 159.99, "currency": "USD", "availability": "https://schema.org/InStock", "seller": "Best Buy" },
+  "rating": { "value": 4.4, "count": 8 }
 }
+```
+
+### 3. Recipe page (Recipe + nutrition + steps)
 
-payload = {
-    "url": "https://example.com/news/1",
-    "expected_type": "article",
-    "enable_llm": True
+```bash
+curl -X POST https://api.acedata.cloud/webextrator/extract \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://www.allrecipes.com/recipe/16354/easy-meatloaf/"
+  }'
+```
+
+Surfaced from schema.org:
+
+```json
+{
+  "kind": "recipe",
+  "name": "Easy Meatloaf",
+  "cookTime": "PT60M",
+  "totalTime": "PT75M",
+  "recipeYield": "8 / 1 (9x5-inch) meatloaf",
+  "ingredients": ["1 1/2 pounds ground beef", "..."],
+  "instructions": [{ "text": "Preheat oven to 350°F ..." }, "..."],
+  "rating": { "value": 4.7, "count": 9348 }
 }
+```
+
+### 4. HN discussion (no JSON-LD — LLM is required)
+
+```bash
+curl -X POST https://api.acedata.cloud/webextrator/extract \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://news.ycombinator.com/item?id=37000000",
+    "enable_llm": true
+  }'
+```
+
+Look in `data.structured.llm.data`:
 
-response = requests.post(url, json=payload, headers=headers)
-print(response.text)
+```json
+{
+  "kind": "discussion",
+  "title": "Show HN: A new way to extract web pages",
+  "author": "alice",
+  "points": 173,
+  "commentCount": 42,
+  "body": "Hi HN, we built a self-hosted alternative to Diffbot's Analyze API ..."
+}
 ```
 
-## Example: Async + Custom Structured Extraction
+The top-level response is also back-filled: `byline = "alice"`,
+`publishedAt = "..."`.
+
+### 5. Amazon product (Amazon ships no JSON-LD — LLM is required)
 
 ```bash
 curl -X POST https://api.acedata.cloud/webextrator/extract \
   -H "Authorization: Bearer $API_KEY" \
   -H "Content-Type: application/json" \
   -d '{
-    "url": "https://shop.example.com/item/123",
-    "expected_type": "structured",
-    "enable_llm": true,
-    "instruction": "Extract product title, price, stock, and 3 main image URLs",
-    "callback_url": "https://your-domain.com/wbx-callback"
+    "url": "https://www.amazon.com/dp/B0BSHF7WHW",
+    "expected_type": "product",
+    "enable_llm": true
   }'
 ```
 
-## Error Handling
+`data.structured.llm.data` (typed `product`):
+
+```json
+{
+  "kind": "product",
+  "name": "Apple 2023 MacBook Pro M2 Pro 14-inch",
+  "brand": "Apple",
+  "price": 1799,
+  "currency": "USD",
+  "bullets": ["Apple M2 Pro chip with 10-core CPU", "..."],
+  "specifications": [{ "name": "Display size", "value": "14.2 inches" }, "..."]
+}
+```
+
+### Python (requests)
+
+```python
+import os, requests
+
+API_KEY = os.environ["ACEDATA_API_KEY"]
+
+resp = requests.post(
+    "https://api.acedata.cloud/webextrator/extract",
+    headers={
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    },
+    json={
+        "url": "https://en.wikipedia.org/wiki/Diffbot",
+        "expected_type": "article",
+    },
+    timeout=120,
+)
+resp.raise_for_status()
+data = resp.json()["data"]
+
+primary = (data.get("structured") or {}).get("schemaOrg", {}).get("primary")
+print("contentType:", data["contentType"])
+print("title:      ", data["title"])
+print("byline:     ", data.get("byline"))
+print("publishedAt:", data.get("publishedAt"))
+if primary and primary["kind"] == "article":
+    print("headline:   ", primary["headline"])
+    print("dateModified:", primary.get("dateModified"))
+```
+
+### Node.js (fetch)
+
+```js
+const apiKey = process.env.ACEDATA_API_KEY;
+
+const res = await fetch('https://api.acedata.cloud/webextrator/extract', {
+  method: 'POST',
+  headers: {
+    Authorization: `Bearer ${apiKey}`,
+    'Content-Type': 'application/json',
+  },
+  body: JSON.stringify({
+    url: 'https://www.allrecipes.com/recipe/16354/easy-meatloaf/',
+  }),
+});
+const { data } = await res.json();
+const recipe = data?.structured?.schemaOrg?.primary;
+console.log(recipe.name, recipe.cookTime, recipe.ingredients.length, 'ingredients');
+```
 
-When calling the API, if an error occurs, the API will return the corresponding error code and message. Error codes: `bad_request` / `forbidden` / `too_many_requests` / `not_found` / `api_error` / `timeout` / `unknown` / `busy`.
+---
 
-## Conclusion
+## Tips and Gotchas
 
-Through this document, you have learned how to use the WebExtrator Extract API to intelligently extract content from web pages, including articles, markdown, plain text, links, and custom structured data. We hope this document can help you better integrate and use this API. If you have any questions, please feel free to contact our technical support team.
+- **Always set `expected_type` when you know it.** Free hint, no cost, skips
+  the URL/text heuristic. Especially valuable on pages whose URL doesn't
+  match the built-in patterns.
+- **`enable_llm: true` is free when schema.org wins.** The LLM is only called
+  when no schema.org primary was found, so flipping it on by default is safe
+  for traffic dominated by sites that ship JSON-LD.
+- **Inspect `rawSignals.hasJsonLd` first when debugging.** If `true` and
+  `structured.schemaOrg.primary` is `null`, the JSON-LD present uses a `@type`
+  the mapper doesn't recognise — open an issue and we'll add it.
+- **`structured.llmError` is informational.** The request still succeeds and
+  the heuristic-only payload still ships. Check `llmError.error` for the
+  reason (timeout, JSON parse failure, Zod validation failure).
+- **Don't rely on `links[]` being clean for non-article pages.** It's
+  best-effort — we keep up to 100 outbound URLs, fragment-only and
+  `javascript:` filtered, but no content-relevance ranking.
+- **Cache hits are still billed.** Caching is for *latency* (and to protect
+  the underlying browser pool), not for cost.
diff --git a/webextrator/docs/webextrator_render_api_integration_guide.md b/webextrator/docs/webextrator_render_api_integration_guide.md
index 82060ae..976f4e6 100644
--- a/webextrator/docs/webextrator_render_api_integration_guide.md
+++ b/webextrator/docs/webextrator_render_api_integration_guide.md
@@ -2,34 +2,79 @@
 
 `POST https://api.acedata.cloud/webextrator/render`
 
-This document introduces the WebExtrator Render API. This API provides headless browser rendering for any URL, returning the fully rendered HTML, markdown, plain text, screenshot, and extracted links.
+The WebExtrator Render API is a headless-Chromium rendering service. Give it a
+URL and get back the fully-rendered HTML (including JavaScript-injected
+content), plain text, page title, and the final URL after redirects.
+
+Render is the lowest-level WebExtrator endpoint. If you want **structured**
+content extraction (article body, product price, recipe ingredients, …) use
+[`/webextrator/extract`](webextrator_extract_api_integration_guide.md) instead —
+it runs the same render + a typed extraction pipeline on top.
+
+---
 
 ## Application Process
 
-To use the WebExtrator Render API, apply for the corresponding service on the WebExtrator service page. After entering the page, click the "Acquire" button to obtain the credentials needed for the request.
+To use the WebExtrator Render API, apply for the service on the
+[WebExtrator service page](https://platform.acedata.cloud/service/webextrator).
+Click **Acquire** to obtain the credentials needed for the request. A free
+quota is provided to first-time applicants.
 
-There is a free quota available for first-time applicants, allowing you to use this API for free.
+---
 
 ## Authentication
 
-Add `Authorization: Bearer <your API Key>` to the request header.
+All WebExtrator endpoints use the standard AceDataCloud bearer-token scheme:
+
+```
+Authorization: Bearer YOUR_API_KEY
+Content-Type:  application/json
+```
+
+---
 
-## Request Parameters
+## Request Body
 
 | Field | Type | Required | Default | Description |
-|------|------|:----:|------|------|
-| `url` | string | ✅ | - | The URL of the page to render |
-| `user_agent` | string | ❌ | System default | Custom User-Agent |
-| `timeout` | number | ❌ | 30000 | Single render timeout in milliseconds, max 120000 |
-| `wait_until` | string | ❌ | `load` | Load completion event: `load` / `domcontentloaded` / `networkidle` |
-| `delay` | number | ❌ | 0 | Additional wait time after load completes (milliseconds), max 30000 |
-| `wait_for_selector` | string | ❌ | - | Wait until this CSS selector appears |
-| `block_resources` | string[] | ❌ | - | Block resource types: `image` / `media` / `font` / `stylesheet`, etc. |
-| `headers` | object | ❌ | - | Additional HTTP headers |
-| `cookies` | array | ❌ | - | Cookie list; each element has the form `{name, value, domain, path}` |
-| `callback_url` | string | ❌ | - | Async mode callback URL; if provided, the task ID is returned immediately and the result is delivered via POST callback |
-
-## Synchronous Response (without callback_url)
+|---|---|:---:|---|---|
+| `url` | string | ✅ | — | Page URL to render. Must be `http(s)://`. |
+| `user_agent` | string | ❌ | rotating modern Chrome UA | Override the browser User-Agent header. |
+| `timeout` | number | ❌ | 30 | Per-request navigation timeout in **seconds**. |
+| `wait_until` | enum | ❌ | `networkidle` | Page-ready condition: `load` \| `domcontentloaded` \| `networkidle` \| `commit`. |
+| `delay` | number | ❌ | 0 | Extra wait **in seconds** after `wait_until` fires (use for SPAs that re-render). |
+| `wait_for_selector` | string | ❌ | — | CSS selector to wait for before considering the page ready. Cuts down on flaky `networkidle` failures. |
+| `block_resources` | string[] | ❌ | `["image","font","media"]` (server default) | Resource types to drop. Choices: `image`, `font`, `media`, `stylesheet`, `xhr`, `fetch`. Blocking saves bandwidth and renders faster. |
+| `headers` | object | ❌ | — | Additional request headers sent to the target site (e.g. `{"Accept-Language": "en-US"}`). |
+| `cookies` | array | ❌ | — | Cookies to install before navigation. See [Cookie shape](#cookie-shape). |
+| `callback_url` | string | ❌ | — | If set with `mode=async`, the platform `POST`s the final envelope here when the job finishes. |
+| `bypass_cache` | boolean | ❌ | false | Skip the Redis result cache for this request (still writes the fresh result back). |
+| `cache_ttl_seconds` | number | ❌ | 3600 | Override the global cache TTL for this entry. `0` is allowed and means "don't cache this response". |
+| `mode` | enum | ❌ | `sync` | `sync` waits inline; `async` returns a job id immediately and (optionally) posts to `callback_url`. |
+
+> Note: parameters use **snake_case** on the platform contract. The internal
+> render service uses `camelCase`; both are documented in the OpenAPI spec but
+> external callers should always use snake_case.
+
+### Cookie shape
+
+```json
+{
+  "name":      "string",
+  "value":     "string",
+  "domain":    "string",
+  "path":      "/",
+  "expires":   1735689600,
+  "httpOnly":  false,
+  "secure":    true,
+  "sameSite":  "Lax"
+}
+```
+
+---
+
+## Response (Sync mode)
+
+The envelope is the standard AceDataCloud `success / error` shape.
 
 ```json
 {
@@ -37,54 +82,90 @@ Add `Authorization: Bearer <your API Key>` to the request header.
   "task_id": "550e8400-e29b-41d4-a716-446655440000",
   "trace_id": "550e8400-e29b-41d4-a716-446655440001",
   "started_at": "2026-05-02T10:30:00.123Z",
-  "finished_at": "2026-05-02T10:30:05.456Z",
-  "elapsed": 5.333,
+  "finished_at": "2026-05-02T10:30:01.234Z",
+  "elapsed": 1.111,
   "data": {
     "kind": "render",
     "url": "https://example.com",
+    "finalUrl": "https://example.com/",
     "title": "Example Domain",
-    "html": "<!doctype html>...",
-    "text": "Example Domain ...",
-    "markdown": "# Example Domain\n...",
-    "screenshot": "data:image/png;base64,iVBORw0K...",
-    "links": ["https://www.iana.org/domains/example"]
+    "status": 200,
+    "html": "<!DOCTYPE html><html>...</html>",
+    "text": "Example Domain\nThis domain is for use in illustrative examples...",
+    "userAgent": "Mozilla/5.0 ...",
+    "elapsedMs": 1108
   }
 }
 ```
 
-## Async Mode (with callback_url)
+| Field | Type | Description |
+|---|---|---|
+| `data.kind` | string | Always `"render"`. |
+| `data.url` | string | The URL you supplied. |
+| `data.finalUrl` | string | The URL after redirects. |
+| `data.title` | string | `document.title` of the rendered page. |
+| `data.status` | number \| null | HTTP status of the main navigation response. |
+| `data.html` | string | Full rendered HTML. |
+| `data.text` | string | `document.body.innerText` — quick plain-text snapshot (Extract API returns a cleaner Readability text). |
+| `data.userAgent` | string | UA actually used (after pool rotation, if any). |
+| `data.elapsedMs` | number | Render time (browser only). |
+| `data.cached` | boolean? | `true` if served from Redis cache. |
+| `data.cacheStoredAt` | number? | Unix-ms timestamp when the cached entry was first stored. |
+
+---
 
-Initial response:
+## Response (Async mode)
+
+When `mode=async`:
 
 ```json
-{
-  "success": true,
-  "task_id": "550e8400-e29b-41d4-a716-446655440000",
-  "trace_id": "550e8400-e29b-41d4-a716-446655440001",
-  "started_at": "2026-05-02T10:30:00.123Z"
-}
+{ "jobId": "550e8400-...", "status": "queued" }
 ```
 
-The response header will include `x-usage-exempt: true`, indicating that this synchronous handshake is not billed. Once the task actually completes, the platform will send a POST to the `callback_url` with the same `data` field as the synchronous response, plus the same `task_id` / `trace_id` / `started_at` / `finished_at` / `elapsed` fields.
+Status code is `202`. Result is delivered either by `POST` to your `callback_url`
+or by polling [`/webextrator/tasks`](webextrator_tasks_api_integration_guide.md).
 
-## Error Response
+### Callback shape
+
+The platform `POST`s the **same envelope** you would have received in sync mode
+to `callback_url` with `Content-Type: application/json`. Acknowledge with any
+`2xx`; the platform retries `5xx` with exponential backoff for ~5 minutes.
+
+---
+
+## Error Responses
+
+| HTTP | `error.code` | Meaning |
+|---|---|---|
+| 400 | `bad_request` | Body failed Zod validation (missing `url`, wrong types, …). |
+| 401 | `unauthorized` | Missing or invalid `Authorization: Bearer …`. |
+| 402 | (x402) | Insufficient platform balance — see x402 payment envelope. |
+| 408 | `timeout` | Navigation exceeded `timeout`. |
+| 429 | `queue_busy` | Sync queue depth too high — retry, or use `mode=async`. |
+| 500 | `internal_error` | Unhandled server-side failure (browser crash, etc.). Auto-retried by the worker once. |
+
+Errors share the standard envelope:
 
 ```json
 {
   "success": false,
-  "task_id": "550e8400-e29b-41d4-a716-446655440000",
-  "trace_id": "550e8400-e29b-41d4-a716-446655440001",
-  "started_at": "2026-05-02T10:30:00.123Z",
+  "task_id": "...",
+  "trace_id": "...",
+  "started_at": "...",
+  "finished_at": "...",
+  "elapsed": 0.012,
   "error": {
-    "code": "timeout",
-    "message": "page load timed out after 30000ms"
+    "code": "bad_request",
+    "message": "url: Invalid url"
   }
 }
 ```
 
-Error codes: `bad_request` / `forbidden` / `too_many_requests` / `not_found` / `api_error` / `timeout` / `unknown` / `busy`.
+---
+
+## Examples
 
-## Example
+### cURL
 
 ```bash
 curl -X POST https://api.acedata.cloud/webextrator/render \
@@ -97,29 +178,98 @@ curl -X POST https://api.acedata.cloud/webextrator/render \
   }'
 ```
 
-Python example:
+### Python (requests)
 
 ```python
+import os
 import requests
 
-url = "https://api.acedata.cloud/webextrator/render"
+API_KEY = os.environ["ACEDATA_API_KEY"]
 
-headers = {
-    "accept": "application/json",
-    "authorization": "Bearer {token}",
-    "content-type": "application/json"
-}
+resp = requests.post(
+    "https://api.acedata.cloud/webextrator/render",
+    headers={
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    },
+    json={
+        "url": "https://example.com",
+        "wait_until": "networkidle",
+        "block_resources": ["image", "media", "font"],
+    },
+    timeout=60,
+)
+resp.raise_for_status()
+data = resp.json()["data"]
+print(data["title"], data["status"], len(data["html"]))
+```
+
+### Node.js (fetch)
+
+```js
+const apiKey = process.env.ACEDATA_API_KEY;
+
+const res = await fetch('https://api.acedata.cloud/webextrator/render', {
+  method: 'POST',
+  headers: {
+    Authorization: `Bearer ${apiKey}`,
+    'Content-Type': 'application/json',
+  },
+  body: JSON.stringify({
+    url: 'https://example.com',
+    wait_until: 'networkidle',
+    block_resources: ['image', 'media', 'font'],
+  }),
+});
+const { data } = await res.json();
+console.log(data.title, data.status, data.html.length);
+```
 
-payload = {
+### Async + callback
+
+```bash
+curl -X POST https://api.acedata.cloud/webextrator/render \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
     "url": "https://example.com",
-    "wait_until": "networkidle",
-    "block_resources": ["image", "media", "font"]
-}
+    "mode": "async",
+    "callback_url": "https://your-app.example.com/hooks/webextrator"
+  }'
+```
 
-response = requests.post(url, json=payload, headers=headers)
-print(response.text)
+You will receive `{ "jobId": "...", "status": "queued" }` immediately. Your
+`callback_url` will be called once the job finishes (typically within a few
+seconds for normal pages, up to a minute for heavy SPAs).
+
+### Forcing a re-render past the cache
+
+```bash
+curl -X POST https://api.acedata.cloud/webextrator/render \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://example.com",
+    "bypass_cache": true
+  }'
 ```
 
-## Conclusion
+---
+
+## Tips and Gotchas
 
-Through this document, you have learned how to use the WebExtrator Render API to render any web page and receive the fully rendered HTML, markdown, text, screenshot, and links. We hope this document can help you better integrate and use this API. If you have any questions, please feel free to contact our technical support team.
+- **`wait_until` choice matters.** `networkidle` is the safest but slowest;
+  `domcontentloaded` is fast but may miss late XHR-injected content; `load`
+  works well for classic static pages.
+- **Cache key ignores `mode`.** A `sync` and `async` request for the same URL
+  hit the same cache entry — flip `mode` freely without invalidating anything.
+- **Cache key ignores `bypass_cache` and `cache_ttl_seconds`.** Those are
+  operational toggles, not part of the response.
+- **Cookies and headers DO partition the cache.** If you customise them per
+  request, expect cache misses on the first call per unique combination.
+- **Heavy pages can exceed the default 30 s timeout.** For SPAs that lazy-load,
+  set `timeout: 60` and `wait_until: "domcontentloaded"` + `delay: 4` and a
+  `wait_for_selector` for the element you actually care about.
+- **`block_resources` is your fastest path to lower latency.** Default already
+  blocks images / fonts / media. Add `stylesheet` if your extraction doesn't
+  need CSS-driven layout.
diff --git a/webextrator/docs/webextrator_tasks_api_integration_guide.md b/webextrator/docs/webextrator_tasks_api_integration_guide.md
index 42ef590..1c87f50 100644
--- a/webextrator/docs/webextrator_tasks_api_integration_guide.md
+++ b/webextrator/docs/webextrator_tasks_api_integration_guide.md
@@ -1,122 +1,228 @@
 # WebExtrator Tasks API Integration Guide
 
-`POST https://api.acedata.cloud/webextrator/tasks` (**Free**)
+`POST https://api.acedata.cloud/webextrator/tasks`
 
-This document introduces the WebExtrator Tasks API. This API allows you to query historical `render` / `extract` tasks (retained for 7 days).
+The WebExtrator Tasks API lets you look up historical `render` / `extract`
+job envelopes. Useful for:
 
-## Application Process
+- Retrieving the result of an **async** job after the platform `POST`s it to
+  your `callback_url` (or instead of polling).
+- Auditing what you submitted — task records keep both the **request** body
+  and the **response** envelope side by side.
+- Batch back-fill — pull many tasks in one call by `id` or `trace_id`.
 
-The WebExtrator Tasks API is bundled with the existing WebExtrator service. If you already have access to the WebExtrator Render or Extract API, you can call this endpoint with the same authorization token — no additional application is required.
+Task records are retained for **7 days** in Redis.
 
-## Single Task Query
+The Tasks API is **free** (no Credit consumption per call).
 
-Query by task ID:
+---
+
+## Authentication
+
+```
+Authorization: Bearer YOUR_API_KEY
+Content-Type:  application/json
+```
+
+You can only look up tasks owned by your own AceDataCloud account.
+
+---
+
+## Request Body
+
+The body is a discriminated union on `action`. Two actions are supported:
+
+### `action: "retrieve"` — single task
+
+| Field | Type | Required | Description |
+|---|---|:---:|---|
+| `action` | const | ✅ | Must be `"retrieve"`. |
+| `id` | string | one of | Task id (returned in `task_id` of every render/extract envelope). |
+| `trace_id` | string | one of | Trace id (returned in `trace_id` of every envelope). |
+
+Exactly one of `id` or `trace_id` must be supplied.
+
+### `action: "retrieve_batch"` — many tasks
+
+| Field | Type | Required | Description |
+|---|---|:---:|---|
+| `action` | const | ✅ | Must be `"retrieve_batch"`. |
+| `ids` | string[] | one of | List of task ids. |
+| `trace_ids` | string[] | one of | List of trace ids. |
+| `offset` | number | ❌ | Pagination offset (default 0). |
+| `limit` | number | ❌ | Page size, 1–100 (default 50). |
+
+Exactly one of `ids` or `trace_ids` must be supplied.
+
+---
+
+## Response — single task
 
 ```json
 {
-  "action": "retrieve",
-  "id": "550e8400-e29b-41d4-a716-446655440000"
+  "task": {
+    "id": "550e8400-e29b-41d4-a716-446655440000",
+    "trace_id": "550e8400-e29b-41d4-a716-446655440001",
+    "type": "extract",
+    "created_at": 1730000000000,
+    "started_at": "2026-05-02T10:30:00.123Z",
+    "finished_at": "2026-05-02T10:30:02.535Z",
+    "elapsed": 2.412,
+    "request": {
+      "url": "https://en.wikipedia.org/wiki/Diffbot",
+      "expected_type": "article"
+    },
+    "response": {
+      "success": true,
+      "data": { /* the full extract envelope */ }
+    }
+  }
 }
 ```
 
-Or query by `trace_id`:
+If no task is found with the given id / trace id, returns
+`{ "task": null }` (HTTP 200).
+
+---
+
+## Response — batch
 
 ```json
 {
-  "action": "retrieve",
-  "trace_id": "550e8400-e29b-41d4-a716-446655440001"
+  "tasks": [
+    { /* same shape as single-task `.task` */ },
+    { /* ... */ }
+  ],
+  "offset": 0,
+  "limit":  50
 }
 ```
 
-Returns a single task object containing `request` / `response` / `started_at` / `finished_at`, and other fields.
+Missing ids return no error — they are simply absent from `tasks`.
+
+---
 
-### Code Example
+## Examples
 
-#### CURL
+### Retrieve a single task by id
 
 ```bash
-curl -X POST 'https://api.acedata.cloud/webextrator/tasks' \
-  -H 'accept: application/json' \
-  -H 'authorization: Bearer {token}' \
-  -H 'content-type: application/json' \
+curl -X POST https://api.acedata.cloud/webextrator/tasks \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
   -d '{
     "action": "retrieve",
     "id": "550e8400-e29b-41d4-a716-446655440000"
   }'
 ```
 
-#### Python
+### Retrieve a single task by trace id
 
-```python
-import requests
-
-url = "https://api.acedata.cloud/webextrator/tasks"
-
-headers = {
-    "accept": "application/json",
-    "authorization": "Bearer {token}",
-    "content-type": "application/json"
-}
-
-payload = {
+```bash
+curl -X POST https://api.acedata.cloud/webextrator/tasks \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
     "action": "retrieve",
-    "id": "550e8400-e29b-41d4-a716-446655440000"
-}
-
-response = requests.post(url, json=payload, headers=headers)
-print(response.text)
+    "trace_id": "550e8400-e29b-41d4-a716-446655440001"
+  }'
 ```
 
-## Batch Query
+### Batch retrieve
 
-```json
-{
-  "action": "retrieve_batch",
-  "ids": ["...", "..."],
-  "limit": 12,
-  "offset": 0
-}
+```bash
+curl -X POST https://api.acedata.cloud/webextrator/tasks \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "action": "retrieve_batch",
+    "ids": [
+      "550e8400-e29b-41d4-a716-446655440000",
+      "550e8400-e29b-41d4-a716-446655440002"
+    ],
+    "limit": 50
+  }'
 ```
 
-You can also use `trace_ids`, or omit both to paginate through your own task history.
+### Python (requests) — poll until done
 
-Returns:
+```python
+import os, time, requests
+
+API_KEY = os.environ["ACEDATA_API_KEY"]
+BASE = "https://api.acedata.cloud"
+
+# 1) Fire an async extract.
+queue = requests.post(
+    f"{BASE}/webextrator/extract",
+    headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
+    json={"url": "https://example.com", "mode": "async"},
+).json()
+
+job_id = queue["jobId"]
+
+# 2) Poll the Tasks API until the task finishes.
+while True:
+    r = requests.post(
+        f"{BASE}/webextrator/tasks",
+        headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
+        json={"action": "retrieve", "id": job_id},
+    ).json()
+    task = r.get("task")
+    if task and task.get("finished_at"):
+        print("done in", task["elapsed"], "s")
+        print(task["response"]["data"]["title"])
+        break
+    time.sleep(2)
+```
 
-```json
-{
-  "items": [ { }, ... ],
-  "count": 2
-}
+### Node.js (fetch) — process a callback then re-fetch the full envelope
+
+```js
+// Inside your callback_url handler:
+app.post('/hooks/webextrator', async (req, res) => {
+  res.status(200).end();              // ack fast
+
+  const taskId = req.body?.task_id;
+  if (!taskId) return;
+
+  const fetchRes = await fetch('https://api.acedata.cloud/webextrator/tasks', {
+    method: 'POST',
+    headers: {
+      Authorization: `Bearer ${process.env.ACEDATA_API_KEY}`,
+      'Content-Type': 'application/json',
+    },
+    body: JSON.stringify({ action: 'retrieve', id: taskId }),
+  });
+  const { task } = await fetchRes.json();
+  console.log('full envelope:', task.response.data);
+});
 ```
 
-### CURL Example
+---
 
-```bash
-curl -X POST 'https://api.acedata.cloud/webextrator/tasks' \
-  -H 'accept: application/json' \
-  -H 'authorization: Bearer {token}' \
-  -H 'content-type: application/json' \
-  -d '{
-    "action": "retrieve_batch",
-    "ids": ["550e8400-e29b-41d4-a716-446655440000", "550e8400-e29b-41d4-a716-446655440002"],
-    "limit": 12,
-    "offset": 0
-  }'
-```
+## Error Responses
 
-## Field Definitions
+| HTTP | `error.code` | Meaning |
+|---|---|---|
+| 400 | `bad_request` | Validation failure (missing `action`, both `id` and `trace_id` present, …). |
+| 401 | `unauthorized` | Missing or invalid `Authorization: Bearer …`. |
 
-| Field | Description |
-|------|------|
-| `id` | Unique task ID (also returned as `task_id` in some response contexts) |
-| `trace_id` | Call chain ID (aligned with PlatformGateway / CLS) |
-| `type` | `render` or `extract` |
-| `request` | Original request body |
-| `response` | Render / extraction result (same as the `data` field in a sync response) |
-| `started_at` / `finished_at` / `elapsed` | Timestamps and duration (seconds) |
+```json
+{ "error": { "code": "bad_request", "message": "..." } }
+```
 
-> This API is free and does not count toward usage.
+---
 
-## Conclusion
+## Tips and Gotchas
 
-Through this document, you have learned how to use the WebExtrator Tasks API to query single or batch historical render and extract tasks. We hope this document can help you better integrate and use this API. If you have any questions, please feel free to contact our technical support team.
+- **Use trace_id when the caller chose it.** Pass `?trace_id=…` on the
+  original render/extract request to make tasks searchable by your own ids
+  (e.g. workflow run id). Otherwise the server generates a uuid for you.
+- **Retention is 7 days.** Older tasks return `task: null` — store anything
+  you need long-term on your side.
+- **Tasks API is free.** Look up as often as you want; the cost was already
+  paid when the original render/extract job ran.
+- **Async > polling.** When practical, set `callback_url` on the original
+  request so the platform delivers the envelope to you instead of you
+  polling every 2 s.