From a66ec69cb7d69a90ceee36c05fd7bcbf7f35dbd2 Mon Sep 17 00:00:00 2001 From: Han Ngo Date: Tue, 12 May 2026 10:09:16 +0700 Subject: [PATCH] docs(skills): refresh annas-fetch + local-ocr for shipped CLI verbs annas-fetch: document mirror auto-discovery (Wikipedia/blog fallback, 30-day cache, mirror-discover verb). local-ocr: add OLLAMA_HOST_URL override, triage/verify/audit subcommands, portrait routing caveat. Co-Authored-By: Claude Opus 4.7 (1M context) --- home/dot_claude/skills/annas-fetch/SKILL.md | 11 ++++++----- home/dot_claude/skills/local-ocr/SKILL.md | 17 ++++++++++++++++- 2 files changed, 22 insertions(+), 6 deletions(-) diff --git a/home/dot_claude/skills/annas-fetch/SKILL.md b/home/dot_claude/skills/annas-fetch/SKILL.md index 310e85d..72d8afd 100644 --- a/home/dot_claude/skills/annas-fetch/SKILL.md +++ b/home/dot_claude/skills/annas-fetch/SKILL.md @@ -23,7 +23,7 @@ Search, browse, and download books from Anna's Archive via the **member fast-dow ## Hard rules 1. **Never hardcode the secret key.** Always resolve from 1Password at runtime. The skill uses `op run --env-file ` (the `secret-guard` hook blocks raw `op read` even in nested contexts). The Python CLI reads `ANNAS_SECRET_KEY` from env; the skill is responsible for setting it for the subprocess only. -2. **Default base URL is `https://annas-archive.gl`** (current primary as of 2026-05-10). The CLI has mirror fallback; if the user reports the primary has rotated, update `DEFAULT_BASE` in `annas_fetch.py` or pass `ANNAS_BASE_URL`. +2. **Default base URL is `https://annas-archive.gl`** (current primary as of 2026-05-10). The CLI iterates `MIRRORS` on transport failure, then auto-discovers replacements via Wikipedia (then Anna's blog) when every mirror in the effective list dies in one invocation. Discovery result is cached at `~/.cache/annas-fetch/mirrors.json` (30-day TTL). Manual edits to `DEFAULT_BASE` / `ANNAS_BASE_URL` are still supported but rarely needed; for a forced refresh run `annas-fetch mirror-discover --force`. See Spec 08. 3. **Default output is `~/Downloads/annas/`.** Never write into a consumer repo. Books are not repo content. 4. **Confirm before downloading >3 files in one session.** Member tier daily quota is ~75 fast-downloads; bulk runs burn it fast. 5. **Surface the quota line.** When the CLI prints `# quota: ...` to stderr, relay `downloads_left / downloads_per_day` to the user. @@ -124,6 +124,7 @@ Report to the user: | "build the local library index" | `annas-fetch library scan --path ~/Downloads/annas` — walks recursively, hashes books, writes `~/.cache/annas-fetch/library.jsonl`. Add `--full` to rebuild from scratch. | | "do I already have ?" | `annas-fetch library check ` — exits 0 + path on hit, 1 on miss, 3 if no index | | "are the AA mirrors alive?" / "is .gl down?" | `annas-fetch mirror-check` — probes every entry in `MIRRORS` with HEAD, prints latency table. Add `--json` for structured output. | +| "AA rotated all domains, find the new one" / "refresh the mirror list" | `annas-fetch mirror-discover [--force] [--json]` — scrapes the Anna's Archive Wikipedia page (then the blog) for current domains, writes `~/.cache/annas-fetch/mirrors.json`. Lazy auto-fallback already fires inside search/fetch on full failure; this verb is for explicit/manual refresh. | | "AA's HTML changed, parser tests broken" | `annas-fetch dev refresh-fixtures` — re-pulls `tests/fixtures/*.html`. Run unit tests after to surface drift. Dev-only. | None of these require the member key (only `fetch` does). @@ -131,7 +132,7 @@ None of these require the member key (only `fetch` does). ### Step 6: Failure handling - **`download_url` missing across all 9 attempts** → API may have shifted field names. Inspect one raw response: `op run --env-file /tmp/annas.env -- python3 -c "import sys; sys.path.insert(0, '/Users/tieubao/workspace/tieubao/ops-toolkit/tools/annas-fetch'); from annas_fetch import fast_download_url; import json; print(json.dumps(fast_download_url(''), indent=2))"`. Patch the field name in the CLI. -- **All mirrors fail at transport** → run `annas-fetch mirror-check` first to confirm which mirrors are alive. If `.gl` rotated out, update `DEFAULT_BASE` in `annas_fetch.py` or set `ANNAS_BASE_URL`. +- **All mirrors fail at transport** → the CLI now auto-recovers by scraping Wikipedia (then Anna's blog) for the current domain list. If even that fails (rare: Wikipedia and the blog would both have to be unreachable), run `annas-fetch mirror-check` to confirm baseline reachability and `annas-fetch mirror-discover --force --json` to inspect the discovery payload. Last-resort overrides are still `DEFAULT_BASE` or `ANNAS_BASE_URL`. - **Quota exceeded** → tell the user; do not retry. - **Captcha / Cloudflare challenge in response** → member API should bypass these; if hit, the key may be invalid or expired. Verify the 1Password ref. - **Search results look like garbage / empty** → AA HTML markup may have shifted. Run `python3 -m unittest discover ~/workspace/tieubao/ops-toolkit/tools/annas-fetch/tests -v` — if parser tests still pass against fixtures but live results are broken, refresh fixtures with `annas-fetch dev refresh-fixtures` and re-run tests to surface what changed. @@ -147,7 +148,7 @@ None of these require the member key (only `fetch` does). - Tool source: `~/workspace/tieubao/ops-toolkit/tools/annas-fetch/` - Spec: `ops-toolkit/tools/annas-fetch/SPEC.md` -- Follow-up specs: `ops-toolkit/tools/annas-fetch/specs/` (filters, hybrid browse, library dedup, quota tracker, mirror-check, fixture refresh; spec 05 parallel-probe is deferred) -- Tests: `python3 -m unittest discover ~/workspace/tieubao/ops-toolkit/tools/annas-fetch/tests -v` (no network, 91 cases) plus `RUN_LIVE_SMOKE=1 python3 tests/smoke_live.py` (5 cases, network, no fetch) +- Follow-up specs: `ops-toolkit/tools/annas-fetch/specs/` (filters, hybrid browse, library dedup, quota tracker, mirror-check, fixture refresh, mirror auto-discovery; spec 05 parallel-probe is deferred) +- Tests: `python3 -m unittest discover ~/workspace/tieubao/ops-toolkit/tools/annas-fetch/tests -v` (no network, 119 cases) plus `RUN_LIVE_SMOKE=1 python3 tests/smoke_live.py` (6 cases, network, no fetch) - API: `GET /dyn/api/fast_download.json?md5=&key=&path_index=<0..2>&domain_index=<0..2>` → JSON `{download_url, account_fast_download_info, ...}` or `{error, ...}` -- Cache: `~/.cache/annas-fetch/` holds `quota.jsonl` (per-fetch row) and `library.jsonl` (built by `library scan`). +- Cache: `~/.cache/annas-fetch/` holds `quota.jsonl` (per-fetch row), `library.jsonl` (built by `library scan`), and `mirrors.json` (auto-discovered mirror list, 30-day TTL). diff --git a/home/dot_claude/skills/local-ocr/SKILL.md b/home/dot_claude/skills/local-ocr/SKILL.md index 0b80854..94be217 100644 --- a/home/dot_claude/skills/local-ocr/SKILL.md +++ b/home/dot_claude/skills/local-ocr/SKILL.md @@ -36,6 +36,7 @@ For **PDF in vault** (`_vault/legal/...`) the user references by path: that's fi - CLI: `~/workspace/tieubao/ops-toolkit/tools/local-ocr/local_ocr.py` - Endpoint: `http://100.118.23.42:11434` (the Mini's personal Tailnet identity; do NOT use `mac-mini-danang:11434` — that's the work-tagged identity and Ollama is bound to the personal one only) +- Override with the `OLLAMA_HOST_URL` env var when you're off-Tailnet or testing against a different box. Scheme is required (e.g. `http://192.0.2.1:11434`). - Health check: `local-ocr health` If `local-ocr health` fails, surface the error to the user. Do not retry blindly. Common causes: Mini offline, Tailscale down, agent unloaded (`launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/mini.ollama.plist`), models pruned (`ollama list` on Mini). @@ -98,9 +99,10 @@ The CLI emits `doc_class:` as the first field of the YAML. Route based on it. Ea | `imaging` | `health/.md` → `## Imaging / procedures` (one line: date, modality, body part, plain-language finding, vault path) | `_vault/legal/notes.md` (accession_number, full radiology narrative) | `_vault/legal/imaging---.pdf` | | `discharge` | `health/.md` → `## Encounters` (new dated block) | `_vault/legal/notes.md` (patient_id, full discharge narrative, meds at discharge) | `_vault/legal/discharge--.pdf` | | `vaccination` | `health/.md` → `## Vaccinations` (flip Status: due → done for each matching dose; add row if no scheduled match exists) | `_vault/legal/notes.md` (patient_id if present) | `_vault/legal/vax--.pdf` | -| `portrait` | (no text fields to land) | (no notes; file itself is the artifact) | `_vault/legal/portrait--.jpg` | | `other` | (manual review only; do not auto-write) | (manual review only) | `_vault/legal/` (preserve original filename) | +**Note on `portrait`**: the `structured` CLI does NOT emit `doc_class: portrait` — it's not in the `--doc-class` choices and has no schema. Portraits are recognized only by `triage` (see Other subcommands below), which routes them to `_vault/legal/portrait--.` without any field extraction. If a user drops a portrait into Mode B, the CLI will classify it as `other` and you'll fall through to the manual-review row. + ### When `doc_class: other` is returned Switch to Mode A behavior automatically: show the OCR text + the model's free-form `notes`, archive the PDF to vault, and ask the user where to land the content. No auto-writes. @@ -185,6 +187,19 @@ For DeepSeek with bounding boxes + region tags (visual audit trail): Useful when DeepSeek's output looks suspicious or the user wants a second opinion before committing field values. +### Other subcommands (housekeeping) + +These verbs sit alongside `ocr` / `structured` / `compare` / `absorb` / `rollback`. Use them when the user's intent matches. + +| User intent | Subcommand | Notes | +|---|---|---| +| "what's in this folder?" / "triage the inbox" / "what doc class is this?" | `local-ocr triage [--format json]` | Filename + extension heuristics only. No content reads, so it's safe to run against an `_inbox/` you haven't reviewed yet. Returns one row per file with sensitivity score + suggested doc-class hint (the hint set is broader than `structured`'s `--doc-class` choices: it recognises `portrait`, `bao-hiem-y-te`, `vaccine`, etc. as routing aliases). | +| "double-check this YAML for PII leaks" / "re-verify the extraction" | `local-ocr verify ` (or `-` for stdin) | Runs Presidio over the YAML and flags `safe_for_git` fields that look like PII. Same threshold as Mode B's blocking gate (0.7). Useful if a user hand-edited an extraction and you want a sanity pass before they commit. | +| "show me past blind-absorb runs" / "what did I absorb last week?" | `local-ocr audit list` | Lists every run in `_vault/audit/blind-absorb/`, newest first, redacted. | +| "show me run X's manifest" / "what files did blind-absorb touch on RUN-Y?" | `local-ocr audit show [--format json]` | Prints the manifest for one run. Redacted by default. **Never invoke with `--full`** — the CLI's own help string warns "DANGER: reads .values.json and prints extracted field values. Do NOT use from an LLM session; only in your own terminal." That's the same load-bearing invariant as Mode D's no-read rule. | + +None of these need the member-tier model or the 1Password key; they're local-only metadata/filename operations. + ## Output format and human review The CLI emits text by default; pass `--format json` for piping. For the structured pipeline, the YAML output looks like: