From a66ec69cb7d69a90ceee36c05fd7bcbf7f35dbd2 Mon Sep 17 00:00:00 2001
From: Han Ngo <nntruonghan@gmail.com>
Date: Tue, 12 May 2026 10:09:16 +0700
Subject: [PATCH] docs(skills): refresh annas-fetch + local-ocr for shipped CLI
 verbs

annas-fetch: document mirror auto-discovery (Wikipedia/blog fallback,
30-day cache, mirror-discover verb).
local-ocr: add OLLAMA_HOST_URL override, triage/verify/audit
subcommands, portrait routing caveat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 home/dot_claude/skills/annas-fetch/SKILL.md | 11 ++++++-----
 home/dot_claude/skills/local-ocr/SKILL.md   | 17 ++++++++++++++++-
 2 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/home/dot_claude/skills/annas-fetch/SKILL.md b/home/dot_claude/skills/annas-fetch/SKILL.md
index 310e85d..72d8afd 100644
--- a/home/dot_claude/skills/annas-fetch/SKILL.md
+++ b/home/dot_claude/skills/annas-fetch/SKILL.md
@@ -23,7 +23,7 @@ Search, browse, and download books from Anna's Archive via the **member fast-dow
 ## Hard rules
 
 1. **Never hardcode the secret key.** Always resolve from 1Password at runtime. The skill uses `op run --env-file <tempfile>` (the `secret-guard` hook blocks raw `op read` even in nested contexts). The Python CLI reads `ANNAS_SECRET_KEY` from env; the skill is responsible for setting it for the subprocess only.
-2. **Default base URL is `https://annas-archive.gl`** (current primary as of 2026-05-10). The CLI has mirror fallback; if the user reports the primary has rotated, update `DEFAULT_BASE` in `annas_fetch.py` or pass `ANNAS_BASE_URL`.
+2. **Default base URL is `https://annas-archive.gl`** (current primary as of 2026-05-10). The CLI iterates `MIRRORS` on transport failure, then auto-discovers replacements via Wikipedia (then Anna's blog) when every mirror in the effective list dies in one invocation. Discovery result is cached at `~/.cache/annas-fetch/mirrors.json` (30-day TTL). Manual edits to `DEFAULT_BASE` / `ANNAS_BASE_URL` are still supported but rarely needed; for a forced refresh run `annas-fetch mirror-discover --force`. See Spec 08.
 3. **Default output is `~/Downloads/annas/`.** Never write into a consumer repo. Books are not repo content.
 4. **Confirm before downloading >3 files in one session.** Member tier daily quota is ~75 fast-downloads; bulk runs burn it fast.
 5. **Surface the quota line.** When the CLI prints `# quota: ...` to stderr, relay `downloads_left / downloads_per_day` to the user.
@@ -124,6 +124,7 @@ Report to the user:
 | "build the local library index" | `annas-fetch library scan --path ~/Downloads/annas` — walks recursively, hashes books, writes `~/.cache/annas-fetch/library.jsonl`. Add `--full` to rebuild from scratch. |
 | "do I already have <md5>?" | `annas-fetch library check <md5>` — exits 0 + path on hit, 1 on miss, 3 if no index |
 | "are the AA mirrors alive?" / "is .gl down?" | `annas-fetch mirror-check` — probes every entry in `MIRRORS` with HEAD, prints latency table. Add `--json` for structured output. |
+| "AA rotated all domains, find the new one" / "refresh the mirror list" | `annas-fetch mirror-discover [--force] [--json]` — scrapes the Anna's Archive Wikipedia page (then the blog) for current domains, writes `~/.cache/annas-fetch/mirrors.json`. Lazy auto-fallback already fires inside search/fetch on full failure; this verb is for explicit/manual refresh. |
 | "AA's HTML changed, parser tests broken" | `annas-fetch dev refresh-fixtures` — re-pulls `tests/fixtures/*.html`. Run unit tests after to surface drift. Dev-only. |
 
 None of these require the member key (only `fetch` does).
@@ -131,7 +132,7 @@ None of these require the member key (only `fetch` does).
 ### Step 6: Failure handling
 
 - **`download_url` missing across all 9 attempts** → API may have shifted field names. Inspect one raw response: `op run --env-file /tmp/annas.env -- python3 -c "import sys; sys.path.insert(0, '/Users/tieubao/workspace/tieubao/ops-toolkit/tools/annas-fetch'); from annas_fetch import fast_download_url; import json; print(json.dumps(fast_download_url('<md5>'), indent=2))"`. Patch the field name in the CLI.
-- **All mirrors fail at transport** → run `annas-fetch mirror-check` first to confirm which mirrors are alive. If `.gl` rotated out, update `DEFAULT_BASE` in `annas_fetch.py` or set `ANNAS_BASE_URL`.
+- **All mirrors fail at transport** → the CLI now auto-recovers by scraping Wikipedia (then Anna's blog) for the current domain list. If even that fails (rare: Wikipedia and the blog would both have to be unreachable), run `annas-fetch mirror-check` to confirm baseline reachability and `annas-fetch mirror-discover --force --json` to inspect the discovery payload. Last-resort overrides are still `DEFAULT_BASE` or `ANNAS_BASE_URL`.
 - **Quota exceeded** → tell the user; do not retry.
 - **Captcha / Cloudflare challenge in response** → member API should bypass these; if hit, the key may be invalid or expired. Verify the 1Password ref.
 - **Search results look like garbage / empty** → AA HTML markup may have shifted. Run `python3 -m unittest discover ~/workspace/tieubao/ops-toolkit/tools/annas-fetch/tests -v` — if parser tests still pass against fixtures but live results are broken, refresh fixtures with `annas-fetch dev refresh-fixtures` and re-run tests to surface what changed.
@@ -147,7 +148,7 @@ None of these require the member key (only `fetch` does).
 
 - Tool source: `~/workspace/tieubao/ops-toolkit/tools/annas-fetch/`
 - Spec: `ops-toolkit/tools/annas-fetch/SPEC.md`
-- Follow-up specs: `ops-toolkit/tools/annas-fetch/specs/` (filters, hybrid browse, library dedup, quota tracker, mirror-check, fixture refresh; spec 05 parallel-probe is deferred)
-- Tests: `python3 -m unittest discover ~/workspace/tieubao/ops-toolkit/tools/annas-fetch/tests -v` (no network, 91 cases) plus `RUN_LIVE_SMOKE=1 python3 tests/smoke_live.py` (5 cases, network, no fetch)
+- Follow-up specs: `ops-toolkit/tools/annas-fetch/specs/` (filters, hybrid browse, library dedup, quota tracker, mirror-check, fixture refresh, mirror auto-discovery; spec 05 parallel-probe is deferred)
+- Tests: `python3 -m unittest discover ~/workspace/tieubao/ops-toolkit/tools/annas-fetch/tests -v` (no network, 119 cases) plus `RUN_LIVE_SMOKE=1 python3 tests/smoke_live.py` (6 cases, network, no fetch)
 - API: `GET /dyn/api/fast_download.json?md5=<hash>&key=<member_key>&path_index=<0..2>&domain_index=<0..2>` → JSON `{download_url, account_fast_download_info, ...}` or `{error, ...}`
-- Cache: `~/.cache/annas-fetch/` holds `quota.jsonl` (per-fetch row) and `library.jsonl` (built by `library scan`).
+- Cache: `~/.cache/annas-fetch/` holds `quota.jsonl` (per-fetch row), `library.jsonl` (built by `library scan`), and `mirrors.json` (auto-discovered mirror list, 30-day TTL).
diff --git a/home/dot_claude/skills/local-ocr/SKILL.md b/home/dot_claude/skills/local-ocr/SKILL.md
index 0b80854..94be217 100644
--- a/home/dot_claude/skills/local-ocr/SKILL.md
+++ b/home/dot_claude/skills/local-ocr/SKILL.md
@@ -36,6 +36,7 @@ For **PDF in vault** (`_vault/legal/...`) the user references by path: that's fi
 
 - CLI: `~/workspace/tieubao/ops-toolkit/tools/local-ocr/local_ocr.py`
 - Endpoint: `http://100.118.23.42:11434` (the Mini's personal Tailnet identity; do NOT use `mac-mini-danang:11434` — that's the work-tagged identity and Ollama is bound to the personal one only)
+- Override with the `OLLAMA_HOST_URL` env var when you're off-Tailnet or testing against a different box. Scheme is required (e.g. `http://192.0.2.1:11434`).
 - Health check: `local-ocr health`
 
 If `local-ocr health` fails, surface the error to the user. Do not retry blindly. Common causes: Mini offline, Tailscale down, agent unloaded (`launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/mini.ollama.plist`), models pruned (`ollama list` on Mini).
@@ -98,9 +99,10 @@ The CLI emits `doc_class:` as the first field of the YAML. Route based on it. Ea
 | `imaging` | `health/<person>.md` → `## Imaging / procedures` (one line: date, modality, body part, plain-language finding, vault path) | `_vault/legal/notes.md` (accession_number, full radiology narrative) | `_vault/legal/imaging-<person>-<modality>-<study-date>.pdf` |
 | `discharge` | `health/<person>.md` → `## Encounters` (new dated block) | `_vault/legal/notes.md` (patient_id, full discharge narrative, meds at discharge) | `_vault/legal/discharge-<person>-<discharge-date>.pdf` |
 | `vaccination` | `health/<person>.md` → `## Vaccinations` (flip Status: due → done for each matching dose; add row if no scheduled match exists) | `_vault/legal/notes.md` (patient_id if present) | `_vault/legal/vax-<person>-<latest-dose-date>.pdf` |
-| `portrait` | (no text fields to land) | (no notes; file itself is the artifact) | `_vault/legal/portrait-<person>-<date>.jpg` |
 | `other` | (manual review only; do not auto-write) | (manual review only) | `_vault/legal/<original-name>` (preserve original filename) |
 
+**Note on `portrait`**: the `structured` CLI does NOT emit `doc_class: portrait` — it's not in the `--doc-class` choices and has no schema. Portraits are recognized only by `triage` (see Other subcommands below), which routes them to `_vault/legal/portrait-<person>-<date>.<ext>` without any field extraction. If a user drops a portrait into Mode B, the CLI will classify it as `other` and you'll fall through to the manual-review row.
+
 ### When `doc_class: other` is returned
 
 Switch to Mode A behavior automatically: show the OCR text + the model's free-form `notes`, archive the PDF to vault, and ask the user where to land the content. No auto-writes.
@@ -185,6 +187,19 @@ For DeepSeek with bounding boxes + region tags (visual audit trail):
 
 Useful when DeepSeek's output looks suspicious or the user wants a second opinion before committing field values.
 
+### Other subcommands (housekeeping)
+
+These verbs sit alongside `ocr` / `structured` / `compare` / `absorb` / `rollback`. Use them when the user's intent matches.
+
+| User intent | Subcommand | Notes |
+|---|---|---|
+| "what's in this folder?" / "triage the inbox" / "what doc class is this?" | `local-ocr triage <dir-or-file> [--format json]` | Filename + extension heuristics only. No content reads, so it's safe to run against an `_inbox/` you haven't reviewed yet. Returns one row per file with sensitivity score + suggested doc-class hint (the hint set is broader than `structured`'s `--doc-class` choices: it recognises `portrait`, `bao-hiem-y-te`, `vaccine`, etc. as routing aliases). |
+| "double-check this YAML for PII leaks" / "re-verify the extraction" | `local-ocr verify <yaml-file>` (or `-` for stdin) | Runs Presidio over the YAML and flags `safe_for_git` fields that look like PII. Same threshold as Mode B's blocking gate (0.7). Useful if a user hand-edited an extraction and you want a sanity pass before they commit. |
+| "show me past blind-absorb runs" / "what did I absorb last week?" | `local-ocr audit list` | Lists every run in `_vault/audit/blind-absorb/`, newest first, redacted. |
+| "show me run X's manifest" / "what files did blind-absorb touch on RUN-Y?" | `local-ocr audit show <run-id> [--format json]` | Prints the manifest for one run. Redacted by default. **Never invoke with `--full`** — the CLI's own help string warns "DANGER: reads .values.json and prints extracted field values. Do NOT use from an LLM session; only in your own terminal." That's the same load-bearing invariant as Mode D's no-read rule. |
+
+None of these need the member-tier model or the 1Password key; they're local-only metadata/filename operations.
+
 ## Output format and human review
 
 The CLI emits text by default; pass `--format json` for piping. For the structured pipeline, the YAML output looks like: