Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions home/dot_claude/skills/annas-fetch/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Search, browse, and download books from Anna's Archive via the **member fast-dow
## Hard rules

1. **Never hardcode the secret key.** Always resolve from 1Password at runtime. The skill uses `op run --env-file <tempfile>` (the `secret-guard` hook blocks raw `op read` even in nested contexts). The Python CLI reads `ANNAS_SECRET_KEY` from env; the skill is responsible for setting it for the subprocess only.
2. **Default base URL is `https://annas-archive.gl`** (current primary as of 2026-05-10). The CLI has mirror fallback; if the user reports the primary has rotated, update `DEFAULT_BASE` in `annas_fetch.py` or pass `ANNAS_BASE_URL`.
2. **Default base URL is `https://annas-archive.gl`** (current primary as of 2026-05-10). The CLI iterates `MIRRORS` on transport failure, then auto-discovers replacements via Wikipedia (then Anna's blog) when every mirror in the effective list dies in one invocation. Discovery result is cached at `~/.cache/annas-fetch/mirrors.json` (30-day TTL). Manual edits to `DEFAULT_BASE` / `ANNAS_BASE_URL` are still supported but rarely needed; for a forced refresh run `annas-fetch mirror-discover --force`. See Spec 08.
3. **Default output is `~/Downloads/annas/`.** Never write into a consumer repo. Books are not repo content.
4. **Confirm before downloading >3 files in one session.** Member tier daily quota is ~75 fast-downloads; bulk runs burn it fast.
5. **Surface the quota line.** When the CLI prints `# quota: ...` to stderr, relay `downloads_left / downloads_per_day` to the user.
Expand Down Expand Up @@ -124,14 +124,15 @@ Report to the user:
| "build the local library index" | `annas-fetch library scan --path ~/Downloads/annas` — walks recursively, hashes books, writes `~/.cache/annas-fetch/library.jsonl`. Add `--full` to rebuild from scratch. |
| "do I already have <md5>?" | `annas-fetch library check <md5>` — exits 0 + path on hit, 1 on miss, 3 if no index |
| "are the AA mirrors alive?" / "is .gl down?" | `annas-fetch mirror-check` — probes every entry in `MIRRORS` with HEAD, prints latency table. Add `--json` for structured output. |
| "AA rotated all domains, find the new one" / "refresh the mirror list" | `annas-fetch mirror-discover [--force] [--json]` — scrapes the Anna's Archive Wikipedia page (then the blog) for current domains, writes `~/.cache/annas-fetch/mirrors.json`. Lazy auto-fallback already fires inside search/fetch on full failure; this verb is for explicit/manual refresh. |
| "AA's HTML changed, parser tests broken" | `annas-fetch dev refresh-fixtures` — re-pulls `tests/fixtures/*.html`. Run unit tests after to surface drift. Dev-only. |

None of these require the member key (only `fetch` does).

### Step 6: Failure handling

- **`download_url` missing across all 9 attempts** → API may have shifted field names. Inspect one raw response: `op run --env-file /tmp/annas.env -- python3 -c "import sys; sys.path.insert(0, '/Users/tieubao/workspace/tieubao/ops-toolkit/tools/annas-fetch'); from annas_fetch import fast_download_url; import json; print(json.dumps(fast_download_url('<md5>'), indent=2))"`. Patch the field name in the CLI.
- **All mirrors fail at transport** → run `annas-fetch mirror-check` first to confirm which mirrors are alive. If `.gl` rotated out, update `DEFAULT_BASE` in `annas_fetch.py` or set `ANNAS_BASE_URL`.
- **All mirrors fail at transport** → the CLI now auto-recovers by scraping Wikipedia (then Anna's blog) for the current domain list. If even that fails (rare: Wikipedia and the blog would both have to be unreachable), run `annas-fetch mirror-check` to confirm baseline reachability and `annas-fetch mirror-discover --force --json` to inspect the discovery payload. Last-resort overrides are still `DEFAULT_BASE` or `ANNAS_BASE_URL`.
- **Quota exceeded** → tell the user; do not retry.
- **Captcha / Cloudflare challenge in response** → member API should bypass these; if hit, the key may be invalid or expired. Verify the 1Password ref.
- **Search results look like garbage / empty** → AA HTML markup may have shifted. Run `python3 -m unittest discover ~/workspace/tieubao/ops-toolkit/tools/annas-fetch/tests -v` — if parser tests still pass against fixtures but live results are broken, refresh fixtures with `annas-fetch dev refresh-fixtures` and re-run tests to surface what changed.
Expand All @@ -147,7 +148,7 @@ None of these require the member key (only `fetch` does).

- Tool source: `~/workspace/tieubao/ops-toolkit/tools/annas-fetch/`
- Spec: `ops-toolkit/tools/annas-fetch/SPEC.md`
- Follow-up specs: `ops-toolkit/tools/annas-fetch/specs/` (filters, hybrid browse, library dedup, quota tracker, mirror-check, fixture refresh; spec 05 parallel-probe is deferred)
- Tests: `python3 -m unittest discover ~/workspace/tieubao/ops-toolkit/tools/annas-fetch/tests -v` (no network, 91 cases) plus `RUN_LIVE_SMOKE=1 python3 tests/smoke_live.py` (5 cases, network, no fetch)
- Follow-up specs: `ops-toolkit/tools/annas-fetch/specs/` (filters, hybrid browse, library dedup, quota tracker, mirror-check, fixture refresh, mirror auto-discovery; spec 05 parallel-probe is deferred)
- Tests: `python3 -m unittest discover ~/workspace/tieubao/ops-toolkit/tools/annas-fetch/tests -v` (no network, 119 cases) plus `RUN_LIVE_SMOKE=1 python3 tests/smoke_live.py` (6 cases, network, no fetch)
- API: `GET /dyn/api/fast_download.json?md5=<hash>&key=<member_key>&path_index=<0..2>&domain_index=<0..2>` → JSON `{download_url, account_fast_download_info, ...}` or `{error, ...}`
- Cache: `~/.cache/annas-fetch/` holds `quota.jsonl` (per-fetch row) and `library.jsonl` (built by `library scan`).
- Cache: `~/.cache/annas-fetch/` holds `quota.jsonl` (per-fetch row), `library.jsonl` (built by `library scan`), and `mirrors.json` (auto-discovered mirror list, 30-day TTL).
17 changes: 16 additions & 1 deletion home/dot_claude/skills/local-ocr/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ For **PDF in vault** (`_vault/legal/...`) the user references by path: that's fi

- CLI: `~/workspace/tieubao/ops-toolkit/tools/local-ocr/local_ocr.py`
- Endpoint: `http://100.118.23.42:11434` (the Mini's personal Tailnet identity; do NOT use `mac-mini-danang:11434` — that's the work-tagged identity and Ollama is bound to the personal one only)
- Override with the `OLLAMA_HOST_URL` env var when you're off-Tailnet or testing against a different box. Scheme is required (e.g. `http://192.0.2.1:11434`).
- Health check: `local-ocr health`

If `local-ocr health` fails, surface the error to the user. Do not retry blindly. Common causes: Mini offline, Tailscale down, agent unloaded (`launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/mini.ollama.plist`), models pruned (`ollama list` on Mini).
Expand Down Expand Up @@ -98,9 +99,10 @@ The CLI emits `doc_class:` as the first field of the YAML. Route based on it. Ea
| `imaging` | `health/<person>.md` → `## Imaging / procedures` (one line: date, modality, body part, plain-language finding, vault path) | `_vault/legal/notes.md` (accession_number, full radiology narrative) | `_vault/legal/imaging-<person>-<modality>-<study-date>.pdf` |
| `discharge` | `health/<person>.md` → `## Encounters` (new dated block) | `_vault/legal/notes.md` (patient_id, full discharge narrative, meds at discharge) | `_vault/legal/discharge-<person>-<discharge-date>.pdf` |
| `vaccination` | `health/<person>.md` → `## Vaccinations` (flip Status: due → done for each matching dose; add row if no scheduled match exists) | `_vault/legal/notes.md` (patient_id if present) | `_vault/legal/vax-<person>-<latest-dose-date>.pdf` |
| `portrait` | (no text fields to land) | (no notes; file itself is the artifact) | `_vault/legal/portrait-<person>-<date>.jpg` |
| `other` | (manual review only; do not auto-write) | (manual review only) | `_vault/legal/<original-name>` (preserve original filename) |

**Note on `portrait`**: the `structured` CLI does NOT emit `doc_class: portrait` — it's not in the `--doc-class` choices and has no schema. Portraits are recognized only by `triage` (see Other subcommands below), which routes them to `_vault/legal/portrait-<person>-<date>.<ext>` without any field extraction. If a user drops a portrait into Mode B, the CLI will classify it as `other` and you'll fall through to the manual-review row.

### When `doc_class: other` is returned

Switch to Mode A behavior automatically: show the OCR text + the model's free-form `notes`, archive the PDF to vault, and ask the user where to land the content. No auto-writes.
Expand Down Expand Up @@ -185,6 +187,19 @@ For DeepSeek with bounding boxes + region tags (visual audit trail):

Useful when DeepSeek's output looks suspicious or the user wants a second opinion before committing field values.

### Other subcommands (housekeeping)

These verbs sit alongside `ocr` / `structured` / `compare` / `absorb` / `rollback`. Use them when the user's intent matches.

| User intent | Subcommand | Notes |
|---|---|---|
| "what's in this folder?" / "triage the inbox" / "what doc class is this?" | `local-ocr triage <dir-or-file> [--format json]` | Filename + extension heuristics only. No content reads, so it's safe to run against an `_inbox/` you haven't reviewed yet. Returns one row per file with sensitivity score + suggested doc-class hint (the hint set is broader than `structured`'s `--doc-class` choices: it recognises `portrait`, `bao-hiem-y-te`, `vaccine`, etc. as routing aliases). |
| "double-check this YAML for PII leaks" / "re-verify the extraction" | `local-ocr verify <yaml-file>` (or `-` for stdin) | Runs Presidio over the YAML and flags `safe_for_git` fields that look like PII. Same threshold as Mode B's blocking gate (0.7). Useful if a user hand-edited an extraction and you want a sanity pass before they commit. |
| "show me past blind-absorb runs" / "what did I absorb last week?" | `local-ocr audit list` | Lists every run in `_vault/audit/blind-absorb/`, newest first, redacted. |
| "show me run X's manifest" / "what files did blind-absorb touch on RUN-Y?" | `local-ocr audit show <run-id> [--format json]` | Prints the manifest for one run. Redacted by default. **Never invoke with `--full`** — the CLI's own help string warns "DANGER: reads .values.json and prints extracted field values. Do NOT use from an LLM session; only in your own terminal." That's the same load-bearing invariant as Mode D's no-read rule. |

None of these need the member-tier model or the 1Password key; they're local-only metadata/filename operations.

## Output format and human review

The CLI emits text by default; pass `--format json` for piping. For the structured pipeline, the YAML output looks like:
Expand Down
Loading