Skip to content

feat(filters): add ignore_library_caches for well-known per-library caches#133

Open
christophergeyer wants to merge 3 commits into
mainfrom
feat/ignore-library-caches
Open

feat(filters): add ignore_library_caches for well-known per-library caches#133
christophergeyer wants to merge 3 commits into
mainfrom
feat/ignore-library-caches

Conversation

@christophergeyer
Copy link
Copy Markdown
Member

What

A new default-true filter filters.ignore_library_caches that drops reads and writes under a curated set of per-library user-cache subdirectories.

Motivation

The original end-to-end nanochat run (DAG ef174b91...) registered 604 artifact relations on 8 jobs, of which ~99.7 % were transient HF / library cache files — chunk-cache binaries, dataset_info.json metadata, blob storage. They clutter the DAG and bury the real model checkpoint among them.

We considered ignore_hf_cache as a single-library filter but rejected it as too specialized — there are many libraries with the same per-library-cache pattern (HF, pip, uv, poetry, wandb, mlflow, pre-commit, …). A curated list is the right abstraction; HF is just one entry.

What's included in LIBRARY_CACHE_PATTERNS

/.cache/{huggingface, pip, uv, poetry, pypoetry, conda, wandb, mlflow, pre-commit, black, mypy, ruff, pylint, yarn, npm, Cypress, torch, triton, jupyter, matplotlib}/

Matched as a substring so any home-dir variant (/home/<user>/.cache/<lib>/, /root/.cache/<lib>/, etc.) is caught without enumerating every layout.

What stays tracked

Project-specific cache subdirs not on the curated list — e.g. ~/.cache/nanochat/ where nanochat writes its tokenizer + base/SFT checkpoints. The test_filter_files_keeps_project_specific_cache_under_dot_cache test locks that property in.

Pipeline placement

Same point in the pipeline as the existing ignore_torch_cache / ignore_system_reads checks — reconstitution-time, before artifact-list construction. The tracer still captures every syscall; this filter only affects what becomes a registered artifact. Already-registered cache artifacts on existing glaas DAGs stay until those runs are re-registered.

Tests

  • 4 new tests in tests/unit/test_file_filter.py:
    • HF cache (dataset_info.json + hub blob + xet chunk) dropped when enabled.
    • HF cache kept when filter explicitly disabled (override).
    • Project-specific ~/.cache/nanochat/ paths preserved.
    • Other curated entries (pip/uv/wandb/mlflow/poetry) dropped.
  • All 12 file-filter tests pass; the broader filter/provenance/config suite passes (one unrelated pre-existing telemetry failure on main not caused by this change).

Touches

  • roar/execution/provenance/file_filter.pyLIBRARY_CACHE_PATTERNS, _is_library_cache, category, read+write wiring.
  • roar/core/models/provenance.py — add library_caches to FilterCounts + total.
  • roar/integrations/config/schema.py — add ignore_library_caches: bool = True.
  • roar/integrations/config/access.py — config-key descriptor.
  • roar/backends/ray/collector.py — thread the flag through ray's parallel filter call site.
  • roar/cli/commands/init.py — default .roarconfig template gets the new key.

Net: +190 / -2 lines.

🤖 Generated with Claude Code

chrisgeyertreqs and others added 3 commits May 29, 2026 18:08
…aches

Adds a new default-true filter `filters.ignore_library_caches` that drops
reads and writes under a curated set of per-library user-cache subdirectories
— ~/.cache/huggingface/, ~/.cache/pip/, ~/.cache/uv/, ~/.cache/poetry/,
~/.cache/wandb/, ~/.cache/mlflow/, ~/.cache/pre-commit/, ~/.cache/torch/,
~/.cache/triton/, etc. Same point in the pipeline as the existing
ignore_torch_cache / ignore_system_reads checks (reconstitution-time, not
trace-time), so adding the filter only affects future runs' artifact lists
— already-registered artifacts in glaas stay until re-registration.

Curated list approach rather than a single-library carve-out (we considered
ignore_hf_cache but it's too specialized). The full list of per-library
prefixes lives in FileFilterService.LIBRARY_CACHE_PATTERNS. Matched as a
substring so any home-dir variant (`/home/<user>/.cache/<lib>/`,
`/root/.cache/<lib>/`, etc.) is caught without enumerating each.

Project-specific cache subdirs that aren't on the curated list stay
tracked — e.g. nanochat writes outputs under ~/.cache/nanochat/ and those
must NOT be filtered as transient cache. The new
`test_filter_files_keeps_project_specific_cache_under_dot_cache` test
locks that property in.

Touches:
- roar/execution/provenance/file_filter.py: add LIBRARY_CACHE_PATTERNS,
  _is_library_cache, library_caches filter category, wire into reads + writes.
- roar/core/models/provenance.py: add library_caches to FilterCounts + total.
- roar/integrations/config/schema.py: add ignore_library_caches: bool=True.
- roar/integrations/config/access.py: add config-key descriptor.
- roar/backends/ray/collector.py: thread the flag through ray's parallel
  filter call site that mirrors file_filter.py's read/write checks.
- roar/cli/commands/init.py: add to the default .roarconfig template.
- tests/unit/test_file_filter.py: 4 new tests covering HF cache drop,
  override-to-keep, project-specific subdir preservation, and other
  curated entries (pip/uv/wandb/mlflow/poetry).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The library-cache list had /.cache/triton/ but Triton's actual JIT kernel
cache defaults to ~/.triton/cache/ (outside ~/.cache/). On a GPU run,
torch.compile/inductor writes compiled .so launchers there every
invocation — they leaked into a nanochat A10 lineage (5 artifacts).
Add the /.triton/cache/ substring so they're dropped too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants