feat(filters): add ignore_library_caches for well-known per-library caches by christophergeyer · Pull Request #133 · treqs/roar

christophergeyer · 2026-05-29T18:08:56Z

What

A new default-true filter filters.ignore_library_caches that drops reads and writes under a curated set of per-library user-cache subdirectories.

Motivation

The original end-to-end nanochat run (DAG ef174b91...) registered 604 artifact relations on 8 jobs, of which ~99.7 % were transient HF / library cache files — chunk-cache binaries, dataset_info.json metadata, blob storage. They clutter the DAG and bury the real model checkpoint among them.

We considered ignore_hf_cache as a single-library filter but rejected it as too specialized — there are many libraries with the same per-library-cache pattern (HF, pip, uv, poetry, wandb, mlflow, pre-commit, …). A curated list is the right abstraction; HF is just one entry.

What's included in `LIBRARY_CACHE_PATTERNS`

/.cache/{huggingface, pip, uv, poetry, pypoetry, conda, wandb, mlflow, pre-commit, black, mypy, ruff, pylint, yarn, npm, Cypress, torch, triton, jupyter, matplotlib}/

Matched as a substring so any home-dir variant (/home/<user>/.cache/<lib>/, /root/.cache/<lib>/, etc.) is caught without enumerating every layout.

What stays tracked

Project-specific cache subdirs not on the curated list — e.g. ~/.cache/nanochat/ where nanochat writes its tokenizer + base/SFT checkpoints. The test_filter_files_keeps_project_specific_cache_under_dot_cache test locks that property in.

Pipeline placement

Same point in the pipeline as the existing ignore_torch_cache / ignore_system_reads checks — reconstitution-time, before artifact-list construction. The tracer still captures every syscall; this filter only affects what becomes a registered artifact. Already-registered cache artifacts on existing glaas DAGs stay until those runs are re-registered.

Tests

4 new tests in tests/unit/test_file_filter.py:
- HF cache (dataset_info.json + hub blob + xet chunk) dropped when enabled.
- HF cache kept when filter explicitly disabled (override).
- Project-specific ~/.cache/nanochat/ paths preserved.
- Other curated entries (pip/uv/wandb/mlflow/poetry) dropped.
All 12 file-filter tests pass; the broader filter/provenance/config suite passes (one unrelated pre-existing telemetry failure on main not caused by this change).

Touches

roar/execution/provenance/file_filter.py — LIBRARY_CACHE_PATTERNS, _is_library_cache, category, read+write wiring.
roar/core/models/provenance.py — add library_caches to FilterCounts + total.
roar/integrations/config/schema.py — add ignore_library_caches: bool = True.
roar/integrations/config/access.py — config-key descriptor.
roar/backends/ray/collector.py — thread the flag through ray's parallel filter call site.
roar/cli/commands/init.py — default .roarconfig template gets the new key.

Net: +190 / -2 lines.

🤖 Generated with Claude Code

…aches Adds a new default-true filter `filters.ignore_library_caches` that drops reads and writes under a curated set of per-library user-cache subdirectories — ~/.cache/huggingface/, ~/.cache/pip/, ~/.cache/uv/, ~/.cache/poetry/, ~/.cache/wandb/, ~/.cache/mlflow/, ~/.cache/pre-commit/, ~/.cache/torch/, ~/.cache/triton/, etc. Same point in the pipeline as the existing ignore_torch_cache / ignore_system_reads checks (reconstitution-time, not trace-time), so adding the filter only affects future runs' artifact lists — already-registered artifacts in glaas stay until re-registration. Curated list approach rather than a single-library carve-out (we considered ignore_hf_cache but it's too specialized). The full list of per-library prefixes lives in FileFilterService.LIBRARY_CACHE_PATTERNS. Matched as a substring so any home-dir variant (`/home/<user>/.cache/<lib>/`, `/root/.cache/<lib>/`, etc.) is caught without enumerating each. Project-specific cache subdirs that aren't on the curated list stay tracked — e.g. nanochat writes outputs under ~/.cache/nanochat/ and those must NOT be filtered as transient cache. The new `test_filter_files_keeps_project_specific_cache_under_dot_cache` test locks that property in. Touches: - roar/execution/provenance/file_filter.py: add LIBRARY_CACHE_PATTERNS, _is_library_cache, library_caches filter category, wire into reads + writes. - roar/core/models/provenance.py: add library_caches to FilterCounts + total. - roar/integrations/config/schema.py: add ignore_library_caches: bool=True. - roar/integrations/config/access.py: add config-key descriptor. - roar/backends/ray/collector.py: thread the flag through ray's parallel filter call site that mirrors file_filter.py's read/write checks. - roar/cli/commands/init.py: add to the default .roarconfig template. - tests/unit/test_file_filter.py: 4 new tests covering HF cache drop, override-to-keep, project-specific subdir preservation, and other curated entries (pip/uv/wandb/mlflow/poetry). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The library-cache list had /.cache/triton/ but Triton's actual JIT kernel cache defaults to ~/.triton/cache/ (outside ~/.cache/). On a GPU run, torch.compile/inductor writes compiled .so launchers there every invocation — they leaked into a nanochat A10 lineage (5 artifacts). Add the /.triton/cache/ substring so they're dropped too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chrisgeyertreqs and others added 3 commits May 29, 2026 18:08

style: ruff format the new library-caches tests

55a8575

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

christophergeyer mentioned this pull request May 30, 2026

fix(filters): always drop .netrc/_netrc credential files from lineage #135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(filters): add ignore_library_caches for well-known per-library caches#133

feat(filters): add ignore_library_caches for well-known per-library caches#133
christophergeyer wants to merge 3 commits into
mainfrom
feat/ignore-library-caches

christophergeyer commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

christophergeyer commented May 29, 2026

What

Motivation

What's included in LIBRARY_CACHE_PATTERNS

What stays tracked

Pipeline placement

Tests

Touches

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

What's included in `LIBRARY_CACHE_PATTERNS`