-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Problem
PR #538 changes the filenames and adds new artifacts that download_calibration_inputs expects from HuggingFace, but the HF repo still has the old artifacts. Modal workers download from HF, fail to find the expected files, and crash.
What #538 expects vs what's on HF
| File expected by #538 | On HF? | Notes |
|---|---|---|
calibration/source_imputed_stratified_extended_cps.h5 |
No | Only stratified_extended_cps.h5 exists (old name) |
calibration/calibration_weights.npy |
No | Only w_district_calibration.npy exists (old name) |
calibration/stacked_blocks.npy |
No | New artifact from unified calibration |
calibration/geo_labels.json |
No | New artifact from unified calibration |
calibration/stacked_takeup.npz |
No | New artifact from compute_stacked_takeup() |
The first two are required downloads — the pipeline cannot proceed without them. The last three are optional but needed for correct takeup values and geography assignment.
Root cause
#538 changes the contract between the calibration step and the Modal H5-building pipeline (new filenames, new artifacts), but the HF repo was never updated to match. This is a chicken-and-egg problem: the new artifacts are produced by the updated calibration pipeline, which needs to run first and upload results to HF before Modal can consume them.
Additional concerns
-
Modal image caching. Modal aggressively caches container images. If the image was built from an older branch, workers run old code expecting old filenames — creating mismatches in either direction. The
branchparameter doesgit checkout+uv syncinside the container, but stale dependencies or cached volume data can cause silent failures. -
No artifact versioning. Weights, blocks, takeup, and dataset are all downloaded independently with no guarantee they came from the same calibration run. Updating one but not the others produces silent inconsistency.
-
The
skip_downloadworkaround. Add calibration package checkpointing, target config, and hyperparameter CLI #538 added a--skip-downloadflag that lets you pre-push inputs to the Modal volume, but this requires someone to manually produce and upload the artifacts first — which is the local-only workflow described in Add calibration package checkpointing, target config, and hyperparameter CLI #538's comments.
Suggested fix
- Run the full calibration pipeline locally (or on a single Modal worker) to produce the new artifacts
- Upload them to HF under the new names (
source_imputed_stratified_extended_cps.h5,calibration_weights.npy,stacked_blocks.npy,geo_labels.json,stacked_takeup.npz) - Consider keeping the old filenames as symlinks/copies during a transition period, or pinning artifact sets with a version/manifest so stale combinations are detected rather than silently used