Add calibration package checkpointing, target config, and hyperparameter CLI#538
Add calibration package checkpointing, target config, and hyperparameter CLI#538
Conversation
There was a problem hiding this comment.
Minor comments, but generally LGTM, I was also able to run the calibration job in modal (after removing the ellipsis in unified_calibration.py)!
Small note: if im not mistaken this pr addressess issue #534. Seems like #310 was referenced in it as something that would be addressed together, but this pr does not save the calibration_log.csv among its outputs. Do we want to add it at this point?
4c51b32 to
61523d8
Compare
59b27a8 to
0a0f167
Compare
|
A couple questions on recent changes
Commit 49a1f66 ("Remove redundant --puf-dataset flag, add national targets") removed the ability to run PUF cloning inside the calibration pipeline, with the rationale that PUF cloning already happens upstream in However, PR #516 specifically designed the pipeline so that PUF + QRF imputation runs after cloning and geography assignment, so that each clone gets geographically-informed imputations (with
With the current flow ( Are we planning to bring back the post-cloning PUF imputation once the calibration pipeline is stabilized? Or has the approach changed?
In commit 02f8ad0, Then in commit 40fb389 (a "checkpoint"), This means the matrix builder now runs ~1,000-2,000 county-level simulations (one per unique county in the geography assignment) instead of 51 state-level simulations for variables like Was this an intentional simplification, or a debugging shortcut that could be reverted to the two-tier approach? Restoring |
|
@juaristi22 thank you for your thoughtful and excellent comments
That will fit the model on modal and drop a calibration_log.csv right on your local drive. I know one of the Issues was about actually storing it in an archive, and maybe that should be out of scope given this PR's complexity.
|
d500b27 to
d163d6a
Compare
ca20dd2 to
fd0dee2
Compare
…ter CLI - Add build-only mode to save calibration matrix as pickle package - Add target config YAML for declarative target exclusion rules - Add CLI flags for beta, lambda_l2, learning_rate hyperparameters - Add streaming subprocess output in Modal runner - Add calibration pipeline documentation - Add tests for target config filtering and CLI arg parsing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Modal calibration runner was missing --lambda-l0 passthrough. Also fix KeyError: Ellipsis when load_dataset() returns dicts instead of h5py datasets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Upload a pre-built calibration package to Modal and run only the fitting phase, skipping HuggingFace download and matrix build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Chunked training with per-target CSV log matching notebook format - Wire --log-freq through CLI and Modal runner - Create output directory if missing (fixes Modal container error) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Set verbose_freq=chunk so epoch counts don't reset each chunk - Rename: diagnostics -> unified_diagnostics.csv, epoch log -> calibration_log.csv (matches dashboard expectation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of creating a new Microsimulation per clone (~3 min each, 22 hours for 436 clones), precompute values for all 51 states on one sim object (~3 min total), then assemble per-clone values via numpy fancy indexing (~microseconds per clone). New methods: _build_state_values, _assemble_clone_values, _evaluate_constraints_from_values, _calculate_target_values_from_values. DEFAULT_N_CLONES raised to 436 for 5.2M record matrix builds. Takeup re-randomization deferred to future post-processing layer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Modal runner: add --package-volume flag to read calibration package from a Modal Volume instead of passing 2+ GB as a function argument - unified_calibration: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments to prevent CUDA memory fragmentation during L0 backward pass - docs/calibration.md: rewrite to lead with lightweight build-then-fit workflow, document prerequisites, and add volume-based Modal usage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Worker commits to volume but coordinator's view is stale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace inline takeup draw loops in unified_matrix_builder.py (both the parallel worker path and the sequential clone path) with calls to the shared compute_block_takeup_for_entities() from utils/takeup.py. Remove deprecated functions from takeup.py that are no longer used: draw_takeup_for_geo, compute_entity_takeup_for_geo, apply_takeup_draws_to_sim, apply_block_takeup_draws_to_sim, and _build_entity_to_hh_index. Also remove the now-unused rerandomize_takeup function from unified_calibration.py. Simplify compute_block_takeup_for_entities signature by deriving state FIPS from block GEOID prefix instead of requiring a separate entity_state_fips parameter. Update tests to exercise the remaining shared functions directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove dead sim-based methods: _evaluate_constraints_entity_aware,
_calculate_target_values, and calculate_spm_thresholds_for_cd
- Delete duplicate class methods _evaluate_constraints_from_values and
_calculate_target_values_from_values; update call sites to use the
existing standalone functions with variable_entity_map
- Fix count-vs-dollar classifier: replace substring heuristic in
_get_uprating_info with endswith("_count"); use exact equality in
validate_staging._classify_variable to prevent false positives
- Add optional precomputed_rates parameter to
apply_block_takeup_to_arrays to skip redundant load_take_up_rate calls
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Source imputation (ACS/SIPP/SCF) was previously embedded inside unified_calibration.py, meaning make data stopped at the stratified CPS and promote/upload targets shipped the wrong file. This extracts it into create_source_imputed_cps.py as the final make data step, updates all dataset references to the source-imputed file, and defaults skip_source_impute=True in calibration since the step is now pre-built. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Moves calibration_weights.npy from required to optional in download_calibration_inputs (it's a calibration output, not needed for matrix building). Updates the primary dataset reference from stratified_extended_cps.h5 to source_imputed_stratified_extended_cps.h5 and removes now-unnecessary fallback logic in local_area.py and publish_local_area.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…built H5s The matrix builder computes takeup draws using per-clone block assignments, but the dataset builder recomputed them from stacked blocks (first-clone-wins), producing different X values. This adds compute_stacked_takeup() which recomputes entity-level takeup draws after optimization using the final weights, producing a stacked_takeup.npz artifact. The dataset builder uses these pre-computed values instead of re-deriving from blocks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When using --package-path, build_matrix() is skipped so the builder's entity_hh_idx_map is unavailable. Reconstruct it from the simulation so compute_stacked_takeup() works in both the direct and package paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…et_builder Comment out aca_ptc, eitc, and refundable_ctc calibration targets in target_config.yaml — these credits require filing but their formulas compute nonzero values for non-filers, inflating totals beyond IRS SOI targets. See PolicyEngine/policyengine-us#7748 Refactor stacked_dataset_builder.py to delegate to build_states/ build_districts/build_cities from publish_local_area.py instead of reimplementing loop/filter logic. Construct GeographyAssignment from calibration artifacts and pass to the shared build_h5 API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…regnant keep - Comment out aca_ptc, eitc, refundable_ctc calibration targets — formulas don't gate on tax_unit_is_filer so non-filer values inflate totals beyond IRS SOI targets (see policyengine-us#7748) - Refactor stacked_dataset_builder to delegate to build_states/build_districts/ build_cities instead of reimplementing loop logic (single source of truth) - Construct GeographyAssignment from calibration artifacts for new build_h5 API - Remove is_pregnant from _KEEP_FORMULA_VARS (it's a pure input variable) - Remaining unstaged pipeline improvements from prior work Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5f32f0e to
926cd8f
Compare
…tion - Remove baked-in branch from Modal image; setup_repo() now always clones fresh from GitHub at the requested branch so cached containers never run stale code (fixes #599) - Add geography.npz to remote_calibration_runner: parse GEOGRAPHY_PATH from stdout, collect bytes from GPU container, save locally, and pass to upload_calibration_artifacts() - Move geography.npz and unified_run_config.json from optional to required in download_calibration_inputs() so missing artifacts fail fast instead of silently passing - Add geography.npz and run_config to skip_download validation in coordinate_publish() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds upload_from_hf_staging_to_gcs() to download from HF staging/ and upload to GCS, making HF staging the single source of truth. Also includes geo caching, batch constraint loading, via-districts validation, stacked builder slimming, and promote_local_h5s script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolved conflict in stacked_dataset_builder.py by taking Maria's load_geography() approach over the inline block-derivation method. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
juaristi22
left a comment
There was a problem hiding this comment.
PR #538 Review: Calibration Pipeline Improvements
Solid architectural improvements — checkpointing, declarative target config, and CLI hyperparameter tuning are well-designed. Approving with notes below for follow-up.
Critical (recommend fixing before or shortly after merge)
1. cd_geoid undefined in single-cd mode
stacked_dataset_builder.py:160 — calibrated_cds = sorted(set(cd_geoid)) will crash with NameError. This code path appears untested.
2. docs/build_h5.md documents wrong function signature
The documented build_h5 signature shows blocks, cds_to_calibrate, rerandomize_takeup parameters that don't exist in the actual function, which takes a geography: GeographyAssignment object instead. Anyone following this doc will hit immediate errors.
3. n_clones calculated from unique CDs, not actual clone count
unified_calibration.py:1292 — n_clones=len(sorted(set(geography_info["cd_geoid"]...))) uses the number of unique congressional districts (~435) rather than the actual clone count (430). These are close but semantically wrong and could silently affect H5 output.
4. validate-staging workflow never gates on failures
local_area_publish.yaml — The validation job posts a summary but has no step that exits non-zero on sanity failures. A run with 100% failures would show green CI.
High
5. Target Config docs describe only exclude: but the active config uses include:
docs/calibration.md Target Config section exclusively documents exclude: format with examples, but target_config.yaml uses include: mode. Confusing for contributors.
6. modal_app/README.md line 184 claims make calibrate-modal passes --prebuilt-matrices — it does not; only --push-results is passed.
7. builtins.print monkey-patch never restored when --log-freq is used
unified_calibration.py:615-746 — The finally block restoring builtins.print is only on the else branch.
8. Pickle deserialization without integrity checks
unified_calibration.py:457, remote_calibration_runner.py:373,940 — Consider checksum verification before loading packages from shared Modal volumes.
Medium
_collect_outputs(remote_calibration_runner.py:90) crashes with unhelpfulTypeErrorifOUTPUT_PATH:marker is missing from subprocess output.- Makefile
calibrate-both/stage-all-h5suse&+wait— if the first backgrounded process fails but the second succeeds last,waitreturns 0. build-h5shardcodes--n-clones 430vs the 436 default elsewhere.promotetarget$(eval VERSION)silently becomes empty if the Python command errors.- Workflows use Python 3.13 but project targets 3.11.
Low / Code Quality
- ~350 lines of GPU function duplication in
remote_calibration_runner.py(5xfit_weights_*+ 5xfit_from_package_*differing only ingpuparam). A factory function would eliminate this. includerules inapply_target_confighave no dedicated test.- No integration test for
--package-pathor--build-only+--package-outputflows. - Tracked binary
w_district_calibration.npyshould begit rm --cachedsince it's now covered by the broader.gitignore.
Overall: great work on the pipeline consolidation. The items above are mostly documentation drift and edge-case hardening — nothing blocking.
Fixes #533
Fixes #534
Fixes #558
Fixes #559
Fixes #562
Fixes #599
Summary
--build-onlysaves the expensive matrix build as a pickle,--package-pathloads it for fast re-fitting with different hyperparameters or target setstarget_config.yaml) replace hardcoded target filtering; checked-in config reproduces the junkyard's 22 excluded groups--beta,--lambda-l2,--learning-rateare now tunable from the command line and Modal runnerdocs/calibration.mdcovers all workflows (single-pass, build-then-fit, package re-filtering, Modal, portable fitting)XX-01(conventional 1-based) instead ofXX-00Test plan
pytest policyengine_us_data/tests/test_calibration/test_unified_calibration.py— CLI arg parsing testspytest policyengine_us_data/tests/test_calibration/test_target_config.py— target config filtering + package round-trip testsmake calibrate-buildproduces package,--package-pathloads it and fits🤖 Generated with Claude Code