Add calibration package checkpointing, target config, and hyperparameter CLI by baogorek · Pull Request #538 · PolicyEngine/policyengine-us-data

baogorek · 2026-02-17T17:19:48Z

Fixes #533
Fixes #534
Fixes #558
Fixes #559
Fixes #562
Fixes #599

Summary

Calibration package checkpointing: --build-only saves the expensive matrix build as a pickle, --package-path loads it for fast re-fitting with different hyperparameters or target sets
Target config YAML: Declarative exclusion rules (target_config.yaml) replace hardcoded target filtering; checked-in config reproduces the junkyard's 22 excluded groups
Hyperparameter CLI flags: --beta, --lambda-l2, --learning-rate are now tunable from the command line and Modal runner
Modal runner improvements: Streaming subprocess output, support for new flags
Documentation: docs/calibration.md covers all workflows (single-pass, build-then-fit, package re-filtering, Modal, portable fitting)
At-large district naming fix: H5 filenames for at-large districts now use XX-01 (conventional 1-based) instead of XX-00
GCS staging fix: GCS uploads moved from staging phase to promotion phase, so both GCS and HuggingFace are updated together during promote

Note: This branch includes commits from #537 (PUF impute) since the calibration pipeline depends on that work. The calibration-specific changes are in the top commit.

Test plan

pytest policyengine_us_data/tests/test_calibration/test_unified_calibration.py — CLI arg parsing tests
pytest policyengine_us_data/tests/test_calibration/test_target_config.py — target config filtering + package round-trip tests
Manual: make calibrate-build produces package, --package-path loads it and fits

🤖 Generated with Claude Code

policyengine_us_data/calibration/unified_calibration.py

policyengine_us_data/calibration/source_impute.py

docs/calibration.md

policyengine_us_data/calibration/source_impute.py

juaristi22

Minor comments, but generally LGTM, I was also able to run the calibration job in modal (after removing the ellipsis in unified_calibration.py)!

Small note: if im not mistaken this pr addressess issue #534. Seems like #310 was referenced in it as something that would be addressed together, but this pr does not save the calibration_log.csv among its outputs. Do we want to add it at this point?

juaristi22 · 2026-02-25T10:09:11Z

A couple questions on recent changes

Post-cloning PUF imputation remove, is this permanent?

Commit 49a1f66 ("Remove redundant --puf-dataset flag, add national targets") removed the ability to run PUF cloning inside the calibration pipeline, with the rationale that PUF cloning already happens upstream in extended_cps.py.

However, PR #516 specifically designed the pipeline so that PUF + QRF imputation runs after cloning and geography assignment, so that each clone gets geographically-informed imputations (with state_fips as a QRF predictor). As Max described it:

Each geographic clone gets geographically-appropriate PUF tax imputations instead of identical national-average ones duplicated everywhere. State is now a QRF predictor — California clones get California-like tax distributions.

With the current flow (extended_cps.py runs PUF once on base records → calibration pipeline clones 10x), all clones of the same household share identical PUF-imputed values, losing that variability benefit.

Are we planning to bring back the post-cloning PUF imputation once the calibration pipeline is stabilized? Or has the approach changed?

All target variable precomputation moved to county level, is this worth the large increase in computation?

In commit 02f8ad0, _build_county_values was introduced to handle county-dependent variables (specifically aca_ptc, since marketplace premiums vary by county). It ran alongside the existing _build_state_values which handled everything else via 51 state-level simulations — a two-tier design gated by COUNTY_DEPENDENT_VARS = {"aca_ptc"}.

Then in commit 40fb389 (a "checkpoint"), COUNTY_DEPENDENT_VARS was removed and all target variable precomputation was moved to _build_county_values. _build_state_values was demoted to only computing constraint variables.

This means the matrix builder now runs ~1,000-2,000 county-level simulations (one per unique county in the geography assignment) instead of 51 state-level simulations for variables like snap, household_count, etc. that don't depend on county. This is a ~40x increase in compute cost with no accuracy benefit for those variables.

Was this an intentional simplification, or a debugging shortcut that could be reverted to the two-tier approach? Restoring COUNTY_DEPENDENT_VARS and routing only aca_ptc (and any future county-dependent vars) through county precomputation would significantly reduce matrix build time.

baogorek · 2026-02-26T02:03:04Z

@juaristi22 thank you for your thoughtful and excellent comments

By the time this PR is done, I do want the calibration_log.csv to be saved once again! Actually, I do get it in this workflow that I run after make data:

 # 1. Build matrix locally (no GPU needed)
  source ~/envs/sep/bin/activate
  python -m policyengine_us_data.calibration.unified_calibration \
      --build-only \
      --skip-source-impute \
      --target-config policyengine_us_data/calibration/target_config.yaml \
      --package-output /tmp/calibration_package.pkl

  # 2. Push package to Modal volume
  modal volume put calibration-data /tmp/calibration_package.pkl calibration_package.pkl --force

  # 3. Fit on GPU from package
  modal run modal_app/remote_calibration_runner.py \
      --branch calibration-pipeline-improvements \
      --gpu T4 \
      --package-volume \
      --epochs 1000 \
      --beta 0.65 \
      --lambda-l0 1e-7 \
      --lambda-l2 1e-8 \
      --log-freq 500 \
      --target-config policyengine_us_data/calibration/target_config.yaml

That will fit the model on modal and drop a calibration_log.csv right on your local drive. I know one of the Issues was about actually storing it in an archive, and maybe that should be out of scope given this PR's complexity.

You're absolutely right about the counties and I've now made the per-county computation optional. I know that aca_ptc's formula does involve the county, so we may have to live with not getting perfect matches on from X * w to sim.calculate("aca_ptc").sum(). But I saw the perfect ratios with your code so I know we can do it.
Yes, you're right that I lost a bit of the vision with not imputing new PUF values for every clone. Given the brutal difficulty with getting X * w to match sim.calculate().sum() and the speed (which is still slow even after taking out the counties). From Claude:
- Would it make X*w consistency harder? Yes, significantly. Right now the matrix builder can precompute values once per state and reuse them across all 436 clones — because every clone of household #5000 has the same underlying values regardless of which state it's assigned to. The only things that change per clone are geography inputs and takeup draws.
- Issue created: Restore post-cloning PUF QRF re-imputation for geographic tax variation #560

…ter CLI - Add build-only mode to save calibration matrix as pickle package - Add target config YAML for declarative target exclusion rules - Add CLI flags for beta, lambda_l2, learning_rate hyperparameters - Add streaming subprocess output in Modal runner - Add calibration pipeline documentation - Add tests for target config filtering and CLI arg parsing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Modal calibration runner was missing --lambda-l0 passthrough. Also fix KeyError: Ellipsis when load_dataset() returns dicts instead of h5py datasets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Upload a pre-built calibration package to Modal and run only the fitting phase, skipping HuggingFace download and matrix build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Chunked training with per-target CSV log matching notebook format - Wire --log-freq through CLI and Modal runner - Create output directory if missing (fixes Modal container error) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Set verbose_freq=chunk so epoch counts don't reset each chunk - Rename: diagnostics -> unified_diagnostics.csv, epoch log -> calibration_log.csv (matches dashboard expectation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of creating a new Microsimulation per clone (~3 min each, 22 hours for 436 clones), precompute values for all 51 states on one sim object (~3 min total), then assemble per-clone values via numpy fancy indexing (~microseconds per clone). New methods: _build_state_values, _assemble_clone_values, _evaluate_constraints_from_values, _calculate_target_values_from_values. DEFAULT_N_CLONES raised to 436 for 5.2M record matrix builds. Takeup re-randomization deferred to future post-processing layer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Modal runner: add --package-volume flag to read calibration package from a Modal Volume instead of passing 2+ GB as a function argument - unified_calibration: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments to prevent CUDA memory fragmentation during L0 backward pass - docs/calibration.md: rewrite to lead with lightweight build-then-fit workflow, document prerequisites, and add volume-based Modal usage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Worker commits to volume but coordinator's view is stale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace inline takeup draw loops in unified_matrix_builder.py (both the parallel worker path and the sequential clone path) with calls to the shared compute_block_takeup_for_entities() from utils/takeup.py. Remove deprecated functions from takeup.py that are no longer used: draw_takeup_for_geo, compute_entity_takeup_for_geo, apply_takeup_draws_to_sim, apply_block_takeup_draws_to_sim, and _build_entity_to_hh_index. Also remove the now-unused rerandomize_takeup function from unified_calibration.py. Simplify compute_block_takeup_for_entities signature by deriving state FIPS from block GEOID prefix instead of requiring a separate entity_state_fips parameter. Update tests to exercise the remaining shared functions directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove dead sim-based methods: _evaluate_constraints_entity_aware, _calculate_target_values, and calculate_spm_thresholds_for_cd - Delete duplicate class methods _evaluate_constraints_from_values and _calculate_target_values_from_values; update call sites to use the existing standalone functions with variable_entity_map - Fix count-vs-dollar classifier: replace substring heuristic in _get_uprating_info with endswith("_count"); use exact equality in validate_staging._classify_variable to prevent false positives - Add optional precomputed_rates parameter to apply_block_takeup_to_arrays to skip redundant load_take_up_rate calls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Source imputation (ACS/SIPP/SCF) was previously embedded inside unified_calibration.py, meaning make data stopped at the stratified CPS and promote/upload targets shipped the wrong file. This extracts it into create_source_imputed_cps.py as the final make data step, updates all dataset references to the source-imputed file, and defaults skip_source_impute=True in calibration since the step is now pre-built. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Moves calibration_weights.npy from required to optional in download_calibration_inputs (it's a calibration output, not needed for matrix building). Updates the primary dataset reference from stratified_extended_cps.h5 to source_imputed_stratified_extended_cps.h5 and removes now-unnecessary fallback logic in local_area.py and publish_local_area.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…built H5s The matrix builder computes takeup draws using per-clone block assignments, but the dataset builder recomputed them from stacked blocks (first-clone-wins), producing different X values. This adds compute_stacked_takeup() which recomputes entity-level takeup draws after optimization using the final weights, producing a stacked_takeup.npz artifact. The dataset builder uses these pre-computed values instead of re-deriving from blocks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When using --package-path, build_matrix() is skipped so the builder's entity_hh_idx_map is unavailable. Reconstruct it from the simulation so compute_stacked_takeup() works in both the direct and package paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…et_builder Comment out aca_ptc, eitc, and refundable_ctc calibration targets in target_config.yaml — these credits require filing but their formulas compute nonzero values for non-filers, inflating totals beyond IRS SOI targets. See PolicyEngine/policyengine-us#7748 Refactor stacked_dataset_builder.py to delegate to build_states/ build_districts/build_cities from publish_local_area.py instead of reimplementing loop/filter logic. Construct GeographyAssignment from calibration artifacts and pass to the shared build_h5 API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…regnant keep - Comment out aca_ptc, eitc, refundable_ctc calibration targets — formulas don't gate on tax_unit_is_filer so non-filer values inflate totals beyond IRS SOI targets (see policyengine-us#7748) - Refactor stacked_dataset_builder to delegate to build_states/build_districts/ build_cities instead of reimplementing loop logic (single source of truth) - Construct GeographyAssignment from calibration artifacts for new build_h5 API - Remove is_pregnant from _KEEP_FORMULA_VARS (it's a pure input variable) - Remaining unstaged pipeline improvements from prior work Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tion - Remove baked-in branch from Modal image; setup_repo() now always clones fresh from GitHub at the requested branch so cached containers never run stale code (fixes #599) - Add geography.npz to remote_calibration_runner: parse GEOGRAPHY_PATH from stdout, collect bytes from GPU container, save locally, and pass to upload_calibration_artifacts() - Move geography.npz and unified_run_config.json from optional to required in download_calibration_inputs() so missing artifacts fail fast instead of silently passing - Add geography.npz and run_config to skip_download validation in coordinate_publish() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds upload_from_hf_staging_to_gcs() to download from HF staging/ and upload to GCS, making HF staging the single source of truth. Also includes geo caching, batch constraint loading, via-districts validation, stacked builder slimming, and promote_local_h5s script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolved conflict in stacked_dataset_builder.py by taking Maria's load_geography() approach over the inline block-derivation method. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

juaristi22

PR #538 Review: Calibration Pipeline Improvements

Solid architectural improvements — checkpointing, declarative target config, and CLI hyperparameter tuning are well-designed. Approving with notes below for follow-up.

Critical (recommend fixing before or shortly after merge)

1. cd_geoid undefined in single-cd mode
stacked_dataset_builder.py:160 — calibrated_cds = sorted(set(cd_geoid)) will crash with NameError. This code path appears untested.

2. docs/build_h5.md documents wrong function signature
The documented build_h5 signature shows blocks, cds_to_calibrate, rerandomize_takeup parameters that don't exist in the actual function, which takes a geography: GeographyAssignment object instead. Anyone following this doc will hit immediate errors.

3. n_clones calculated from unique CDs, not actual clone count
unified_calibration.py:1292 — n_clones=len(sorted(set(geography_info["cd_geoid"]...))) uses the number of unique congressional districts (~435) rather than the actual clone count (430). These are close but semantically wrong and could silently affect H5 output.

4. validate-staging workflow never gates on failures
local_area_publish.yaml — The validation job posts a summary but has no step that exits non-zero on sanity failures. A run with 100% failures would show green CI.

High

5. Target Config docs describe only exclude: but the active config uses include:
docs/calibration.md Target Config section exclusively documents exclude: format with examples, but target_config.yaml uses include: mode. Confusing for contributors.

6. modal_app/README.md line 184 claims make calibrate-modal passes --prebuilt-matrices — it does not; only --push-results is passed.

7. builtins.print monkey-patch never restored when --log-freq is used
unified_calibration.py:615-746 — The finally block restoring builtins.print is only on the else branch.

8. Pickle deserialization without integrity checks
unified_calibration.py:457, remote_calibration_runner.py:373,940 — Consider checksum verification before loading packages from shared Modal volumes.

Medium

_collect_outputs (remote_calibration_runner.py:90) crashes with unhelpful TypeError if OUTPUT_PATH: marker is missing from subprocess output.
Makefile calibrate-both / stage-all-h5s use & + wait — if the first backgrounded process fails but the second succeeds last, wait returns 0.
build-h5s hardcodes --n-clones 430 vs the 436 default elsewhere.
promote target $(eval VERSION) silently becomes empty if the Python command errors.
Workflows use Python 3.13 but project targets 3.11.

Low / Code Quality

~350 lines of GPU function duplication in remote_calibration_runner.py (5x fit_weights_* + 5x fit_from_package_* differing only in gpu param). A factory function would eliminate this.
include rules in apply_target_config have no dedicated test.
No integration test for --package-path or --build-only + --package-output flows.
Tracked binary w_district_calibration.npy should be git rm --cached since it's now covered by the broader .gitignore.

Overall: great work on the pipeline consolidation. The items above are mostly documentation drift and edge-case hardening — nothing blocking.