Skip to content

int32 / float32 default dtypes risk silent overflow and precision loss #467

@MaxGhenis

Description

@MaxGhenis

Problem

policyengine_core/variables/config.py:16-29 sets the default dtypes for integer / float variables to int32 / float32:

  • int32 caps at 2.147e9. National-level aggregates (e.g. total SSA outlays, total payroll) overflow silently above that ceiling.
  • float32 carries ~7 decimal digits of precision. Incomes above ~$10M lose dollars; dollar-level tax calculations on multi-million-dollar incomes lose cents. 25_000_000 and 25_000_001 are indistinguishable in float32.

Every country package inherits this precision limit. The related bug H6 (assert_near also uses float32, being fixed in a separate PR) means the test suite cannot currently catch dollar-level regressions on large values.

Why this isn't a drop-in fix

Changing the defaults to int64 / float64 would make pre-existing H5 datasets incompatible with new readers and vice versa. Specifically:

  • Holder.put_in_cache / set_input currently casts incoming arrays to the variable's dtype.
  • Dataset.save writes arrays as-is into H5 — existing datasets contain float32 / int32 arrays.
  • Simulation.calculate returns holder.default_array() which is shaped by the dtype.

A naive swap would:

  1. Read existing H5 as float32, then upcast on every read (memory overhead).
  2. Read back datasets written by the new code as float64, but tax_benefit_system formulas assume the dtype matches variable.dtype.
  3. Invalidate every on-disk cache (each variable's .npy file stores the array with its original dtype).

Proposed migration plan

  1. Add an opt-in flag on TaxBenefitSystem — e.g. use_extended_precision: bool = False — that forces all newly-built variables to int64 / float64.
  2. For one release, emit a DeprecationWarning when a country package constructs a variable with the default int32/float32 dtype.
  3. Bump that default to int64/float64 the release after.
  4. Provide a migration utility: policyengine-core data migrate-dtype <path-to-h5> that promotes arrays in-place.
  5. Update country-package CI to validate that their data still round-trips through the new defaults.

References

Identified in the 2026-04 bug hunt (finding H5). Related: H6 (assert_near float32 downcast) is being fixed separately in a non-breaking way.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions