Skip to content

Improve top-tail income representation in enhanced CPS #606

@MaxGhenis

Description

@MaxGhenis

Summary

Improve top-tail income representation in the enhanced CPS through two complementary approaches:

Phase 1: Include PUF aggregate records (this PR)

The IRS PUF contains 4 aggregate records (MARS=0) that bundle ultra-high-income filers for anonymity protection. These have been dropped from the PUF pipeline (puf = puf[puf.MARS != 0]), discarding $140B+ in weighted AGI — mostly in the $10M+ bracket.

Changes:

  1. Assign demographics to aggregate records (filing status, age, gender) instead of filtering them out
  2. Inject high-income PUF records (AGI > $1M) directly into the ExtendedCPS dataset, giving the reweighter actual high-income observations

Phase 2: Forbes 400 synthetic records (future)

Add Forbes 400 records with wealth-to-income imputation for the extreme top tail.

Problem

The CPS has catastrophic under-representation at the top of the income distribution:

  • $5M-$10M AGI bracket: -98.5% calibration error
  • $10M+ AGI bracket: -95.1% calibration error

This means millionaire/billionaire tax scoring is unreliable, and calibration weights get distorted trying to compensate.

Key data findings

The 4 aggregate records contain:

  • ~1,233 total weighted filers
  • $140.3B weighted AGI ($152.9B in $10M+ bracket alone)
  • Massive capital gains ($86.7B), dividends ($20.4B), partnership income ($11.2B)
  • Each has XTOT=1 (single filer, not multiple bundled) with weights of 140-465

Approach

  1. assign_aggregate_demographics() assigns MARS, age, gender to MARS=0 records
  2. _inject_high_income_puf_records() appends PUF records with AGI > $1M to ExtendedCPS
  3. The reweighting optimizer adjusts weights to match SOI targets

Verification needed

  • Build ExtendedCPS with aggregate records included
  • Compare calibration_log.csv before/after — $5M+ bracket errors should improve
  • Run full EnhancedCPS build and verify reweighting convergence
  • Score a millionaire tax reform before/after to validate revenue estimates

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions