Skip to content

Reduce cross-machine differences in record weights and improve national weighting performance #400

@donboyd5

Description

@donboyd5

@martinholmer, please see my conversation with Claude, below. I could start on this in a couple of days. If we are fortunate, this could reduce the need to keep playing around with tolerances to ensure results pass on different machines.

Reweighting Optimization Analysis

Background

The national weight optimization runs 2,000 iterations using PyTorch's Adam optimizer. Results differ very slightly across machines despite identical inputs. This analysis investigates why, using the tfevents log from the Feb 16, 2026 run on @donboyd5's machine.

Loss Trajectory

The loss drops from ~38.15 to ~8.00 in the first 100 iterations, then barely moves:

Step Loss
0 38.1537
100 8.0806
200 8.0059
300 8.0014
400 8.0004
500 8.0002
... ...
1900 8.0008

Three recent runs on the same machine (Feb 9, 10, 16) all converge to identical final loss values (8.000790), confirming on-machine determinism.

The Loss Floor: 8 Impossible Targets

The loss is stuck at ~8.0 because exactly 8 targets have estimates of zero against nonzero SOI targets. The tc_to_soi() function in soi_replication.py (lines 169-181) creates these columns but hard-codes them to zero because the underlying Tax-Calculator variables are not available:

df["estate_income"] = 0  # puf.E26390 Not in TC
df["estate_losses"] = 0  # puf.E26400
df["rent_and_royalty_net_income"] = 0  # puf.E25850 Not in TC
df["rent_and_royalty_net_losses"] = 0  # puf.E25860

Because the columns exist (with all-zero values), they pass the if variable in df.columns filter in build_loss_matrix and are included as optimization targets. But since every value is zero, no reweighting can produce nonzero estimates — the weighted sum of zeros is always zero:

Target SOI Target Value
estate income/total ~$49B
estate income/count ~625K returns
estate losses/total ~$5.9B
estate losses/count ~49K returns
rent & royalty net income/total ~$125B
rent & royalty net income/count ~6.3M returns
rent & royalty net losses/total ~$56.8B
rent & royalty net losses/count ~3.5M returns

Each contributes exactly 1.0 to the loss via the formula ((0+1)/(large+1) - 1)^2 ≈ 1.0.

Convergence Quality (Excluding Impossible Targets)

Beyond those 8, the optimization is well-converged:

  • 498 of 558 targets have error < 0.1%
  • 550 of 558 targets have error < 1%
  • The worst non-impossible target has error of only 0.69%
  • Median target error: 0.005%

Why Weights Differ Across Machines

Even with identical inputs and the same random seed (set at reweight.py line 200-202), slightly different weights result on different machines because:

1. Floating-point non-associativity in the core computation

The critical operation (reweight.py line 268) sums over hundreds of thousands of records for each of 558 targets. In floating-point math, (a + b) + c != a + (b + c). The order of reduction in .sum() depends on the hardware — on CPU, the SIMD instruction set (SSE vs AVX2 vs AVX-512); on GPU, the parallel reduction strategy and GPU architecture. These differences are tiny (~1e-7 in float32) but compound over 2,000 Adam iterations.

2. BLAS backend differences

Different machines may use different BLAS libraries (OpenBLAS vs MKL) or different compiled versions, which implement matrix operations with different internal algorithms.

3. The loss surface is extremely flat (the key amplifier)

The loss reaches ~8.0005 by step 300 and barely moves for the remaining 1,700 iterations. On a flat surface, there's no strong gradient pulling toward a unique minimum, so many slightly different weight vectors give essentially the same loss. Tiny numerical differences in early iterations steer each machine toward a different but equally good solution.

4. The seed helps, but only on the same hardware

The code sets torch.manual_seed(65748392), and weight_multiplier starts as all-ones, which is deterministic. But torch.use_deterministic_algorithms(True) is NOT called, and even if it were, PyTorch only guarantees determinism on the same hardware, not across different CPUs.

Proposed Improvements

Step 1: Drop the 8 Impossible Targets

These contribute nothing except a constant 8.0 to the loss and wasted gradient computation. Dropping them makes loss values meaningful and eliminates noise.

Step 2: Turn On the Weight Deviation Penalty (with L2 Norm)

A mechanism already exists in the code (reweight.py lines 269-274) but REWEIGHT_DEVIATION_PENALTY is set to 0.0 in imputation_assumptions.py.

Without the penalty, the problem is underdetermined near the optimum — many different weight vectors produce the same loss. Each machine wanders to a different one. Adding a penalty for deviating from original weights breaks the degeneracy by creating a unique optimum: the weight vector closest to the original weights that also satisfies the targets. A flat plateau becomes a bowl with a single bottom.

The current implementation uses absolute differences (L1 norm). For forcing a unique solution, sum of squared differences (L2 norm) is better because L2 is strictly convex everywhere, while L1 can have flat regions along coordinate axes.

The penalty magnitude needs tuning: too small and the surface is still flat; too large and target accuracy suffers.

Step 3: Switch from Adam to L-BFGS

Adam is designed for stochastic deep learning training. This problem is a deterministic, smooth, moderate-dimensional optimization — exactly what L-BFGS excels at. It converges to tighter tolerances in fewer iterations.

Implementation options:

scipy L-BFGS-B PyTorch torch.optim.LBFGS
GPU support No (CPU only, numpy arrays) Yes (runs on CUDA tensors)
Bound constraints Native bounds parameter Not built in, but current torch.clamp approach works
Convergence criteria Native ftol, gtol Must be implemented manually in the loop
Ecosystem fit Requires converting tensors to/from numpy; breaks existing TensorBoard logging flow Stays in PyTorch ecosystem; keeps TensorBoard logging

GPU consideration: The current setup uses an NVIDIA GeForce RTX 5070 Ti (16GB VRAM, CUDA 12.8) and the code already runs on GPU when available (reweight.py lines 184-198). Switching to scipy L-BFGS-B would move computation off the GPU to CPU, which could be a speed regression. PyTorch's LBFGS keeps computation on the GPU.

This tilts toward PyTorch torch.optim.LBFGS, with manual convergence checks and the existing clamping for bounds. However, both options remain viable — scipy L-BFGS-B may still be faster in wall-clock time if the per-iteration overhead of GPU transfers outweighs the GPU's parallel computation benefit for this problem size.

Device portability: PyTorch LBFGS works on whatever device the tensors live on. The existing device selection logic in reweight.py (lines 183-198) already handles this:

  • Auto-detection: torch.cuda.is_available() checks for a GPU and falls back to CPU if none is found. No code changes needed for non-GPU machines.
  • User override: The use_gpu parameter (defaults to True) lets a caller force CPU execution by passing use_gpu=False, even on a machine with a GPU. This is unchanged from the current Adam-based code — switching to LBFGS doesn't affect device handling.

Step 4: Switch from float32 to float64

The code uses dtype=torch.float32 throughout. Float32 has ~7 decimal digits of precision; float64 has ~15. Switching reduces per-iteration floating-point noise by ~8 orders of magnitude, giving paths much less room to diverge. The memory and speed cost is modest for this problem size.

Why These Changes Are Complementary

  • Drop impossible targets -> meaningful loss values
  • Weight deviation penalty -> unique solution (most important for cross-machine agreement)
  • L-BFGS with convergence criteria -> finds that solution precisely and stops
  • float64 -> reduces the floating-point noise driving remaining divergence

Together these should make cross-machine weight differences negligible.

Key Code Locations

  • Optimization loop: tmd/utils/reweight.py lines 259-305
  • Loss function: tmd/utils/reweight.py lines 275-277
  • Weight deviation penalty: tmd/utils/reweight.py lines 269-274
  • Penalty setting: tmd/imputation_assumptions.py (REWEIGHT_DEVIATION_PENALTY = 0.0)
  • Multiplier bounds: tmd/imputation_assumptions.py (REWEIGHT_MULTIPLIER_MIN, REWEIGHT_MULTIPLIER_MAX)
  • Tensor dtype: tmd/utils/reweight.py lines 205, 208, 224, 228
  • tfevents logs: tmd/storage/output/reweighting/

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions