@martinholmer, please see my conversation with Claude, below. I could start on this in a couple of days. If we are fortunate, this could reduce the need to keep playing around with tolerances to ensure results pass on different machines.
Reweighting Optimization Analysis
Background
The national weight optimization runs 2,000 iterations using PyTorch's Adam optimizer. Results differ very slightly across machines despite identical inputs. This analysis investigates why, using the tfevents log from the Feb 16, 2026 run on @donboyd5's machine.
Loss Trajectory
The loss drops from ~38.15 to ~8.00 in the first 100 iterations, then barely moves:
| Step |
Loss |
| 0 |
38.1537 |
| 100 |
8.0806 |
| 200 |
8.0059 |
| 300 |
8.0014 |
| 400 |
8.0004 |
| 500 |
8.0002 |
| ... |
... |
| 1900 |
8.0008 |
Three recent runs on the same machine (Feb 9, 10, 16) all converge to identical final loss values (8.000790), confirming on-machine determinism.
The Loss Floor: 8 Impossible Targets
The loss is stuck at ~8.0 because exactly 8 targets have estimates of zero against nonzero SOI targets. The tc_to_soi() function in soi_replication.py (lines 169-181) creates these columns but hard-codes them to zero because the underlying Tax-Calculator variables are not available:
df["estate_income"] = 0 # puf.E26390 Not in TC
df["estate_losses"] = 0 # puf.E26400
df["rent_and_royalty_net_income"] = 0 # puf.E25850 Not in TC
df["rent_and_royalty_net_losses"] = 0 # puf.E25860
Because the columns exist (with all-zero values), they pass the if variable in df.columns filter in build_loss_matrix and are included as optimization targets. But since every value is zero, no reweighting can produce nonzero estimates — the weighted sum of zeros is always zero:
| Target |
SOI Target Value |
| estate income/total |
~$49B |
| estate income/count |
~625K returns |
| estate losses/total |
~$5.9B |
| estate losses/count |
~49K returns |
| rent & royalty net income/total |
~$125B |
| rent & royalty net income/count |
~6.3M returns |
| rent & royalty net losses/total |
~$56.8B |
| rent & royalty net losses/count |
~3.5M returns |
Each contributes exactly 1.0 to the loss via the formula ((0+1)/(large+1) - 1)^2 ≈ 1.0.
Convergence Quality (Excluding Impossible Targets)
Beyond those 8, the optimization is well-converged:
- 498 of 558 targets have error < 0.1%
- 550 of 558 targets have error < 1%
- The worst non-impossible target has error of only 0.69%
- Median target error: 0.005%
Why Weights Differ Across Machines
Even with identical inputs and the same random seed (set at reweight.py line 200-202), slightly different weights result on different machines because:
1. Floating-point non-associativity in the core computation
The critical operation (reweight.py line 268) sums over hundreds of thousands of records for each of 558 targets. In floating-point math, (a + b) + c != a + (b + c). The order of reduction in .sum() depends on the hardware — on CPU, the SIMD instruction set (SSE vs AVX2 vs AVX-512); on GPU, the parallel reduction strategy and GPU architecture. These differences are tiny (~1e-7 in float32) but compound over 2,000 Adam iterations.
2. BLAS backend differences
Different machines may use different BLAS libraries (OpenBLAS vs MKL) or different compiled versions, which implement matrix operations with different internal algorithms.
3. The loss surface is extremely flat (the key amplifier)
The loss reaches ~8.0005 by step 300 and barely moves for the remaining 1,700 iterations. On a flat surface, there's no strong gradient pulling toward a unique minimum, so many slightly different weight vectors give essentially the same loss. Tiny numerical differences in early iterations steer each machine toward a different but equally good solution.
4. The seed helps, but only on the same hardware
The code sets torch.manual_seed(65748392), and weight_multiplier starts as all-ones, which is deterministic. But torch.use_deterministic_algorithms(True) is NOT called, and even if it were, PyTorch only guarantees determinism on the same hardware, not across different CPUs.
Proposed Improvements
Step 1: Drop the 8 Impossible Targets
These contribute nothing except a constant 8.0 to the loss and wasted gradient computation. Dropping them makes loss values meaningful and eliminates noise.
Step 2: Turn On the Weight Deviation Penalty (with L2 Norm)
A mechanism already exists in the code (reweight.py lines 269-274) but REWEIGHT_DEVIATION_PENALTY is set to 0.0 in imputation_assumptions.py.
Without the penalty, the problem is underdetermined near the optimum — many different weight vectors produce the same loss. Each machine wanders to a different one. Adding a penalty for deviating from original weights breaks the degeneracy by creating a unique optimum: the weight vector closest to the original weights that also satisfies the targets. A flat plateau becomes a bowl with a single bottom.
The current implementation uses absolute differences (L1 norm). For forcing a unique solution, sum of squared differences (L2 norm) is better because L2 is strictly convex everywhere, while L1 can have flat regions along coordinate axes.
The penalty magnitude needs tuning: too small and the surface is still flat; too large and target accuracy suffers.
Step 3: Switch from Adam to L-BFGS
Adam is designed for stochastic deep learning training. This problem is a deterministic, smooth, moderate-dimensional optimization — exactly what L-BFGS excels at. It converges to tighter tolerances in fewer iterations.
Implementation options:
|
scipy L-BFGS-B |
PyTorch torch.optim.LBFGS |
| GPU support |
No (CPU only, numpy arrays) |
Yes (runs on CUDA tensors) |
| Bound constraints |
Native bounds parameter |
Not built in, but current torch.clamp approach works |
| Convergence criteria |
Native ftol, gtol |
Must be implemented manually in the loop |
| Ecosystem fit |
Requires converting tensors to/from numpy; breaks existing TensorBoard logging flow |
Stays in PyTorch ecosystem; keeps TensorBoard logging |
GPU consideration: The current setup uses an NVIDIA GeForce RTX 5070 Ti (16GB VRAM, CUDA 12.8) and the code already runs on GPU when available (reweight.py lines 184-198). Switching to scipy L-BFGS-B would move computation off the GPU to CPU, which could be a speed regression. PyTorch's LBFGS keeps computation on the GPU.
This tilts toward PyTorch torch.optim.LBFGS, with manual convergence checks and the existing clamping for bounds. However, both options remain viable — scipy L-BFGS-B may still be faster in wall-clock time if the per-iteration overhead of GPU transfers outweighs the GPU's parallel computation benefit for this problem size.
Device portability: PyTorch LBFGS works on whatever device the tensors live on. The existing device selection logic in reweight.py (lines 183-198) already handles this:
- Auto-detection:
torch.cuda.is_available() checks for a GPU and falls back to CPU if none is found. No code changes needed for non-GPU machines.
- User override: The
use_gpu parameter (defaults to True) lets a caller force CPU execution by passing use_gpu=False, even on a machine with a GPU. This is unchanged from the current Adam-based code — switching to LBFGS doesn't affect device handling.
Step 4: Switch from float32 to float64
The code uses dtype=torch.float32 throughout. Float32 has ~7 decimal digits of precision; float64 has ~15. Switching reduces per-iteration floating-point noise by ~8 orders of magnitude, giving paths much less room to diverge. The memory and speed cost is modest for this problem size.
Why These Changes Are Complementary
- Drop impossible targets -> meaningful loss values
- Weight deviation penalty -> unique solution (most important for cross-machine agreement)
- L-BFGS with convergence criteria -> finds that solution precisely and stops
- float64 -> reduces the floating-point noise driving remaining divergence
Together these should make cross-machine weight differences negligible.
Key Code Locations
- Optimization loop:
tmd/utils/reweight.py lines 259-305
- Loss function:
tmd/utils/reweight.py lines 275-277
- Weight deviation penalty:
tmd/utils/reweight.py lines 269-274
- Penalty setting:
tmd/imputation_assumptions.py (REWEIGHT_DEVIATION_PENALTY = 0.0)
- Multiplier bounds:
tmd/imputation_assumptions.py (REWEIGHT_MULTIPLIER_MIN, REWEIGHT_MULTIPLIER_MAX)
- Tensor dtype:
tmd/utils/reweight.py lines 205, 208, 224, 228
- tfevents logs:
tmd/storage/output/reweighting/
@martinholmer, please see my conversation with Claude, below. I could start on this in a couple of days. If we are fortunate, this could reduce the need to keep playing around with tolerances to ensure results pass on different machines.
Reweighting Optimization Analysis
Background
The national weight optimization runs 2,000 iterations using PyTorch's Adam optimizer. Results differ very slightly across machines despite identical inputs. This analysis investigates why, using the tfevents log from the Feb 16, 2026 run on @donboyd5's machine.
Loss Trajectory
The loss drops from ~38.15 to ~8.00 in the first 100 iterations, then barely moves:
Three recent runs on the same machine (Feb 9, 10, 16) all converge to identical final loss values (8.000790), confirming on-machine determinism.
The Loss Floor: 8 Impossible Targets
The loss is stuck at ~8.0 because exactly 8 targets have estimates of zero against nonzero SOI targets. The
tc_to_soi()function insoi_replication.py(lines 169-181) creates these columns but hard-codes them to zero because the underlying Tax-Calculator variables are not available:Because the columns exist (with all-zero values), they pass the
if variable in df.columnsfilter inbuild_loss_matrixand are included as optimization targets. But since every value is zero, no reweighting can produce nonzero estimates — the weighted sum of zeros is always zero:Each contributes exactly 1.0 to the loss via the formula
((0+1)/(large+1) - 1)^2 ≈ 1.0.Convergence Quality (Excluding Impossible Targets)
Beyond those 8, the optimization is well-converged:
Why Weights Differ Across Machines
Even with identical inputs and the same random seed (set at
reweight.pyline 200-202), slightly different weights result on different machines because:1. Floating-point non-associativity in the core computation
The critical operation (
reweight.pyline 268) sums over hundreds of thousands of records for each of 558 targets. In floating-point math,(a + b) + c != a + (b + c). The order of reduction in.sum()depends on the hardware — on CPU, the SIMD instruction set (SSE vs AVX2 vs AVX-512); on GPU, the parallel reduction strategy and GPU architecture. These differences are tiny (~1e-7 in float32) but compound over 2,000 Adam iterations.2. BLAS backend differences
Different machines may use different BLAS libraries (OpenBLAS vs MKL) or different compiled versions, which implement matrix operations with different internal algorithms.
3. The loss surface is extremely flat (the key amplifier)
The loss reaches ~8.0005 by step 300 and barely moves for the remaining 1,700 iterations. On a flat surface, there's no strong gradient pulling toward a unique minimum, so many slightly different weight vectors give essentially the same loss. Tiny numerical differences in early iterations steer each machine toward a different but equally good solution.
4. The seed helps, but only on the same hardware
The code sets
torch.manual_seed(65748392), andweight_multiplierstarts as all-ones, which is deterministic. Buttorch.use_deterministic_algorithms(True)is NOT called, and even if it were, PyTorch only guarantees determinism on the same hardware, not across different CPUs.Proposed Improvements
Step 1: Drop the 8 Impossible Targets
These contribute nothing except a constant 8.0 to the loss and wasted gradient computation. Dropping them makes loss values meaningful and eliminates noise.
Step 2: Turn On the Weight Deviation Penalty (with L2 Norm)
A mechanism already exists in the code (
reweight.pylines 269-274) butREWEIGHT_DEVIATION_PENALTYis set to 0.0 inimputation_assumptions.py.Without the penalty, the problem is underdetermined near the optimum — many different weight vectors produce the same loss. Each machine wanders to a different one. Adding a penalty for deviating from original weights breaks the degeneracy by creating a unique optimum: the weight vector closest to the original weights that also satisfies the targets. A flat plateau becomes a bowl with a single bottom.
The current implementation uses absolute differences (L1 norm). For forcing a unique solution, sum of squared differences (L2 norm) is better because L2 is strictly convex everywhere, while L1 can have flat regions along coordinate axes.
The penalty magnitude needs tuning: too small and the surface is still flat; too large and target accuracy suffers.
Step 3: Switch from Adam to L-BFGS
Adam is designed for stochastic deep learning training. This problem is a deterministic, smooth, moderate-dimensional optimization — exactly what L-BFGS excels at. It converges to tighter tolerances in fewer iterations.
Implementation options:
torch.optim.LBFGSboundsparametertorch.clampapproach worksftol,gtolGPU consideration: The current setup uses an NVIDIA GeForce RTX 5070 Ti (16GB VRAM, CUDA 12.8) and the code already runs on GPU when available (
reweight.pylines 184-198). Switching to scipy L-BFGS-B would move computation off the GPU to CPU, which could be a speed regression. PyTorch's LBFGS keeps computation on the GPU.This tilts toward PyTorch
torch.optim.LBFGS, with manual convergence checks and the existing clamping for bounds. However, both options remain viable — scipy L-BFGS-B may still be faster in wall-clock time if the per-iteration overhead of GPU transfers outweighs the GPU's parallel computation benefit for this problem size.Device portability: PyTorch LBFGS works on whatever device the tensors live on. The existing device selection logic in
reweight.py(lines 183-198) already handles this:torch.cuda.is_available()checks for a GPU and falls back to CPU if none is found. No code changes needed for non-GPU machines.use_gpuparameter (defaults toTrue) lets a caller force CPU execution by passinguse_gpu=False, even on a machine with a GPU. This is unchanged from the current Adam-based code — switching to LBFGS doesn't affect device handling.Step 4: Switch from float32 to float64
The code uses
dtype=torch.float32throughout. Float32 has ~7 decimal digits of precision; float64 has ~15. Switching reduces per-iteration floating-point noise by ~8 orders of magnitude, giving paths much less room to diverge. The memory and speed cost is modest for this problem size.Why These Changes Are Complementary
Together these should make cross-machine weight differences negligible.
Key Code Locations
tmd/utils/reweight.pylines 259-305tmd/utils/reweight.pylines 275-277tmd/utils/reweight.pylines 269-274tmd/imputation_assumptions.py(REWEIGHT_DEVIATION_PENALTY = 0.0)tmd/imputation_assumptions.py(REWEIGHT_MULTIPLIER_MIN,REWEIGHT_MULTIPLIER_MAX)tmd/utils/reweight.pylines 205, 208, 224, 228tmd/storage/output/reweighting/