Reduce cross-machine differences in record weights and improve national weighting performance

@martinholmer, please see my conversation with Claude, below. I could start on this in a couple of days. If we are fortunate, this could reduce the need to keep playing around with tolerances to ensure results pass on different machines.

# Reweighting Optimization Analysis

## Background

The national weight optimization runs 2,000 iterations using PyTorch's Adam optimizer. Results differ very slightly across machines despite identical inputs. This analysis investigates why, using the tfevents log from the Feb 16, 2026 run on @donboyd5's machine.

## Loss Trajectory

The loss drops from ~38.15 to ~8.00 in the first 100 iterations, then barely moves:

| Step | Loss |
|------|------|
| 0 | 38.1537 |
| 100 | 8.0806 |
| 200 | 8.0059 |
| 300 | 8.0014 |
| 400 | 8.0004 |
| 500 | 8.0002 |
| ... | ... |
| 1900 | 8.0008 |

Three recent runs on the same machine (Feb 9, 10, 16) all converge to identical final loss values (8.000790), confirming on-machine determinism.

## The Loss Floor: 8 Impossible Targets

The loss is stuck at ~8.0 because exactly 8 targets have estimates of **zero** against nonzero SOI targets. The `tc_to_soi()` function in `soi_replication.py` (lines 169-181) creates these columns but hard-codes them to zero because the underlying Tax-Calculator variables are not available:

```python
df["estate_income"] = 0  # puf.E26390 Not in TC
df["estate_losses"] = 0  # puf.E26400
df["rent_and_royalty_net_income"] = 0  # puf.E25850 Not in TC
df["rent_and_royalty_net_losses"] = 0  # puf.E25860
```

Because the columns exist (with all-zero values), they pass the `if variable in df.columns` filter in `build_loss_matrix` and are included as optimization targets. But since every value is zero, no reweighting can produce nonzero estimates — the weighted sum of zeros is always zero:

| Target | SOI Target Value |
|--------|-----------------|
| estate income/total | ~$49B |
| estate income/count | ~625K returns |
| estate losses/total | ~$5.9B |
| estate losses/count | ~49K returns |
| rent & royalty net income/total | ~$125B |
| rent & royalty net income/count | ~6.3M returns |
| rent & royalty net losses/total | ~$56.8B |
| rent & royalty net losses/count | ~3.5M returns |

Each contributes exactly 1.0 to the loss via the formula `((0+1)/(large+1) - 1)^2 ≈ 1.0`.

## Convergence Quality (Excluding Impossible Targets)

Beyond those 8, the optimization is well-converged:

- **498 of 558** targets have error < 0.1%
- **550 of 558** targets have error < 1%
- The worst non-impossible target has error of only 0.69%
- Median target error: 0.005%

## Why Weights Differ Across Machines

Even with identical inputs and the same random seed (set at `reweight.py` line 200-202), slightly different weights result on different machines because:

**1. Floating-point non-associativity in the core computation**

The critical operation (`reweight.py` line 268) sums over hundreds of thousands of records for each of 558 targets. In floating-point math, `(a + b) + c != a + (b + c)`. The order of reduction in `.sum()` depends on the hardware — on CPU, the SIMD instruction set (SSE vs AVX2 vs AVX-512); on GPU, the parallel reduction strategy and GPU architecture. These differences are tiny (~1e-7 in float32) but compound over 2,000 Adam iterations.

**2. BLAS backend differences**

Different machines may use different BLAS libraries (OpenBLAS vs MKL) or different compiled versions, which implement matrix operations with different internal algorithms.

**3. The loss surface is extremely flat (the key amplifier)**

The loss reaches ~8.0005 by step 300 and barely moves for the remaining 1,700 iterations. On a flat surface, there's no strong gradient pulling toward a unique minimum, so many slightly different weight vectors give essentially the same loss. Tiny numerical differences in early iterations steer each machine toward a *different but equally good* solution.

**4. The seed helps, but only on the same hardware**

The code sets `torch.manual_seed(65748392)`, and `weight_multiplier` starts as all-ones, which is deterministic. But `torch.use_deterministic_algorithms(True)` is NOT called, and even if it were, PyTorch only guarantees determinism on the *same* hardware, not across different CPUs.

## Proposed Improvements

### Step 1: Drop the 8 Impossible Targets

These contribute nothing except a constant 8.0 to the loss and wasted gradient computation. Dropping them makes loss values meaningful and eliminates noise.

### Step 2: Turn On the Weight Deviation Penalty (with L2 Norm)

A mechanism already exists in the code (`reweight.py` lines 269-274) but `REWEIGHT_DEVIATION_PENALTY` is set to 0.0 in `imputation_assumptions.py`.

Without the penalty, the problem is **underdetermined** near the optimum — many different weight vectors produce the same loss. Each machine wanders to a different one. Adding a penalty for deviating from original weights **breaks the degeneracy** by creating a unique optimum: the weight vector closest to the original weights that also satisfies the targets. A flat plateau becomes a bowl with a single bottom.

The current implementation uses absolute differences (L1 norm). For forcing a unique solution, **sum of squared differences** (L2 norm) is better because L2 is strictly convex everywhere, while L1 can have flat regions along coordinate axes.

The penalty magnitude needs tuning: too small and the surface is still flat; too large and target accuracy suffers.

### Step 3: Switch from Adam to L-BFGS

Adam is designed for stochastic deep learning training. This problem is a deterministic, smooth, moderate-dimensional optimization — exactly what L-BFGS excels at. It converges to tighter tolerances in fewer iterations.

**Implementation options:**

| | scipy L-BFGS-B | PyTorch `torch.optim.LBFGS` |
|---|---|---|
| **GPU support** | No (CPU only, numpy arrays) | Yes (runs on CUDA tensors) |
| **Bound constraints** | Native `bounds` parameter | Not built in, but current `torch.clamp` approach works |
| **Convergence criteria** | Native `ftol`, `gtol` | Must be implemented manually in the loop |
| **Ecosystem fit** | Requires converting tensors to/from numpy; breaks existing TensorBoard logging flow | Stays in PyTorch ecosystem; keeps TensorBoard logging |

**GPU consideration:** The current setup uses an NVIDIA GeForce RTX 5070 Ti (16GB VRAM, CUDA 12.8) and the code already runs on GPU when available (`reweight.py` lines 184-198). Switching to scipy L-BFGS-B would move computation off the GPU to CPU, which could be a speed regression. PyTorch's LBFGS keeps computation on the GPU.

This tilts toward **PyTorch `torch.optim.LBFGS`**, with manual convergence checks and the existing clamping for bounds. However, both options remain viable — scipy L-BFGS-B may still be faster in wall-clock time if the per-iteration overhead of GPU transfers outweighs the GPU's parallel computation benefit for this problem size.

**Device portability:** PyTorch LBFGS works on whatever device the tensors live on. The existing device selection logic in `reweight.py` (lines 183-198) already handles this:
- **Auto-detection:** `torch.cuda.is_available()` checks for a GPU and falls back to CPU if none is found. No code changes needed for non-GPU machines.
- **User override:** The `use_gpu` parameter (defaults to `True`) lets a caller force CPU execution by passing `use_gpu=False`, even on a machine with a GPU. This is unchanged from the current Adam-based code — switching to LBFGS doesn't affect device handling.

### Step 4: Switch from float32 to float64

The code uses `dtype=torch.float32` throughout. Float32 has ~7 decimal digits of precision; float64 has ~15. Switching reduces per-iteration floating-point noise by ~8 orders of magnitude, giving paths much less room to diverge. The memory and speed cost is modest for this problem size.

### Why These Changes Are Complementary

- **Drop impossible targets** -> meaningful loss values
- **Weight deviation penalty** -> unique solution (most important for cross-machine agreement)
- **L-BFGS with convergence criteria** -> finds that solution precisely and stops
- **float64** -> reduces the floating-point noise driving remaining divergence

Together these should make cross-machine weight differences negligible.

## Key Code Locations

- Optimization loop: `tmd/utils/reweight.py` lines 259-305
- Loss function: `tmd/utils/reweight.py` lines 275-277
- Weight deviation penalty: `tmd/utils/reweight.py` lines 269-274
- Penalty setting: `tmd/imputation_assumptions.py` (`REWEIGHT_DEVIATION_PENALTY = 0.0`)
- Multiplier bounds: `tmd/imputation_assumptions.py` (`REWEIGHT_MULTIPLIER_MIN`, `REWEIGHT_MULTIPLIER_MAX`)
- Tensor dtype: `tmd/utils/reweight.py` lines 205, 208, 224, 228
- tfevents logs: `tmd/storage/output/reweighting/`


	scipy L-BFGS-B	PyTorch `torch.optim.LBFGS`
GPU support	No (CPU only, numpy arrays)	Yes (runs on CUDA tensors)
Bound constraints	Native `bounds` parameter	Not built in, but current `torch.clamp` approach works
Convergence criteria	Native `ftol`, `gtol`	Must be implemented manually in the loop
Ecosystem fit	Requires converting tensors to/from numpy; breaks existing TensorBoard logging flow	Stays in PyTorch ecosystem; keeps TensorBoard logging

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce cross-machine differences in record weights and improve national weighting performance #400

Reweighting Optimization Analysis

Background

Loss Trajectory

The Loss Floor: 8 Impossible Targets

Convergence Quality (Excluding Impossible Targets)

Why Weights Differ Across Machines

Proposed Improvements

Step 1: Drop the 8 Impossible Targets

Step 2: Turn On the Weight Deviation Penalty (with L2 Norm)

Step 3: Switch from Adam to L-BFGS

Step 4: Switch from float32 to float64

Why These Changes Are Complementary

Key Code Locations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Step	Loss
0	38.1537
100	8.0806
200	8.0059
300	8.0014
400	8.0004
500	8.0002
...	...
1900	8.0008

Target	SOI Target Value
estate income/total	~$49B
estate income/count	~625K returns
estate losses/total	~$5.9B
estate losses/count	~49K returns
rent & royalty net income/total	~$125B
rent & royalty net income/count	~6.3M returns
rent & royalty net losses/total	~$56.8B
rent & royalty net losses/count	~3.5M returns

Reduce cross-machine differences in record weights and improve national weighting performance #400

Description

Reweighting Optimization Analysis

Background

Loss Trajectory

The Loss Floor: 8 Impossible Targets

Convergence Quality (Excluding Impossible Targets)

Why Weights Differ Across Machines

Proposed Improvements

Step 1: Drop the 8 Impossible Targets

Step 2: Turn On the Weight Deviation Penalty (with L2 Norm)

Step 3: Switch from Adam to L-BFGS

Step 4: Switch from float32 to float64

Why These Changes Are Complementary

Key Code Locations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions