Skip to content

Commit 7551018

Browse files
committed
Update for disk mapping
1 parent a88e915 commit 7551018

21 files changed

Lines changed: 1758 additions & 3076 deletions

.claude/LESSONS.md

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,6 @@ Critical knowledge to avoid repeating mistakes.
1010
- Violates triangle inequality. MIP integrality gap bounds (assume metric D) don't formally apply.
1111
- In practice the gap is small. References: Marteau (2009), Jain (2018).
1212

13-
### k-Medoids constraint matrix is NOT totally unimodular
14-
- TU boundary is p=3. For p≤2, the matrix IS TU.
15-
- Odd cycles among facilities break TU (det = (-1)^n - 1 for n-cycle).
16-
- With fixed medoid set, the assignment IS a transportation problem (TU) → enables Benders.
17-
1813
### DTW-AROW ≠ simple zero-cost DTW
1914
- DTW-AROW constrains each missing value to one-to-one diagonal alignment.
2015
- Simple zero-cost is less restrictive and underestimates distances more.

.claude/TODO.md

Lines changed: 54 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,60 @@
11
# DTWC++ Development TODO
22

3-
**Last Updated:** 2026-04-03
4-
5-
## Remaining Work
3+
**Last Updated:** 2026-04-04
4+
5+
## Active Work
6+
7+
### Performance (Phase 0 — complete)
8+
- [x] `-march=native` / `/arch:AVX2` CMake option (`DTWC_ENABLE_NATIVE_ARCH`) — **+45-103% DTW throughput**
9+
- [x] Replaced `-ffast-math` with explicit sub-flags (omits `-ffinite-math-only`) — `std::isnan` now reliable
10+
- [x] Simplified `missing_utils.hpp` `is_missing()` to `std::isnan()` wrapper
11+
- [x] O(n) envelope computation via Lemire ring-buffer deque (was O(n×band))
12+
13+
### Performance (Phase 2 — next CPU work)
14+
- [x] `adtwBanded`: replaced full Eigen/ScratchMatrix (O(n*m)) with rolling column vector (O(n)); added `early_abandon` parameter; pruned strategy now enabled for ADTW variant
15+
- [x] Early abandon for `adtwFull_L``early_abandon` parameter added; full ADTW pruned path now fully wired
16+
- [ ] OpenMP scheduling: benchmark `schedule(dynamic,1)` vs `schedule(dynamic,16)` vs `schedule(guided)` for `fillDistanceMatrix_BruteForce` (`Problem.cpp:364`)
17+
18+
### Architecture (Phase 1 — out-of-core, critical for 5TB datasets)
19+
- [ ] `DataSource` interface (`dtwc/core/data_source.hpp`) — index-based, identical API for in-memory and disk-backed data; builds on existing `distByInd` pattern
20+
- [ ] Binary format with in-memory index (`.dtwi`/`.dtwd`) + memory-mapped reader (`MappedDataSource`)
21+
- Evaluate `mio` (MIT, header-only) vs custom ~100-line `mmap`/`CreateFileMapping` wrapper for 5TB file support
22+
- [ ] Streaming CLARA assignment pass (`fast_clara.cpp`) — load blocks, compute k distances, discard
23+
- [ ] Sample size default: scale with `sqrt(N)` for large N (current `max(40+2k, 10k+100)` too small at 100M series)
24+
- [ ] CLARA checkpointing: save/resume assignment state (labels + best cost) for long runs
25+
- [ ] CPU/GPU heuristic in `DistanceMatrixStrategy::Auto`: in-memory → full matrix; disk-backed + short series → GPU K-vs-N streaming; disk-backed + long series → CPU streaming
26+
27+
### CUDA (Phase 3)
28+
- [ ] Architecture-aware dispatch (`DispatchProfile` by compute capability)
29+
- [ ] Wire `compute_dtw_k_vs_all` kernel into streaming CLARA assignment (GPU double-buffered path)
30+
- [ ] Wavefront kernel cleanup: document/remove dead preload branch (reachable only for L ∈ (256, 512])
31+
- [ ] Unify kernel dispatch logic across `launch_dtw_kernel`, `launch_dtw_one_vs_all_kernel`, `launch_dtw_k_vs_all_kernel`
32+
- [ ] Multi-stream pipelining for N > 5000
33+
34+
### Bindings
35+
- [ ] Python: PyPI first release — CI ready (`python-wheels.yml`), needs GitHub trusted publisher setup
36+
- [ ] MATLAB Phase 2: MIPSettings, CUDA dispatch, checkpointing, I/O utilities in `dtwc_mex.cpp`
37+
- [ ] MATLAB: `compile.m` standalone build script (no CMake required)
638

739
### MIP Solver
8-
- [ ] Benders decomposition: verify on machine with HiGHS enabled (code exists, tests skip without HiGHS)
9-
- [ ] Odd-cycle cutting planes ({0,1/2}-CG cuts) as lazy constraints
10-
11-
### CUDA Next Phase
12-
- [ ] Device-side pruning: stop launching DTW for LB-pruned pairs
13-
- [ ] Architecture-aware dispatch (DispatchProfile by compute capability)
14-
- [ ] Wire K-vs-N kernel into CLARA clustering loop (for sample-based, not full matrix)
15-
- [ ] Benchmark expansion: standalone LB, pruned matrix, 1-vs-N, K-vs-N
16-
17-
### CUDA Medium Priority
18-
- [ ] Fix register-tiled kernel for banded DTW edge cases
19-
- [ ] Multi-stream pipelining for very large N
20-
- [ ] GPU early-abandon within DTW kernels
40+
- [ ] Odd-cycle cutting planes ({0,1/2}-CG cuts) as lazy constraints — **instrument Benders gap first** (>50 iterations needed to justify)
2141

2242
### Algorithms & Scale
23-
- [ ] Condensed distance matrix (half memory for symmetric storage)
24-
- [ ] Two-phase clustering for pre-categorized data (within-group + cross-group)
25-
- [ ] Lazy loading (FileBackedDataSource, CachedDataSource)
26-
- [ ] Algorithm auto-selection based on cost = N^2 * min(L, band) * ndim
27-
28-
### Bindings Phase 2
29-
- [ ] MATLAB: Phase 2 parity (MIPSettings, CUDA, checkpointing, I/O)
30-
- [ ] MATLAB: compile.m standalone build script (no CMake required)
31-
- [ ] Python: PyPI first release (workflows ready, need trusted publisher setup)
32-
- [ ] HIPify for AMD GPU support
33-
34-
### Technical Debt
35-
- [ ] Clean up wavefront kernel dead code (unreachable preload branch)
36-
- [ ] Unify kernel dispatch logic
43+
- [ ] Two-phase clustering (within-group + cross-group) — after Phase 1 streaming infra proven
44+
- [ ] Algorithm auto-selection: improve `DistanceMatrixStrategy::Auto` cost model (`N^2 * min(L, band) * ndim` threshold)
45+
46+
### Platform
47+
- [ ] ARM Mac Studio investigation: test CPU path on Apple Silicon, evaluate Metal compute for GPU path
48+
- [ ] HIPify for AMD GPU — **accept community PRs only**, do not invest core developer time
49+
50+
## Guidelines (not TODOs — current conventions)
51+
- Buffer > thread_local >> heap allocation: already enforced everywhere
52+
- No naked `new`/`delete` in core: already enforced
53+
- Contiguous arrays in hot paths: `Data::p_vec` as `vector<vector<data_t>>` is correct for variable-length series
54+
55+
## Removed (completed or cut)
56+
- ~~NaN/-ffast-math robustification~~**DONE**: explicit fast-math sub-flags + `std::isnan()` wrapper
57+
- ~~Eigen 5.0.1 exploration~~**CUT**: Eigen used only as aligned allocator, no gap identified
58+
- ~~Condensed distance matrix~~**DONE**: `DenseDistanceMatrix` already uses packed triangular N*(N+1)/2
59+
- ~~Unnecessary memory allocations audit~~**CUT**: 30+ `thread_local` declarations already in place
60+
- ~~Device-side LB pruning (pair-level)~~**DONE**: `compact_active_pairs` in `cuda_dtw.cu`

.claude/UNIMODULAR.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -560,8 +560,14 @@ Benders is expected to outperform compact MIP for N > 200 and becomes essential
560560
| How to interpret fractional values? | Fractional A[i,i] = 1/2 indicates facility i is on an odd cycle in the LP solution. Fractional A[i,j] means point j is "shared" between clusters. |
561561
| Practical strategy? | LP first -> branch on A[i,i] only -> full MIP fallback. |
562562

563+
### k-Medoids constraint matrix is NOT totally unimodular
564+
- TU boundary is p=3. For p≤2, the matrix IS TU.
565+
- Odd cycles among facilities break TU (det = (-1)^n - 1 for n-cycle).
566+
- With fixed medoid set, the assignment IS a transportation problem (TU) → enables Benders.
567+
563568
---
564569

570+
565571
## 8. References
566572

567573
1. **Balinski, M.L.** (1965). "Integer programming: methods, uses, computation." Management Science 12(3), 253-313. *Original p-median formulation.*

.claude/reports/bright-wishing-candle.md

Lines changed: 0 additions & 127 deletions
This file was deleted.

0 commit comments

Comments
 (0)