|
1 | 1 | # DTWC++ Development TODO |
2 | 2 |
|
3 | | -**Last Updated:** 2026-04-03 |
4 | | - |
5 | | -## Remaining Work |
| 3 | +**Last Updated:** 2026-04-04 |
| 4 | + |
| 5 | +## Active Work |
| 6 | + |
| 7 | +### Performance (Phase 0 — complete) |
| 8 | +- [x] `-march=native` / `/arch:AVX2` CMake option (`DTWC_ENABLE_NATIVE_ARCH`) — **+45-103% DTW throughput** |
| 9 | +- [x] Replaced `-ffast-math` with explicit sub-flags (omits `-ffinite-math-only`) — `std::isnan` now reliable |
| 10 | +- [x] Simplified `missing_utils.hpp` `is_missing()` to `std::isnan()` wrapper |
| 11 | +- [x] O(n) envelope computation via Lemire ring-buffer deque (was O(n×band)) |
| 12 | + |
| 13 | +### Performance (Phase 2 — next CPU work) |
| 14 | +- [x] `adtwBanded`: replaced full Eigen/ScratchMatrix (O(n*m)) with rolling column vector (O(n)); added `early_abandon` parameter; pruned strategy now enabled for ADTW variant |
| 15 | +- [x] Early abandon for `adtwFull_L` — `early_abandon` parameter added; full ADTW pruned path now fully wired |
| 16 | +- [ ] OpenMP scheduling: benchmark `schedule(dynamic,1)` vs `schedule(dynamic,16)` vs `schedule(guided)` for `fillDistanceMatrix_BruteForce` (`Problem.cpp:364`) |
| 17 | + |
| 18 | +### Architecture (Phase 1 — out-of-core, critical for 5TB datasets) |
| 19 | +- [ ] `DataSource` interface (`dtwc/core/data_source.hpp`) — index-based, identical API for in-memory and disk-backed data; builds on existing `distByInd` pattern |
| 20 | +- [ ] Binary format with in-memory index (`.dtwi`/`.dtwd`) + memory-mapped reader (`MappedDataSource`) |
| 21 | + - Evaluate `mio` (MIT, header-only) vs custom ~100-line `mmap`/`CreateFileMapping` wrapper for 5TB file support |
| 22 | +- [ ] Streaming CLARA assignment pass (`fast_clara.cpp`) — load blocks, compute k distances, discard |
| 23 | +- [ ] Sample size default: scale with `sqrt(N)` for large N (current `max(40+2k, 10k+100)` too small at 100M series) |
| 24 | +- [ ] CLARA checkpointing: save/resume assignment state (labels + best cost) for long runs |
| 25 | +- [ ] CPU/GPU heuristic in `DistanceMatrixStrategy::Auto`: in-memory → full matrix; disk-backed + short series → GPU K-vs-N streaming; disk-backed + long series → CPU streaming |
| 26 | + |
| 27 | +### CUDA (Phase 3) |
| 28 | +- [ ] Architecture-aware dispatch (`DispatchProfile` by compute capability) |
| 29 | +- [ ] Wire `compute_dtw_k_vs_all` kernel into streaming CLARA assignment (GPU double-buffered path) |
| 30 | +- [ ] Wavefront kernel cleanup: document/remove dead preload branch (reachable only for L ∈ (256, 512]) |
| 31 | +- [ ] Unify kernel dispatch logic across `launch_dtw_kernel`, `launch_dtw_one_vs_all_kernel`, `launch_dtw_k_vs_all_kernel` |
| 32 | +- [ ] Multi-stream pipelining for N > 5000 |
| 33 | + |
| 34 | +### Bindings |
| 35 | +- [ ] Python: PyPI first release — CI ready (`python-wheels.yml`), needs GitHub trusted publisher setup |
| 36 | +- [ ] MATLAB Phase 2: MIPSettings, CUDA dispatch, checkpointing, I/O utilities in `dtwc_mex.cpp` |
| 37 | +- [ ] MATLAB: `compile.m` standalone build script (no CMake required) |
6 | 38 |
|
7 | 39 | ### MIP Solver |
8 | | -- [ ] Benders decomposition: verify on machine with HiGHS enabled (code exists, tests skip without HiGHS) |
9 | | -- [ ] Odd-cycle cutting planes ({0,1/2}-CG cuts) as lazy constraints |
10 | | - |
11 | | -### CUDA Next Phase |
12 | | -- [ ] Device-side pruning: stop launching DTW for LB-pruned pairs |
13 | | -- [ ] Architecture-aware dispatch (DispatchProfile by compute capability) |
14 | | -- [ ] Wire K-vs-N kernel into CLARA clustering loop (for sample-based, not full matrix) |
15 | | -- [ ] Benchmark expansion: standalone LB, pruned matrix, 1-vs-N, K-vs-N |
16 | | - |
17 | | -### CUDA Medium Priority |
18 | | -- [ ] Fix register-tiled kernel for banded DTW edge cases |
19 | | -- [ ] Multi-stream pipelining for very large N |
20 | | -- [ ] GPU early-abandon within DTW kernels |
| 40 | +- [ ] Odd-cycle cutting planes ({0,1/2}-CG cuts) as lazy constraints — **instrument Benders gap first** (>50 iterations needed to justify) |
21 | 41 |
|
22 | 42 | ### Algorithms & Scale |
23 | | -- [ ] Condensed distance matrix (half memory for symmetric storage) |
24 | | -- [ ] Two-phase clustering for pre-categorized data (within-group + cross-group) |
25 | | -- [ ] Lazy loading (FileBackedDataSource, CachedDataSource) |
26 | | -- [ ] Algorithm auto-selection based on cost = N^2 * min(L, band) * ndim |
27 | | - |
28 | | -### Bindings Phase 2 |
29 | | -- [ ] MATLAB: Phase 2 parity (MIPSettings, CUDA, checkpointing, I/O) |
30 | | -- [ ] MATLAB: compile.m standalone build script (no CMake required) |
31 | | -- [ ] Python: PyPI first release (workflows ready, need trusted publisher setup) |
32 | | -- [ ] HIPify for AMD GPU support |
33 | | - |
34 | | -### Technical Debt |
35 | | -- [ ] Clean up wavefront kernel dead code (unreachable preload branch) |
36 | | -- [ ] Unify kernel dispatch logic |
| 43 | +- [ ] Two-phase clustering (within-group + cross-group) — after Phase 1 streaming infra proven |
| 44 | +- [ ] Algorithm auto-selection: improve `DistanceMatrixStrategy::Auto` cost model (`N^2 * min(L, band) * ndim` threshold) |
| 45 | + |
| 46 | +### Platform |
| 47 | +- [ ] ARM Mac Studio investigation: test CPU path on Apple Silicon, evaluate Metal compute for GPU path |
| 48 | +- [ ] HIPify for AMD GPU — **accept community PRs only**, do not invest core developer time |
| 49 | + |
| 50 | +## Guidelines (not TODOs — current conventions) |
| 51 | +- Buffer > thread_local >> heap allocation: already enforced everywhere |
| 52 | +- No naked `new`/`delete` in core: already enforced |
| 53 | +- Contiguous arrays in hot paths: `Data::p_vec` as `vector<vector<data_t>>` is correct for variable-length series |
| 54 | + |
| 55 | +## Removed (completed or cut) |
| 56 | +- ~~NaN/-ffast-math robustification~~ — **DONE**: explicit fast-math sub-flags + `std::isnan()` wrapper |
| 57 | +- ~~Eigen 5.0.1 exploration~~ — **CUT**: Eigen used only as aligned allocator, no gap identified |
| 58 | +- ~~Condensed distance matrix~~ — **DONE**: `DenseDistanceMatrix` already uses packed triangular N*(N+1)/2 |
| 59 | +- ~~Unnecessary memory allocations audit~~ — **CUT**: 30+ `thread_local` declarations already in place |
| 60 | +- ~~Device-side LB pruning (pair-level)~~ — **DONE**: `compact_active_pairs` in `cuda_dtw.cu` |
0 commit comments