Skip to content

Fix OOM in get_global_map_index for billion-scale sample workloads (DLRM)#11

Open
idevasena wants to merge 1 commit intomainfrom
fix/dlrm-oom-virtual-index-map
Open

Fix OOM in get_global_map_index for billion-scale sample workloads (DLRM)#11
idevasena wants to merge 1 commit intomainfrom
fix/dlrm-oom-virtual-index-map

Conversation

@idevasena
Copy link
Copy Markdown

@idevasena idevasena commented Apr 11, 2026

Problem

get_global_map_index() materializes a Python dict mapping every sample index to its (filename, sample_index) tuple. For DLRM with 369 files × 4,718,592 samples/file = 1.74 billion entries, this dict alone consumes ~350 GB — exceeding typical host memory (247 GB in the reported case mlcommons/storage#329) and triggering the Linux OOM killer (signal 9).

Fixes mlcommons/storage#329

Root Cause

The mapping global_index → (filename, sample_idx) is trivially computable as:

  file_index   = global_index // num_samples_per_file
  sample_index = global_index % num_samples_per_file

Materializing all 1.74B entries in a Python dict wastes ~350 GB when only the shuffle permutation (~14 GB numpy array) is needed.

Fix

Introduce VirtualIndexMap — a dict-like class that stores only the shuffled permutation array and computes file mappings on demand via __getitem__. Memory usage drops from ~362 GB to ~14 GB.

Files Changed:

dlio_benchmark/utils/config.py — Add VirtualIndexMap class, modify get_global_map_index() and build_sample_map_iter()
tests/test_virtual_index_map.py — New unit tests

Testing

  • Unit tests for correctness, shuffle determinism, memory bounds

get_global_map_index() materialized a Python dict with 1.74B entries
(~350 GB) for DLRM, exceeding host memory before any parquet I/O.
VirtualIndexMap stores only the numpy permutation array (~14 GB) and
computes file mappings on demand via integer division.

96% memory reduction. 15 unit tests included.

Fixes mlcommons/storage#329

Signed-off-by: Devasena Inupakutika <devasena.i@samsung.com>
@idevasena idevasena requested a review from a team April 11, 2026 02:19
Copy link
Copy Markdown

@FileSystemGuy FileSystemGuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mlpstorage training run --model=dlrm fails with job being aborted errors (signal 9)

2 participants