Fix OOM in get_global_map_index for billion-scale sample workloads (DLRM) by idevasena · Pull Request #11 · mlcommons/DLIO_local_changes

idevasena · 2026-04-11T02:19:07Z

Problem

get_global_map_index() materializes a Python dict mapping every sample index to its (filename, sample_index) tuple. For DLRM with 369 files × 4,718,592 samples/file = 1.74 billion entries, this dict alone consumes ~350 GB — exceeding typical host memory (247 GB in the reported case mlcommons/storage#329) and triggering the Linux OOM killer (signal 9).

Fixes mlcommons/storage#329

Root Cause

The mapping global_index → (filename, sample_idx) is trivially computable as:

  file_index   = global_index // num_samples_per_file
  sample_index = global_index % num_samples_per_file

Materializing all 1.74B entries in a Python dict wastes ~350 GB when only the shuffle permutation (~14 GB numpy array) is needed.

Fix

Introduce VirtualIndexMap — a dict-like class that stores only the shuffled permutation array and computes file mappings on demand via __getitem__. Memory usage drops from ~362 GB to ~14 GB.

Files Changed:

dlio_benchmark/utils/config.py — Add VirtualIndexMap class, modify get_global_map_index() and build_sample_map_iter()
tests/test_virtual_index_map.py — New unit tests

Testing

Unit tests for correctness, shuffle determinism, memory bounds

get_global_map_index() materialized a Python dict with 1.74B entries (~350 GB) for DLRM, exceeding host memory before any parquet I/O. VirtualIndexMap stores only the numpy permutation array (~14 GB) and computes file mappings on demand via integer division. 96% memory reduction. 15 unit tests included. Fixes mlcommons/storage#329 Signed-off-by: Devasena Inupakutika <devasena.i@samsung.com>

FileSystemGuy

@russfellows

idevasena requested a review from a team April 11, 2026 02:19

FileSystemGuy approved these changes Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OOM in get_global_map_index for billion-scale sample workloads (DLRM)#11

Fix OOM in get_global_map_index for billion-scale sample workloads (DLRM)#11
idevasena wants to merge 1 commit intomainfrom
fix/dlrm-oom-virtual-index-map

idevasena commented Apr 11, 2026 •

edited

Loading

Uh oh!

FileSystemGuy left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

idevasena commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix

Files Changed:

Testing

Uh oh!

FileSystemGuy left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

idevasena commented Apr 11, 2026 •

edited

Loading