Fix OOM in get_global_map_index for billion-scale sample workloads (DLRM)#11
Open
Fix OOM in get_global_map_index for billion-scale sample workloads (DLRM)#11
Conversation
get_global_map_index() materialized a Python dict with 1.74B entries (~350 GB) for DLRM, exceeding host memory before any parquet I/O. VirtualIndexMap stores only the numpy permutation array (~14 GB) and computes file mappings on demand via integer division. 96% memory reduction. 15 unit tests included. Fixes mlcommons/storage#329 Signed-off-by: Devasena Inupakutika <devasena.i@samsung.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
get_global_map_index()materializes a Python dict mapping every sample index to its (filename, sample_index) tuple. For DLRM with 369 files × 4,718,592 samples/file = 1.74 billion entries, this dict alone consumes ~350 GB — exceeding typical host memory (247 GB in the reported case mlcommons/storage#329) and triggering the Linux OOM killer (signal 9).Fixes mlcommons/storage#329
Root Cause
The mapping
global_index → (filename, sample_idx)is trivially computable as:Materializing all 1.74B entries in a Python dict wastes ~350 GB when only the shuffle permutation (~14 GB numpy array) is needed.
Fix
Introduce
VirtualIndexMap— a dict-like class that stores only the shuffled permutation array and computes file mappings on demand via__getitem__. Memory usage drops from ~362 GB to ~14 GB.Files Changed:
dlio_benchmark/utils/config.py— Add VirtualIndexMap class, modifyget_global_map_index()andbuild_sample_map_iter()tests/test_virtual_index_map.py— New unit testsTesting