Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation by polarG · Pull Request #800 · microsoft/superbenchmark

polarG · 2026-04-01T06:07:20Z

Description
preprocess_data.py uses multiprocessing.Pool(workers), which requires workers >= 1. When num_workers is set to 0 (valid for DataLoader, where it means "load in main process"), the preprocessing step crashes.

This change clamps the worker count to max(1, self._args.num_workers) before passing it to preprocess_data.py, while leaving the original num_workers value unchanged for other uses like DataLoader.

Copilot

Pull request overview

This PR fixes a crash in the Megatron GPT-3 dataset preprocessing path when --num_workers=0 by ensuring the worker count passed to Megatron’s tools/preprocess_data.py is at least 1, while keeping num_workers unchanged for other consumers (e.g., PyTorch DataLoader).

Changes:

Clamp --workers for preprocess_data.py to max(1, self._args.num_workers) to avoid multiprocessing.Pool(0) failures.
Add inline clarification comment documenting why preprocessing clamps workers while other components may allow num_workers=0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Megatron's preprocess_data.py appends '_text_document' to the --output-prefix when producing the .bin/.idx files. Derive the output-prefix from --data_prefix (stripping a trailing '_text_document' suffix when present) so that the generated files match the existence checks for any custom data_prefix value, instead of being hardcoded to '<data_home>/dataset'. - Add unit test test_megatron_gpt_dataset_generate_command covering: num_workers=0 clamps to '--workers 1' with default data_prefix ('dataset_text_document' -> '<data_home>/dataset'), num_workers=4 with custom 'custom_text_document' -> '<data_home>/custom', and a data_prefix without the suffix used as-is.

Avoid '--workers 1' matching '--workers 10' etc.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2026-04-27T23:20:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.85%. Comparing base (036c471) to head (ee2839e).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
+ Coverage   85.69%   85.85%   +0.15%     
==========================================
  Files         103      103              
  Lines        7890     7894       +4     
==========================================
+ Hits         6761     6777      +16     
+ Misses       1129     1117      -12

Flag	Coverage Δ
cpu-python3.12-unit-test	`70.59% <100.00%> (+0.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…et generate test - Replace brittle whitespace substring assertions ('--workers 1 ', '--output-prefix ... ') with normalize_command()-based parsed CLI unit checks, so the test validates semantics rather than formatting. - Use --code_base {self._tmp_dir} together with createMockFiles(['pretrain_gpt.py']) to avoid the unrealistic /root/Megatron-DeepSpeed path. The mocked run_command now creates the expected .bin/.idx files via side_effect so _preprocess() succeeds end-to-end and is asserted to be True.

- Warn when num_workers is silently clamped from 0 to 1 for the preprocess subprocess so the user sees the override in the log. The DataLoader still receives the original num_workers value. - Guard the '_text_document' suffix-strip against the edge case where data_prefix == '_text_document' (which would otherwise produce a malformed --output-prefix '<data_home>/' with an empty basename). Fall back to the original data_prefix value in that case. - Extend test_megatron_gpt_dataset_generate_command with a 4th case asserting the empty-basename fallback.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                    # when producing the .bin/.idx files. Derive the output-prefix from
+                    # data_prefix (stripping the '_text_document' suffix when present) so that
+                    # the generated files match the existence checks above for any custom
+                    # data_prefix value.
+                    output_prefix_basename = self._args.data_prefix
+                    if output_prefix_basename.endswith('_text_document'):
+                        stripped = output_prefix_basename[:-len('_text_document')]
+                        # Guard against data_prefix == '_text_document' which would
+                        # leave an empty basename and produce a malformed --output-prefix
+                        # ending in '/'. Fall back to the original value in that case.
+                        output_prefix_basename = stripped or output_prefix_basename


+        # Case 3: data_prefix without the '_text_document' suffix is used as-is.
+        _run_case(
+            extra_params='--num_workers 2 --data_prefix mydata',
+            expected_workers=2,
+            expected_prefix_basename='mydata',
+            expected_data_prefix='mydata',
+        )
+
+        # Case 4: edge case - data_prefix == '_text_document' should NOT strip down
+        # to an empty basename (which would produce '--output-prefix <data_home>/').
+        # Fall back to using '_text_document' as the basename.
+        _run_case(
+            extra_params='--num_workers 1 --data_prefix _text_document',
+            expected_workers=1,
+            expected_prefix_basename='_text_document',
+            expected_data_prefix='_text_document',
+        )
+


Fix the num_workers usage.

203a883

polarG requested a review from a team as a code owner April 1, 2026 06:07

Copilot AI review requested due to automatic review settings April 1, 2026 06:07

Copilot started reviewing on behalf of polarG April 1, 2026 06:08 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated

polarG self-assigned this Apr 17, 2026

polarG added enhancement New feature or request ROCm labels Apr 27, 2026

Copilot AI review requested due to automatic review settings April 27, 2026 22:02

Tighten test substring matches with trailing space

ee2839e

Avoid '--workers 1' matching '--workers 10' etc.

polarG force-pushed the dev/hongtaozhang/fix-num_workers-usage branch from a475b85 to ee2839e Compare April 27, 2026 22:03

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated

Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated

Hongtao Zhang added 2 commits May 2, 2026 00:19

Copilot AI review requested due to automatic review settings May 3, 2026 05:05

Copilot started reviewing on behalf of polarG May 3, 2026 05:05 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
polarG wants to merge 5 commits intomainfrom
dev/hongtaozhang/fix-num_workers-usage

polarG commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

polarG commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 27, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants