Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a crash in the Megatron GPT-3 dataset preprocessing path when --num_workers=0 by ensuring the worker count passed to Megatron’s tools/preprocess_data.py is at least 1, while keeping num_workers unchanged for other consumers (e.g., PyTorch DataLoader).
Changes:
- Clamp
--workersforpreprocess_data.pytomax(1, self._args.num_workers)to avoidmultiprocessing.Pool(0)failures. - Add inline clarification comment documenting why preprocessing clamps workers while other components may allow
num_workers=0.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Megatron's preprocess_data.py appends '_text_document' to the
--output-prefix when producing the .bin/.idx files. Derive the
output-prefix from --data_prefix (stripping a trailing
'_text_document' suffix when present) so that the generated files
match the existence checks for any custom data_prefix value, instead
of being hardcoded to '<data_home>/dataset'.
- Add unit test test_megatron_gpt_dataset_generate_command covering:
num_workers=0 clamps to '--workers 1' with default data_prefix
('dataset_text_document' -> '<data_home>/dataset'), num_workers=4
with custom 'custom_text_document' -> '<data_home>/custom', and a
data_prefix without the suffix used as-is.
Avoid '--workers 1' matching '--workers 10' etc.
a475b85 to
ee2839e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #800 +/- ##
==========================================
+ Coverage 85.69% 85.85% +0.15%
==========================================
Files 103 103
Lines 7890 7894 +4
==========================================
+ Hits 6761 6777 +16
+ Misses 1129 1117 -12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…et generate test
- Replace brittle whitespace substring assertions ('--workers 1 ', '--output-prefix ... ') with normalize_command()-based parsed CLI unit checks, so the test validates semantics rather than formatting.
- Use --code_base {self._tmp_dir} together with createMockFiles(['pretrain_gpt.py']) to avoid the unrealistic /root/Megatron-DeepSpeed path. The mocked run_command now creates the expected .bin/.idx files via side_effect so _preprocess() succeeds end-to-end and is asserted to be True.
- Warn when num_workers is silently clamped from 0 to 1 for the preprocess subprocess so the user sees the override in the log. The DataLoader still receives the original num_workers value. - Guard the '_text_document' suffix-strip against the edge case where data_prefix == '_text_document' (which would otherwise produce a malformed --output-prefix '<data_home>/' with an empty basename). Fall back to the original data_prefix value in that case. - Extend test_megatron_gpt_dataset_generate_command with a 4th case asserting the empty-basename fallback.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # when producing the .bin/.idx files. Derive the output-prefix from | ||
| # data_prefix (stripping the '_text_document' suffix when present) so that | ||
| # the generated files match the existence checks above for any custom | ||
| # data_prefix value. | ||
| output_prefix_basename = self._args.data_prefix | ||
| if output_prefix_basename.endswith('_text_document'): | ||
| stripped = output_prefix_basename[:-len('_text_document')] | ||
| # Guard against data_prefix == '_text_document' which would | ||
| # leave an empty basename and produce a malformed --output-prefix | ||
| # ending in '/'. Fall back to the original value in that case. | ||
| output_prefix_basename = stripped or output_prefix_basename |
| # Case 3: data_prefix without the '_text_document' suffix is used as-is. | ||
| _run_case( | ||
| extra_params='--num_workers 2 --data_prefix mydata', | ||
| expected_workers=2, | ||
| expected_prefix_basename='mydata', | ||
| expected_data_prefix='mydata', | ||
| ) | ||
|
|
||
| # Case 4: edge case - data_prefix == '_text_document' should NOT strip down | ||
| # to an empty basename (which would produce '--output-prefix <data_home>/'). | ||
| # Fall back to using '_text_document' as the basename. | ||
| _run_case( | ||
| extra_params='--num_workers 1 --data_prefix _text_document', | ||
| expected_workers=1, | ||
| expected_prefix_basename='_text_document', | ||
| expected_data_prefix='_text_document', | ||
| ) | ||
|
|
Description
preprocess_data.py uses multiprocessing.Pool(workers), which requires workers >= 1. When num_workers is set to 0 (valid for DataLoader, where it means "load in main process"), the preprocessing step crashes.
This change clamps the worker count to max(1, self._args.num_workers) before passing it to preprocess_data.py, while leaving the original num_workers value unchanged for other uses like DataLoader.