[bugfix]: fix SP deadlock in negative prompt encoding during training#1178
[bugfix]: fix SP deadlock in negative prompt encoding during training#1178alexzms wants to merge 4 commits intohao-ai-lab:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical deadlock issue encountered during distributed training, specifically when using sequence parallelism. The previous implementation of negative prompt encoding inadvertently created a bottleneck by centralizing the process on a single rank, causing other ranks to wait indefinitely for a collective operation that never fully materialized. The updated approach decentralizes this encoding, allowing each rank to process its negative prompt independently, thereby resolving the deadlock and ensuring smooth distributed training. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively resolves a deadlock issue in distributed training with sequence parallelism by refactoring the negative prompt encoding. The previous approach of encoding on a single rank and broadcasting the results is replaced by independent encoding on each rank. This change simplifies the code and eliminates complex collective communication. The implementation is solid, but I have one suggestion to improve robustness by using the configured precision for the text encoder instead of a hardcoded value.
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 PR merge requirementsWaiting for:
This rule is failing.
|
|
/merge |
| mask_sizes = tuple(int(x) for x in mask_shape[:mask_ndim].tolist()) | ||
| text_enc_dtype = PRECISION_TO_TYPE[tc.pipeline_config.text_encoder_precisions[0]] | ||
| tokenizer = AutoTokenizer.from_pretrained(os.path.join(model_path, "tokenizer")) | ||
| text_encoder = T5EncoderModel.from_pretrained( |
There was a problem hiding this comment.
It should still be possible to just use the prompt encoding stage instead of creating a new instance? I'm pretty sure this is the wrong encoder for wan. There's also a UMT5EncoderModel which should be loaded by the existing prompt stage
Summary
sp_size > 1ensure_negative_conditioning()created an inference pipeline on rank 0 only, butpost_init()calledwarmup_sequence_parallel_communication()which does all-to-all on the SP group — other ranks never participated, causing a hangTests
sp_size=4on 4 and 8 GPUs. No hangs, no quality degradation