Skip to content

HG - Hardware-specific sharding test and instructions with torchrun#1479

Merged
cirquit merged 120 commits intomainfrom
rsy/hg-model-hardware-test
Mar 19, 2026
Merged

HG - Hardware-specific sharding test and instructions with torchrun#1479
cirquit merged 120 commits intomainfrom
rsy/hg-model-hardware-test

Conversation

@rsyue
Copy link
Copy Markdown
Contributor

@rsyue rsyue commented Dec 10, 2025

What does this PR do? Please describe:
When more than 1 GPU present, model sharding can be tested with torchrun --nproc-per-node 8 test_hg_factory.py

Does your PR introduce any breaking changes? If yes, please list them:
None aware of

Check list:

  • Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
  • Did you read the contributor guideline?
  • Did you make sure that your PR does only one thing instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

zyaoj and others added 24 commits September 20, 2025 11:48
…del) and changed allowed patterns to allow for the json index file
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 10, 2025
rsyue and others added 24 commits March 18, 2026 15:28
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <alex.erben@tum.de>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Co-authored-by: Alexander Erben <aerben@meta.com>
Restore column alignment in sharder files, structured.py, and
revert style-only change in basic.py and parquet test.
@cirquit cirquit self-requested a review March 19, 2026 00:14
Copy link
Copy Markdown
Contributor

@cirquit cirquit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work on addressing all of these.

@cirquit cirquit merged commit 91d24c6 into main Mar 19, 2026
19 checks passed
cirquit added a commit that referenced this pull request Mar 25, 2026
**Summary**
- Fix test and lint regressions introduced in v0.8 development
- Add backward compatibility shims for breaking API changes
- Update CHANGELOG and README for v0.8 release

**Regressions**
- Fix `test_get_shard_dims_work` device mismatch when running with `--device cuda`
- Replace deprecated `datetime.utcnow()` with `datetime.now(timezone.utc)`
- Bump `black` to `~=26.3` (CVE fix) and reformat lines that the new parser rejects
- Fix `Flash3SDPA` to support `flash-attn-3` v3.0.0 API (#1495)
- Pin `pandas~=2.2` for Python 3.12 compatibility

**Backward compatibility shims**
- Add re-export shims for `fairseq2.recipe.validator` and `fairseq2.recipe.task` (#1417)
- Add deprecated `resolve_optional()` on `DependencyResolver` (#1462)
- Add deprecated `ModelCheckpointError` alias for `CorruptModelCheckpointError` (#1475)

**Release prep**
- Update CHANGELOG with missing entries, PR references, and new features (#1479, #1496)
- Add v0.7 and v0.8 rows to README version matrix
YunchaoYang pushed a commit that referenced this pull request Mar 31, 2026
**Summary**
- Fix test and lint regressions introduced in v0.8 development
- Add backward compatibility shims for breaking API changes
- Update CHANGELOG and README for v0.8 release

**Regressions**
- Fix `test_get_shard_dims_work` device mismatch when running with `--device cuda`
- Replace deprecated `datetime.utcnow()` with `datetime.now(timezone.utc)`
- Bump `black` to `~=26.3` (CVE fix) and reformat lines that the new parser rejects
- Fix `Flash3SDPA` to support `flash-attn-3` v3.0.0 API (#1495)
- Pin `pandas~=2.2` for Python 3.12 compatibility

**Backward compatibility shims**
- Add re-export shims for `fairseq2.recipe.validator` and `fairseq2.recipe.task` (#1417)
- Add deprecated `resolve_optional()` on `DependencyResolver` (#1462)
- Add deprecated `ModelCheckpointError` alias for `CorruptModelCheckpointError` (#1475)

**Release prep**
- Update CHANGELOG with missing entries, PR references, and new features (#1479, #1496)
- Add v0.7 and v0.8 rows to README version matrix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants