Skip to content

fix: wrap dataset loading in main_process_first to prevent cache race #20

Merged
Neonkraft merged 2 commits intomainfrom
feat/main-process-first
Apr 30, 2026
Merged

fix: wrap dataset loading in main_process_first to prevent cache race #20
Neonkraft merged 2 commits intomainfrom
feat/main-process-first

Conversation

@Neonkraft
Copy link
Copy Markdown
Collaborator

@Neonkraft Neonkraft commented Apr 29, 2026

Summary

On multi-node runs, all ranks try to build the HF datasets cache simultaneously on Lustre. Because Lustre lacks the local-filesystem locking semantics that the HF datasets library assumes, concurrent writes corrupt the cache or crash the job. Wrapping load_and_mix_datasets() in PartialState().main_process_first() ensures rank 0 populates the cache first; all other ranks then read from it.

Type of change

  • Bug fix
  • New feature
  • Refactor
  • Performance
  • Documentation
  • Maintenance

…che races

On multi-node runs, all ranks race to build the HF datasets cache
simultaneously on Lustre, causing corruption or crashes. Wrapping
load_and_mix_datasets() in PartialState().main_process_first() lets
rank 0 populate the cache first; all other ranks then read from it.
@Neonkraft Neonkraft requested a review from KonstiNik April 29, 2026 14:00
@Neonkraft Neonkraft changed the title feat: wrap dataset loading in main_process_first to prevent Lustre cache races feat: wrap dataset loading in main_process_first to prevent cache race Apr 29, 2026
Copy link
Copy Markdown
Collaborator

@KonstiNik KonstiNik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Very important addition in my opinion.
I believe it's debatable whether this is a feature or a bug fix.

@Neonkraft Neonkraft changed the title feat: wrap dataset loading in main_process_first to prevent cache race fix: wrap dataset loading in main_process_first to prevent cache race Apr 30, 2026
@Neonkraft
Copy link
Copy Markdown
Collaborator Author

I believe it's debatable whether this is a feature or a bug fix.

Fair point, this is more of a fix.

@Neonkraft Neonkraft merged commit d7118b1 into main Apr 30, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants