Question about fault tolerance threshold (f) and zero3

Hi @insujang,

Thank you for open-sourcing Oobleck—it’s an impressive piece of work!

I noticed in the paper that there is a parameter f that controls the fault tolerance threshold. However, I couldn’t find it in the codebase. Is there a way to configure or control this parameter? Additionally, is there a default value set for f?

Another question I have is that I noticed Zero3 is being used. In this case, each GPU should hold a unique model slice for optimizer states (as is the case with traditional Zero3). If one node fails, the corresponding Zero3 slice would also be lost. How can this be recovered? If my understanding is incorrect, please feel free to point it out.

Looking forward to your response. Thanks again for your contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about fault tolerance threshold (f) and zero3 #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about fault tolerance threshold (f) and zero3 #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions