Hi @insujang,
Thank you for open-sourcing Oobleck—it’s an impressive piece of work!
I noticed in the paper that there is a parameter f that controls the fault tolerance threshold. However, I couldn’t find it in the codebase. Is there a way to configure or control this parameter? Additionally, is there a default value set for f?
Another question I have is that I noticed Zero3 is being used. In this case, each GPU should hold a unique model slice for optimizer states (as is the case with traditional Zero3). If one node fails, the corresponding Zero3 slice would also be lost. How can this be recovered? If my understanding is incorrect, please feel free to point it out.
Looking forward to your response. Thanks again for your contributions!
Hi @insujang,
Thank you for open-sourcing Oobleck—it’s an impressive piece of work!
I noticed in the paper that there is a parameter f that controls the fault tolerance threshold. However, I couldn’t find it in the codebase. Is there a way to configure or control this parameter? Additionally, is there a default value set for f?
Another question I have is that I noticed Zero3 is being used. In this case, each GPU should hold a unique model slice for optimizer states (as is the case with traditional Zero3). If one node fails, the corresponding Zero3 slice would also be lost. How can this be recovered? If my understanding is incorrect, please feel free to point it out.
Looking forward to your response. Thanks again for your contributions!