FSDP now has an express auto-wrapper for Transformer models. This allows FSDP to create a 'model aware' sharding plan for how it breaks up the model across the GPU's and can result in some significant performance improvements for your training time.
Here's the video: https://www.loom.com/share/2cc2633fa69940789f7f886cbe1fef79
And the notebook for it is included in this dir - transformer wrapper tutorial.jpynb