PyTorch_FSDP_Tutorials/throughput_max_gpu at main · jafraustro/PyTorch_FSDP_Tutorials

Name	Name	Last commit message	Last commit date
parent directory ..
images	images
gpu_memory.py	gpu_memory.py
readme.md	readme.md
throughput_max.ipynb	throughput_max.ipynb

Name

Last commit message

Last commit date

Maximizing your training speed with FSDP and gpu memory:

Conventional wisdom is that to maximize your training throughput, you should run your batch size up until you OOM, and then just slightly back off from there, and viola, optimal throughput.

This is not correct though as you need to optimize by ensuring you are not hitting cudaMalloc retries to get maximum speed!

This tutorial covers an example with tuning a 2B model and the improvements by avoiding retries (25% greater throughput vs conventional practice), as well as offers a utility you can add to your project to automatically monitor gpu info and retry counts for optimizing.

Video: https://www.loom.com/share/2dd1bb59468640df876578835603d0a7

Notebook: throughput_max.ipynb

The Memory_Maximizer class is also included in the attached gpu_memory.py, which automates the monitoring to implement the best practice in the tutorial.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

Maximizing your training speed with FSDP and gpu memory:

FilesExpand file tree

throughput_max_gpu

Directory actions

More options

Directory actions

More options

Latest commit

History

throughput_max_gpu

Folders and files

parent directory

readme.md

Maximizing your training speed with FSDP and gpu memory: