diffmoe

A mixture of experts (MOE) that is designed for vision transformers and diffusion transformers

diffmoe uses a batch pool of tokens, where experts are selected across an entire batch of tokens. This contrasts with token-choice MOE, where each token independantly selects the activated expert. The batch pool complicates training - batch statistics influence the experts selected during the forward pass. To resolve this, diffmoe trains a capacity predictor to ensure that during evaluation, the expert selection mimics the same expert selection dynamics as training. diffmoe is a major improvement. Information across image patches is spread out thinly - some image patches contain vital signals and should recieve extra compute (Experts) while other image patches are largely uninformative. diffmoe's batch pool allows allocating more capacity to complex samples.

An example of training a DiT to generate images of MNIST is in examples/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diffmoe

Mini DiT with a Vanilla MLP, 100 epochs, generated images, no cfg

Final validation loss 0.041

Mini DiT with a DiffMOE MLP, 16 experts, 100 epochs, generated images, no cfg

Final validation loss 0.0322

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

diffmoe

Mini DiT with a Vanilla MLP, 100 epochs, generated images, no cfg

Final validation loss 0.041

Mini DiT with a DiffMOE MLP, 16 experts, 100 epochs, generated images, no cfg

Final validation loss 0.0322