diff --git a/.github/workflows/build-docs.yml b/.github/workflows/build-docs.yml index d01e3afc..c3582df0 100644 --- a/.github/workflows/build-docs.yml +++ b/.github/workflows/build-docs.yml @@ -44,10 +44,13 @@ jobs: uv pip install \ "myst-parser>=4.0.1" \ "nvidia-sphinx-theme>=0.0.8" \ - "sphinx>=8.1.3" \ - "sphinx-autobuild>=2024.10.3" \ + "sphinx>=8.2.3" \ + "sphinx-autobuild>=2025.8.25" \ "sphinx-autodoc2>=0.5.0" \ - "sphinx-copybutton>=0.5.2" + "sphinx-copybutton>=0.5.2" \ + "sphinxcontrib-mermaid>=1.0.0" \ + "sphinx-design>=0.6.1" \ + "swagger-plugin-for-sphinx>=6.0.0" - name: Build documentation run: | diff --git a/docs/INFORMATION_CHECKLIST.md b/docs/INFORMATION_CHECKLIST.md new file mode 100644 index 00000000..e341ba7e --- /dev/null +++ b/docs/INFORMATION_CHECKLIST.md @@ -0,0 +1,524 @@ +# Information Preservation Checklist + +**Purpose**: Verify all unique information from old docs is captured in new structure. + +**How to Use**: Check off each item as it's integrated into the new docs. Items can be integrated anywhere logical in the new IA. + +--- + +## 1. Performance Benchmarks (`performance-summary.md`) + +**Target Location**: `docs/reference/performance.md` (REFERENCE) + +### Nomenclature Definitions +- [ ] **GBS**: Global Batch Size +- [ ] **MBS**: Micro Batch Size +- [ ] **FSDP**: Fully Sharded Data Parallel + - [ ] FSDP = 1: use FSDP + - [ ] FSDP = 0: use DDP (Distributed Data Parallel) +- [ ] **TP**: Tensor Parallel Size +- [ ] **SP**: Sequence Parallel +- [ ] **PP**: Pipeline Parallel Size +- [ ] **CP**: Context Parallel Size +- [ ] **VP**: Virtual Pipeline Parallel Size +- [ ] **EP**: Expert Parallel Size + +### Performance Metrics +- [ ] **Tokens/sec/GPU**: Throughput per GPU (explanation) +- [ ] **Model TFLOP/sec/GPU**: Model floating-point operations per second per GPU (explanation) + +### Benchmark Tables + +#### Megatron-Core Pre-Training Performance + +**DGX-GB200**: +- [ ] WAN 2.1 14B benchmark row (32 GPUs, GBS=64, MBS=1, SeqLen=37440, FSDP=0, TP=1, SP=0, PP=1, CP=4, VP=0, EP=0, TFLOP=787.59) + +**DGX-GB300**: +- [ ] WAN 2.1 14B benchmark row (32 GPUs, GBS=64, MBS=1, SeqLen=37440, FSDP=0, TP=1, SP=0, PP=1, CP=2, VP=0, EP=0, TFLOP=1,022.26) + +**DGX-H100**: +- [ ] WAN 2.1 14B benchmark row (128 GPUs, GBS=128, MBS=1, SeqLen=37440, FSDP=0, TP=2, SP=1, PP=1, CP=4, VP=0, EP=0, TFLOP=325.77) + +#### NeMo Automodel Pre-Training Performance + +**DGX-H100**: +- [ ] WAN 2.1 14B benchmark row (8 GPUs, GBS=8, MBS=1, SeqLen=37440, FSDP=1, DP=8, TP=1, SP=1, PP=1, CP=1, VP=0, EP=0, TFLOP=175.88) +- [ ] WAN 2.1 14B benchmark row (64 GPUs, GBS=64, MBS=1, SeqLen=37440, FSDP=1, DP=64, TP=1, SP=1, PP=1, CP=1, VP=0, EP=0, TFLOP=228.85) + +### Context Information +- [ ] Note about referring to `examples/megatron/recipes/wan/conf` for updated YAML configs +- [ ] Statement about ongoing optimization + +--- + +## 2. Paradigm Comparison (`mcore_automodel_comparision_wan21.md`) + +**Target Location**: `docs/about/comparison.md` OR integrate into `docs/about/concepts/training-paradigms.md` (EXPLANATION) + +### Experiment Overview +- [ ] Goal: Compare two training paths for WAN 2.1 +- [ ] Path 1: Diffusers + Automodel training path (with links) +- [ ] Path 2: Megatron-Core + Megatron-Bridge training path (with links) +- [ ] Two-stage training approach explanation +- [ ] Dataset: 3,000 videos (frames extracted for Stage 1) + +### Stage 1: Text-to-Image +- [ ] Extract 40 frames per video → 120k images +- [ ] Resolution: 240 × 416 +- [ ] Each frame uses same caption as parent video +- [ ] Global batch size: 2560 images +- [ ] Learning rate: warmup 10k → 5e-5 constant +- [ ] Hardware: 10 nodes (80 GPUs) +- [ ] Megatron-Core parallelism: TP=1, PP=1, CP=1, Sequence packing (32 samples/pack) +- [ ] Automodel parallelism: FSDP, micro_batch_size = 32 +- [ ] Training curve image: `lm_loss_text2image_3kvids.png` + +### Stage 2: Text-to-Video +- [ ] Full videos → 3,000 videos +- [ ] Resolution: 240 × 416, duration 4–8 seconds +- [ ] Global batch size: 80 videos +- [ ] Learning rate: 5e-5 constant +- [ ] Hardware: 10 nodes (80 GPUs) +- [ ] Megatron-Core parallelism: TP=1, PP=1, CP=1, micro_batch_size = 1 +- [ ] Automodel parallelism: FSDP, micro_batch_size = 1 +- [ ] Training curve image: `lm_loss_text2video_3kvids.png` + +### Results Analysis +- [ ] Note: Training loss smoothed with 50 steps averaging +- [ ] Observation: Training curves have similar value ranges but don't match exactly +- [ ] Explanation: Expected due to differences in implementation and training loop setups +- [ ] **Critical Caveat**: Megatron-Core applies same diffusion timesteps to all samples in pack (not different timesteps per sample) +- [ ] **Critical Caveat**: Training loss for Megatron-Core fluctuates more than AutoModel, especially at beginning + +### Context Notes +- [ ] Note: Partial convergence test (3K videos insufficient for generalization) +- [ ] Note: Only demonstrates reconstruction ability, not novel generation + +--- + +## 3. Automodel Training Information (`automodel_training_doc.md`) + +**Target Location**: Integrate into `docs/get-started/automodel.md` (TUTORIAL with progressive disclosure) + +### Overview +- [ ] Currently Supported: WAN 2.1 Text-to-Video (1.3B and 14B models) + +### Docker Setup +- [ ] Build command: `docker build -f docker/Dockerfile.ci -t dfm-training .` +- [ ] Run command with all flags (--gpus, -v mounts, --ipc=host, ulimit settings) +- [ ] Inside container: Initialize submodules command + +### Data Preparation + +#### Dataset Options +- [ ] Option 1: Start with raw videos (use data-preparation scripts) +- [ ] Option 2: Bring your own `meta.json` + +#### Dataset Structure +- [ ] Folder structure example (`/` with videos and `meta.json`) +- [ ] Note about per-video `.jsonl` captions being picked up automatically + +#### meta.json Schema +- [ ] Complete JSON schema with all fields: + - [ ] `file_name` + - [ ] `width` + - [ ] `height` + - [ ] `start_frame` + - [ ] `end_frame` + - [ ] `vila_caption` +- [ ] Example with two video entries + +#### Preprocessing Modes + +**Full Video Mode (`--mode video`)**: +- [ ] What it is: Converts each source video into single `.meta` preserving full temporal sequence +- [ ] When to use: Fine-tuning text-to-video models where motion/temporal consistency matter +- [ ] Status: Recommended default for most training runs +- [ ] Command example with all flags +- [ ] Output: Creates one `.meta` file per video + +**Extract Frames Mode (`--mode frames`)**: +- [ ] What it is: Uniformly samples N frames, writes each as one-frame `.meta` sample +- [ ] When to use: Image/frame-level training, quick smoke tests, ablations +- [ ] Command example with `--num-frames` flag +- [ ] Output: Creates one `.meta` file per frame + +#### Preprocessing Key Arguments +- [ ] `--mode`: `video` or `frames` explanation +- [ ] `--num-frames`: Number of frames to extract (frames mode only) +- [ ] `--height/--width`: Target resolution +- [ ] `--center-crop`: Crop to exact size after aspect-preserving resize + +#### Preprocessing Output +- [ ] Encoded video latents (normalized) +- [ ] Text embeddings (from UMT5) +- [ ] First frame as JPEG (video mode only) +- [ ] Metadata + +### Training + +#### Single-Node Training +- [ ] Command: `uv run --group automodel --with . torchrun --nproc-per-node=8 ...` +- [ ] Config file: `examples/automodel/finetune/wan2_1_t2v_flow.yaml` +- [ ] Note about `UV_PROJECT_ENVIRONMENT` export + +#### Multi-Node SLURM Training +- [ ] Complete SLURM script with all SBATCH directives +- [ ] MASTER_ADDR setup from SLURM_JOB_NODELIST +- [ ] MASTER_PORT setup +- [ ] Per-rank UV cache setup to avoid conflicts +- [ ] UV_CACHE_DIR per job/rank +- [ ] torchrun command with multi-node flags +- [ ] Config file: `wan2_1_t2v_flow_multinode.yaml` + +### Validation + +#### Validation Script Details +- [ ] Purpose: Quick qualitative check of trained checkpoint +- [ ] Reads prompts from `.meta` files in `--meta_folder` +- [ ] Uses `metadata.vila_caption` (latents ignored) +- [ ] Loads `WanPipeline` +- [ ] Checkpoint loading priority: `ema_shadow.pt` → `consolidated_model.bin` → sharded FSDP `model/*.distcp` +- [ ] Generation settings: `--guidance_scale`, `--num_inference_steps`, `--height/--width`, `--num_frames`, `--fps`, `--seed` +- [ ] Output: Writes videos to `--output_dir` +- [ ] Note: Qualitative comparison only, no quantitative metrics +- [ ] Command example +- [ ] Note: `--checkpoint ./checkpoints/LATEST` automatically uses most recent checkpoint + +### Configuration + +#### Fine-tuning Config (`wan2_1_t2v_flow.yaml`) +- [ ] Complete YAML config with all sections: + - [ ] `model.pretrained_model_name_or_path` + - [ ] `step_scheduler` (global_batch_size, local_batch_size, num_epochs, ckpt_every_steps) + - [ ] `data.dataloader` (meta_folder, num_workers) + - [ ] `optim.learning_rate` + - [ ] `flow_matching` (timestep_sampling, flow_shift) + - [ ] `fsdp.dp_size` + - [ ] `checkpoint` (enabled, checkpoint_dir) +- [ ] Note about canonical files in repository + +#### Multi-Node Config Differences +- [ ] `fsdp.dp_size`: Total data-parallel replicas (2 nodes × 8 GPUs = 16) +- [ ] `fsdp.dp_replicate_size`: Number of replicated groups across nodes (2) + +#### Pretraining vs Fine-tuning Comparison Table +- [ ] `learning_rate`: Fine-tuning (5e-6) vs Pretraining (5e-5) +- [ ] `weight_decay`: Fine-tuning (0.01) vs Pretraining (0.1) +- [ ] `flow_shift`: Fine-tuning (3.0) vs Pretraining (2.5) +- [ ] `logit_std`: Fine-tuning (1.0) vs Pretraining (1.5) +- [ ] Dataset size: Fine-tuning (100s-1000s) vs Pretraining (10K+) + +### Hardware Requirements Table +- [ ] GPU: Minimum (A100 40GB) vs Recommended (A100 80GB / H100) +- [ ] GPUs: Minimum (4) vs Recommended (8+) +- [ ] RAM: Minimum (128 GB) vs Recommended (256 GB+) +- [ ] Storage: Minimum (500 GB SSD) vs Recommended (2 TB NVMe) + +### Features List +- [ ] Flow Matching: Pure flow matching training +- [ ] Distributed: FSDP2 + Tensor Parallelism +- [ ] Mixed Precision: BF16 by default +- [ ] WandB: Automatic logging +- [ ] Checkpointing: consolidated and sharded formats +- [ ] Multi-node: SLURM and torchrun support + +### Supported Models Table +- [ ] WAN 2.1 T2V 1.3B: 1.3B params, FSDP2 via Automodel + DDP, Status ✅ +- [ ] WAN 2.1 T2V 14B: 14B params, FSDP2 via Automodel + DDP, Status ✅ +- [ ] FLUX: TBD params, TBD parallelization, Status 🔄 In Progress + +### Advanced Topics + +#### Custom Parallelization +- [ ] Example YAML: `fsdp.tp_size: 2`, `fsdp.dp_size: 4` + +#### Checkpoint Cleanup +- [ ] Python function: `cleanup_old_checkpoints(checkpoint_dir, keep_last_n=3)` +- [ ] Complete code example with Path and shutil usage + +--- + +## 4. DiT Model Information (`megatron/models/dit/README.md`) + +**Target Location**: Integrate into `docs/get-started/megatron.md` (TUTORIAL with progressive disclosure) + +### Overview +- [ ] DiT description: Open-source implementation of Diffusion Transformers +- [ ] Purpose: Training text-to-image/video models with EDM Pipeline +- [ ] Based on: Megatron-Core and Megatron-Bridge +- [ ] Parallelism support: Tensor, sequence, and context parallelism + +### Dataset Preparation + +#### Energon Data Loader +- [ ] Uses NVIDIA's Megatron-Energon +- [ ] WebDataset-compatible format (sharded `.tar` archives) +- [ ] Supports: Large-scale distributed loading, sharding, sampling for multi-modal pairs +- [ ] Set `dataset.path` to WebDataset location or shard pattern + +#### Butterfly Dataset Example +- [ ] Dataset: `huggan/smithsonian_butterflies_subset` on Hugging Face +- [ ] Script: `prepare_energon_dataset_butterfly.py` +- [ ] Command with `--nproc-per-node` +- [ ] Optional arguments: `--t5_cache_dir`, `--tokenizer_cache_dir` + +#### Energon Prepare Workflow +- [ ] Command: `energon prepare $dataset_path` +- [ ] Interactive prompts explanation: + - [ ] Train/val/test split entry (e.g., "1,0,0") + - [ ] Sample type selection: "Crude sample (plain dict for cooking)" (option 11) +- [ ] Sample structure: keys include `json`, `pickle`, `pth` +- [ ] Sample JSON content example (`image_height`, `image_width`) +- [ ] Note: CrudeWebdataset doesn't need field map +- [ ] Note: Need to provide `Cooker` in `TaskEncoder` +- [ ] Note: Can add `subflavors` in meta dataset specification + +### Container Build +- [ ] Reference to container section in main README + +### Pretraining + +#### Sequence Packing +- [ ] Purpose: Maximize training efficiency +- [ ] How it works: Stacks multiple samples into single sequence instead of padding +- [ ] Requirement: `micro_batch_size` must be set to 1 +- [ ] Requirement: `qkv_format` should be set to `thd` (signals Transformer Engine) +- [ ] Link to NeMo sequence packing documentation + +#### Sequence Packing Parameters +- [ ] `task_encoder_seq_length`: Controls maximum sequence length passed to model +- [ ] `packing_buffer_size`: Determines number of samples processed to create buckets +- [ ] Reference to `select_samples_to_pack` and `pack_selected_samples` methods +- [ ] Link to DiffusionTaskEncoderWithSequencePacking code +- [ ] Link to Energon packing documentation + +#### Parallelism +- [ ] Multiple parallelism techniques supported (tensor, sequence, context) +- [ ] Configurable based on computational requirements + +#### Model Architecture Customization +- [ ] Parameters: `num_layers`, `num_attention_heads` +- [ ] Link to Megatron-Bridge documentation for comprehensive options + +#### WandB Notes +- [ ] If using `wandb_project` and `wandb_exp_name`, export `WANDB_API_KEY` + +#### Validation Details +- [ ] Model generates one sample per GPU at start of each validation round +- [ ] Samples saved to `validation_generation` folder within `checkpoint_dir` +- [ ] Logged to WandB if `WANDB_API_KEY` configured +- [ ] Requires access to video tokenizer used during dataset preparation +- [ ] Specify VAE artifacts location using `vae_cache_folder` argument +- [ ] Otherwise downloaded in first validation round + +#### Pretraining Script Example +- [ ] Copy config file: `cp examples/megatron/recipes/dit/conf/dit_pretrain_example.yaml ...` +- [ ] Edit instructions for `my_config.yaml`: + - [ ] `model.vae_cache_folder`: Path to VAE cache folder + - [ ] `dataset.path`: Path to dataset folder + - [ ] `checkpoint.save` and `checkpoint.load`: Path to checkpoint folder + - [ ] `train.global_batch_size`: Set to be divisible by NUM_GPUs + - [ ] `logger.wandb_exp_name`: Your experiment name +- [ ] Run command with `--config-file` +- [ ] CLI override example: `train.train_iters=20000`, `model.num_layers=32` + +#### Training Split Note +- [ ] If 100% data to training, pass `dataset.use_train_split_for_val=true` +- [ ] Uses subset of training data for validation +- [ ] Command example with this flag + +#### Mock Dataset +- [ ] Use `--mock` flag for performance measurement without dataset +- [ ] Command example with `--mock` flag + +### Inference + +#### Inference Script +- [ ] Script: `inference_dit_model.py` +- [ ] Requires: Trained checkpoint (`--checkpoint_path`), save path (`--video_save_path`) +- [ ] Optional: `--t5_cache_dir`, `--tokenizer_cache_dir` (avoid re-downloading) +- [ ] Command example with all parameters: + - [ ] `--t5_cache_dir` + - [ ] `--tokenizer_cache_dir` + - [ ] `--tokenizer_model Cosmos-0.1-Tokenizer-CV4x8x8` + - [ ] `--checkpoint_path` + - [ ] `--num_video_frames 10` + - [ ] `--height 240` + - [ ] `--width 416` + - [ ] `--video_save_path` + - [ ] `--prompt` + +### Parallelism Support Table +- [ ] DiT-S (330M): Data Parallel (TBD), Tensor Parallel (TBD), Sequence Parallel (TBD), Context Parallel (TBD) +- [ ] DiT-L (450M): Data Parallel (TBD), Tensor Parallel (TBD), Sequence Parallel (TBD), Context Parallel (TBD) +- [ ] DiT-XL (700M): Data Parallel (✅), Tensor Parallel (✅), Sequence Parallel (✅), Context Parallel (✅) + +--- + +## 5. WAN Recipe Information (`megatron/recipes/wan/wan2.1.md`) + +**Target Location**: `docs/get-started/megatron-wan.md` OR integrate into `docs/get-started/megatron.md` with tabs (TUTORIAL/HOW-TO) + +### Overview +- [ ] WAN 2.1 description: Open-source implementation of large-scale text-to-video/image generative models +- [ ] Built on: Megatron-Core and Megatron-Bridge +- [ ] Supports: Advanced parallelism strategies (data, tensor, sequence, context parallelism) +- [ ] Optimized kernels: Transformer Engine fused attention + +### Dataset Preparation + +#### Energon Data Loader +- [ ] Uses NVIDIA's Megatron-Energon +- [ ] WebDataset-compatible format (sharded `.tar` archives) +- [ ] Supports: Large-scale distributed loading, sharding, sampling for video-text and image-text pairs +- [ ] Set `dataset.path` to WebDataset directory or shard pattern +- [ ] Link to Megatron-Energon docs for format details, subflavors, advanced options + +#### Mock Dataset Note +- [ ] If no dataset: See "Quick Start with Mock Dataset" section + +#### WAN Dataset Preparation Example +- [ ] Input: Directory with raw `.mp4` videos and `.json` metadata files with captions +- [ ] Output: WAN-ready WebDataset shards +- [ ] Step 1: Define input/output folders (`DATASET_SRC`, `DATASET_PATH`) +- [ ] Step 2: Optional HF_TOKEN export if auth required +- [ ] Step 3: Create WAN shards with latents + text embeddings + - [ ] Script: `prepare_energon_dataset_wan.py` + - [ ] Uses WAN's VAE encoder and T5 encoder + - [ ] Extracts videos' latents and caption embeddings offline + - [ ] Arguments: `--height/--width` control resize target (832x480 supported for 1.3B and 14B) + - [ ] `--center-crop`: Run center crop to exact target size after resize + - [ ] Command example with all flags +- [ ] Step 4: Use Energon to process shards + - [ ] Command: `energon prepare "${DATASET_PATH}"` + - [ ] Interactive prompts: Enter train/val/test split (e.g., "8,1,1") + - [ ] Sample type: Choose "Crude sample (plain dict for cooking)" + +#### What Gets Produced +- [ ] Each shard contains: + - [ ] `pth`: WAN video latents + - [ ] `pickle`: Text embeddings + - [ ] `json`: Useful side-info (text caption, sizes, processing choices) +- [ ] Energon writes `.nv-meta` directory with dataset info +- [ ] Energon writes `dataset.yaml` (can version/control) + +#### Training Config Setup +- [ ] Point WAN config to processed data: `dataset.path=${DATASET_PATH}` + +### Container Build +- [ ] Reference to DFM container guide in main README + +### Pretraining + +#### Sequence Packing for WAN +- [ ] Purpose: Maximize throughput +- [ ] Problem: Naive batching/padding requires significant padded tokens for videos +- [ ] Solution: Sequence packing stacks multiple samples (different resolutions) into single sequence +- [ ] Benefit: No computation wasted on padded tokens +- [ ] Requirements: + - [ ] Set `train.micro_batch_size=1` and `dataset.micro_batch_size=1` + - [ ] Ensure `model.qkv_format=thd` (required with context parallelism, recommended with sequence packing) + +#### Parallelism +- [ ] Multiple parallelism techniques supported (tensor, sequence, context parallelism) +- [ ] Configurable per hardware + +#### Training Script +- [ ] Script: `examples/megatron/recipes/wan/pretrain_wan.py` +- [ ] Supports: YAML config file and CLI overrides + +#### Training Mode Presets +- [ ] `--training-mode` with `pretrain` and `finetune` presets +- [ ] Purpose: Flow-matching hyperparameters as starting point +- [ ] **Pretraining preset**: + - [ ] Uses noisier, biased sampling + - [ ] Examples: logit-normal, higher logit_std, lower flow_shift + - [ ] Purpose: Stability and broad learning +- [ ] **Finetuning preset**: + - [ ] Uses uniform, lower-noise settings + - [ ] Examples: uniform sampling, lower logit_std, higher flow_shift + - [ ] Purpose: Refine details and improve quality + +#### WandB Notes +- [ ] If using `logger.wandb_project` and `logger.wandb_exp_name`, export `WANDB_API_KEY` + +#### Pretraining Script Example +- [ ] Example configs: `wan_1_3B.yaml` and `wan_14B.yaml` under `examples/megatron/recipes/wan/conf` +- [ ] Copy and edit instructions: + - [ ] `dataset.path`: Path to WebDataset directory + - [ ] `train.global_batch_size/micro_batch_size`: Keep micro_batch_size=1 + - [ ] `model.tensor_model_parallel_size` / `model.context_parallel_size`: Based on GPUs + - [ ] `checkpoint.save` and `checkpoint.load`: Checkpoint directory +- [ ] Run command with `--training-mode pretrain` and `--config-file` +- [ ] CLI override example with all parameters: + - [ ] `dataset.path` + - [ ] `train.global_batch_size` + - [ ] `train.micro_batch_size` + - [ ] `model.tensor_model_parallel_size` + - [ ] `model.context_parallel_size` + - [ ] `checkpoint.save` + - [ ] `checkpoint.load` +- [ ] Link to Megatron-Bridge docs for argument details + +#### Mock Dataset +- [ ] Use `--mock` flag for debugging or performance measurement +- [ ] Command example with `--mock` flag +- [ ] Note: Can adjust mock shapes (`F_latents`, `H_latents`, `W_latents`) and packing behavior (`number_packed_samples`) in `WanMockDataModuleConfig` +- [ ] Reference: See `dfm/src/megatron/recipes/wan/wan.py` + +### Inference + +#### Inference Script +- [ ] Script: `examples/megatron/recipes/wan/inference_wan.py` +- [ ] `--checkpoint_step`: Use specific checkpoint for inference +- [ ] `--sizes`: Specify video shape (height, width) +- [ ] `--frame_nums`: Specify number of frames +- [ ] `--sample_steps`: Number of noise diffusion steps (default: 50) +- [ ] Command example with all parameters: + - [ ] `--task t2v-1.3B` + - [ ] `--frame_nums 81` + - [ ] `--sizes 480*832` + - [ ] `--checkpoint_dir` + - [ ] `--checkpoint_step 10000` + - [ ] `--prompts` (example prompt) + - [ ] `--sample_steps 50` +- [ ] **Note**: Current inference path is single-GPU. Parallel inference not yet supported. + +### Parallelism Support Table +- [ ] 1.3B model: Data Parallel (✅), Tensor Parallel (✅), Sequence Parallel (✅), Context Parallel (✅), FSDP (Coming Soon) +- [ ] 14B model: Data Parallel (✅), Tensor Parallel (✅), Sequence Parallel (✅), Context Parallel (✅), FSDP (Coming Soon) + +### References +- [ ] WAN Team citation: (2025). Wan: Open and advanced large-scale video generative models (Wan 2.1). GitHub. https://github.com/Wan-Video/Wan2.1/ + +--- + +## Verification Summary + +**Total Information Items**: ~200+ discrete pieces + +**Checklist Status**: +- [ ] All items from `performance-summary.md` captured +- [ ] All items from `mcore_automodel_comparision_wan21.md` captured +- [ ] All items from `automodel_training_doc.md` captured +- [ ] All items from `megatron/models/dit/README.md` captured +- [ ] All items from `megatron/recipes/wan/wan2.1.md` captured + +**Integration Verification**: +- [ ] Each item checked off as integrated +- [ ] Location documented (which file/section) +- [ ] Progressive disclosure applied (Layer 1/2/3/4) +- [ ] Links and references verified +- [ ] Images copied and paths updated + +--- + +## Notes + +- **Information can be integrated anywhere logical** - doesn't need to match old file structure +- **Progressive disclosure**: Layer 3/4 items can be in dropdowns/tabs/separate pages +- **Cross-references**: Related information can be linked rather than duplicated +- **Verification**: Check off items as you integrate them, note location + diff --git a/docs/MIGRATION_PLAN.md b/docs/MIGRATION_PLAN.md new file mode 100644 index 00000000..f2ff108e --- /dev/null +++ b/docs/MIGRATION_PLAN.md @@ -0,0 +1,758 @@ +# Documentation Migration Plan: Preserving All Information + +**Goal**: Capture all information from old docs in the new information architecture, organized logically using Diataxis, progressive disclosure, and MyST directives. + +**Status**: Draft Plan +**Date**: 2025-01-XX + +**Key Principle**: Preserve **information**, not file structure. Content can be merged, split, or reorganized as long as all information is captured in a well-organized manner. + +--- + +## Overview + +This plan ensures: +- ✅ **Zero information loss**: All content from old docs preserved somewhere logical +- ✅ **Mature information architecture**: Content organized by purpose and user need +- ✅ **Diataxis alignment**: Content organized by type (Tutorial, How-To, Explanation, Reference) +- ✅ **Progressive disclosure**: Advanced details in dropdowns/tabs/separate pages +- ✅ **Cognitive load reduction**: Scannable structure with clear navigation + +--- + +## Information Inventory (Not File Inventory) + +### Information Currently Missing from New Structure + +1. **Performance Benchmarks** + - **Source**: `performance-summary.md` + - **Information**: Nomenclature, metrics, benchmark tables (DGX-GB200, GB300, H100) + - **Best Location**: `docs/reference/performance.md` (REFERENCE type) + - **Status**: Missing entirely + +2. **Paradigm Comparison Analysis** + - **Source**: `mcore_automodel_comparision_wan21.md` + - **Information**: Experimental comparison, training curves, caveats + - **Best Location**: `docs/about/comparison.md` OR integrate into `docs/about/concepts/training-paradigms.md` + - **Status**: Missing entirely + +### Information in Orphaned Files (Needs Integration) + +1. **Detailed Automodel Training Information** + - **Source**: `automodel_training_doc.md` + - **Information**: Preprocessing modes, validation, hardware reqs, advanced config + - **Best Location**: Integrate into `get-started/automodel.md` (progressive disclosure) + - **Status**: Exists but not integrated + +2. **DiT-Specific Training Details** + - **Source**: `megatron/models/dit/README.md` + - **Information**: Sequence packing details, Energon format, validation + - **Best Location**: Integrate into `get-started/megatron.md` (progressive disclosure) + - **Status**: Exists but not integrated + +3. **WAN-Specific Training Information** + - **Source**: `megatron/recipes/wan/wan2.1.md` + - **Information**: WAN dataset prep, training modes, WAN-specific workflows + - **Best Location**: Either: + - Option A: `get-started/megatron-wan.md` (separate guide) + - Option B: Enhance `get-started/megatron.md` with WAN section (tabs) + - **Status**: Exists but not integrated + +--- + +## Information Mapping Strategy + +**Approach**: Map information to logical locations in new IA, not files to files. + +### Information Organization Principles + +1. **User Intent First**: Where would users look for this information? +2. **Diataxis Alignment**: What type of content is this? (Tutorial/How-To/Explanation/Reference) +3. **Progressive Disclosure**: What layer does this belong to? (Core/Advanced/Reference) +4. **Logical Grouping**: Related information should be together + +## Migration Strategy by Information Type + +### 1. Performance Summary (`performance-summary.md` → `docs/reference/performance.md`) + +**Diataxis Type**: REFERENCE +**Progressive Disclosure**: Use tabs for different systems, dropdowns for detailed metrics + +**Structure**: +```markdown +# Performance Benchmarks + +## Overview +[Layer 1: 30-second overview] + +## Nomenclature +[Layer 2: Core definitions - use dropdowns for detailed explanations] + +## Performance Metrics +[Layer 2: Core metrics explanation] + +## Benchmark Results +[Layer 2: Main results - use tabs for different systems] + +:::: {tab-set} +::: {tab-item} DGX-GB200 +[Results table] +::: +::: {tab-item} DGX-GB300 +[Results table] +::: +::: {tab-item} DGX-H100 +[Results table] +::: +:::: + +## Detailed Configurations +[Layer 3: Advanced details in dropdowns] +``` + +**Content to Preserve**: +- ✅ All nomenclature definitions (GBS, MBS, FSDP, TP, SP, PP, CP, VP, EP) +- ✅ Performance metrics explanation (Tokens/sec/GPU, Model TFLOP/sec/GPU) +- ✅ All benchmark tables (DGX-GB200, DGX-GB300, DGX-H100) +- ✅ Both Megatron-Core and NeMo Automodel results +- ✅ All model configurations + +**Progressive Disclosure**: +- **Layer 1**: Overview + summary table +- **Layer 2**: Core metrics + main results (tabs for systems) +- **Layer 3**: Detailed configurations (dropdowns) +- **Layer 4**: Raw data tables (if needed, separate page) + +--- + +### 2. Comparison Document (`mcore_automodel_comparision_wan21.md` → `docs/about/comparison.md`) + +**Diataxis Type**: EXPLANATION +**Progressive Disclosure**: Use tabs for stages, dropdowns for detailed analysis + +**Structure**: +```markdown +# Automodel vs Megatron Comparison + +## Overview +[Layer 1: What this comparison shows] + +## Experiment Overview +[Layer 2: Core experiment details] + +## Training Stages +[Layer 2: Use tabs for Stage 1 vs Stage 2] + +:::: {tab-set} +::: {tab-item} Stage 1: Text-to-Image +[Dataset, setup, results] +::: +::: {tab-item} Stage 2: Text-to-Video +[Dataset, setup, results] +::: +:::: + +## Results Analysis +[Layer 2: Training curves with images] + +:::{dropdown} Detailed Analysis +[Layer 3: Caveats and technical details] +::: + +## Key Takeaways +[Layer 2: Summary comparison] +``` + +**Content to Preserve**: +- ✅ Complete experiment overview +- ✅ Both training stages (Text→Image, Text→Video) +- ✅ Dataset details (3K videos, 120K images) +- ✅ Training setup comparison tables +- ✅ Training curve images (both stages) +- ✅ Important caveat about Megatron-Core timestep handling +- ✅ All parallelism configurations + +**Progressive Disclosure**: +- **Layer 1**: Overview + key findings +- **Layer 2**: Main comparison (tabs for stages) +- **Layer 3**: Detailed analysis (dropdowns) +- **Layer 4**: Full technical details (if needed) + +**Integration**: Also enhance `docs/about/concepts/training-paradigms.md` with link to this comparison. + +--- + +### 3. Automodel Training Doc (`automodel_training_doc.md` → Enhance `get-started/automodel.md`) + +**Diataxis Type**: TUTORIAL (enhanced) +**Progressive Disclosure**: Add missing details as dropdowns and expandable sections + +**Missing Content to Add**: + +#### A. Preprocessing Details (Add to Step 1) +```markdown +### 1. Prepare Your Dataset + +[Current content...] + +:::{dropdown} Detailed Preprocessing Modes +[Layer 3: Full explanation of video vs frames mode] + +**Full Video Mode** (`--mode video`): +- What it is: [detailed explanation] +- When to use: [use cases] +- Output: [what gets created] + +**Extract Frames Mode** (`--mode frames`): +- What it is: [detailed explanation] +- When to use: [use cases] +- Output: [what gets created] +::: + +:::{dropdown} meta.json Format Specification +[Layer 3: Complete schema] + +```json +[Full JSON schema with all fields] +``` +::: +``` + +#### B. Multi-Node Setup (Add to Step 3) +```markdown +### 3. Run Training + +[Current single-node content...] + +:::{dropdown} Multi-Node with SLURM +[Layer 3: Advanced setup] + +[Complete SLURM script from old docs] +::: +``` + +#### C. Validation (Add new section) +```markdown +### 4. Validate Training + +[New section with validation script details] + +:::{dropdown} Validation Script Details +[Layer 3: Advanced validation options] + +[Complete validation documentation] +::: +``` + +#### D. Hardware Requirements (Add as dropdown) +```markdown +:::{dropdown} Hardware Requirements +[Layer 3: System requirements] + +| Component | Minimum | Recommended | +|-----------|---------|-------------| +[Full table from old docs] +::: +``` + +#### E. Advanced Configuration (Add as new section) +```markdown +## Advanced Topics + +:::{dropdown} Pretraining vs Fine-tuning +[Layer 3: Comparison table] + +[Full comparison table] +::: + +:::{dropdown} Custom Parallelization +[Layer 3: Advanced parallelism] + +[Custom parallelization examples] +::: + +:::{dropdown} Checkpoint Management +[Layer 3: Advanced checkpointing] + +[Checkpoint cleanup code] +::: +``` + +**Content to Preserve**: +- ✅ All preprocessing mode details +- ✅ Complete `meta.json` schema +- ✅ Multi-node SLURM setup +- ✅ Validation script documentation +- ✅ Hardware requirements table +- ✅ Pretraining vs fine-tuning comparison +- ✅ Advanced parallelization examples +- ✅ Checkpoint cleanup utilities +- ✅ Supported models table + +**Progressive Disclosure**: +- **Layer 1**: Core tutorial steps (current) +- **Layer 2**: Essential details (expand current sections) +- **Layer 3**: Advanced topics (dropdowns) +- **Layer 4**: Complete reference (link to detailed guide) + +**Integration Strategy**: +- Keep current tutorial structure (Layer 1-2) +- Add missing information as progressive disclosure elements (Layer 3) +- **No need to preserve `automodel_training_doc.md` as separate file** - all information integrated + +--- + +### 4. DiT Model Guide (`megatron/models/dit/README.md` → Enhance `get-started/megatron.md`) + +**Diataxis Type**: TUTORIAL (enhanced) +**Progressive Disclosure**: Add DiT-specific details as expandable sections + +**Missing Content to Add**: + +#### A. Sequence Packing Details (Enhance existing section) +```markdown +### Sequence Packing + +[Current brief mention...] + +:::{dropdown} Understanding Sequence Packing +[Layer 3: Detailed explanation] + +[Complete sequence packing explanation from old docs] +- Why use it +- How it works +- Configuration requirements +- Performance impact +::: + +:::{dropdown} Sequence Packing Parameters +[Layer 3: Advanced configuration] + +**Key Parameters**: +- `task_encoder_seq_length`: [explanation] +- `packing_buffer_size`: [explanation] +- `qkv_format=thd`: [why required] +::: +``` + +#### B. Validation Details (Add new section) +```markdown +### Monitor Training + +[Current content...] + +:::{dropdown} Validation and Sample Generation +[Layer 3: Advanced monitoring] + +[Complete validation details from old docs] +- How validation works +- Sample generation +- WandB integration +- VAE cache requirements +::: +``` + +#### C. Energon Dataset Details (Enhance existing section) +```markdown +### Prepare Dataset + +[Current butterfly example...] + +:::{dropdown} Understanding Energon Format +[Layer 3: Advanced data format] + +[Complete Energon explanation] +- WebDataset format +- Sample structure +- Energon prepare command details +::: +``` + +**Content to Preserve**: +- ✅ Complete sequence packing explanation +- ✅ Sequence packing parameters (`task_encoder_seq_length`, `packing_buffer_size`) +- ✅ Validation details (sample generation, WandB) +- ✅ VAE cache folder requirements +- ✅ Energon dataset format details +- ✅ Complete Energon prepare workflow +- ✅ All configuration examples + +**Progressive Disclosure**: +- **Layer 1**: Core tutorial (current) +- **Layer 2**: Essential DiT details (expand current) +- **Layer 3**: Advanced topics (dropdowns) +- **Layer 4**: Complete reference (link to `dit/README.md`) + +**Integration Strategy**: +- Enhance existing Megatron tutorial with DiT-specific details +- Use dropdowns for advanced topics +- **No need to preserve `dit/README.md` as separate file** - all information integrated + +--- + +### 5. WAN Recipe Guide (`megatron/recipes/wan/wan2.1.md` → New page or enhance tutorial) + +**Diataxis Type**: HOW-TO +**Progressive Disclosure**: Use tabs for different workflows, dropdowns for details + +**Decision**: Create separate WAN guide page OR enhance Megatron tutorial with WAN section + +**Option A: Separate WAN Guide Page** (Recommended) +``` +docs/get-started/megatron-wan.md +``` + +**Option B: Enhance Megatron Tutorial** (Alternative) +Add WAN section with tabs: `:::: {tab-set}` for DiT vs WAN + +**Recommended Structure** (Option A): +```markdown +# Megatron WAN Workflow + +## Overview +[Layer 1: What WAN is, when to use it] + +## Choose Your Model +[Layer 2: DiT vs WAN decision] + +:::: {tab-set} +::: {tab-item} DiT Model +:link: megatron +[Link to DiT tutorial] +::: +::: {tab-item} WAN Model +[WAN-specific content] +::: +:::: + +## Prepare WAN Dataset +[Layer 2: WAN-specific dataset prep] + +:::{dropdown} Understanding WAN Data Format +[Layer 3: Detailed format explanation] +::: + +## Train WAN Model +[Layer 2: WAN training] + +:::{dropdown} Training Mode Presets +[Layer 3: pretrain vs finetune modes] + +[Complete explanation of presets] +::: + +:::{dropdown} Sequence Packing for WAN +[Layer 3: WAN-specific packing] + +[WAN sequence packing details] +::: + +## Generate Videos +[Layer 2: WAN inference] + +## Parallelism Support +[Layer 2: WAN parallelism table] +``` + +**Content to Preserve**: +- ✅ Complete WAN overview +- ✅ WAN dataset preparation (Energon workflow) +- ✅ Training mode presets (pretrain vs finetune) +- ✅ Sequence packing for WAN +- ✅ WAN inference details +- ✅ Parallelism support table +- ✅ All configuration examples +- ✅ Mock dataset configuration + +**Progressive Disclosure**: +- **Layer 1**: Overview + quick start +- **Layer 2**: Core workflow steps +- **Layer 3**: Advanced topics (dropdowns) +- **Layer 4**: Complete reference (link to `wan2.1.md`) + +**Integration Strategy**: +- **Decision**: Choose Option A (separate page) OR Option B (tabs in existing tutorial) +- If Option A: Create `docs/get-started/megatron-wan.md` and integrate all WAN information +- If Option B: Add WAN section to `docs/get-started/megatron.md` using tabs +- **No need to preserve `wan2.1.md` as separate file** - all information integrated into chosen location + +--- + +## Navigation Updates + +### Update `docs/get-started/index.md` + +Add WAN option: +```markdown +:::: {grid} 1 2 2 2 +:::{grid-item-card} 2a. Automodel Tutorial +[Current content] +::: +:::{grid-item-card} 2b. Megatron DiT Tutorial +[Current content] +::: +:::{grid-item-card} 2c. Megatron WAN Tutorial +:link: megatron-wan +:link-type: doc +Train WAN models with Megatron for video generation. ++++ +{bdg-secondary}`wan` {bdg-secondary}`megatron` +::: +:::: +``` + +### Update `docs/about/concepts/training-paradigms.md` + +Add comparison link: +```markdown +## Learn More + +- [Automodel vs Megatron Comparison](comparison.md) - Detailed experimental comparison +- [Performance Benchmarks](../reference/performance.md) - Training performance metrics +``` + +### Update `docs/reference/index.md` + +Add performance link: +```markdown +## Performance and Benchmarks + +:::{grid-item-card} Performance Benchmarks +:link: performance +:link-type: doc +Training throughput and performance metrics across GPU systems. ++++ +{bdg-secondary}`benchmarks` {bdg-secondary}`performance` +::: +``` + +--- + +## Implementation Checklist + +### Phase 1: Create Missing Files + +- [ ] **Create `docs/reference/performance.md`** + - [ ] Migrate nomenclature section + - [ ] Migrate performance metrics explanation + - [ ] Migrate all benchmark tables (use tabs for systems) + - [ ] Add progressive disclosure (dropdowns for details) + - [ ] Add frontmatter with proper metadata + - [ ] Link from reference index + +- [ ] **Create `docs/about/comparison.md`** + - [ ] Migrate experiment overview + - [ ] Migrate training stages (use tabs) + - [ ] Migrate training curves (include images) + - [ ] Migrate caveats and analysis + - [ ] Add progressive disclosure + - [ ] Add frontmatter with proper metadata + - [ ] Link from training-paradigms page + +### Phase 2: Integrate Information into Existing Tutorials + +- [ ] **Enhance `docs/get-started/automodel.md`** + - [ ] Integrate preprocessing details (dropdown) + - [ ] Integrate `meta.json` schema (dropdown) + - [ ] Integrate multi-node SLURM setup (dropdown) + - [ ] Integrate validation section + - [ ] Integrate hardware requirements (dropdown) + - [ ] Integrate advanced topics section (dropdowns) + - [ ] **Archive or remove `automodel_training_doc.md`** (information now integrated) + +- [ ] **Enhance `docs/get-started/megatron.md`** + - [ ] Integrate sequence packing details (dropdown) + - [ ] Integrate validation details (dropdown) + - [ ] Integrate Energon format details (dropdown) + - [ ] **Archive or remove `megatron/models/dit/README.md`** (information now integrated) + +### Phase 3: Integrate WAN Information + +- [ ] **Decide**: Separate WAN guide OR tabs in Megatron tutorial +- [ ] **If separate guide**: Create `docs/get-started/megatron-wan.md` + - [ ] Integrate all WAN information + - [ ] Add progressive disclosure + - [ ] **Archive or remove `megatron/recipes/wan/wan2.1.md`** (information now integrated) +- [ ] **If tabs**: Enhance `docs/get-started/megatron.md` + - [ ] Add WAN section with tabs (DiT vs WAN) + - [ ] Integrate all WAN information + - [ ] **Archive or remove `megatron/recipes/wan/wan2.1.md`** (information now integrated) + +### Phase 4: Update Navigation + +- [ ] **Update `docs/get-started/index.md`** + - [ ] Add WAN tutorial card + - [ ] Update comparison table + +- [ ] **Update `docs/about/concepts/training-paradigms.md`** + - [ ] Add comparison link + - [ ] Add performance link + +- [ ] **Update `docs/reference/index.md`** + - [ ] Add performance benchmarks card + +- [ ] **Update `docs/index.md`** (if needed) + - [ ] Ensure all new pages are discoverable + +### Phase 5: Verify Content Preservation + +- [ ] **Content Audit** + - [ ] Verify all nomenclature preserved + - [ ] Verify all tables preserved + - [ ] Verify all code examples preserved + - [ ] Verify all images preserved + - [ ] Verify all configuration examples preserved + - [ ] Verify all troubleshooting content preserved + +- [ ] **Link Verification** + - [ ] All internal links work + - [ ] All reference targets exist + - [ ] All images load correctly + - [ ] All code examples render + +- [ ] **Progressive Disclosure Check** + - [ ] Layer 1 content scannable in 30 seconds + - [ ] Layer 2 content accessible without scrolling + - [ ] Layer 3 content in dropdowns/tabs + - [ ] Layer 4 content linked appropriately + +--- + +## Progressive Disclosure Patterns + +### Pattern 1: Advanced Details → Dropdown +```markdown +## Core Concept + +[Layer 2: Essential explanation] + +:::{dropdown} Advanced: Detailed Analysis +[Layer 3: Full technical details] +::: +``` + +### Pattern 2: Alternative Options → Tabs +```markdown +## Choose Your Approach + +:::: {tab-set} +::: {tab-item} Option A +[Content for option A] +::: +::: {tab-item} Option B +[Content for option B] +::: +:::: +``` + +### Pattern 3: Reference Material → Separate Page + Link +```markdown +## Core Tutorial + +[Layer 1-2: Essential steps] + +## Complete Reference + +For complete configuration options and advanced topics, see: +[Complete Reference Guide](reference-guide.md) +``` + +### Pattern 4: Comparison Tables → Collapsible +```markdown +## Quick Comparison + +[Layer 2: Summary table] + +:::{dropdown} Detailed Comparison +[Layer 3: Full comparison with all details] +::: +``` + +--- + +## Information Mapping to New IA + +| Information Source | Information Type | New Location | Diataxis Type | Integration Method | +|-------------------|------------------|--------------|---------------|-------------------| +| `performance-summary.md` | Performance benchmarks | `docs/reference/performance.md` | REFERENCE | New page (all info) | +| `mcore_automodel_comparision_wan21.md` | Paradigm comparison | `docs/about/comparison.md` OR `docs/about/concepts/training-paradigms.md` | EXPLANATION | New page OR integrate | +| `automodel_training_doc.md` | Detailed training info | `docs/get-started/automodel.md` | TUTORIAL | Integrate (progressive disclosure) | +| `megatron/models/dit/README.md` | DiT-specific details | `docs/get-started/megatron.md` | TUTORIAL | Integrate (progressive disclosure) | +| `megatron/recipes/wan/wan2.1.md` | WAN-specific details | `docs/get-started/megatron-wan.md` OR `docs/get-started/megatron.md` | TUTORIAL/HOW-TO | New page OR integrate with tabs | + +--- + +## Content Fidelity Principles + +1. **Preserve All Technical Details** + - All configuration examples + - All code snippets + - All parameter explanations + - All troubleshooting content + +2. **Preserve All Data** + - All benchmark numbers + - All comparison tables + - All training configurations + - All hardware specifications + +3. **Preserve All Context** + - Experiment methodology + - Caveats and limitations + - Use case guidance + - Best practices + +4. **Improve Organization** + - Group related content + - Use progressive disclosure + - Add clear navigation + - Improve scannability + +--- + +## Success Criteria + +✅ **Zero Information Loss** +- All content from old docs present in new structure +- All tables, code examples, images preserved +- All technical details maintained + +✅ **Improved Usability** +- Clear navigation paths +- Progressive disclosure reduces cognitive load +- Scannable structure (30-second test passes) + +✅ **Diataxis Compliance** +- Each page has single clear purpose +- Content type matches user intent +- Cross-links to related types + +✅ **Maintainability** +- Clear file organization +- Consistent structure +- Easy to update +- Single source of truth (new IA) + +--- + +## Next Steps + +1. **Review this plan** with stakeholders +2. **Prioritize phases** (suggest: Phase 1 → 2 → 3 → 4 → 5) +3. **Execute migration** following checklist +4. **Verify information** using audit checklist (verify all info captured, not files) +5. **Test navigation** and user flows +6. **Archive old files** after verification (information is now in new IA) + +--- + +## Notes + +- **Information Preservation**: Focus on preserving information, not file structure +- **File Cleanup**: After integration, old files can be archived or removed (information is captured) +- **Images**: Ensure all images copied to new locations with correct paths +- **Links**: Update all internal links to new structure +- **Frontmatter**: Add consistent frontmatter to all new/modified files +- **Testing**: Build docs locally to verify all MyST directives render correctly +- **Mature IA**: The new structure should be the source of truth; old files are temporary + diff --git a/docs/MIGRATION_SUMMARY.md b/docs/MIGRATION_SUMMARY.md new file mode 100644 index 00000000..5df4492d --- /dev/null +++ b/docs/MIGRATION_SUMMARY.md @@ -0,0 +1,123 @@ +# Migration Plan Summary + +**Quick Reference**: Information mapping strategy - preserve information, not file structure. + +**Key Principle**: Information should be captured in logical locations in the new IA. Files can be merged, split, or reorganized. + +--- + +## Missing Information (Create New Pages) + +| File | Location | Type | Priority | +|------|----------|------|----------| +| Performance Benchmarks | `docs/reference/performance.md` | REFERENCE | High | +| Paradigm Comparison | `docs/about/comparison.md` | EXPLANATION | High | + +--- + +## Information to Integrate (Not Preserve as Separate Files) + +| Source File | Information | Integration Point | Method | +|-------------|------------|-------------------|--------| +| `automodel_training_doc.md` | Detailed training info | `get-started/automodel.md` | Integrate via progressive disclosure | +| `megatron/models/dit/README.md` | DiT-specific details | `get-started/megatron.md` | Integrate via progressive disclosure | +| `megatron/recipes/wan/wan2.1.md` | WAN-specific details | `get-started/megatron-wan.md` OR `get-started/megatron.md` | New page OR tabs | + +--- + +## Content Gaps to Fill + +### Automodel Tutorial (`get-started/automodel.md`) +- [ ] Preprocessing modes (video vs frames) - **Add as dropdown** +- [ ] `meta.json` schema - **Add as dropdown** +- [ ] Multi-node SLURM setup - **Add as dropdown** +- [ ] Validation script details - **Add new section** +- [ ] Hardware requirements - **Add as dropdown** +- [ ] Pretraining vs fine-tuning comparison - **Add as dropdown** +- [ ] Advanced parallelization - **Add as dropdown** +- [ ] Checkpoint cleanup - **Add as dropdown** + +### Megatron Tutorial (`get-started/megatron.md`) +- [ ] Sequence packing details - **Add as dropdown** +- [ ] Validation details - **Add as dropdown** +- [ ] Energon format details - **Add as dropdown** +- [ ] WAN content - **Create separate WAN guide** + +--- + +## Progressive Disclosure Strategy + +### Layer 1 (Always Visible) +- Overview, key concepts, main steps + +### Layer 2 (Scannable) +- Core content, essential details, main workflows + +### Layer 3 (Collapsible) +- Advanced topics → Use `:::{dropdown}` +- Alternative options → Use `:::: {tab-set}` +- Detailed explanations → Use `:::{dropdown}` + +### Layer 4 (Separate Pages) +- Complete reference guides → Link to existing detailed docs + +--- + +## MyST Directives to Use + +```markdown +# Dropdowns (Layer 3 content) +:::{dropdown} Advanced Topic +:icon: info +[Detailed content here] +::: + +# Tabs (Alternative options) +:::: {tab-set} +::: {tab-item} Option A +[Content A] +::: +::: {tab-item} Option B +[Content B] +::: +:::: + +# Cards (Navigation) +::::{grid} 1 2 2 2 +:::{grid-item-card} Title +:link: target +:link-type: ref +Description +::: +:::: +``` + +--- + +## Implementation Order + +1. **Phase 1**: Create missing files (performance, comparison) +2. **Phase 2**: Enhance existing tutorials (add dropdowns/tabs) +3. **Phase 3**: Create WAN guide page +4. **Phase 4**: Update navigation (index pages, links) +5. **Phase 5**: Verify (content audit, link check) + +--- + +## Quick Checklist + +- [ ] Performance benchmarks page created (all info from `performance-summary.md`) +- [ ] Comparison page created OR integrated (all info from `mcore_automodel_comparision_wan21.md`) +- [ ] Automodel tutorial enhanced (all info from `automodel_training_doc.md` integrated) +- [ ] Megatron tutorial enhanced (all info from `dit/README.md` integrated) +- [ ] WAN information integrated (all info from `wan2.1.md` integrated) +- [ ] All navigation updated +- [ ] **Information audit**: All information verified (not files - verify content) +- [ ] All links working +- [ ] Progressive disclosure applied correctly +- [ ] Old files archived/removed after verification + +--- + +**Full Plan**: See `MIGRATION_PLAN.md` for detailed implementation guide. + diff --git a/docs/Makefile b/docs/Makefile new file mode 100644 index 00000000..47595c2d --- /dev/null +++ b/docs/Makefile @@ -0,0 +1,84 @@ +# Makefile for Sphinx documentation + +# Default target shows help +.DEFAULT_GOAL := help + +.PHONY: help docs-html docs-clean docs-live docs-publish ensure-docs-env check-uv + +# Help target +help: ## Show this help message + @echo "" + @echo "📚 Documentation Build System" + @echo "==============================" + @echo "" + @echo "Available targets:" + @echo " make docs-html Build HTML documentation" + @echo " make docs-live Start live-reload server" + @echo " make docs-publish Build for publication (fail on warnings)" + @echo " make docs-clean Clean built documentation" + @echo "" + @echo "Note: Environment is automatically set up on first run." + @echo "" + +# Detect OS for cross-platform compatibility +ifeq ($(OS),Windows_NT) + VENV_PYTHON = ../.venv-docs/Scripts/python.exe + VENV_ACTIVATE = ..\\.venv-docs\\Scripts\\activate + VENV_ACTIVATE_PS = ..\\.venv-docs\\Scripts\\Activate.ps1 + RM_CMD = if exist _build rmdir /s /q _build + ECHO_BLANK = @echo. +else + VENV_PYTHON = ../.venv-docs/bin/python + VENV_ACTIVATE = source ../.venv-docs/bin/activate + RM_CMD = rm -rf _build + ECHO_BLANK = @echo "" +endif + +# Check if uv is installed +check-uv: +ifeq ($(OS),Windows_NT) + @where uv >nul 2>&1 || ( \ + echo. && \ + echo ❌ uv is not installed or not in PATH && \ + echo. && \ + echo Please install uv: https://docs.astral.sh/uv/getting-started/installation/ && \ + exit 1 \ + ) +else + @command -v uv >/dev/null 2>&1 || ( \ + echo ""; \ + echo "❌ uv is not installed or not in PATH"; \ + echo ""; \ + echo "Please install uv: https://docs.astral.sh/uv/getting-started/installation/"; \ + echo ""; \ + exit 1; \ + ) +endif + +# Ensure docs environment exists and is up to date +ensure-docs-env: check-uv + @if [ ! -f "$(VENV_PYTHON)" ]; then \ + echo "📦 Setting up docs environment with uv..."; \ + cd .. && uv venv .venv-docs && uv pip install --group docs --python .venv-docs; \ + echo "✅ Environment ready!"; \ + else \ + echo "🔄 Syncing docs dependencies (this ensures dependencies are up to date)..."; \ + cd .. && uv pip install --group docs --python .venv-docs; \ + echo "✅ Dependencies synced!"; \ + fi + +docs-html: ensure-docs-env + @echo "Building HTML documentation..." + $(VENV_PYTHON) -m sphinx -b html . _build/html + +docs-publish: ensure-docs-env + @echo "Building HTML documentation for publication (fail on warnings)..." + $(VENV_PYTHON) -m sphinx --fail-on-warning --builder html . _build/html + +docs-clean: + @echo "Cleaning built documentation..." + $(RM_CMD) + +docs-live: ensure-docs-env + @echo "Starting live-reload server (sphinx-autobuild)..." + $(VENV_PYTHON) -m sphinx_autobuild --port 8001 . _build/html diff --git a/docs/about/comparison.md b/docs/about/comparison.md new file mode 100644 index 00000000..08cceea8 --- /dev/null +++ b/docs/about/comparison.md @@ -0,0 +1,127 @@ +--- +description: "Experimental comparison between AutoModel and Megatron training paths for WAN 2.1" +categories: ["concepts-architecture"] +tags: ["comparison", "automodel", "megatron", "wan", "experimental"] +personas: ["mle-focused", "data-scientist-focused"] +difficulty: "intermediate" +content_type: "explanation" +--- + +(about-comparison)= + +# AutoModel vs Megatron Comparison + +Experimental comparison of two training paths for WAN 2.1: the AutoModel (Diffusers) path versus the Megatron-Core (Megatron-Bridge) path. + +## Experiment Overview + +**Goal**: Compare two training paths for WAN 2.1: + +1. **[Diffusers](https://huggingface.co/docs/diffusers/en/index) implementation + [AutoModel](https://github.com/NVIDIA-NeMo/Automodel/tree/diffusion) training path** +2. **[Megatron-Core](https://github.com/NVIDIA/Megatron-LM) implementation + [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) training path** + +**Training Approach**: Two-stage training + +- **Stage 1**: Text → Image - Learn to connect textual embeddings with visual concepts +- **Stage 2**: Text → Video - Learn visual movements aligning with prompts + +**Dataset**: 3,000 videos; frames extracted from videos are used for text-to-image training stage. + +:::{note} +This experiment is a partial convergence test and only demonstrates the model's ability to reconstruct images and videos from input prompts. With only 3,000 videos, the model cannot generalize to generate novel content. Such generalization can be achieved with larger training datasets and increased training resources. +::: + +--- + +## Dataset Configuration + +:::: {tab-set} + +::: {tab-item} Stage 1: Text-to-Image + +**Dataset**: +- Extract 40 frames per video → **120k images** +- Resolution: **240 × 416** +- Each frame uses same caption as parent video + +**Training Setup**: +- Global batch size: 2560 images +- Learning rate: warmup 10k → 5e-5 constant +- Hardware: 10 nodes (80 GPUs) + +| Path | Parallelism | Notes | +|------|-------------|-------| +| Megatron-Core | TP=1, PP=1, CP=1 | Sequence packing (32 samples/pack) | +| AutoModel | FSDP | micro_batch_size = 32 | + +::: + +::: {tab-item} Stage 2: Text-to-Video + +**Dataset**: +- Full videos → **3,000 videos** +- Resolution: **240 × 416**, duration 4–8 seconds + +**Training Setup**: +- Global batch size: 80 videos +- Learning rate: 5e-5 constant +- Hardware: 10 nodes (80 GPUs) + +| Path | Parallelism | Notes | +|------|-------------|-------| +| Megatron-Core | TP=1, PP=1, CP=1 | micro_batch_size = 1 | +| AutoModel | FSDP | micro_batch_size = 1 | + +::: + +:::: + +--- + +## Results + +### Stage 1 — Loss vs. Steps + +```{image} ../medias/training_curves/lm_loss_text2image_3kvids.png +:alt: Training loss curve for Stage 1 (Text-to-Image) +:width: 700px +``` + +### Stage 2 — Loss vs. Steps + +```{image} ../medias/training_curves/lm_loss_text2video_3kvids.png +:alt: Training loss curve for Stage 2 (Text-to-Video) +:width: 700px +``` + +:::{note} +Training loss is smoothed with 50 steps averaging. +::: + +### Analysis + +The training curves for both stages have similar value ranges, although they do not match exactly. This is expected due to differences in implementation and training loop setups. + +:::{dropdown} Important Caveat: Megatron-Core Timestep Handling +:icon: alert + +In the current Megatron-Core implementation, the same diffusion time steps are applied to all samples within a pack for each step, rather than different time steps for each sample. As a result, the training loss for Megatron-Core fluctuates more significantly than for AutoModel, especially at the beginning of training. +::: + +--- + +## Key Takeaways + +- Both paths achieve similar training loss ranges +- Implementation differences lead to curve variations (expected) +- Megatron-Core shows more loss fluctuation due to timestep handling in sequence packing +- Both paths successfully learn reconstruction from prompts + +--- + +## Related Documentation + +- [Training Paradigms](concepts/training-paradigms.md) - Detailed comparison of paradigms +- [Performance Benchmarks](../reference/performance.md) - Training throughput metrics +- [Get Started](../get-started/index.md) - Start training with either path + diff --git a/docs/about/concepts/configuration.md b/docs/about/concepts/configuration.md new file mode 100644 index 00000000..558177ad --- /dev/null +++ b/docs/about/concepts/configuration.md @@ -0,0 +1,251 @@ +--- +description: "Understanding NeMo DFM's configuration system: YAML files, CLI overrides, and configuration precedence" +categories: ["concepts-architecture"] +tags: ["configuration", "yaml", "cli", "overrides"] +personas: ["mle-focused", "data-scientist-focused"] +difficulty: "beginner" +content_type: "explanation" +--- + +(about-concepts-configuration)= + +# Configuration System + +NeMo DFM uses a layered configuration system: base recipes provide defaults, YAML files define reusable settings, and CLI overrides enable quick experimentation. Each layer overrides the previous, with CLI arguments taking highest precedence. + +## Configuration Layers + +Configuration precedence: Base Recipe < YAML File < CLI Overrides + +1. **Base recipes**: Python functions with framework defaults +2. **YAML files**: Reusable configuration templates +3. **CLI overrides**: Runtime argument overrides (highest precedence) + +## Automodel Configuration + +Automodel is a separate training framework in DFM that uses a simplified, YAML-first configuration approach. It requires the Automodel submodule from `3rdparty/Automodel`. + +### YAML-Based Configuration + +Automodel uses a single YAML file for all configuration: + +```yaml +seed: 42 + +model: + pretrained_model_name_or_path: Wan-AI/Wan2.1-T2V-1.3B-Diffusers + +data: + dataloader: + _target_: Automodel.datasets.build_wan21_dataloader + meta_folder: /path/to/dataset/meta/ + batch_size: 1 + num_workers: 2 + +batch: + batch_size_per_node: 8 + +training: + num_epochs: 100 + +optim: + learning_rate: 5e-6 + optimizer: + weight_decay: 0.01 + betas: [0.9, 0.999] + +fsdp: + tp_size: 1 + cp_size: 1 + pp_size: 1 + dp_size: 8 +``` + +### Loading Configuration + +Load configuration using Automodel's argument parser: + +```python +# From Automodel package (3rdparty/Automodel) +from nemo_automodel.components.config._arg_parser import parse_args_and_load_config + +cfg = parse_args_and_load_config("config.yaml") +``` + +The `nemo_automodel` package is provided by the Automodel submodule in `3rdparty/Automodel`. + +## Megatron Configuration + +### Multi-Level Configuration + +Megatron supports three configuration levels: + +#### 1. Base Recipe Configuration + +Python functions define base configurations: + +```python +from dfm.src.megatron.recipes.dit.dit import pretrain_config + +cfg = pretrain_config(dataset_path="/path/to/dataset", mock=False) +``` + +#### 2. YAML Override Files + +YAML files override base configuration: + +```yaml +model: + tensor_model_parallel_size: 4 +train: + global_batch_size: 512 +``` + +#### 3. CLI Overrides + +Command-line arguments override everything: + +```bash +python pretrain_dit_model.py \ + --config-file config.yaml \ + model.tensor_model_parallel_size=8 \ + train.global_batch_size=1024 +``` + +## CLI Override Syntax + +### Basic Syntax + +```bash +key=value +``` + +### Nested Keys + +Use dot notation for nested configuration: + +```bash +model.tensor_model_parallel_size=4 +train.global_batch_size=512 +optimizer.learning_rate=1e-4 +``` + +### Adding New Keys + +Use `+` prefix to add new configuration keys: + +```bash ++new_key=value ++model.custom_setting=42 +``` + +### Removing Keys + +Use `~` prefix to remove configuration keys: + +```bash +~key_to_remove +~model.unused_setting +``` + +### Type Conversion + +CLI overrides automatically convert types: + +```bash +model.tensor_model_parallel_size=4 # int +train.learning_rate=1e-4 # float +model.use_mixed_precision=true # bool +model.model_name="my_model" # string +``` + +### Complex Types + +PyTorch types use string representations that are parsed by OmegaConf: + +```bash +model.pipeline_dtype=torch.bfloat16 # torch dtype (common: torch.float16, torch.bfloat16, torch.float32) +``` + +For function references and complex objects, define them in YAML files rather than CLI overrides. + +## Configuration Structure + +Configuration files organize settings into logical sections: + +**Model**: Architecture and parallelism + +```yaml +model: + tensor_model_parallel_size: 4 + pipeline_model_parallel_size: 2 + pipeline_dtype: torch.bfloat16 +``` + +**Training**: Batch sizes and iteration control + +```yaml +train: + global_batch_size: 512 + max_steps: 10000 + save_interval: 1000 +``` + +**Data**: Dataset paths and loading + +```yaml +data: + dataset_path: /path/to/data + num_workers: 8 +``` + +**Optimizer**: Learning rates and schedules + +```yaml +optim: + learning_rate: 1e-4 + weight_decay: 0.01 +``` + +## Configuration Patterns + +### Experiment Workflows + +Base configuration with CLI variations: + +```bash +# Base run +python train.py --config-file base_config.yaml + +# Learning rate sweep +python train.py --config-file base_config.yaml train.learning_rate=2e-4 +python train.py --config-file base_config.yaml train.learning_rate=5e-4 + +# Scale model parallelism +python train.py --config-file base_config.yaml \ + model.tensor_model_parallel_size=8 \ + model.pipeline_model_parallel_size=2 +``` + +### Verify Final Configuration + +Print merged configuration in Megatron to verify all overrides: + +```python +from megatron.bridge.utils.common_utils import get_rank_safe + +if get_rank_safe() == 0: + cfg.print_yaml() +``` + +This displays the final configuration after all merging, showing effective values for model, training, data, and optimizer settings. + +## Environment Variables + +Set runtime behavior with environment variables: + +```bash +export CUDA_VISIBLE_DEVICES=0,1,2,3 # Select GPUs +export NCCL_DEBUG=INFO # Debug distributed communication +``` + diff --git a/docs/about/concepts/diffusion-models.md b/docs/about/concepts/diffusion-models.md new file mode 100644 index 00000000..a9c37bf2 --- /dev/null +++ b/docs/about/concepts/diffusion-models.md @@ -0,0 +1,199 @@ +--- +description: "How diffusion models work for video generation in NeMo DFM, including EDM and Flow Matching paradigms" +categories: ["concepts-architecture"] +tags: ["diffusion", "video-generation", "edm", "flow-matching"] +personas: ["mle-focused", "data-scientist-focused"] +difficulty: "intermediate" +content_type: "explanation" +--- + +(about-concepts-diffusion-models)= + +# Diffusion Models for Video + +Diffusion models generate video by learning to reverse a gradual noise-addition process. NeMo DFM implements two paradigms—EDM and Flow Matching—each offering distinct training dynamics and sampling characteristics for video generation. + +## Core Mechanism + +Diffusion models operate through two complementary processes: + +1. **Forward (noise addition)**: The model gradually corrupts clean video data by adding Gaussian noise over many timesteps until the data becomes indistinguishable from pure noise. This forward process is deterministic and follows a predefined noise schedule that controls the rate of corruption. + +2. **Reverse (denoising)**: The model learns to invert the forward process by predicting and removing noise at each timestep. During training, the model sees corrupted data at various noise levels and learns to estimate the original clean data or the noise that was added. During inference, the model starts with random noise and iteratively denoises it to generate new video content. + +The key insight is that learning to denoise at all noise levels enables generation: if you can remove noise step by step, you can transform random noise into coherent video. + +### Video-Specific Challenges + +Video diffusion extends image diffusion with additional complexity: + +- **Temporal consistency**: Models must maintain coherent motion and object identity across frames. This typically requires 3D attention mechanisms that attend across both spatial and temporal dimensions, or causal attention that processes frames sequentially. +- **Computational scale**: A 5-second video at 24 fps contains 120 frames. Generating each frame at 512×512 resolution requires processing over 31 million pixels, making efficient architectures and parallelization essential. +- **Conditioning mechanisms**: Text embeddings from encoders such as T5 provide semantic guidance, but video generation often requires additional conditioning on motion, camera movement, or reference frames. +- **Memory requirements**: Processing multiple frames simultaneously demands substantial GPU memory. Latent diffusion models compress videos into lower-dimensional representations before applying diffusion, reducing memory usage by 16-64×. + +## Diffusion Paradigms in DFM + +NeMo DFM implements two paradigms with different mathematical formulations and sampling characteristics: + +### EDM (Elucidating Diffusion Models) + +EDM frames diffusion as a Stochastic Differential Equation (SDE) where the forward process adds noise according to a continuous-time stochastic process, and the reverse process learns to integrate backward through time. + +**Mathematical formulation**: EDM uses a variance-preserving SDE formulation where the noise schedule is parameterized to maintain consistent signal-to-noise ratios across timesteps. The model predicts either the noise ε, the denoised data x₀, or the score function ∇log p(x). + +**Sampling characteristics**: + +- Stochastic sampling paths allow controlled randomness during generation +- Classifier-free guidance scales the conditional and unconditional predictions: `output = unconditional + guidance_scale × (conditional - unconditional)` +- Typical inference requires 25-50 sampling steps, with quality improving at higher step counts +- Second-order samplers (Heun, DPM-Solver++) can reduce required steps + +**When to use EDM**: + +- Production inference where generation quality is critical +- Scenarios requiring classifier-free guidance for prompt adherence +- Models trained with variance-preserving objectives + +**Primary architecture**: DiT (Diffusion Transformer) + +### Flow Matching + +Flow matching learns a deterministic ordinary differential equation (ODE) that transports samples from a noise distribution to the data distribution through continuous-time flows. + +**Mathematical formulation**: Instead of learning to denoise at discrete timesteps, flow matching learns a velocity field v(x, t) that defines how samples should move through space over time. The generative process integrates this ODE: dx/dt = v(x, t). The training objective directly matches the learned velocity field to a target conditional flow. + +**Sampling characteristics**: + +- Deterministic sampling paths provide consistent generation given the same seed +- Typically requires fewer sampling steps (10-20) compared to EDM due to the direct ODE formulation +- Time-shift techniques can adjust the speed of the flow at different timesteps +- ODE solvers (Euler, Runge-Kutta) control the numerical integration accuracy + +**When to use Flow Matching**: + +- Applications requiring deterministic generation for reproducibility +- Scenarios where faster inference (fewer steps) is prioritized +- Research exploring flow-based generative models +- Models trained with flow matching objectives + +**Primary architecture**: WAN + +## Training Dynamics + +### EDM Training Objective + +EDM training optimizes the model to predict noise at randomly sampled timesteps. For each training sample, the framework corrupts the clean video by adding Gaussian noise at a random noise level t, then trains the model to estimate either the added noise ε, the clean data x₀, or the score ∇log p(x_t). The loss function typically uses mean squared error between the prediction and target: + +`L = E[||prediction - target||²]` + +The random sampling of timesteps ensures the model learns to denoise at all noise levels, from slight corruptions to nearly pure noise. Variance-preserving formulations maintain signal strength across timesteps, preventing the model from focusing disproportionately on certain noise levels. + +### Flow Matching Training Objective + +Flow matching training optimizes the model to predict velocity fields that transport noise to data. The framework samples a clean video, constructs a conditional flow path from noise to that specific video, then trains the model to predict the velocity field along that path: + +`L = E[||v_θ(x_t, t) - u_t(x_t)||²]` + +where v_θ is the learned velocity field and u_t is the target conditional velocity. The key difference from EDM is that flow matching learns a direct mapping through time rather than iterative denoising. Conditional flow matching uses simple linear interpolation paths during training, making the training objective straightforward while still enabling complex generation. + +## Inference Characteristics + +### EDM Sampling + +EDM sampling iteratively denoises random noise by reversing the learned diffusion process. Starting from pure Gaussian noise, the sampler makes multiple predictions at decreasing noise levels, each time removing a portion of the noise. The sampling trajectory can be deterministic or stochastic depending on the sampler choice. + +Classifier-free guidance modifies the sampling process by computing both conditional (text-guided) and unconditional predictions at each step, then extrapolating away from the unconditional prediction. Higher guidance scales (typically 7-15 for video) increase prompt adherence but can reduce diversity. The guidance computation doubles the inference cost since the model must make two predictions per step. + +Sampling quality depends on the number of steps and sampler algorithm. First-order samplers (DDPM, DDIM) require more steps but are simpler, while second-order samplers (Heun, DPM-Solver++) achieve similar quality with 50-70% fewer steps by using higher-order numerical approximations. + +### Flow Matching Sampling + +Flow matching sampling integrates the learned velocity field forward through time using an ODE solver. Starting from noise, the solver numerically integrates dx/dt = v(x, t) from t=0 to t=1, where the velocity field guides the sample along a continuous path toward the data distribution. + +The deterministic nature of ODE integration means the same seed and hyperparameters produce identical outputs, which benefits reproducibility and iterative refinement. Time-shift techniques can reweight the integration schedule to spend more computational budget at critical phases of generation. + +Flow matching typically achieves competitive quality with fewer function evaluations (10-20) compared to EDM because the direct velocity prediction avoids the iterative error accumulation of denoising steps. However, classifier-free guidance is less commonly used with flow matching, as the formulation doesn't naturally separate conditional and unconditional paths. + +## Text Conditioning Mechanisms + +Both paradigms condition generation on text prompts through embedding-based guidance: + +**Text encoder integration**: Models typically use T5 or CLIP text encoders to convert prompts into high-dimensional embeddings (for example, 768 or 1024 dimensions). These embeddings are injected into the diffusion model through cross-attention layers, where the model's hidden states attend to the text representations at each layer of the architecture. + +**Classifier-free guidance**: During training, the model randomly drops conditioning information (typically 10-20% of samples) to learn both conditional p(x|text) and unconditional p(x) distributions. During inference, the two predictions are combined: `output = unconditional + guidance_scale × (conditional - unconditional)`. This extrapolation increases the influence of the text condition, improving prompt adherence at the cost of reduced diversity. + +**Negative prompts**: Some implementations support negative text conditioning, which guides generation away from undesired content by subtracting the influence of negative prompt embeddings from the positive prompt guidance. The modified guidance becomes: `output = unconditional + guidance_scale × (positive_conditional - negative_conditional)`. + +## Architecture Implementations + +### DiT (Diffusion Transformer) + +DiT applies transformer architectures to diffusion models by treating the latent video representation as a sequence of patches. Each frame is divided into spatial patches (similar to Vision Transformers), and the patches are processed through transformer blocks with both spatial and temporal attention. + +**Key architectural components**: + +- **Patch embedding**: Divides frames into non-overlapping patches and projects them to the model dimension +- **Positional encoding**: Combines spatial (2D position within frame) and temporal (frame index) positional information +- **Attention patterns**: 3D attention across height, width, and time dimensions enables modeling spatial structure and temporal dynamics simultaneously +- **Adaptive layer normalization (AdaLN)**: Conditions the normalization on timestep and text embeddings, modulating the network behavior based on the current noise level and prompt +- **Hierarchical processing**: Some variants use multi-scale representations with downsampling and upsampling stages + +DiT architectures scale effectively with model size and training compute, making them suitable for large-scale video generation. + +### WAN (Flow-Based Architecture) + +WAN implements flow matching with architectural designs optimized for learning velocity fields. While sharing transformer-based components with DiT, WAN modifications support the continuous-time dynamics of flow matching. + +**Flow-specific design choices**: + +- Velocity prediction heads that output per-patch velocity fields +- Time embeddings that integrate smoothly across the continuous [0,1] interval rather than discrete diffusion timesteps +- Architectural modifications that support deterministic ODE integration during inference + +The WAN architecture demonstrates that flow matching can achieve competitive results with specialized architectural considerations for the flow-based training paradigm. + +## Hyperparameters and Trade-offs + +### Noise Schedule + +The noise schedule defines the variance of noise at each timestep, controlling the diffusion process trajectory. Common schedules include: + +**Linear schedule**: Noise variance increases linearly from near-zero to one. Simple but can be suboptimal for complex data distributions. + +**Cosine schedule**: Uses a cosine function to allocate more capacity to mid-range noise levels where the model learns the most semantic information. Generally produces better results than linear schedules. + +**Learned schedules**: Some advanced formulations learn the optimal noise schedule during training, adapting to the specific data distribution. + +During inference, the schedule determines the timesteps at which the model makes predictions. Non-uniform schedules can concentrate sampling steps at critical noise levels, improving efficiency. + +### Guidance Scale + +The guidance scale parameter γ controls the strength of conditional guidance in the formula: `output = unconditional + γ × (conditional - unconditional)`. + +**Trade-offs**: + +- γ = 1: No guidance, equivalent to standard conditional generation +- γ = 7-10: Typical range for video, balances prompt adherence and quality +- γ = 15+: Strong guidance, may improve text alignment but can reduce diversity and introduce artifacts +- γ < 1: Weakens conditioning, increases diversity + +Higher guidance scales amplify the difference between conditional and unconditional predictions, effectively increasing the model's confidence in prompt-related features. + +### Inference Steps + +The number of function evaluations during sampling determines the quality-speed trade-off: + +**EDM typical ranges**: + +- 25-50 steps: Standard quality, 2-5 seconds per video (depending on resolution and hardware) +- 50-100 steps: High quality, diminishing returns above 50 +- <25 steps: Fast sampling, potential quality degradation with first-order samplers + +**Flow matching typical ranges**: + +- 10-20 steps: Competitive quality due to direct velocity prediction +- 20-50 steps: Marginal improvements, higher computational cost + +Second-order ODE solvers can reduce required steps by 30-50% while maintaining quality through better numerical approximation of the integration path. + diff --git a/docs/about/concepts/distributed-training.md b/docs/about/concepts/distributed-training.md new file mode 100644 index 00000000..9e81efdf --- /dev/null +++ b/docs/about/concepts/distributed-training.md @@ -0,0 +1,357 @@ +--- +description: "Understanding distributed training parallelism in NeMo DFM: tensor parallelism, context parallelism, pipeline parallelism, and data parallelism" +categories: ["concepts-architecture"] +tags: ["distributed", "parallelism", "training", "tensor-parallelism"] +personas: ["mle-focused", "admin-focused"] +difficulty: "intermediate" +content_type: "explanation" +--- + +(about-concepts-distributed-training)= + +# Distributed Training + +NeMo DFM scales training across multiple GPUs and nodes using four parallelism strategies. These strategies address different bottlenecks: model size (TP, PP), sequence length (CP), and throughput (DP). + +## Overview + +| Type | What It Splits | When to Use | Communication | +|------|----------------|-------------|---------------| +| **Tensor Parallelism (TP)** | Model weights across GPUs | Model >40 GB per GPU | High-bandwidth (NVLink) | +| **Context Parallelism (CP)** | Sequence tokens across GPUs | Sequences >32K tokens | High-bandwidth (NVLink) | +| **Pipeline Parallelism (PP)** | Model layers across GPUs | Very deep models, multi-node | Low-bandwidth (point-to-point) | +| **Data Parallelism (DP)** | Training batches across GPUs | Standard scaling | Standard (all-reduce) | + +**Example**: A 70B parameter model with 16K sequence length on 128 GPUs might use TP=4, CP=2, PP=2, DP=8. + +## Tensor Parallelism (TP) + +Splits model weights across GPUs within each layer. A 40 GB layer with TP=4 uses 10 GB per GPU. + +### How It Works + +For a matrix multiplication `Y = XW`: +1. Weight matrix `W` is split column-wise across GPUs +2. Each GPU computes partial result using its weight shard +3. Results are combined via all-reduce operation + +**Example**: For a 12,288 × 12,288 weight matrix with TP=4, each GPU holds 12,288 × 3,072. + +### When to Use + +- **Model size**: Model parameters >40 GB per GPU +- **Layer size**: Individual layers >10 GB +- **Hardware**: GPUs connected via NVLink or high-speed interconnect + +**Typical configurations**: +- TP=2: 70B-175B models on A100 80GB +- TP=4: 175B-400B models on H100 80GB +- TP=8: >400B models or limited GPU memory + +### Configuration + +**Automodel**: +```yaml +fsdp: + tp_size: 4 # Split across 4 GPUs + cp_size: 1 + pp_size: 1 + dp_size: 2 # Calculated automatically if not specified +``` + +**Megatron**: +```python +model.tensor_model_parallel_size = 4 +``` + +### Performance Impact + +- **Memory**: Reduces per-GPU memory by `1/tp_size` +- **Communication**: All-reduce after each layer forward/backward pass +- **Bandwidth requirement**: High-bandwidth interconnect (NVLink, NVSwitch) required for efficient scaling + +## Context Parallelism (CP) + +Splits sequence tokens across GPUs. A 64K token sequence with CP=2 processes 32K tokens per GPU. + +### How It Works + +For attention computation: +1. Sequence split into chunks across GPUs +2. Each GPU computes attention for its chunk +3. Key-value pairs shared via all-gather +4. Results combined for full attention + +**Example**: A 64K token sequence with CP=4 splits into 4 chunks of 16K tokens, reducing attention memory by 75%. + +### When to Use + +- **Sequence length**: >32K tokens or frames +- **Memory bottleneck**: Attention memory exceeds 40% of total +- **Use case**: Video generation (100+ frames), long-context language models + +**Typical configurations**: +- CP=2: 32K-64K token sequences +- CP=4: 64K-128K token sequences +- CP=8: >128K token sequences + +### Configuration + +**Automodel**: +```yaml +fsdp: + tp_size: 1 + cp_size: 2 # Split sequence across 2 GPUs + pp_size: 1 + dp_size: 4 +``` + +**Megatron**: +```python +model.context_parallel_size = 2 +``` + +### Performance Impact + +- **Memory**: Reduces attention memory by `1/cp_size` +- **Communication**: All-gather for key-value pairs per attention layer +- **Scaling**: Most effective when attention is memory bottleneck + +## Pipeline Parallelism (PP) + +Splits model layers across GPUs or nodes. A 48-layer model with PP=4 assigns 12 layers per stage. + +### How It Works + +Model divided into sequential stages: +1. Stage 1 (GPU 0): Layers 1-12 +2. Stage 2 (GPU 1): Layers 13-24 +3. Stage 3 (GPU 2): Layers 25-36 +4. Stage 4 (GPU 3): Layers 37-48 + +Activations flow forward through stages; gradients flow backward. Microbatching overlaps computation to reduce idle time. + +### When to Use + +- **Multi-node training**: Minimizes inter-node bandwidth requirements +- **Very deep models**: >80 layers that don't fit with TP alone +- **Heterogeneous networks**: Lower bandwidth between nodes than within + +**Typical configurations**: +- PP=2: 2-node training with fast inter-node links +- PP=4: 4+ node training +- PP=8: Large-scale multi-node deployments + +### Configuration + +**Automodel**: +```yaml +fsdp: + tp_size: 2 + cp_size: 1 + pp_size: 4 # 4 pipeline stages + dp_size: 1 +``` + +**Megatron**: +```python +model.pipeline_model_parallel_size = 4 +``` + +### Performance Impact + +- **Memory**: Reduces per-GPU memory by ~`1/pp_size` +- **Communication**: Point-to-point activation/gradient transfers between stages +- **Efficiency**: Pipeline bubbles cause idle time during stage transitions; mitigated by microbatching and virtual pipeline parallelism + +## Data Parallelism (DP) + +Replicates the model and splits batches across GPUs. Each GPU processes different data with the same model. + +### How It Works + +For batch size 64 with DP=8: +1. Each GPU gets 8 samples +2. Each GPU computes gradients independently +3. Gradients averaged across all GPUs via all-reduce +4. All GPUs update with averaged gradients + +This increases effective batch size and training throughput. + +### When to Use + +- **Scaling throughput**: Increase samples per second +- **Batch size**: Increase effective batch size +- **Standard case**: After applying TP/CP/PP, use remaining GPUs for DP + +**Typical configurations**: +- DP=8: Single 8-GPU node +- DP=16-32: Multi-node without model parallelism +- DP=4-16: Remaining GPUs after TP/CP/PP + +### Configuration + +**Automodel**: +```yaml +fsdp: + tp_size: 1 + cp_size: 1 + pp_size: 1 + dp_size: 8 # 8 data parallel replicas +``` + +**Megatron**: +```python +# Automatically calculated: DP = total_gpus / (TP × CP × PP) +# Example: 32 GPUs with TP=4, CP=2, PP=2 → DP = 32/(4×2×2) = 2 +``` + +### Performance Impact + +- **Memory**: No memory savings (full model copy per GPU) +- **Communication**: All-reduce for gradients after each backward pass +- **Scaling**: Near-linear speedup; efficiency depends on batch size + +## Combining Parallelism Strategies + +All four parallelism types can be combined. Total GPUs = TP × CP × PP × DP. + +### Real-World Examples + +**Small model, long sequences (8 GPUs)**: +```yaml +# Video generation: 13B model, 128K frames +fsdp: + tp_size: 1 # Model fits on single GPU + cp_size: 4 # Split long sequence + pp_size: 1 # No pipeline needed + dp_size: 2 # Use remaining GPUs for throughput +``` + +**Large model, standard sequences (64 GPUs)**: +```yaml +# Language model: 175B model, 8K tokens +fsdp: + tp_size: 4 # Split large model + cp_size: 1 # Sequence fits in memory + pp_size: 2 # 2-node deployment + dp_size: 8 # Scale throughput +``` + +**Massive model, multi-node (256 GPUs)**: +```yaml +# 500B+ model across 32 nodes +fsdp: + tp_size: 8 # Within-node parallelism + cp_size: 2 # Moderate sequences + pp_size: 4 # Across-node parallelism + dp_size: 4 # Remaining GPUs +``` + +### Design Principles + +1. **Start with TP**: If model doesn't fit, add TP first (requires high bandwidth) +2. **Add CP if needed**: For sequences >32K tokens +3. **Use PP for multi-node**: Pipeline across nodes to reduce inter-node traffic +4. **Fill with DP**: Use remaining GPUs for data parallelism + +## Choosing Parallelism Strategy + +### Decision Flowchart + +**Step 1**: Model fits on single GPU? +- **Yes**: Use DP only (simplest, most efficient) +- **No**: Go to Step 2 + +**Step 2**: Single node or multi-node? +- **Single node (8 GPUs)**: Use TP=2 or TP=4, then DP +- **Multi-node (16+ GPUs)**: Go to Step 3 + +**Step 3**: Configure multi-node strategy +1. Use **PP** across nodes (minimize inter-node bandwidth) +2. Use **TP** within nodes (leverage NVLink) +3. Add **CP** if sequences >32K tokens +4. Use **DP** for remaining GPUs + +### Hardware-Specific Guidance + +**8x A100 80GB (single node)**: +```yaml +# 70B model, 8K tokens +fsdp: + tp_size: 2 + cp_size: 1 + pp_size: 1 + dp_size: 4 +``` + +**4 nodes × 8 H100 80GB (32 GPUs)**: +```yaml +# 175B model, 16K tokens +fsdp: + tp_size: 4 # Within node + cp_size: 2 # Long sequences + pp_size: 2 # Across nodes (4 → 2 nodes per stage) + dp_size: 2 # Remaining GPUs +``` + +**32 nodes × 8 H100 80GB (256 GPUs)**: +```yaml +# 500B model, 8K tokens +fsdp: + tp_size: 8 # Full node + cp_size: 1 # Standard sequences + pp_size: 4 # Across nodes + dp_size: 8 # Remaining GPUs +``` + +### Performance vs Memory Trade-offs + +| Priority | Strategy | Rationale | +|----------|----------|-----------| +| **Maximum speed** | DP only | No communication overhead, if model fits | +| **Fit large model** | TP first | Most memory reduction per communication cost | +| **Long sequences** | CP | Only option for >32K tokens | +| **Multi-node scaling** | PP | Minimizes expensive inter-node bandwidth | + +## Implementation Details + +### Automodel (FSDP2) + +Automodel uses FSDP2 (Fully Sharded Data Parallel) with automatic optimizations: + +- **Weight sharding**: Distributes model weights across DP ranks +- **Gradient synchronization**: Overlaps communication with computation +- **Optimizer state sharding**: Distributes optimizer states across DP ranks to reduce per-GPU memory +- **Checkpointing**: Saves only one copy regardless of DP size + +Best for: Standard training workflows with minimal tuning. + +**Note**: Configure all parallelism dimensions in the `fsdp:` section of your YAML config. The framework handles DP calculation automatically if `dp_size` is not specified. + +### Megatron + +Megatron provides explicit control over parallelism configuration: + +- **Fine-grained tuning**: Set communication schedules and buffer sizes +- **Custom patterns**: Optimize for specific network topologies +- **Large-scale focus**: Optimized for 100+ GPU deployments + +Best for: Large-scale training requiring custom optimization. + +### Verifying Parallelism Configuration + +To check your current parallelism settings at runtime: + +**Megatron**: +```python +from megatron.core import parallel_state as ps + +tp_size = ps.get_tensor_model_parallel_world_size() +cp_size = ps.get_context_parallel_world_size() +pp_size = ps.get_pipeline_model_parallel_world_size() +# DP is calculated: dp_size = world_size / (tp_size * cp_size * pp_size) +``` + +**Automodel**: +Check your configuration YAML or training logs for the applied parallelism settings. The framework logs parallelism configuration at initialization. diff --git a/docs/about/concepts/index.md b/docs/about/concepts/index.md new file mode 100644 index 00000000..da3a4f32 --- /dev/null +++ b/docs/about/concepts/index.md @@ -0,0 +1,70 @@ +--- +description: "Core concepts and terminology for NeMo DFM including training paradigms, diffusion models, video data representation, and distributed training" +categories: ["concepts-architecture"] +tags: ["concepts", "fundamentals", "diffusion", "training", "distributed"] +personas: ["data-scientist-focused", "mle-focused"] +difficulty: "beginner" +content_type: "concept" +modality: "universal" +--- + +(about-concepts)= + +# Concepts + +Learn about the core concepts you need to understand before using NeMo DFM. + +## Core Concepts + +These concepts are essential for understanding how NeMo DFM works and making informed decisions about your training and inference workflows. + +::::{grid} 1 1 1 2 +:gutter: 1 1 1 2 + +:::{grid-item-card} {octicon}`git-branch;1.5em;sd-mr-1` Training Paradigms +:link: about-concepts-training-paradigms +:link-type: ref + +Understand the two main training approaches: Automodel (recipe-based) and Megatron (large-scale distributed), and when to use each. +::: + +:::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` Diffusion Models for Video +:link: about-concepts-diffusion-models +:link-type: ref + +Learn how diffusion models work for video generation, including EDM and Flow Matching paradigms. +::: + +:::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` Video Data Representation +:link: about-concepts-video-data +:link-type: ref + +Understand how video data is represented in DFM: latents, VAE encoding, tokenization, and data formats. +::: + +:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Distributed Training +:link: about-concepts-distributed-training +:link-type: ref + +Learn about parallelism strategies: tensor parallelism (TP), context parallelism (CP), pipeline parallelism (PP), and data parallelism (DP). +::: + +:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration System +:link: about-concepts-configuration +:link-type: ref + +Understand how DFM's configuration system works: YAML files, CLI overrides, and configuration precedence. +::: + +:::: + +```{toctree} +:hidden: +:maxdepth: 2 + +Training Paradigms +Diffusion Models for Video +Video Data Representation +Distributed Training +Configuration System +``` diff --git a/docs/about/concepts/training-paradigms.md b/docs/about/concepts/training-paradigms.md new file mode 100644 index 00000000..ca886854 --- /dev/null +++ b/docs/about/concepts/training-paradigms.md @@ -0,0 +1,279 @@ +--- +description: "Understanding the two training paradigms in NeMo DFM: Automodel and Megatron, and when to use each" +categories: ["concepts-architecture"] +tags: ["training", "automodel", "megatron", "paradigms"] +personas: ["mle-focused", "data-scientist-focused"] +difficulty: "beginner" +content_type: "explanation" +--- + +(about-concepts-training-paradigms)= + +# Training Paradigms + +NeMo DFM offers two training paradigms: **Automodel** for quick prototyping and fine-tuning, and **Megatron** for large-scale production training. Each paradigm uses different configuration systems, parallelism strategies, and data loading approaches. + +## Overview + +Choose between two approaches based on your training goal: + +| Paradigm | Best For | Complexity | Configuration | Example | +|----------|----------|------------|---------------|---------| +| **Automodel** | Quick prototyping, fine-tuning, research | Lower | YAML-based recipes | `finetune.py` | +| **Megatron** | Large-scale pretraining, production training | Higher | Python recipes + YAML + CLI | `pretrain_dit_model.py` | + +## Understanding the Paradigms + +### Key Features + +Each paradigm takes a different approach to configuration, parallelism, and data loading. Understanding these differences helps you choose the right paradigm for your training workflow. + +::::{tab-set} +:sync-group: paradigm + +:::{tab-item} Automodel +:sync: automodel + +Automodel provides recipe-based training that abstracts distributed training complexity behind a single YAML configuration file. Pre-built recipes handle model initialization, data loading, and training loops automatically. + +**Configuration**: Single YAML file controls all training parameters. The recipe provides sensible defaults, and you override only what you need to change. + +**Parallelism**: FSDP2 automatically distributes training across GPUs using tensor parallelism (TP), context parallelism (CP), pipeline parallelism (PP), and data parallelism (DP). You configure parallelism strategy in the `fsdp` section without managing low-level details. + +**Data Loading**: Uses PyTorch DataLoader with standard dataset interfaces. Works with common formats like images, text, and Hugging Face datasets. + +**Model Integration**: Works directly with Hugging Face Diffusers models, making fine-tuning pre-trained models straightforward. +::: + +:::{tab-item} Megatron +:sync: megatron + +Megatron provides explicit control over every aspect of distributed training, from parallelism dimensions to data loading pipelines. Built for large-scale pretraining, it supports multi-node clusters with thousands of GPUs and custom model architectures. + +**Configuration**: Three-level configuration system provides maximum flexibility: + +1. Base recipe (Python) defines training logic and default parameters +2. YAML override files modify specific parameters for experiments +3. CLI overrides (highest precedence) enable quick parameter sweeps + +This layered approach supports Hydra-style syntax for complex configuration changes. + +**Parallelism**: Explicit control over all parallelism dimensions. You specify tensor parallel size, context parallel size, pipeline parallel stages, and data parallel degree independently. This fine-grained control enables optimal scaling for different model architectures and cluster configurations. + +**Data Loading**: Uses Energon data loader with webdataset format, optimized for distributed training at scale. Supports efficient data streaming across nodes and advanced features like sample reweighting and mixing multiple datasets. + +**Model Customization**: Full access to model architecture, forward pass logic, and training step. You define custom `ForwardStep` functions and modify model components directly. +::: +:::: + +### Use Cases + +Your training goal determines which paradigm fits best. Here are the scenarios where each paradigm excels. + +::::{tab-set} +:sync-group: paradigm + +:::{tab-item} Automodel +:sync: automodel + +- **Fine-tuning**: Adapt pre-trained models to your dataset +- **Research prototyping**: Test ideas quickly without infrastructure overhead +- **Small-scale training**: Single-node or small multi-node setups +- **Standard architectures**: Using existing model recipes without customization +::: + +:::{tab-item} Megatron +:sync: megatron + +- **Large-scale pretraining**: Training foundation models from scratch on multi-node clusters +- **Production workflows**: Reproducible training with version-controlled configurations +- **Custom architectures**: Implementing novel model designs not available in standard recipes +- **Performance optimization**: Tuning parallelism and memory usage for specific hardware +- **Multi-stage training**: Complex workflows with different training phases +::: +:::: + +### Architecture + +Both paradigms organize code into layers that separate configuration from execution. The layer structure reflects each paradigm's design philosophy. + +::::{tab-set} +:sync-group: paradigm + +:::{tab-item} Automodel +:sync: automodel + +Automodel uses a three-layer architecture: + +1. **Recipe layer**: Pre-built training recipes (such as `TrainWan21DiffusionRecipe`) encapsulate training logic +2. **Config layer**: YAML files specify hyperparameters, data paths, and parallelism +3. **Execution layer**: `recipe.run_train_validation_loop()` handles training iteration +::: + +:::{tab-item} Megatron +:sync: megatron + +Megatron uses a modular architecture with clear separation of concerns: + +1. **Recipe layer**: Base Python configuration (`pretrain_config()`) defines model, optimizer, and training parameters +2. **Override layer**: YAML files and CLI arguments modify base configuration +3. **Execution layer**: `pretrain()` function orchestrates distributed training with custom forward steps +4. **Bridge layer**: Megatron-Bridge handles low-level distributed training mechanics +::: +:::: + +## Comparing the Paradigms + +The paradigms differ fundamentally in how they balance ease of use against control and scalability. + +::::{tab-set} +:sync-group: paradigm + +:::{tab-item} Automodel +:sync: automodel + +**Configuration**: Single YAML file with recipe defaults + +**Parallelism**: Automatic FSDP2 (less control) + +**Data Loading**: PyTorch DataLoader, standard formats + +**Scalability**: Small multi-node + +**Setup Complexity**: Low + +**Customization**: Recipe-level only + +**Best For**: Quick experiments, fine-tuning +::: + +:::{tab-item} Megatron +:sync: megatron + +**Configuration**: Python base + YAML overrides + CLI + +**Parallelism**: Explicit TP/CP/PP/DP (full control) + +**Data Loading**: Energon data loader with distributed streaming + +**Scalability**: Large multi-node clusters + +**Setup Complexity**: High + +**Customization**: Full code-level access + +**Best For**: Large-scale pretraining, production +::: +:::: + +### Configuration Systems + +::::{tab-set} +:sync-group: paradigm + +:::{tab-item} Automodel +:sync: automodel + +Uses a single YAML file where you specify training parameters. The recipe provides defaults for most settings, so you only override what matters for your experiment. Configuration is simple and flat. +::: + +:::{tab-item} Megatron +:sync: megatron + +Uses a three-level system: start with a Python recipe that defines base configuration, override specific parameters with YAML files for experiments, and apply final tweaks via CLI for parameter sweeps. This complexity enables reproducible experiments with version control. +::: +:::: + +### Parallelism Strategies + +::::{tab-set} +:sync-group: paradigm + +:::{tab-item} Automodel +:sync: automodel + +Automatically configures FSDP2 to distribute your model across GPUs. You specify high-level parallelism settings in the `fsdp` section, and the framework determines optimal shard placement. This works well for standard model architectures. +::: + +:::{tab-item} Megatron +:sync: megatron + +Requires you to explicitly set tensor parallel size, context parallel size, pipeline stages, and data parallel degree. This granular control enables optimal memory usage and communication patterns for very large models or custom architectures. +::: +:::: + +### Data Loading Pipelines + +::::{tab-set} +:sync-group: paradigm + +:::{tab-item} Automodel +:sync: automodel + +Uses PyTorch DataLoader with standard Python datasets. This familiar interface works with images, text files, and Hugging Face datasets without preprocessing. +::: + +:::{tab-item} Megatron +:sync: megatron + +Uses the Energon data loader optimized for distributed training at scale. This loader enables efficient streaming of massive datasets across nodes and supports advanced features like deterministic sampling and dataset mixing. +::: +:::: + +## Selecting Your Paradigm + +Your training goal determines which paradigm to use. + +::::{tab-set} +:sync-group: paradigm + +:::{tab-item} Automodel +:sync: automodel + +**Fine-tuning existing models**: Automodel integrates directly with Hugging Face models and provides pre-built fine-tuning recipes. + +**Research experiments**: Quick iteration with YAML-only configuration changes. Test hypotheses in minutes instead of hours. + +**Small-scale training**: Training on single-node or small multi-node setups where automatic parallelism configuration works well. + +**Standard architectures**: Using proven model architectures without custom modifications. +::: + +:::{tab-item} Megatron +:sync: megatron + +**Pretraining foundation models**: Large-scale training from scratch where Energon's data loading efficiency and explicit parallelism control are essential. + +**Production deployments**: Reproducible training with version-controlled Python recipes and configuration overrides. + +**Custom model architectures**: Implementing novel designs that require code-level modifications to model structure and training steps. + +**Performance-critical training**: Optimizing memory usage and communication patterns for specific hardware configurations. + +**Large clusters**: Training on large multi-node clusters where explicit parallelism management becomes necessary. +::: +:::: + +## Paradigm Interoperability + +Model checkpoints from one paradigm can often be loaded in the other, but training workflows are not interchangeable. The paradigms use different: + +- **Configuration formats**: YAML-only versus Python + YAML + CLI +- **Data formats**: PyTorch datasets versus webdataset +- **Parallelism APIs**: FSDP2 versus explicit Megatron parallelism + +Plan to use one paradigm consistently throughout your project. Converting training infrastructure between paradigms requires rewriting configuration and data loading code. + +**Inference**: Both paradigms can export models to standard formats for inference deployment. + +--- + +## Experimental Comparison + +For a detailed experimental comparison of Automodel vs Megatron training paths, including training curves and performance analysis, see [Automodel vs Megatron Comparison](../comparison.md). + +The comparison includes: +- Two-stage training experiment (Text→Image, Text→Video) +- Training loss curves for both paths +- Important caveats about implementation differences +- Performance characteristics analysis diff --git a/docs/about/concepts/video-data.md b/docs/about/concepts/video-data.md new file mode 100644 index 00000000..8c579306 --- /dev/null +++ b/docs/about/concepts/video-data.md @@ -0,0 +1,426 @@ +--- +description: "How video data is represented in NeMo DFM: latents, VAE encoding, tokenization, and data formats" +categories: ["concepts-architecture"] +tags: ["data", "video", "latents", "vae", "tokenization"] +personas: ["data-scientist-focused", "mle-focused"] +difficulty: "intermediate" +content_type: "explanation" +--- + +(about-concepts-video-data)= + +# Video Data Representation + +NeMo DFM processes videos in latent space rather than pixel space, reducing memory requirements and accelerating training by up to 64×. + +## Overview + +Videos in DFM follow a four-stage pipeline: + +1. **Encode to latents**: VAE (Variational Autoencoder) compresses raw pixels into latent space +2. **Store as tensors**: Compressed latents are saved with text embeddings +3. **Process with diffusion**: Models operate on compact latent representations +4. **Decode to pixels**: VAE reconstructs final video frames + +**Key benefit**: A 1080p video (1920×1080×3 channels×120 frames = 746 million values) compresses to latents of 16×15×135×240 = 8.6 million values—a 64× reduction. + +## Video Latents + +### Tensor Format + +Video latents are 4D tensors with shape `(C, T, H, W)`: + +| Dimension | Description | Example Values | +|-----------|-------------|----------------| +| **C** | Channels | 16 (standard for most VAEs) | +| **T** | Temporal frames | 15, 30, 60, 120 (varies by video length) | +| **H** | Latent height | 135 for 1080p (1080÷8) | +| **W** | Latent width | 240 for 1920p (1920÷8) | + +**Spatial compression**: VAEs downsample by 8× in both height and width. A 1920×1080 frame becomes 240×135 in latent space. + +**Temporal compression**: Some VAEs also compress temporally. A 120-frame video might compress to 15 latent frames (8× temporal compression). + +### Why Latents? + +**Memory efficiency**: Latent representation is 64× smaller than raw pixels. + +- Raw 1080p video (120 frames): 746 MB +- Latent representation: 12 MB +- Enables training on longer videos with limited GPU memory + +**Training speed**: Diffusion models process 8.6 million values instead of 746 million values—approximately 8× faster per iteration. + +**Quality preservation**: VAE reconstruction maintains perceptual quality. Peak Signal-to-Noise Ratio (PSNR) remains above 30 dB for most VAE models. + +## VAE Encoding and Decoding + +### Encoding Process + +The VAE encoder transforms raw video frames into compact latent tensors: + +```python +import torch +from diffusers import AutoencoderKLWan + +# Load video: (batch, channels, time, height, width) +video_frames = torch.randn(1, 3, 120, 1080, 1920) # 1080p, 120 frames + +# Normalize to [-1, 1] range +video_frames = video_frames * 2.0 - 1.0 + +# Initialize VAE (WAN 2.1) +vae = AutoencoderKLWan.from_pretrained( + "Wan-AI/Wan2.1-T2V-14B-Diffusers", + subfolder="vae" +) + +# Encode to latents +latent_dist = vae.encode(video_frames) +latents = latent_dist.latent_dist.mean # Use mean for deterministic encoding +# Output shape: (1, 16, 120, 135, 240) +# Compression: 1× in time (no temporal compression), 8× in height, 8× in width +``` + +**Encoding steps**: + +1. Normalize input frames to VAE's expected range (usually [-1, 1]) +2. Pass through encoder network +3. Quantize or sample latent distribution +4. Output compressed latent tensor + +### Decoding Process + +The VAE decoder reconstructs video frames from latents: + +```python +# Generate or load latents +latents = torch.randn(1, 16, 120, 135, 240) + +# Decode to video frames +reconstructed_video = vae.decode(latents).sample +# Output shape: (1, 3, 120, 1080, 1920) + +# Denormalize from [-1, 1] to [0, 255] for video output +video_uint8 = ((reconstructed_video + 1.0) * 127.5).clamp(0, 255).to(torch.uint8) +``` + +**Decoding steps**: + +1. Pass latents through decoder network +2. Upsample to original spatial and temporal resolution +3. Denormalize to pixel value range +4. Output reconstructed video frames + +### VAE Models + +DFM supports multiple VAE architectures: + +**Cosmos Tokenizer** (Continuous Video: `Cosmos-Tokenizer-CV8x8x8`): + +- Compression: 8×8×8 (time × height × width) +- Channels: 16 latent channels +- Use case: DiT models, continuous latent diffusion +- Normalization: Input frames in [-1, 1] + +**Cosmos Tokenizer** (Discrete Video: `Cosmos-Tokenizer-DV4x8x8`): + +- Compression: 4×8×8 (time × height × width) +- Channels: 6 discrete code channels (codebook size 64K) +- Use case: Autoregressive models, discrete token generation +- Normalization: Input frames in [-1, 1] + +**WAN VAE**: + +- Compression: 1×8×8 (no temporal compression) +- Channels: 16 latent channels +- Use case: WAN models, Flow Matching models +- Normalization: Input frames converted to [-1, 1] internally + +Each VAE requires specific normalization. Check model documentation before preprocessing. + +## Data Formats + +### Training Data Formats + +DFM supports two paradigms with different data formats: + +#### Automodel Format + +Automodel uses pickled `.meta` files containing preprocessed latents: + +```python +# Example .meta file structure +{ + "video_latents": torch.Tensor, # Shape: (C, T, H, W) + "text_embeddings": torch.Tensor, # Shape: (S, D) + "first_frame": np.ndarray, # First frame (H, W, 3) in [0, 255] + "metadata": dict, # Original video metadata + "num_frames": int, # Frame count + "original_filename": str, # Source video filename + "original_video_path": str, # Source video path + "deterministic_latents": bool, # Encoding mode used + "memory_optimization": bool, # Memory optimization enabled + "model_version": str, # VAE model version (e.g., "wan2.1") + "resize_settings": dict, # Resize configuration +} +``` + +**File organization**: + +```text +dataset/ +├── sample_0000.meta +├── sample_0001.meta +├── sample_0002.meta +└── ... +``` + +#### Megatron Format + +Megatron supports two distributed data formats: + +**Webdataset format**: + +- Tar archives containing video samples +- Each sample is a set of files with shared basename +- Example: `sample001.latent.pth`, `sample001.text.pth`, `sample001.json` + +**Energon format**: + +- Optimized for distributed data loading across nodes +- Supports efficient sharding and data parallelism +- Recommended for multi-node training at scale + +Both formats include latents, text embeddings, and metadata per sample. + +### DiffusionSample Structure + +The `DiffusionSample` class represents a training sample: + +```python +@dataclass +class DiffusionSample: + video: torch.Tensor # Video latents (C, T, H, W) + context_embeddings: torch.Tensor # Text embeddings (S, D) + context_mask: torch.Tensor # Text mask + image_size: torch.Tensor # [height, width] + fps: torch.Tensor # Frame rate + num_frames: torch.Tensor # Frame count + # ... additional metadata +``` + +## Text Conditioning + +### Text Embeddings + +Text prompts guide video generation through learned embeddings. DFM uses T5 or similar transformer-based text encoders. + +**Embedding dimensions**: + +| Encoder | Sequence Length (S) | Embedding Dim (D) | Model Size | +|---------|---------------------|-------------------|------------| +| T5-Base | Up to 512 tokens | 768 | 220M params | +| T5-Large | Up to 512 tokens | 1024 | 770M params | +| T5-XXL | Up to 512 tokens | 4096 | 11B params | + +**Process**: Text → Tokenizer → Token IDs → Encoder → Embeddings `(S, D)` + +### Text Encoding Example + +```python +from transformers import AutoTokenizer, UMT5EncoderModel +import torch + +# Initialize UMT5 encoder (used by WAN models) +tokenizer = AutoTokenizer.from_pretrained( + "Wan-AI/Wan2.1-T2V-14B-Diffusers", + subfolder="text_encoder" +) +text_encoder = UMT5EncoderModel.from_pretrained( + "Wan-AI/Wan2.1-T2V-14B-Diffusers", + subfolder="text_encoder" +) + +# Encode prompt +prompt = "A robot cooking pasta in a modern kitchen" +inputs = tokenizer( + prompt, + max_length=512, + padding="max_length", + truncation=True, + return_tensors="pt", + return_attention_mask=True, +) + +with torch.no_grad(): + text_embeddings = text_encoder( + input_ids=inputs["input_ids"], + attention_mask=inputs["attention_mask"] + ).last_hidden_state +# Output shape: (1, 512, D) where D is embedding dimension + +# Embeddings condition the diffusion model +# via cross-attention layers during generation +``` + +**Attention masking**: Padding tokens are masked so the model only attends to real tokens, not padding. + +## Video Tokenization + +Some models discretize continuous latents into tokens for autoregressive generation. + +### Cosmos Video Tokenizer + +The Cosmos tokenizer converts continuous latents into discrete token sequences: + +**Process**: + +1. Encode video to continuous latents: `(C, T, H, W)` +2. Quantize latents using learned codebook +3. Output discrete token indices: `(T×H×W,)` flattened sequence + +**Use cases**: + +- Autoregressive video models (predict next token) +- Enables language model-style training on videos +- Supports efficient caching during generation + +### Causal Video Tokenizer + +Causal tokenizers maintain temporal causality for autoregressive models: + +- **Temporal masking**: Each frame can only see previous frames +- **Autoregressive generation**: Generate frame-by-frame sequentially +- **Architecture compatibility**: Required for GPT-style video models + +**Example**: Generating a 120-frame video autoregressively produces frames 1→2→3→...→120, where each frame conditions on all previous frames. + +## Sequence Packing + +Sequence packing improves GPU utilization during distributed training: + +**Without packing**: + +```text +Batch 1: [sequence_A (50 tokens), padding (14 tokens)] # 22% wasted +Batch 2: [sequence_B (40 tokens), padding (24 tokens)] # 37% wasted +``` + +**With packing**: + +```text +Batch 1: [sequence_A (50 tokens), sequence_B (14 tokens)] # 0% wasted +``` + +**Implementation**: + +- Combine multiple sequences into fixed-length batches +- Use attention masks to separate sequences +- Track sequence boundaries for gradient computation + +**Benefits**: Up to 2× throughput improvement on datasets with variable-length videos. + +## Data Preprocessing + +### Preparation Pipeline + +Preprocessing transforms raw videos into training-ready samples: + +1. **Load raw video**: Read MP4, AVI, or other video formats +2. **Resize and crop**: Standardize to target resolution (for example, 1080p) +3. **Normalize frames**: Convert to expected range ([-1, 1] or [0, 1]) +4. **Encode to latents**: Apply VAE encoder +5. **Encode text prompts**: Apply text encoder +6. **Package sample**: Create `DiffusionSample` with metadata +7. **Save to disk**: Write as `.meta` file or webdataset entry + +**Batch processing**: Process videos in parallel to maximize throughput. Use multi-GPU encoding for large datasets. + +### Preprocessing Example + +```python +from dfm.src.automodel.utils.data.preprocess_resize import VideoPreprocessor +from pathlib import Path + +# Initialize preprocessor +preprocessor = VideoPreprocessor( + video_folder="raw_videos", + wan21_model_id="Wan-AI/Wan2.1-T2V-14B-Diffusers", + output_folder="processed_meta", + device="cuda", + deterministic_latents=True, # Use deterministic encoding (no flares) + target_size=(1080, 1920), # Target resolution (height, width) + resize_mode="bilinear", + maintain_aspect_ratio=True, +) + +# Process all videos in folder +# Requires meta.json with video metadata in video_folder +preprocessor.process_all_videos() + +# Or load existing processed data +data = preprocessor.load_processed_data("sample_0000.meta") + +# Data contains: +# - video_latents: (16, T, 135, 240) +# - text_embeddings: (1, 512, D) +# - first_frame: (1080, 1920, 3) +# - metadata: Original video metadata +``` + +### Preprocessing Tools + +DFM provides command-line tools and Python APIs: + +**Command-line preprocessing**: + +```bash +python dfm/src/automodel/utils/data/preprocess_resize.py \ + --video_folder raw_videos/ \ + --output_folder processed_meta/ \ + --model Wan-AI/Wan2.1-T2V-14B-Diffusers \ + --height 1080 \ + --width 1920 \ + --resize_mode bilinear \ + --device cuda +``` + +**Python API**: + +- `VideoPreprocessor`: End-to-end video preprocessing (`dfm.src.automodel.utils.data.preprocess_resize`) +- `AutoencoderKLWan.encode()` / `.decode()`: Manual latent encoding (Diffusers library) +- `UMT5EncoderModel`: Text prompt encoding (Transformers library) +- `DiffusionSample`: Training sample dataclass (`dfm.src.megatron.data.common.diffusion_sample`) + +## Metadata + +Each training sample includes metadata for proper model conditioning: + +| Metadata Field | Type | Purpose | Example | +|----------------|------|---------|---------| +| **image_size** | `(int, int)` | Original video resolution | `(1080, 1920)` | +| **fps** | `int` | Frame rate | `24`, `30`, `60` | +| **num_frames** | `int` | Total frame count | `120` | +| **padding_mask** | `torch.Tensor` | Valid vs padded regions | Binary mask | +| **position_ids** | `torch.Tensor` | Spatial/temporal positions | 3D position indices | + +**Why metadata matters**: + +- **Resolution conditioning**: Models can generate videos at different resolutions +- **FPS conditioning**: Control playback speed and motion dynamics +- **Frame count conditioning**: Generate videos of varying lengths +- **Padding masks**: Prevent model from learning on invalid padded regions + +**Example usage**: + +```python +# Model conditions on metadata during training +loss = model( + latents=sample.video, + text_embeddings=sample.context_embeddings, + image_size=sample.image_size, # Conditions generation + fps=sample.fps, # Conditions motion dynamics + num_frames=sample.num_frames, # Conditions temporal length +) +``` diff --git a/docs/about/index.md b/docs/about/index.md new file mode 100644 index 00000000..206787f5 --- /dev/null +++ b/docs/about/index.md @@ -0,0 +1,99 @@ +--- +description: "Overview of NeMo DFM, a framework for large-scale training and inference of video diffusion models with Automodel and Megatron support" +categories: ["getting-started"] +tags: ["overview", "platform", "diffusion", "video-models", "getting-started"] +personas: ["data-scientist-focused", "mle-focused", "admin-focused", "devops-focused"] +difficulty: "beginner" +content_type: "concept" +modality: "universal" +--- + +(about-overview)= + +# Overview of NeMo DFM + +NeMo DFM (Diffusion Foundation Models) trains and runs inference on video diffusion models at scale. It combines two training approaches—Automodel for recipe-based workflows and Megatron for multi-node distributed training—with support for multiple architectures including DiT, WAN, and EDM. + +**Use NeMo DFM to:** + +- Train video diffusion models using Flow Matching or EDM paradigms +- Scale training across GPUs and nodes with tensor, context, and pipeline parallelism +- Run efficient video generation inference on trained models +- Experiment with different architectures (DiT, WAN, EDM) using the same framework + +## Who Should Use DFM + +- **Machine Learning Engineers**: Train video foundation models using diffusion and autoregressive architectures with configuration-driven workflows. +- **Data Scientists**: Process video datasets with VAE encoding and tokenization pipelines for diffusion model training. +- **Cluster Administrators**: Deploy and monitor large-scale distributed training jobs across multi-node GPU clusters. +- **Researchers**: Experiment with diffusion architectures (DiT, EDM, WAN), training paradigms (Flow Matching, EDM), and parallelism strategies. + +## What DFM Provides + +**Two Training Paradigms**: + +- **Automodel**: Recipe-based training with DTensor for 3D parallelism, optimized for experimentation and prototyping +- **Megatron**: Large-scale distributed training with comprehensive parallelism support (TP, CP, PP, DP) for production workloads + +**Architectures**: + +- **DiT** (Diffusion Transformer): Transformer-based diffusion models for video generation +- **WAN**: Flow Matching architecture for alternative training dynamics +- **EDM** (Elucidating Diffusion Models): Improved diffusion training with better convergence + +**Video Processing**: + +- VAE encoding for latent space representation +- Tokenization pipelines for efficient video data handling +- Support for variable-length videos and diverse resolutions + +**Distributed Training**: + +- Tensor parallelism (TP) for splitting model layers across GPUs +- Context parallelism (CP) for long-sequence training +- Pipeline parallelism (PP) for splitting models across stages +- Data parallelism (DP) for scaling batch sizes + +## Learn Core Concepts + +Understand the foundational concepts before training or deploying video diffusion models. + +::::{grid} 1 1 1 2 +:gutter: 1 1 1 2 + +:::{grid-item-card} {octicon}`git-branch;1.5em;sd-mr-1` Training Paradigms +:link: about-concepts-training-paradigms +:link-type: ref + +Understand the two main training approaches: Automodel (recipe-based) and Megatron (large-scale distributed), and when to use each. +::: + +:::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` Diffusion Models for Video +:link: about-concepts-diffusion-models +:link-type: ref + +Learn how diffusion models work for video generation, including EDM and Flow Matching paradigms. +::: + +:::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` Video Data Representation +:link: about-concepts-video-data +:link-type: ref + +Understand how DFM represents video data: latents, VAE encoding, tokenization, and data formats. +::: + +:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Distributed Training +:link: about-concepts-distributed-training +:link-type: ref + +Learn about parallelism strategies: tensor parallelism (TP), context parallelism (CP), pipeline parallelism (PP), and data parallelism (DP). +::: + +:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration System +:link: about-concepts-configuration +:link-type: ref + +Understand how DFM's configuration system works: YAML files, CLI overrides, and configuration precedence. +::: + +:::: diff --git a/docs/conf.py b/docs/conf.py index 9f9bb77f..c2439241 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -39,6 +39,7 @@ "sphinx.ext.doctest", # Allows testing in docstrings "sphinx.ext.napoleon", # For google style docstrings "sphinx_copybutton", # For copy button in code blocks + "sphinx_design", # For grid layouts and card components ] templates_path = ["_templates"] @@ -61,9 +62,17 @@ "deflist", # Supports definition lists with term: definition format "fieldlist", # Enables field lists for metadata like :author: Name "tasklist", # Adds support for GitHub-style task lists with [ ] and [x] + "substitution", # Enables variable substitutions like {{product_name}} ] myst_heading_anchors = 5 # Generates anchor links for headings up to level 5 +# MyST substitutions - variables that can be used in markdown files +myst_substitutions = { + "product_name": "NeMo DFM", +} + +myst_heading_anchors = 5 # Generates anchor links for headings up to level 5 + # -- Options for Autodoc2 --------------------------------------------------- sys.path.insert(0, os.path.abspath("..")) diff --git a/docs/get-started/automodel.md b/docs/get-started/automodel.md new file mode 100644 index 00000000..60a76757 --- /dev/null +++ b/docs/get-started/automodel.md @@ -0,0 +1,658 @@ +--- +description: "End-to-end Automodel quickstart: fine-tune and generate videos" +categories: ["getting-started", "automodel"] +tags: ["quickstart", "tutorial", "automodel"] +personas: ["data-scientist-focused"] +difficulty: "beginner" +content_type: "tutorial" +--- + +(gs-automodel)= + +# Automodel Workflow + +Complete end-to-end tutorial for fine-tuning and generating videos using NeMo DFM's Automodel approach. + +:::{card} + +**Goal**: Fine-tune a pretrained video model and generate videos from your checkpoint. + +^^^ + +**In this tutorial, you will**: + +1. Fine-tune the WAN2.1 model on your dataset +2. Generate videos from your trained model +3. Experiment with generation parameters + +**Time**: 30-45 minutes (depending on training duration) + +::: + +:::{button-ref} gs-index +:color: secondary +:outline: +:ref-type: ref + +← Back to Get Started +::: + +## Before You Start + +Make sure you have completed: + +- ✅ [Installation](installation.md) +- ✅ Multi-GPU setup (recommended: 8 GPUs) +- ✅ Dataset in Energon format or custom dataloader + +--- + +(gs-automodel-training-section)= +## Fine-Tune WAN2.1 Model + +Fine-tune the WAN2.1 text-to-video model using Automodel's recipe-based training approach. + +**Key concept**: Automodel handles parallelism automatically using FSDP2—no manual tensor or pipeline parallelism configuration needed. + +:::{dropdown} What happens during training +:icon: info + +1. Load pretrained WAN2.1 model from Hugging Face +2. Configure FSDP2 parallelism automatically +3. Train on your dataset with flow matching +4. Save checkpoints periodically +::: + +### 1. Prepare Your Dataset + +(gs-automodel-data-requirements)= + +You can prepare your dataset in two ways: + +- **Start with raw videos**: Place your `.mp4` files in a folder and use data-preparation scripts to scan videos and generate a `meta.json` entry for each sample +- **Bring your own `meta.json`**: If you already have annotations, create `meta.json` yourself following the schema below + +#### Dataset Structure + +```text +/ +├── video1.mp4 +├── video2.mp4 +└── meta.json +``` + +:::{note} +If you have captions, you can also include per-video named `