diff --git a/.github/workflows/build-docs.yml b/.github/workflows/build-docs.yml
index d01e3afc..c3582df0 100644
--- a/.github/workflows/build-docs.yml
+++ b/.github/workflows/build-docs.yml
@@ -44,10 +44,13 @@ jobs:
           uv pip install \
             "myst-parser>=4.0.1" \
             "nvidia-sphinx-theme>=0.0.8" \
-            "sphinx>=8.1.3" \
-            "sphinx-autobuild>=2024.10.3" \
+            "sphinx>=8.2.3" \
+            "sphinx-autobuild>=2025.8.25" \
             "sphinx-autodoc2>=0.5.0" \
-            "sphinx-copybutton>=0.5.2"
+            "sphinx-copybutton>=0.5.2" \
+            "sphinxcontrib-mermaid>=1.0.0" \
+            "sphinx-design>=0.6.1" \
+            "swagger-plugin-for-sphinx>=6.0.0"
 
       - name: Build documentation
         run: |
diff --git a/docs/INFORMATION_CHECKLIST.md b/docs/INFORMATION_CHECKLIST.md
new file mode 100644
index 00000000..e341ba7e
--- /dev/null
+++ b/docs/INFORMATION_CHECKLIST.md
@@ -0,0 +1,524 @@
+# Information Preservation Checklist
+
+**Purpose**: Verify all unique information from old docs is captured in new structure.
+
+**How to Use**: Check off each item as it's integrated into the new docs. Items can be integrated anywhere logical in the new IA.
+
+---
+
+## 1. Performance Benchmarks (`performance-summary.md`)
+
+**Target Location**: `docs/reference/performance.md` (REFERENCE)
+
+### Nomenclature Definitions
+- [ ] **GBS**: Global Batch Size
+- [ ] **MBS**: Micro Batch Size
+- [ ] **FSDP**: Fully Sharded Data Parallel
+  - [ ] FSDP = 1: use FSDP
+  - [ ] FSDP = 0: use DDP (Distributed Data Parallel)
+- [ ] **TP**: Tensor Parallel Size
+- [ ] **SP**: Sequence Parallel
+- [ ] **PP**: Pipeline Parallel Size
+- [ ] **CP**: Context Parallel Size
+- [ ] **VP**: Virtual Pipeline Parallel Size
+- [ ] **EP**: Expert Parallel Size
+
+### Performance Metrics
+- [ ] **Tokens/sec/GPU**: Throughput per GPU (explanation)
+- [ ] **Model TFLOP/sec/GPU**: Model floating-point operations per second per GPU (explanation)
+
+### Benchmark Tables
+
+#### Megatron-Core Pre-Training Performance
+
+**DGX-GB200**:
+- [ ] WAN 2.1 14B benchmark row (32 GPUs, GBS=64, MBS=1, SeqLen=37440, FSDP=0, TP=1, SP=0, PP=1, CP=4, VP=0, EP=0, TFLOP=787.59)
+
+**DGX-GB300**:
+- [ ] WAN 2.1 14B benchmark row (32 GPUs, GBS=64, MBS=1, SeqLen=37440, FSDP=0, TP=1, SP=0, PP=1, CP=2, VP=0, EP=0, TFLOP=1,022.26)
+
+**DGX-H100**:
+- [ ] WAN 2.1 14B benchmark row (128 GPUs, GBS=128, MBS=1, SeqLen=37440, FSDP=0, TP=2, SP=1, PP=1, CP=4, VP=0, EP=0, TFLOP=325.77)
+
+#### NeMo Automodel Pre-Training Performance
+
+**DGX-H100**:
+- [ ] WAN 2.1 14B benchmark row (8 GPUs, GBS=8, MBS=1, SeqLen=37440, FSDP=1, DP=8, TP=1, SP=1, PP=1, CP=1, VP=0, EP=0, TFLOP=175.88)
+- [ ] WAN 2.1 14B benchmark row (64 GPUs, GBS=64, MBS=1, SeqLen=37440, FSDP=1, DP=64, TP=1, SP=1, PP=1, CP=1, VP=0, EP=0, TFLOP=228.85)
+
+### Context Information
+- [ ] Note about referring to `examples/megatron/recipes/wan/conf` for updated YAML configs
+- [ ] Statement about ongoing optimization
+
+---
+
+## 2. Paradigm Comparison (`mcore_automodel_comparision_wan21.md`)
+
+**Target Location**: `docs/about/comparison.md` OR integrate into `docs/about/concepts/training-paradigms.md` (EXPLANATION)
+
+### Experiment Overview
+- [ ] Goal: Compare two training paths for WAN 2.1
+- [ ] Path 1: Diffusers + Automodel training path (with links)
+- [ ] Path 2: Megatron-Core + Megatron-Bridge training path (with links)
+- [ ] Two-stage training approach explanation
+- [ ] Dataset: 3,000 videos (frames extracted for Stage 1)
+
+### Stage 1: Text-to-Image
+- [ ] Extract 40 frames per video → 120k images
+- [ ] Resolution: 240 × 416
+- [ ] Each frame uses same caption as parent video
+- [ ] Global batch size: 2560 images
+- [ ] Learning rate: warmup 10k → 5e-5 constant
+- [ ] Hardware: 10 nodes (80 GPUs)
+- [ ] Megatron-Core parallelism: TP=1, PP=1, CP=1, Sequence packing (32 samples/pack)
+- [ ] Automodel parallelism: FSDP, micro_batch_size = 32
+- [ ] Training curve image: `lm_loss_text2image_3kvids.png`
+
+### Stage 2: Text-to-Video
+- [ ] Full videos → 3,000 videos
+- [ ] Resolution: 240 × 416, duration 4–8 seconds
+- [ ] Global batch size: 80 videos
+- [ ] Learning rate: 5e-5 constant
+- [ ] Hardware: 10 nodes (80 GPUs)
+- [ ] Megatron-Core parallelism: TP=1, PP=1, CP=1, micro_batch_size = 1
+- [ ] Automodel parallelism: FSDP, micro_batch_size = 1
+- [ ] Training curve image: `lm_loss_text2video_3kvids.png`
+
+### Results Analysis
+- [ ] Note: Training loss smoothed with 50 steps averaging
+- [ ] Observation: Training curves have similar value ranges but don't match exactly
+- [ ] Explanation: Expected due to differences in implementation and training loop setups
+- [ ] **Critical Caveat**: Megatron-Core applies same diffusion timesteps to all samples in pack (not different timesteps per sample)
+- [ ] **Critical Caveat**: Training loss for Megatron-Core fluctuates more than AutoModel, especially at beginning
+
+### Context Notes
+- [ ] Note: Partial convergence test (3K videos insufficient for generalization)
+- [ ] Note: Only demonstrates reconstruction ability, not novel generation
+
+---
+
+## 3. Automodel Training Information (`automodel_training_doc.md`)
+
+**Target Location**: Integrate into `docs/get-started/automodel.md` (TUTORIAL with progressive disclosure)
+
+### Overview
+- [ ] Currently Supported: WAN 2.1 Text-to-Video (1.3B and 14B models)
+
+### Docker Setup
+- [ ] Build command: `docker build -f docker/Dockerfile.ci -t dfm-training .`
+- [ ] Run command with all flags (--gpus, -v mounts, --ipc=host, ulimit settings)
+- [ ] Inside container: Initialize submodules command
+
+### Data Preparation
+
+#### Dataset Options
+- [ ] Option 1: Start with raw videos (use data-preparation scripts)
+- [ ] Option 2: Bring your own `meta.json`
+
+#### Dataset Structure
+- [ ] Folder structure example (`<your_video_folder>/` with videos and `meta.json`)
+- [ ] Note about per-video `.jsonl` captions being picked up automatically
+
+#### meta.json Schema
+- [ ] Complete JSON schema with all fields:
+  - [ ] `file_name`
+  - [ ] `width`
+  - [ ] `height`
+  - [ ] `start_frame`
+  - [ ] `end_frame`
+  - [ ] `vila_caption`
+- [ ] Example with two video entries
+
+#### Preprocessing Modes
+
+**Full Video Mode (`--mode video`)**:
+- [ ] What it is: Converts each source video into single `.meta` preserving full temporal sequence
+- [ ] When to use: Fine-tuning text-to-video models where motion/temporal consistency matter
+- [ ] Status: Recommended default for most training runs
+- [ ] Command example with all flags
+- [ ] Output: Creates one `.meta` file per video
+
+**Extract Frames Mode (`--mode frames`)**:
+- [ ] What it is: Uniformly samples N frames, writes each as one-frame `.meta` sample
+- [ ] When to use: Image/frame-level training, quick smoke tests, ablations
+- [ ] Command example with `--num-frames` flag
+- [ ] Output: Creates one `.meta` file per frame
+
+#### Preprocessing Key Arguments
+- [ ] `--mode`: `video` or `frames` explanation
+- [ ] `--num-frames`: Number of frames to extract (frames mode only)
+- [ ] `--height/--width`: Target resolution
+- [ ] `--center-crop`: Crop to exact size after aspect-preserving resize
+
+#### Preprocessing Output
+- [ ] Encoded video latents (normalized)
+- [ ] Text embeddings (from UMT5)
+- [ ] First frame as JPEG (video mode only)
+- [ ] Metadata
+
+### Training
+
+#### Single-Node Training
+- [ ] Command: `uv run --group automodel --with . torchrun --nproc-per-node=8 ...`
+- [ ] Config file: `examples/automodel/finetune/wan2_1_t2v_flow.yaml`
+- [ ] Note about `UV_PROJECT_ENVIRONMENT` export
+
+#### Multi-Node SLURM Training
+- [ ] Complete SLURM script with all SBATCH directives
+- [ ] MASTER_ADDR setup from SLURM_JOB_NODELIST
+- [ ] MASTER_PORT setup
+- [ ] Per-rank UV cache setup to avoid conflicts
+- [ ] UV_CACHE_DIR per job/rank
+- [ ] torchrun command with multi-node flags
+- [ ] Config file: `wan2_1_t2v_flow_multinode.yaml`
+
+### Validation
+
+#### Validation Script Details
+- [ ] Purpose: Quick qualitative check of trained checkpoint
+- [ ] Reads prompts from `.meta` files in `--meta_folder`
+- [ ] Uses `metadata.vila_caption` (latents ignored)
+- [ ] Loads `WanPipeline`
+- [ ] Checkpoint loading priority: `ema_shadow.pt` → `consolidated_model.bin` → sharded FSDP `model/*.distcp`
+- [ ] Generation settings: `--guidance_scale`, `--num_inference_steps`, `--height/--width`, `--num_frames`, `--fps`, `--seed`
+- [ ] Output: Writes videos to `--output_dir`
+- [ ] Note: Qualitative comparison only, no quantitative metrics
+- [ ] Command example
+- [ ] Note: `--checkpoint ./checkpoints/LATEST` automatically uses most recent checkpoint
+
+### Configuration
+
+#### Fine-tuning Config (`wan2_1_t2v_flow.yaml`)
+- [ ] Complete YAML config with all sections:
+  - [ ] `model.pretrained_model_name_or_path`
+  - [ ] `step_scheduler` (global_batch_size, local_batch_size, num_epochs, ckpt_every_steps)
+  - [ ] `data.dataloader` (meta_folder, num_workers)
+  - [ ] `optim.learning_rate`
+  - [ ] `flow_matching` (timestep_sampling, flow_shift)
+  - [ ] `fsdp.dp_size`
+  - [ ] `checkpoint` (enabled, checkpoint_dir)
+- [ ] Note about canonical files in repository
+
+#### Multi-Node Config Differences
+- [ ] `fsdp.dp_size`: Total data-parallel replicas (2 nodes × 8 GPUs = 16)
+- [ ] `fsdp.dp_replicate_size`: Number of replicated groups across nodes (2)
+
+#### Pretraining vs Fine-tuning Comparison Table
+- [ ] `learning_rate`: Fine-tuning (5e-6) vs Pretraining (5e-5)
+- [ ] `weight_decay`: Fine-tuning (0.01) vs Pretraining (0.1)
+- [ ] `flow_shift`: Fine-tuning (3.0) vs Pretraining (2.5)
+- [ ] `logit_std`: Fine-tuning (1.0) vs Pretraining (1.5)
+- [ ] Dataset size: Fine-tuning (100s-1000s) vs Pretraining (10K+)
+
+### Hardware Requirements Table
+- [ ] GPU: Minimum (A100 40GB) vs Recommended (A100 80GB / H100)
+- [ ] GPUs: Minimum (4) vs Recommended (8+)
+- [ ] RAM: Minimum (128 GB) vs Recommended (256 GB+)
+- [ ] Storage: Minimum (500 GB SSD) vs Recommended (2 TB NVMe)
+
+### Features List
+- [ ] Flow Matching: Pure flow matching training
+- [ ] Distributed: FSDP2 + Tensor Parallelism
+- [ ] Mixed Precision: BF16 by default
+- [ ] WandB: Automatic logging
+- [ ] Checkpointing: consolidated and sharded formats
+- [ ] Multi-node: SLURM and torchrun support
+
+### Supported Models Table
+- [ ] WAN 2.1 T2V 1.3B: 1.3B params, FSDP2 via Automodel + DDP, Status ✅
+- [ ] WAN 2.1 T2V 14B: 14B params, FSDP2 via Automodel + DDP, Status ✅
+- [ ] FLUX: TBD params, TBD parallelization, Status 🔄 In Progress
+
+### Advanced Topics
+
+#### Custom Parallelization
+- [ ] Example YAML: `fsdp.tp_size: 2`, `fsdp.dp_size: 4`
+
+#### Checkpoint Cleanup
+- [ ] Python function: `cleanup_old_checkpoints(checkpoint_dir, keep_last_n=3)`
+- [ ] Complete code example with Path and shutil usage
+
+---
+
+## 4. DiT Model Information (`megatron/models/dit/README.md`)
+
+**Target Location**: Integrate into `docs/get-started/megatron.md` (TUTORIAL with progressive disclosure)
+
+### Overview
+- [ ] DiT description: Open-source implementation of Diffusion Transformers
+- [ ] Purpose: Training text-to-image/video models with EDM Pipeline
+- [ ] Based on: Megatron-Core and Megatron-Bridge
+- [ ] Parallelism support: Tensor, sequence, and context parallelism
+
+### Dataset Preparation
+
+#### Energon Data Loader
+- [ ] Uses NVIDIA's Megatron-Energon
+- [ ] WebDataset-compatible format (sharded `.tar` archives)
+- [ ] Supports: Large-scale distributed loading, sharding, sampling for multi-modal pairs
+- [ ] Set `dataset.path` to WebDataset location or shard pattern
+
+#### Butterfly Dataset Example
+- [ ] Dataset: `huggan/smithsonian_butterflies_subset` on Hugging Face
+- [ ] Script: `prepare_energon_dataset_butterfly.py`
+- [ ] Command with `--nproc-per-node`
+- [ ] Optional arguments: `--t5_cache_dir`, `--tokenizer_cache_dir`
+
+#### Energon Prepare Workflow
+- [ ] Command: `energon prepare $dataset_path`
+- [ ] Interactive prompts explanation:
+  - [ ] Train/val/test split entry (e.g., "1,0,0")
+  - [ ] Sample type selection: "Crude sample (plain dict for cooking)" (option 11)
+- [ ] Sample structure: keys include `json`, `pickle`, `pth`
+- [ ] Sample JSON content example (`image_height`, `image_width`)
+- [ ] Note: CrudeWebdataset doesn't need field map
+- [ ] Note: Need to provide `Cooker` in `TaskEncoder`
+- [ ] Note: Can add `subflavors` in meta dataset specification
+
+### Container Build
+- [ ] Reference to container section in main README
+
+### Pretraining
+
+#### Sequence Packing
+- [ ] Purpose: Maximize training efficiency
+- [ ] How it works: Stacks multiple samples into single sequence instead of padding
+- [ ] Requirement: `micro_batch_size` must be set to 1
+- [ ] Requirement: `qkv_format` should be set to `thd` (signals Transformer Engine)
+- [ ] Link to NeMo sequence packing documentation
+
+#### Sequence Packing Parameters
+- [ ] `task_encoder_seq_length`: Controls maximum sequence length passed to model
+- [ ] `packing_buffer_size`: Determines number of samples processed to create buckets
+- [ ] Reference to `select_samples_to_pack` and `pack_selected_samples` methods
+- [ ] Link to DiffusionTaskEncoderWithSequencePacking code
+- [ ] Link to Energon packing documentation
+
+#### Parallelism
+- [ ] Multiple parallelism techniques supported (tensor, sequence, context)
+- [ ] Configurable based on computational requirements
+
+#### Model Architecture Customization
+- [ ] Parameters: `num_layers`, `num_attention_heads`
+- [ ] Link to Megatron-Bridge documentation for comprehensive options
+
+#### WandB Notes
+- [ ] If using `wandb_project` and `wandb_exp_name`, export `WANDB_API_KEY`
+
+#### Validation Details
+- [ ] Model generates one sample per GPU at start of each validation round
+- [ ] Samples saved to `validation_generation` folder within `checkpoint_dir`
+- [ ] Logged to WandB if `WANDB_API_KEY` configured
+- [ ] Requires access to video tokenizer used during dataset preparation
+- [ ] Specify VAE artifacts location using `vae_cache_folder` argument
+- [ ] Otherwise downloaded in first validation round
+
+#### Pretraining Script Example
+- [ ] Copy config file: `cp examples/megatron/recipes/dit/conf/dit_pretrain_example.yaml ...`
+- [ ] Edit instructions for `my_config.yaml`:
+  - [ ] `model.vae_cache_folder`: Path to VAE cache folder
+  - [ ] `dataset.path`: Path to dataset folder
+  - [ ] `checkpoint.save` and `checkpoint.load`: Path to checkpoint folder
+  - [ ] `train.global_batch_size`: Set to be divisible by NUM_GPUs
+  - [ ] `logger.wandb_exp_name`: Your experiment name
+- [ ] Run command with `--config-file`
+- [ ] CLI override example: `train.train_iters=20000`, `model.num_layers=32`
+
+#### Training Split Note
+- [ ] If 100% data to training, pass `dataset.use_train_split_for_val=true`
+- [ ] Uses subset of training data for validation
+- [ ] Command example with this flag
+
+#### Mock Dataset
+- [ ] Use `--mock` flag for performance measurement without dataset
+- [ ] Command example with `--mock` flag
+
+### Inference
+
+#### Inference Script
+- [ ] Script: `inference_dit_model.py`
+- [ ] Requires: Trained checkpoint (`--checkpoint_path`), save path (`--video_save_path`)
+- [ ] Optional: `--t5_cache_dir`, `--tokenizer_cache_dir` (avoid re-downloading)
+- [ ] Command example with all parameters:
+  - [ ] `--t5_cache_dir`
+  - [ ] `--tokenizer_cache_dir`
+  - [ ] `--tokenizer_model Cosmos-0.1-Tokenizer-CV4x8x8`
+  - [ ] `--checkpoint_path`
+  - [ ] `--num_video_frames 10`
+  - [ ] `--height 240`
+  - [ ] `--width 416`
+  - [ ] `--video_save_path`
+  - [ ] `--prompt`
+
+### Parallelism Support Table
+- [ ] DiT-S (330M): Data Parallel (TBD), Tensor Parallel (TBD), Sequence Parallel (TBD), Context Parallel (TBD)
+- [ ] DiT-L (450M): Data Parallel (TBD), Tensor Parallel (TBD), Sequence Parallel (TBD), Context Parallel (TBD)
+- [ ] DiT-XL (700M): Data Parallel (✅), Tensor Parallel (✅), Sequence Parallel (✅), Context Parallel (✅)
+
+---
+
+## 5. WAN Recipe Information (`megatron/recipes/wan/wan2.1.md`)
+
+**Target Location**: `docs/get-started/megatron-wan.md` OR integrate into `docs/get-started/megatron.md` with tabs (TUTORIAL/HOW-TO)
+
+### Overview
+- [ ] WAN 2.1 description: Open-source implementation of large-scale text-to-video/image generative models
+- [ ] Built on: Megatron-Core and Megatron-Bridge
+- [ ] Supports: Advanced parallelism strategies (data, tensor, sequence, context parallelism)
+- [ ] Optimized kernels: Transformer Engine fused attention
+
+### Dataset Preparation
+
+#### Energon Data Loader
+- [ ] Uses NVIDIA's Megatron-Energon
+- [ ] WebDataset-compatible format (sharded `.tar` archives)
+- [ ] Supports: Large-scale distributed loading, sharding, sampling for video-text and image-text pairs
+- [ ] Set `dataset.path` to WebDataset directory or shard pattern
+- [ ] Link to Megatron-Energon docs for format details, subflavors, advanced options
+
+#### Mock Dataset Note
+- [ ] If no dataset: See "Quick Start with Mock Dataset" section
+
+#### WAN Dataset Preparation Example
+- [ ] Input: Directory with raw `.mp4` videos and `.json` metadata files with captions
+- [ ] Output: WAN-ready WebDataset shards
+- [ ] Step 1: Define input/output folders (`DATASET_SRC`, `DATASET_PATH`)
+- [ ] Step 2: Optional HF_TOKEN export if auth required
+- [ ] Step 3: Create WAN shards with latents + text embeddings
+  - [ ] Script: `prepare_energon_dataset_wan.py`
+  - [ ] Uses WAN's VAE encoder and T5 encoder
+  - [ ] Extracts videos' latents and caption embeddings offline
+  - [ ] Arguments: `--height/--width` control resize target (832x480 supported for 1.3B and 14B)
+  - [ ] `--center-crop`: Run center crop to exact target size after resize
+  - [ ] Command example with all flags
+- [ ] Step 4: Use Energon to process shards
+  - [ ] Command: `energon prepare "${DATASET_PATH}"`
+  - [ ] Interactive prompts: Enter train/val/test split (e.g., "8,1,1")
+  - [ ] Sample type: Choose "Crude sample (plain dict for cooking)"
+
+#### What Gets Produced
+- [ ] Each shard contains:
+  - [ ] `pth`: WAN video latents
+  - [ ] `pickle`: Text embeddings
+  - [ ] `json`: Useful side-info (text caption, sizes, processing choices)
+- [ ] Energon writes `.nv-meta` directory with dataset info
+- [ ] Energon writes `dataset.yaml` (can version/control)
+
+#### Training Config Setup
+- [ ] Point WAN config to processed data: `dataset.path=${DATASET_PATH}`
+
+### Container Build
+- [ ] Reference to DFM container guide in main README
+
+### Pretraining
+
+#### Sequence Packing for WAN
+- [ ] Purpose: Maximize throughput
+- [ ] Problem: Naive batching/padding requires significant padded tokens for videos
+- [ ] Solution: Sequence packing stacks multiple samples (different resolutions) into single sequence
+- [ ] Benefit: No computation wasted on padded tokens
+- [ ] Requirements:
+  - [ ] Set `train.micro_batch_size=1` and `dataset.micro_batch_size=1`
+  - [ ] Ensure `model.qkv_format=thd` (required with context parallelism, recommended with sequence packing)
+
+#### Parallelism
+- [ ] Multiple parallelism techniques supported (tensor, sequence, context parallelism)
+- [ ] Configurable per hardware
+
+#### Training Script
+- [ ] Script: `examples/megatron/recipes/wan/pretrain_wan.py`
+- [ ] Supports: YAML config file and CLI overrides
+
+#### Training Mode Presets
+- [ ] `--training-mode` with `pretrain` and `finetune` presets
+- [ ] Purpose: Flow-matching hyperparameters as starting point
+- [ ] **Pretraining preset**:
+  - [ ] Uses noisier, biased sampling
+  - [ ] Examples: logit-normal, higher logit_std, lower flow_shift
+  - [ ] Purpose: Stability and broad learning
+- [ ] **Finetuning preset**:
+  - [ ] Uses uniform, lower-noise settings
+  - [ ] Examples: uniform sampling, lower logit_std, higher flow_shift
+  - [ ] Purpose: Refine details and improve quality
+
+#### WandB Notes
+- [ ] If using `logger.wandb_project` and `logger.wandb_exp_name`, export `WANDB_API_KEY`
+
+#### Pretraining Script Example
+- [ ] Example configs: `wan_1_3B.yaml` and `wan_14B.yaml` under `examples/megatron/recipes/wan/conf`
+- [ ] Copy and edit instructions:
+  - [ ] `dataset.path`: Path to WebDataset directory
+  - [ ] `train.global_batch_size/micro_batch_size`: Keep micro_batch_size=1
+  - [ ] `model.tensor_model_parallel_size` / `model.context_parallel_size`: Based on GPUs
+  - [ ] `checkpoint.save` and `checkpoint.load`: Checkpoint directory
+- [ ] Run command with `--training-mode pretrain` and `--config-file`
+- [ ] CLI override example with all parameters:
+  - [ ] `dataset.path`
+  - [ ] `train.global_batch_size`
+  - [ ] `train.micro_batch_size`
+  - [ ] `model.tensor_model_parallel_size`
+  - [ ] `model.context_parallel_size`
+  - [ ] `checkpoint.save`
+  - [ ] `checkpoint.load`
+- [ ] Link to Megatron-Bridge docs for argument details
+
+#### Mock Dataset
+- [ ] Use `--mock` flag for debugging or performance measurement
+- [ ] Command example with `--mock` flag
+- [ ] Note: Can adjust mock shapes (`F_latents`, `H_latents`, `W_latents`) and packing behavior (`number_packed_samples`) in `WanMockDataModuleConfig`
+- [ ] Reference: See `dfm/src/megatron/recipes/wan/wan.py`
+
+### Inference
+
+#### Inference Script
+- [ ] Script: `examples/megatron/recipes/wan/inference_wan.py`
+- [ ] `--checkpoint_step`: Use specific checkpoint for inference
+- [ ] `--sizes`: Specify video shape (height, width)
+- [ ] `--frame_nums`: Specify number of frames
+- [ ] `--sample_steps`: Number of noise diffusion steps (default: 50)
+- [ ] Command example with all parameters:
+  - [ ] `--task t2v-1.3B`
+  - [ ] `--frame_nums 81`
+  - [ ] `--sizes 480*832`
+  - [ ] `--checkpoint_dir`
+  - [ ] `--checkpoint_step 10000`
+  - [ ] `--prompts` (example prompt)
+  - [ ] `--sample_steps 50`
+- [ ] **Note**: Current inference path is single-GPU. Parallel inference not yet supported.
+
+### Parallelism Support Table
+- [ ] 1.3B model: Data Parallel (✅), Tensor Parallel (✅), Sequence Parallel (✅), Context Parallel (✅), FSDP (Coming Soon)
+- [ ] 14B model: Data Parallel (✅), Tensor Parallel (✅), Sequence Parallel (✅), Context Parallel (✅), FSDP (Coming Soon)
+
+### References
+- [ ] WAN Team citation: (2025). Wan: Open and advanced large-scale video generative models (Wan 2.1). GitHub. https://github.com/Wan-Video/Wan2.1/
+
+---
+
+## Verification Summary
+
+**Total Information Items**: ~200+ discrete pieces
+
+**Checklist Status**:
+- [ ] All items from `performance-summary.md` captured
+- [ ] All items from `mcore_automodel_comparision_wan21.md` captured
+- [ ] All items from `automodel_training_doc.md` captured
+- [ ] All items from `megatron/models/dit/README.md` captured
+- [ ] All items from `megatron/recipes/wan/wan2.1.md` captured
+
+**Integration Verification**:
+- [ ] Each item checked off as integrated
+- [ ] Location documented (which file/section)
+- [ ] Progressive disclosure applied (Layer 1/2/3/4)
+- [ ] Links and references verified
+- [ ] Images copied and paths updated
+
+---
+
+## Notes
+
+- **Information can be integrated anywhere logical** - doesn't need to match old file structure
+- **Progressive disclosure**: Layer 3/4 items can be in dropdowns/tabs/separate pages
+- **Cross-references**: Related information can be linked rather than duplicated
+- **Verification**: Check off items as you integrate them, note location
+
diff --git a/docs/MIGRATION_PLAN.md b/docs/MIGRATION_PLAN.md
new file mode 100644
index 00000000..f2ff108e
--- /dev/null
+++ b/docs/MIGRATION_PLAN.md
@@ -0,0 +1,758 @@
+# Documentation Migration Plan: Preserving All Information
+
+**Goal**: Capture all information from old docs in the new information architecture, organized logically using Diataxis, progressive disclosure, and MyST directives.
+
+**Status**: Draft Plan  
+**Date**: 2025-01-XX
+
+**Key Principle**: Preserve **information**, not file structure. Content can be merged, split, or reorganized as long as all information is captured in a well-organized manner.
+
+---
+
+## Overview
+
+This plan ensures:
+- ✅ **Zero information loss**: All content from old docs preserved somewhere logical
+- ✅ **Mature information architecture**: Content organized by purpose and user need
+- ✅ **Diataxis alignment**: Content organized by type (Tutorial, How-To, Explanation, Reference)
+- ✅ **Progressive disclosure**: Advanced details in dropdowns/tabs/separate pages
+- ✅ **Cognitive load reduction**: Scannable structure with clear navigation
+
+---
+
+## Information Inventory (Not File Inventory)
+
+### Information Currently Missing from New Structure
+
+1. **Performance Benchmarks**
+   - **Source**: `performance-summary.md`
+   - **Information**: Nomenclature, metrics, benchmark tables (DGX-GB200, GB300, H100)
+   - **Best Location**: `docs/reference/performance.md` (REFERENCE type)
+   - **Status**: Missing entirely
+
+2. **Paradigm Comparison Analysis**
+   - **Source**: `mcore_automodel_comparision_wan21.md`
+   - **Information**: Experimental comparison, training curves, caveats
+   - **Best Location**: `docs/about/comparison.md` OR integrate into `docs/about/concepts/training-paradigms.md`
+   - **Status**: Missing entirely
+
+### Information in Orphaned Files (Needs Integration)
+
+1. **Detailed Automodel Training Information**
+   - **Source**: `automodel_training_doc.md`
+   - **Information**: Preprocessing modes, validation, hardware reqs, advanced config
+   - **Best Location**: Integrate into `get-started/automodel.md` (progressive disclosure)
+   - **Status**: Exists but not integrated
+
+2. **DiT-Specific Training Details**
+   - **Source**: `megatron/models/dit/README.md`
+   - **Information**: Sequence packing details, Energon format, validation
+   - **Best Location**: Integrate into `get-started/megatron.md` (progressive disclosure)
+   - **Status**: Exists but not integrated
+
+3. **WAN-Specific Training Information**
+   - **Source**: `megatron/recipes/wan/wan2.1.md`
+   - **Information**: WAN dataset prep, training modes, WAN-specific workflows
+   - **Best Location**: Either:
+     - Option A: `get-started/megatron-wan.md` (separate guide)
+     - Option B: Enhance `get-started/megatron.md` with WAN section (tabs)
+   - **Status**: Exists but not integrated
+
+---
+
+## Information Mapping Strategy
+
+**Approach**: Map information to logical locations in new IA, not files to files.
+
+### Information Organization Principles
+
+1. **User Intent First**: Where would users look for this information?
+2. **Diataxis Alignment**: What type of content is this? (Tutorial/How-To/Explanation/Reference)
+3. **Progressive Disclosure**: What layer does this belong to? (Core/Advanced/Reference)
+4. **Logical Grouping**: Related information should be together
+
+## Migration Strategy by Information Type
+
+### 1. Performance Summary (`performance-summary.md` → `docs/reference/performance.md`)
+
+**Diataxis Type**: REFERENCE  
+**Progressive Disclosure**: Use tabs for different systems, dropdowns for detailed metrics
+
+**Structure**:
+```markdown
+# Performance Benchmarks
+
+## Overview
+[Layer 1: 30-second overview]
+
+## Nomenclature
+[Layer 2: Core definitions - use dropdowns for detailed explanations]
+
+## Performance Metrics
+[Layer 2: Core metrics explanation]
+
+## Benchmark Results
+[Layer 2: Main results - use tabs for different systems]
+
+:::: {tab-set}
+::: {tab-item} DGX-GB200
+[Results table]
+:::
+::: {tab-item} DGX-GB300
+[Results table]
+:::
+::: {tab-item} DGX-H100
+[Results table]
+:::
+::::
+
+## Detailed Configurations
+[Layer 3: Advanced details in dropdowns]
+```
+
+**Content to Preserve**:
+- ✅ All nomenclature definitions (GBS, MBS, FSDP, TP, SP, PP, CP, VP, EP)
+- ✅ Performance metrics explanation (Tokens/sec/GPU, Model TFLOP/sec/GPU)
+- ✅ All benchmark tables (DGX-GB200, DGX-GB300, DGX-H100)
+- ✅ Both Megatron-Core and NeMo Automodel results
+- ✅ All model configurations
+
+**Progressive Disclosure**:
+- **Layer 1**: Overview + summary table
+- **Layer 2**: Core metrics + main results (tabs for systems)
+- **Layer 3**: Detailed configurations (dropdowns)
+- **Layer 4**: Raw data tables (if needed, separate page)
+
+---
+
+### 2. Comparison Document (`mcore_automodel_comparision_wan21.md` → `docs/about/comparison.md`)
+
+**Diataxis Type**: EXPLANATION  
+**Progressive Disclosure**: Use tabs for stages, dropdowns for detailed analysis
+
+**Structure**:
+```markdown
+# Automodel vs Megatron Comparison
+
+## Overview
+[Layer 1: What this comparison shows]
+
+## Experiment Overview
+[Layer 2: Core experiment details]
+
+## Training Stages
+[Layer 2: Use tabs for Stage 1 vs Stage 2]
+
+:::: {tab-set}
+::: {tab-item} Stage 1: Text-to-Image
+[Dataset, setup, results]
+:::
+::: {tab-item} Stage 2: Text-to-Video
+[Dataset, setup, results]
+:::
+::::
+
+## Results Analysis
+[Layer 2: Training curves with images]
+
+:::{dropdown} Detailed Analysis
+[Layer 3: Caveats and technical details]
+:::
+
+## Key Takeaways
+[Layer 2: Summary comparison]
+```
+
+**Content to Preserve**:
+- ✅ Complete experiment overview
+- ✅ Both training stages (Text→Image, Text→Video)
+- ✅ Dataset details (3K videos, 120K images)
+- ✅ Training setup comparison tables
+- ✅ Training curve images (both stages)
+- ✅ Important caveat about Megatron-Core timestep handling
+- ✅ All parallelism configurations
+
+**Progressive Disclosure**:
+- **Layer 1**: Overview + key findings
+- **Layer 2**: Main comparison (tabs for stages)
+- **Layer 3**: Detailed analysis (dropdowns)
+- **Layer 4**: Full technical details (if needed)
+
+**Integration**: Also enhance `docs/about/concepts/training-paradigms.md` with link to this comparison.
+
+---
+
+### 3. Automodel Training Doc (`automodel_training_doc.md` → Enhance `get-started/automodel.md`)
+
+**Diataxis Type**: TUTORIAL (enhanced)  
+**Progressive Disclosure**: Add missing details as dropdowns and expandable sections
+
+**Missing Content to Add**:
+
+#### A. Preprocessing Details (Add to Step 1)
+```markdown
+### 1. Prepare Your Dataset
+
+[Current content...]
+
+:::{dropdown} Detailed Preprocessing Modes
+[Layer 3: Full explanation of video vs frames mode]
+
+**Full Video Mode** (`--mode video`):
+- What it is: [detailed explanation]
+- When to use: [use cases]
+- Output: [what gets created]
+
+**Extract Frames Mode** (`--mode frames`):
+- What it is: [detailed explanation]
+- When to use: [use cases]
+- Output: [what gets created]
+:::
+
+:::{dropdown} meta.json Format Specification
+[Layer 3: Complete schema]
+
+```json
+[Full JSON schema with all fields]
+```
+:::
+```
+
+#### B. Multi-Node Setup (Add to Step 3)
+```markdown
+### 3. Run Training
+
+[Current single-node content...]
+
+:::{dropdown} Multi-Node with SLURM
+[Layer 3: Advanced setup]
+
+[Complete SLURM script from old docs]
+:::
+```
+
+#### C. Validation (Add new section)
+```markdown
+### 4. Validate Training
+
+[New section with validation script details]
+
+:::{dropdown} Validation Script Details
+[Layer 3: Advanced validation options]
+
+[Complete validation documentation]
+:::
+```
+
+#### D. Hardware Requirements (Add as dropdown)
+```markdown
+:::{dropdown} Hardware Requirements
+[Layer 3: System requirements]
+
+| Component | Minimum | Recommended |
+|-----------|---------|-------------|
+[Full table from old docs]
+:::
+```
+
+#### E. Advanced Configuration (Add as new section)
+```markdown
+## Advanced Topics
+
+:::{dropdown} Pretraining vs Fine-tuning
+[Layer 3: Comparison table]
+
+[Full comparison table]
+:::
+
+:::{dropdown} Custom Parallelization
+[Layer 3: Advanced parallelism]
+
+[Custom parallelization examples]
+:::
+
+:::{dropdown} Checkpoint Management
+[Layer 3: Advanced checkpointing]
+
+[Checkpoint cleanup code]
+:::
+```
+
+**Content to Preserve**:
+- ✅ All preprocessing mode details
+- ✅ Complete `meta.json` schema
+- ✅ Multi-node SLURM setup
+- ✅ Validation script documentation
+- ✅ Hardware requirements table
+- ✅ Pretraining vs fine-tuning comparison
+- ✅ Advanced parallelization examples
+- ✅ Checkpoint cleanup utilities
+- ✅ Supported models table
+
+**Progressive Disclosure**:
+- **Layer 1**: Core tutorial steps (current)
+- **Layer 2**: Essential details (expand current sections)
+- **Layer 3**: Advanced topics (dropdowns)
+- **Layer 4**: Complete reference (link to detailed guide)
+
+**Integration Strategy**:
+- Keep current tutorial structure (Layer 1-2)
+- Add missing information as progressive disclosure elements (Layer 3)
+- **No need to preserve `automodel_training_doc.md` as separate file** - all information integrated
+
+---
+
+### 4. DiT Model Guide (`megatron/models/dit/README.md` → Enhance `get-started/megatron.md`)
+
+**Diataxis Type**: TUTORIAL (enhanced)  
+**Progressive Disclosure**: Add DiT-specific details as expandable sections
+
+**Missing Content to Add**:
+
+#### A. Sequence Packing Details (Enhance existing section)
+```markdown
+### Sequence Packing
+
+[Current brief mention...]
+
+:::{dropdown} Understanding Sequence Packing
+[Layer 3: Detailed explanation]
+
+[Complete sequence packing explanation from old docs]
+- Why use it
+- How it works
+- Configuration requirements
+- Performance impact
+:::
+
+:::{dropdown} Sequence Packing Parameters
+[Layer 3: Advanced configuration]
+
+**Key Parameters**:
+- `task_encoder_seq_length`: [explanation]
+- `packing_buffer_size`: [explanation]
+- `qkv_format=thd`: [why required]
+:::
+```
+
+#### B. Validation Details (Add new section)
+```markdown
+### Monitor Training
+
+[Current content...]
+
+:::{dropdown} Validation and Sample Generation
+[Layer 3: Advanced monitoring]
+
+[Complete validation details from old docs]
+- How validation works
+- Sample generation
+- WandB integration
+- VAE cache requirements
+:::
+```
+
+#### C. Energon Dataset Details (Enhance existing section)
+```markdown
+### Prepare Dataset
+
+[Current butterfly example...]
+
+:::{dropdown} Understanding Energon Format
+[Layer 3: Advanced data format]
+
+[Complete Energon explanation]
+- WebDataset format
+- Sample structure
+- Energon prepare command details
+:::
+```
+
+**Content to Preserve**:
+- ✅ Complete sequence packing explanation
+- ✅ Sequence packing parameters (`task_encoder_seq_length`, `packing_buffer_size`)
+- ✅ Validation details (sample generation, WandB)
+- ✅ VAE cache folder requirements
+- ✅ Energon dataset format details
+- ✅ Complete Energon prepare workflow
+- ✅ All configuration examples
+
+**Progressive Disclosure**:
+- **Layer 1**: Core tutorial (current)
+- **Layer 2**: Essential DiT details (expand current)
+- **Layer 3**: Advanced topics (dropdowns)
+- **Layer 4**: Complete reference (link to `dit/README.md`)
+
+**Integration Strategy**:
+- Enhance existing Megatron tutorial with DiT-specific details
+- Use dropdowns for advanced topics
+- **No need to preserve `dit/README.md` as separate file** - all information integrated
+
+---
+
+### 5. WAN Recipe Guide (`megatron/recipes/wan/wan2.1.md` → New page or enhance tutorial)
+
+**Diataxis Type**: HOW-TO  
+**Progressive Disclosure**: Use tabs for different workflows, dropdowns for details
+
+**Decision**: Create separate WAN guide page OR enhance Megatron tutorial with WAN section
+
+**Option A: Separate WAN Guide Page** (Recommended)
+```
+docs/get-started/megatron-wan.md
+```
+
+**Option B: Enhance Megatron Tutorial** (Alternative)
+Add WAN section with tabs: `:::: {tab-set}` for DiT vs WAN
+
+**Recommended Structure** (Option A):
+```markdown
+# Megatron WAN Workflow
+
+## Overview
+[Layer 1: What WAN is, when to use it]
+
+## Choose Your Model
+[Layer 2: DiT vs WAN decision]
+
+:::: {tab-set}
+::: {tab-item} DiT Model
+:link: megatron
+[Link to DiT tutorial]
+:::
+::: {tab-item} WAN Model
+[WAN-specific content]
+:::
+::::
+
+## Prepare WAN Dataset
+[Layer 2: WAN-specific dataset prep]
+
+:::{dropdown} Understanding WAN Data Format
+[Layer 3: Detailed format explanation]
+:::
+
+## Train WAN Model
+[Layer 2: WAN training]
+
+:::{dropdown} Training Mode Presets
+[Layer 3: pretrain vs finetune modes]
+
+[Complete explanation of presets]
+:::
+
+:::{dropdown} Sequence Packing for WAN
+[Layer 3: WAN-specific packing]
+
+[WAN sequence packing details]
+:::
+
+## Generate Videos
+[Layer 2: WAN inference]
+
+## Parallelism Support
+[Layer 2: WAN parallelism table]
+```
+
+**Content to Preserve**:
+- ✅ Complete WAN overview
+- ✅ WAN dataset preparation (Energon workflow)
+- ✅ Training mode presets (pretrain vs finetune)
+- ✅ Sequence packing for WAN
+- ✅ WAN inference details
+- ✅ Parallelism support table
+- ✅ All configuration examples
+- ✅ Mock dataset configuration
+
+**Progressive Disclosure**:
+- **Layer 1**: Overview + quick start
+- **Layer 2**: Core workflow steps
+- **Layer 3**: Advanced topics (dropdowns)
+- **Layer 4**: Complete reference (link to `wan2.1.md`)
+
+**Integration Strategy**:
+- **Decision**: Choose Option A (separate page) OR Option B (tabs in existing tutorial)
+- If Option A: Create `docs/get-started/megatron-wan.md` and integrate all WAN information
+- If Option B: Add WAN section to `docs/get-started/megatron.md` using tabs
+- **No need to preserve `wan2.1.md` as separate file** - all information integrated into chosen location
+
+---
+
+## Navigation Updates
+
+### Update `docs/get-started/index.md`
+
+Add WAN option:
+```markdown
+:::: {grid} 1 2 2 2
+:::{grid-item-card} 2a. Automodel Tutorial
+[Current content]
+:::
+:::{grid-item-card} 2b. Megatron DiT Tutorial
+[Current content]
+:::
+:::{grid-item-card} 2c. Megatron WAN Tutorial
+:link: megatron-wan
+:link-type: doc
+Train WAN models with Megatron for video generation.
++++
+{bdg-secondary}`wan` {bdg-secondary}`megatron`
+:::
+::::
+```
+
+### Update `docs/about/concepts/training-paradigms.md`
+
+Add comparison link:
+```markdown
+## Learn More
+
+- [Automodel vs Megatron Comparison](comparison.md) - Detailed experimental comparison
+- [Performance Benchmarks](../reference/performance.md) - Training performance metrics
+```
+
+### Update `docs/reference/index.md`
+
+Add performance link:
+```markdown
+## Performance and Benchmarks
+
+:::{grid-item-card} Performance Benchmarks
+:link: performance
+:link-type: doc
+Training throughput and performance metrics across GPU systems.
++++
+{bdg-secondary}`benchmarks` {bdg-secondary}`performance`
+:::
+```
+
+---
+
+## Implementation Checklist
+
+### Phase 1: Create Missing Files
+
+- [ ] **Create `docs/reference/performance.md`**
+  - [ ] Migrate nomenclature section
+  - [ ] Migrate performance metrics explanation
+  - [ ] Migrate all benchmark tables (use tabs for systems)
+  - [ ] Add progressive disclosure (dropdowns for details)
+  - [ ] Add frontmatter with proper metadata
+  - [ ] Link from reference index
+
+- [ ] **Create `docs/about/comparison.md`**
+  - [ ] Migrate experiment overview
+  - [ ] Migrate training stages (use tabs)
+  - [ ] Migrate training curves (include images)
+  - [ ] Migrate caveats and analysis
+  - [ ] Add progressive disclosure
+  - [ ] Add frontmatter with proper metadata
+  - [ ] Link from training-paradigms page
+
+### Phase 2: Integrate Information into Existing Tutorials
+
+- [ ] **Enhance `docs/get-started/automodel.md`**
+  - [ ] Integrate preprocessing details (dropdown)
+  - [ ] Integrate `meta.json` schema (dropdown)
+  - [ ] Integrate multi-node SLURM setup (dropdown)
+  - [ ] Integrate validation section
+  - [ ] Integrate hardware requirements (dropdown)
+  - [ ] Integrate advanced topics section (dropdowns)
+  - [ ] **Archive or remove `automodel_training_doc.md`** (information now integrated)
+
+- [ ] **Enhance `docs/get-started/megatron.md`**
+  - [ ] Integrate sequence packing details (dropdown)
+  - [ ] Integrate validation details (dropdown)
+  - [ ] Integrate Energon format details (dropdown)
+  - [ ] **Archive or remove `megatron/models/dit/README.md`** (information now integrated)
+
+### Phase 3: Integrate WAN Information
+
+- [ ] **Decide**: Separate WAN guide OR tabs in Megatron tutorial
+- [ ] **If separate guide**: Create `docs/get-started/megatron-wan.md`
+  - [ ] Integrate all WAN information
+  - [ ] Add progressive disclosure
+  - [ ] **Archive or remove `megatron/recipes/wan/wan2.1.md`** (information now integrated)
+- [ ] **If tabs**: Enhance `docs/get-started/megatron.md`
+  - [ ] Add WAN section with tabs (DiT vs WAN)
+  - [ ] Integrate all WAN information
+  - [ ] **Archive or remove `megatron/recipes/wan/wan2.1.md`** (information now integrated)
+
+### Phase 4: Update Navigation
+
+- [ ] **Update `docs/get-started/index.md`**
+  - [ ] Add WAN tutorial card
+  - [ ] Update comparison table
+
+- [ ] **Update `docs/about/concepts/training-paradigms.md`**
+  - [ ] Add comparison link
+  - [ ] Add performance link
+
+- [ ] **Update `docs/reference/index.md`**
+  - [ ] Add performance benchmarks card
+
+- [ ] **Update `docs/index.md`** (if needed)
+  - [ ] Ensure all new pages are discoverable
+
+### Phase 5: Verify Content Preservation
+
+- [ ] **Content Audit**
+  - [ ] Verify all nomenclature preserved
+  - [ ] Verify all tables preserved
+  - [ ] Verify all code examples preserved
+  - [ ] Verify all images preserved
+  - [ ] Verify all configuration examples preserved
+  - [ ] Verify all troubleshooting content preserved
+
+- [ ] **Link Verification**
+  - [ ] All internal links work
+  - [ ] All reference targets exist
+  - [ ] All images load correctly
+  - [ ] All code examples render
+
+- [ ] **Progressive Disclosure Check**
+  - [ ] Layer 1 content scannable in 30 seconds
+  - [ ] Layer 2 content accessible without scrolling
+  - [ ] Layer 3 content in dropdowns/tabs
+  - [ ] Layer 4 content linked appropriately
+
+---
+
+## Progressive Disclosure Patterns
+
+### Pattern 1: Advanced Details → Dropdown
+```markdown
+## Core Concept
+
+[Layer 2: Essential explanation]
+
+:::{dropdown} Advanced: Detailed Analysis
+[Layer 3: Full technical details]
+:::
+```
+
+### Pattern 2: Alternative Options → Tabs
+```markdown
+## Choose Your Approach
+
+:::: {tab-set}
+::: {tab-item} Option A
+[Content for option A]
+:::
+::: {tab-item} Option B
+[Content for option B]
+:::
+::::
+```
+
+### Pattern 3: Reference Material → Separate Page + Link
+```markdown
+## Core Tutorial
+
+[Layer 1-2: Essential steps]
+
+## Complete Reference
+
+For complete configuration options and advanced topics, see:
+[Complete Reference Guide](reference-guide.md)
+```
+
+### Pattern 4: Comparison Tables → Collapsible
+```markdown
+## Quick Comparison
+
+[Layer 2: Summary table]
+
+:::{dropdown} Detailed Comparison
+[Layer 3: Full comparison with all details]
+:::
+```
+
+---
+
+## Information Mapping to New IA
+
+| Information Source | Information Type | New Location | Diataxis Type | Integration Method |
+|-------------------|------------------|--------------|---------------|-------------------|
+| `performance-summary.md` | Performance benchmarks | `docs/reference/performance.md` | REFERENCE | New page (all info) |
+| `mcore_automodel_comparision_wan21.md` | Paradigm comparison | `docs/about/comparison.md` OR `docs/about/concepts/training-paradigms.md` | EXPLANATION | New page OR integrate |
+| `automodel_training_doc.md` | Detailed training info | `docs/get-started/automodel.md` | TUTORIAL | Integrate (progressive disclosure) |
+| `megatron/models/dit/README.md` | DiT-specific details | `docs/get-started/megatron.md` | TUTORIAL | Integrate (progressive disclosure) |
+| `megatron/recipes/wan/wan2.1.md` | WAN-specific details | `docs/get-started/megatron-wan.md` OR `docs/get-started/megatron.md` | TUTORIAL/HOW-TO | New page OR integrate with tabs |
+
+---
+
+## Content Fidelity Principles
+
+1. **Preserve All Technical Details**
+   - All configuration examples
+   - All code snippets
+   - All parameter explanations
+   - All troubleshooting content
+
+2. **Preserve All Data**
+   - All benchmark numbers
+   - All comparison tables
+   - All training configurations
+   - All hardware specifications
+
+3. **Preserve All Context**
+   - Experiment methodology
+   - Caveats and limitations
+   - Use case guidance
+   - Best practices
+
+4. **Improve Organization**
+   - Group related content
+   - Use progressive disclosure
+   - Add clear navigation
+   - Improve scannability
+
+---
+
+## Success Criteria
+
+✅ **Zero Information Loss**
+- All content from old docs present in new structure
+- All tables, code examples, images preserved
+- All technical details maintained
+
+✅ **Improved Usability**
+- Clear navigation paths
+- Progressive disclosure reduces cognitive load
+- Scannable structure (30-second test passes)
+
+✅ **Diataxis Compliance**
+- Each page has single clear purpose
+- Content type matches user intent
+- Cross-links to related types
+
+✅ **Maintainability**
+- Clear file organization
+- Consistent structure
+- Easy to update
+- Single source of truth (new IA)
+
+---
+
+## Next Steps
+
+1. **Review this plan** with stakeholders
+2. **Prioritize phases** (suggest: Phase 1 → 2 → 3 → 4 → 5)
+3. **Execute migration** following checklist
+4. **Verify information** using audit checklist (verify all info captured, not files)
+5. **Test navigation** and user flows
+6. **Archive old files** after verification (information is now in new IA)
+
+---
+
+## Notes
+
+- **Information Preservation**: Focus on preserving information, not file structure
+- **File Cleanup**: After integration, old files can be archived or removed (information is captured)
+- **Images**: Ensure all images copied to new locations with correct paths
+- **Links**: Update all internal links to new structure
+- **Frontmatter**: Add consistent frontmatter to all new/modified files
+- **Testing**: Build docs locally to verify all MyST directives render correctly
+- **Mature IA**: The new structure should be the source of truth; old files are temporary
+
diff --git a/docs/MIGRATION_SUMMARY.md b/docs/MIGRATION_SUMMARY.md
new file mode 100644
index 00000000..5df4492d
--- /dev/null
+++ b/docs/MIGRATION_SUMMARY.md
@@ -0,0 +1,123 @@
+# Migration Plan Summary
+
+**Quick Reference**: Information mapping strategy - preserve information, not file structure.
+
+**Key Principle**: Information should be captured in logical locations in the new IA. Files can be merged, split, or reorganized.
+
+---
+
+## Missing Information (Create New Pages)
+
+| File | Location | Type | Priority |
+|------|----------|------|----------|
+| Performance Benchmarks | `docs/reference/performance.md` | REFERENCE | High |
+| Paradigm Comparison | `docs/about/comparison.md` | EXPLANATION | High |
+
+---
+
+## Information to Integrate (Not Preserve as Separate Files)
+
+| Source File | Information | Integration Point | Method |
+|-------------|------------|-------------------|--------|
+| `automodel_training_doc.md` | Detailed training info | `get-started/automodel.md` | Integrate via progressive disclosure |
+| `megatron/models/dit/README.md` | DiT-specific details | `get-started/megatron.md` | Integrate via progressive disclosure |
+| `megatron/recipes/wan/wan2.1.md` | WAN-specific details | `get-started/megatron-wan.md` OR `get-started/megatron.md` | New page OR tabs |
+
+---
+
+## Content Gaps to Fill
+
+### Automodel Tutorial (`get-started/automodel.md`)
+- [ ] Preprocessing modes (video vs frames) - **Add as dropdown**
+- [ ] `meta.json` schema - **Add as dropdown**
+- [ ] Multi-node SLURM setup - **Add as dropdown**
+- [ ] Validation script details - **Add new section**
+- [ ] Hardware requirements - **Add as dropdown**
+- [ ] Pretraining vs fine-tuning comparison - **Add as dropdown**
+- [ ] Advanced parallelization - **Add as dropdown**
+- [ ] Checkpoint cleanup - **Add as dropdown**
+
+### Megatron Tutorial (`get-started/megatron.md`)
+- [ ] Sequence packing details - **Add as dropdown**
+- [ ] Validation details - **Add as dropdown**
+- [ ] Energon format details - **Add as dropdown**
+- [ ] WAN content - **Create separate WAN guide**
+
+---
+
+## Progressive Disclosure Strategy
+
+### Layer 1 (Always Visible)
+- Overview, key concepts, main steps
+
+### Layer 2 (Scannable)
+- Core content, essential details, main workflows
+
+### Layer 3 (Collapsible)
+- Advanced topics → Use `:::{dropdown}`
+- Alternative options → Use `:::: {tab-set}`
+- Detailed explanations → Use `:::{dropdown}`
+
+### Layer 4 (Separate Pages)
+- Complete reference guides → Link to existing detailed docs
+
+---
+
+## MyST Directives to Use
+
+```markdown
+# Dropdowns (Layer 3 content)
+:::{dropdown} Advanced Topic
+:icon: info
+[Detailed content here]
+:::
+
+# Tabs (Alternative options)
+:::: {tab-set}
+::: {tab-item} Option A
+[Content A]
+:::
+::: {tab-item} Option B
+[Content B]
+:::
+::::
+
+# Cards (Navigation)
+::::{grid} 1 2 2 2
+:::{grid-item-card} Title
+:link: target
+:link-type: ref
+Description
+:::
+::::
+```
+
+---
+
+## Implementation Order
+
+1. **Phase 1**: Create missing files (performance, comparison)
+2. **Phase 2**: Enhance existing tutorials (add dropdowns/tabs)
+3. **Phase 3**: Create WAN guide page
+4. **Phase 4**: Update navigation (index pages, links)
+5. **Phase 5**: Verify (content audit, link check)
+
+---
+
+## Quick Checklist
+
+- [ ] Performance benchmarks page created (all info from `performance-summary.md`)
+- [ ] Comparison page created OR integrated (all info from `mcore_automodel_comparision_wan21.md`)
+- [ ] Automodel tutorial enhanced (all info from `automodel_training_doc.md` integrated)
+- [ ] Megatron tutorial enhanced (all info from `dit/README.md` integrated)
+- [ ] WAN information integrated (all info from `wan2.1.md` integrated)
+- [ ] All navigation updated
+- [ ] **Information audit**: All information verified (not files - verify content)
+- [ ] All links working
+- [ ] Progressive disclosure applied correctly
+- [ ] Old files archived/removed after verification
+
+---
+
+**Full Plan**: See `MIGRATION_PLAN.md` for detailed implementation guide.
+
diff --git a/docs/Makefile b/docs/Makefile
new file mode 100644
index 00000000..47595c2d
--- /dev/null
+++ b/docs/Makefile
@@ -0,0 +1,84 @@
+# Makefile for Sphinx documentation
+
+# Default target shows help
+.DEFAULT_GOAL := help
+
+.PHONY: help docs-html docs-clean docs-live docs-publish ensure-docs-env check-uv
+
+# Help target
+help: ## Show this help message
+	@echo ""
+	@echo "📚 Documentation Build System"
+	@echo "=============================="
+	@echo ""
+	@echo "Available targets:"
+	@echo "  make docs-html      Build HTML documentation"
+	@echo "  make docs-live      Start live-reload server"
+	@echo "  make docs-publish   Build for publication (fail on warnings)"
+	@echo "  make docs-clean     Clean built documentation"
+	@echo ""
+	@echo "Note: Environment is automatically set up on first run."
+	@echo ""
+
+# Detect OS for cross-platform compatibility
+ifeq ($(OS),Windows_NT)
+    VENV_PYTHON = ../.venv-docs/Scripts/python.exe
+    VENV_ACTIVATE = ..\\.venv-docs\\Scripts\\activate
+    VENV_ACTIVATE_PS = ..\\.venv-docs\\Scripts\\Activate.ps1
+    RM_CMD = if exist _build rmdir /s /q _build
+    ECHO_BLANK = @echo.
+else
+    VENV_PYTHON = ../.venv-docs/bin/python
+    VENV_ACTIVATE = source ../.venv-docs/bin/activate
+    RM_CMD = rm -rf _build
+    ECHO_BLANK = @echo ""
+endif
+
+# Check if uv is installed
+check-uv:
+ifeq ($(OS),Windows_NT)
+	@where uv >nul 2>&1 || ( \
+		echo. && \
+		echo ❌ uv is not installed or not in PATH && \
+		echo. && \
+		echo Please install uv: https://docs.astral.sh/uv/getting-started/installation/ && \
+		exit 1 \
+	)
+else
+	@command -v uv >/dev/null 2>&1 || ( \
+		echo ""; \
+		echo "❌ uv is not installed or not in PATH"; \
+		echo ""; \
+		echo "Please install uv: https://docs.astral.sh/uv/getting-started/installation/"; \
+		echo ""; \
+		exit 1; \
+	)
+endif
+
+# Ensure docs environment exists and is up to date
+ensure-docs-env: check-uv
+	@if [ ! -f "$(VENV_PYTHON)" ]; then \
+		echo "📦 Setting up docs environment with uv..."; \
+		cd .. && uv venv .venv-docs && uv pip install --group docs --python .venv-docs; \
+		echo "✅ Environment ready!"; \
+	else \
+		echo "🔄 Syncing docs dependencies (this ensures dependencies are up to date)..."; \
+		cd .. && uv pip install --group docs --python .venv-docs; \
+		echo "✅ Dependencies synced!"; \
+	fi
+
+docs-html: ensure-docs-env
+	@echo "Building HTML documentation..."
+	$(VENV_PYTHON) -m sphinx -b html . _build/html
+
+docs-publish: ensure-docs-env
+	@echo "Building HTML documentation for publication (fail on warnings)..."
+	$(VENV_PYTHON) -m sphinx --fail-on-warning --builder html . _build/html
+
+docs-clean:
+	@echo "Cleaning built documentation..."
+	$(RM_CMD)
+
+docs-live: ensure-docs-env
+	@echo "Starting live-reload server (sphinx-autobuild)..."
+	$(VENV_PYTHON) -m sphinx_autobuild --port 8001 . _build/html
diff --git a/docs/about/comparison.md b/docs/about/comparison.md
new file mode 100644
index 00000000..08cceea8
--- /dev/null
+++ b/docs/about/comparison.md
@@ -0,0 +1,127 @@
+---
+description: "Experimental comparison between AutoModel and Megatron training paths for WAN 2.1"
+categories: ["concepts-architecture"]
+tags: ["comparison", "automodel", "megatron", "wan", "experimental"]
+personas: ["mle-focused", "data-scientist-focused"]
+difficulty: "intermediate"
+content_type: "explanation"
+---
+
+(about-comparison)=
+
+# AutoModel vs Megatron Comparison
+
+Experimental comparison of two training paths for WAN 2.1: the AutoModel (Diffusers) path versus the Megatron-Core (Megatron-Bridge) path.
+
+## Experiment Overview
+
+**Goal**: Compare two training paths for WAN 2.1:
+
+1. **[Diffusers](https://huggingface.co/docs/diffusers/en/index) implementation + [AutoModel](https://github.com/NVIDIA-NeMo/Automodel/tree/diffusion) training path**
+2. **[Megatron-Core](https://github.com/NVIDIA/Megatron-LM) implementation + [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) training path**
+
+**Training Approach**: Two-stage training
+
+- **Stage 1**: Text → Image - Learn to connect textual embeddings with visual concepts
+- **Stage 2**: Text → Video - Learn visual movements aligning with prompts
+
+**Dataset**: 3,000 videos; frames extracted from videos are used for text-to-image training stage.
+
+:::{note}
+This experiment is a partial convergence test and only demonstrates the model's ability to reconstruct images and videos from input prompts. With only 3,000 videos, the model cannot generalize to generate novel content. Such generalization can be achieved with larger training datasets and increased training resources.
+:::
+
+---
+
+## Dataset Configuration
+
+:::: {tab-set}
+
+::: {tab-item} Stage 1: Text-to-Image
+
+**Dataset**:
+- Extract 40 frames per video → **120k images**
+- Resolution: **240 × 416**
+- Each frame uses same caption as parent video
+
+**Training Setup**:
+- Global batch size: 2560 images
+- Learning rate: warmup 10k → 5e-5 constant
+- Hardware: 10 nodes (80 GPUs)
+
+| Path | Parallelism | Notes |
+|------|-------------|-------|
+| Megatron-Core | TP=1, PP=1, CP=1 | Sequence packing (32 samples/pack) |
+| AutoModel | FSDP | micro_batch_size = 32 |
+
+:::
+
+::: {tab-item} Stage 2: Text-to-Video
+
+**Dataset**:
+- Full videos → **3,000 videos**
+- Resolution: **240 × 416**, duration 4–8 seconds
+
+**Training Setup**:
+- Global batch size: 80 videos
+- Learning rate: 5e-5 constant
+- Hardware: 10 nodes (80 GPUs)
+
+| Path | Parallelism | Notes |
+|------|-------------|-------|
+| Megatron-Core | TP=1, PP=1, CP=1 | micro_batch_size = 1 |
+| AutoModel | FSDP | micro_batch_size = 1 |
+
+:::
+
+::::
+
+---
+
+## Results
+
+### Stage 1 — Loss vs. Steps
+
+```{image} ../medias/training_curves/lm_loss_text2image_3kvids.png
+:alt: Training loss curve for Stage 1 (Text-to-Image)
+:width: 700px
+```
+
+### Stage 2 — Loss vs. Steps
+
+```{image} ../medias/training_curves/lm_loss_text2video_3kvids.png
+:alt: Training loss curve for Stage 2 (Text-to-Video)
+:width: 700px
+```
+
+:::{note}
+Training loss is smoothed with 50 steps averaging.
+:::
+
+### Analysis
+
+The training curves for both stages have similar value ranges, although they do not match exactly. This is expected due to differences in implementation and training loop setups.
+
+:::{dropdown} Important Caveat: Megatron-Core Timestep Handling
+:icon: alert
+
+In the current Megatron-Core implementation, the same diffusion time steps are applied to all samples within a pack for each step, rather than different time steps for each sample. As a result, the training loss for Megatron-Core fluctuates more significantly than for AutoModel, especially at the beginning of training.
+:::
+
+---
+
+## Key Takeaways
+
+- Both paths achieve similar training loss ranges
+- Implementation differences lead to curve variations (expected)
+- Megatron-Core shows more loss fluctuation due to timestep handling in sequence packing
+- Both paths successfully learn reconstruction from prompts
+
+---
+
+## Related Documentation
+
+- [Training Paradigms](concepts/training-paradigms.md) - Detailed comparison of paradigms
+- [Performance Benchmarks](../reference/performance.md) - Training throughput metrics
+- [Get Started](../get-started/index.md) - Start training with either path
+
diff --git a/docs/about/concepts/configuration.md b/docs/about/concepts/configuration.md
new file mode 100644
index 00000000..558177ad
--- /dev/null
+++ b/docs/about/concepts/configuration.md
@@ -0,0 +1,251 @@
+---
+description: "Understanding NeMo DFM's configuration system: YAML files, CLI overrides, and configuration precedence"
+categories: ["concepts-architecture"]
+tags: ["configuration", "yaml", "cli", "overrides"]
+personas: ["mle-focused", "data-scientist-focused"]
+difficulty: "beginner"
+content_type: "explanation"
+---
+
+(about-concepts-configuration)=
+
+# Configuration System
+
+NeMo DFM uses a layered configuration system: base recipes provide defaults, YAML files define reusable settings, and CLI overrides enable quick experimentation. Each layer overrides the previous, with CLI arguments taking highest precedence.
+
+## Configuration Layers
+
+Configuration precedence: Base Recipe < YAML File < CLI Overrides
+
+1. **Base recipes**: Python functions with framework defaults
+2. **YAML files**: Reusable configuration templates
+3. **CLI overrides**: Runtime argument overrides (highest precedence)
+
+## Automodel Configuration
+
+Automodel is a separate training framework in DFM that uses a simplified, YAML-first configuration approach. It requires the Automodel submodule from `3rdparty/Automodel`.
+
+### YAML-Based Configuration
+
+Automodel uses a single YAML file for all configuration:
+
+```yaml
+seed: 42
+
+model:
+  pretrained_model_name_or_path: Wan-AI/Wan2.1-T2V-1.3B-Diffusers
+
+data:
+  dataloader:
+    _target_: Automodel.datasets.build_wan21_dataloader
+    meta_folder: /path/to/dataset/meta/
+    batch_size: 1
+    num_workers: 2
+
+batch:
+  batch_size_per_node: 8
+
+training:
+  num_epochs: 100
+
+optim:
+  learning_rate: 5e-6
+  optimizer:
+    weight_decay: 0.01
+    betas: [0.9, 0.999]
+
+fsdp:
+  tp_size: 1
+  cp_size: 1
+  pp_size: 1
+  dp_size: 8
+```
+
+### Loading Configuration
+
+Load configuration using Automodel's argument parser:
+
+```python
+# From Automodel package (3rdparty/Automodel)
+from nemo_automodel.components.config._arg_parser import parse_args_and_load_config
+
+cfg = parse_args_and_load_config("config.yaml")
+```
+
+The `nemo_automodel` package is provided by the Automodel submodule in `3rdparty/Automodel`.
+
+## Megatron Configuration
+
+### Multi-Level Configuration
+
+Megatron supports three configuration levels:
+
+#### 1. Base Recipe Configuration
+
+Python functions define base configurations:
+
+```python
+from dfm.src.megatron.recipes.dit.dit import pretrain_config
+
+cfg = pretrain_config(dataset_path="/path/to/dataset", mock=False)
+```
+
+#### 2. YAML Override Files
+
+YAML files override base configuration:
+
+```yaml
+model:
+  tensor_model_parallel_size: 4
+train:
+  global_batch_size: 512
+```
+
+#### 3. CLI Overrides
+
+Command-line arguments override everything:
+
+```bash
+python pretrain_dit_model.py \
+  --config-file config.yaml \
+  model.tensor_model_parallel_size=8 \
+  train.global_batch_size=1024
+```
+
+## CLI Override Syntax
+
+### Basic Syntax
+
+```bash
+key=value
+```
+
+### Nested Keys
+
+Use dot notation for nested configuration:
+
+```bash
+model.tensor_model_parallel_size=4
+train.global_batch_size=512
+optimizer.learning_rate=1e-4
+```
+
+### Adding New Keys
+
+Use `+` prefix to add new configuration keys:
+
+```bash
++new_key=value
++model.custom_setting=42
+```
+
+### Removing Keys
+
+Use `~` prefix to remove configuration keys:
+
+```bash
+~key_to_remove
+~model.unused_setting
+```
+
+### Type Conversion
+
+CLI overrides automatically convert types:
+
+```bash
+model.tensor_model_parallel_size=4        # int
+train.learning_rate=1e-4                  # float
+model.use_mixed_precision=true            # bool
+model.model_name="my_model"               # string
+```
+
+### Complex Types
+
+PyTorch types use string representations that are parsed by OmegaConf:
+
+```bash
+model.pipeline_dtype=torch.bfloat16       # torch dtype (common: torch.float16, torch.bfloat16, torch.float32)
+```
+
+For function references and complex objects, define them in YAML files rather than CLI overrides.
+
+## Configuration Structure
+
+Configuration files organize settings into logical sections:
+
+**Model**: Architecture and parallelism
+
+```yaml
+model:
+  tensor_model_parallel_size: 4
+  pipeline_model_parallel_size: 2
+  pipeline_dtype: torch.bfloat16
+```
+
+**Training**: Batch sizes and iteration control
+
+```yaml
+train:
+  global_batch_size: 512
+  max_steps: 10000
+  save_interval: 1000
+```
+
+**Data**: Dataset paths and loading
+
+```yaml
+data:
+  dataset_path: /path/to/data
+  num_workers: 8
+```
+
+**Optimizer**: Learning rates and schedules
+
+```yaml
+optim:
+  learning_rate: 1e-4
+  weight_decay: 0.01
+```
+
+## Configuration Patterns
+
+### Experiment Workflows
+
+Base configuration with CLI variations:
+
+```bash
+# Base run
+python train.py --config-file base_config.yaml
+
+# Learning rate sweep
+python train.py --config-file base_config.yaml train.learning_rate=2e-4
+python train.py --config-file base_config.yaml train.learning_rate=5e-4
+
+# Scale model parallelism
+python train.py --config-file base_config.yaml \
+  model.tensor_model_parallel_size=8 \
+  model.pipeline_model_parallel_size=2
+```
+
+### Verify Final Configuration
+
+Print merged configuration in Megatron to verify all overrides:
+
+```python
+from megatron.bridge.utils.common_utils import get_rank_safe
+
+if get_rank_safe() == 0:
+    cfg.print_yaml()
+```
+
+This displays the final configuration after all merging, showing effective values for model, training, data, and optimizer settings.
+
+## Environment Variables
+
+Set runtime behavior with environment variables:
+
+```bash
+export CUDA_VISIBLE_DEVICES=0,1,2,3  # Select GPUs
+export NCCL_DEBUG=INFO               # Debug distributed communication
+```
+
diff --git a/docs/about/concepts/diffusion-models.md b/docs/about/concepts/diffusion-models.md
new file mode 100644
index 00000000..a9c37bf2
--- /dev/null
+++ b/docs/about/concepts/diffusion-models.md
@@ -0,0 +1,199 @@
+---
+description: "How diffusion models work for video generation in NeMo DFM, including EDM and Flow Matching paradigms"
+categories: ["concepts-architecture"]
+tags: ["diffusion", "video-generation", "edm", "flow-matching"]
+personas: ["mle-focused", "data-scientist-focused"]
+difficulty: "intermediate"
+content_type: "explanation"
+---
+
+(about-concepts-diffusion-models)=
+
+# Diffusion Models for Video
+
+Diffusion models generate video by learning to reverse a gradual noise-addition process. NeMo DFM implements two paradigms—EDM and Flow Matching—each offering distinct training dynamics and sampling characteristics for video generation.
+
+## Core Mechanism
+
+Diffusion models operate through two complementary processes:
+
+1. **Forward (noise addition)**: The model gradually corrupts clean video data by adding Gaussian noise over many timesteps until the data becomes indistinguishable from pure noise. This forward process is deterministic and follows a predefined noise schedule that controls the rate of corruption.
+
+2. **Reverse (denoising)**: The model learns to invert the forward process by predicting and removing noise at each timestep. During training, the model sees corrupted data at various noise levels and learns to estimate the original clean data or the noise that was added. During inference, the model starts with random noise and iteratively denoises it to generate new video content.
+
+The key insight is that learning to denoise at all noise levels enables generation: if you can remove noise step by step, you can transform random noise into coherent video.
+
+### Video-Specific Challenges
+
+Video diffusion extends image diffusion with additional complexity:
+
+- **Temporal consistency**: Models must maintain coherent motion and object identity across frames. This typically requires 3D attention mechanisms that attend across both spatial and temporal dimensions, or causal attention that processes frames sequentially.
+- **Computational scale**: A 5-second video at 24 fps contains 120 frames. Generating each frame at 512×512 resolution requires processing over 31 million pixels, making efficient architectures and parallelization essential.
+- **Conditioning mechanisms**: Text embeddings from encoders such as T5 provide semantic guidance, but video generation often requires additional conditioning on motion, camera movement, or reference frames.
+- **Memory requirements**: Processing multiple frames simultaneously demands substantial GPU memory. Latent diffusion models compress videos into lower-dimensional representations before applying diffusion, reducing memory usage by 16-64×.
+
+## Diffusion Paradigms in DFM
+
+NeMo DFM implements two paradigms with different mathematical formulations and sampling characteristics:
+
+### EDM (Elucidating Diffusion Models)
+
+EDM frames diffusion as a Stochastic Differential Equation (SDE) where the forward process adds noise according to a continuous-time stochastic process, and the reverse process learns to integrate backward through time.
+
+**Mathematical formulation**: EDM uses a variance-preserving SDE formulation where the noise schedule is parameterized to maintain consistent signal-to-noise ratios across timesteps. The model predicts either the noise ε, the denoised data x₀, or the score function ∇log p(x).
+
+**Sampling characteristics**:
+
+- Stochastic sampling paths allow controlled randomness during generation
+- Classifier-free guidance scales the conditional and unconditional predictions: `output = unconditional + guidance_scale × (conditional - unconditional)`
+- Typical inference requires 25-50 sampling steps, with quality improving at higher step counts
+- Second-order samplers (Heun, DPM-Solver++) can reduce required steps
+
+**When to use EDM**:
+
+- Production inference where generation quality is critical
+- Scenarios requiring classifier-free guidance for prompt adherence
+- Models trained with variance-preserving objectives
+
+**Primary architecture**: DiT (Diffusion Transformer)
+
+### Flow Matching
+
+Flow matching learns a deterministic ordinary differential equation (ODE) that transports samples from a noise distribution to the data distribution through continuous-time flows.
+
+**Mathematical formulation**: Instead of learning to denoise at discrete timesteps, flow matching learns a velocity field v(x, t) that defines how samples should move through space over time. The generative process integrates this ODE: dx/dt = v(x, t). The training objective directly matches the learned velocity field to a target conditional flow.
+
+**Sampling characteristics**:
+
+- Deterministic sampling paths provide consistent generation given the same seed
+- Typically requires fewer sampling steps (10-20) compared to EDM due to the direct ODE formulation
+- Time-shift techniques can adjust the speed of the flow at different timesteps
+- ODE solvers (Euler, Runge-Kutta) control the numerical integration accuracy
+
+**When to use Flow Matching**:
+
+- Applications requiring deterministic generation for reproducibility
+- Scenarios where faster inference (fewer steps) is prioritized
+- Research exploring flow-based generative models
+- Models trained with flow matching objectives
+
+**Primary architecture**: WAN
+
+## Training Dynamics
+
+### EDM Training Objective
+
+EDM training optimizes the model to predict noise at randomly sampled timesteps. For each training sample, the framework corrupts the clean video by adding Gaussian noise at a random noise level t, then trains the model to estimate either the added noise ε, the clean data x₀, or the score ∇log p(x_t). The loss function typically uses mean squared error between the prediction and target:
+
+`L = E[||prediction - target||²]`
+
+The random sampling of timesteps ensures the model learns to denoise at all noise levels, from slight corruptions to nearly pure noise. Variance-preserving formulations maintain signal strength across timesteps, preventing the model from focusing disproportionately on certain noise levels.
+
+### Flow Matching Training Objective
+
+Flow matching training optimizes the model to predict velocity fields that transport noise to data. The framework samples a clean video, constructs a conditional flow path from noise to that specific video, then trains the model to predict the velocity field along that path:
+
+`L = E[||v_θ(x_t, t) - u_t(x_t)||²]`
+
+where v_θ is the learned velocity field and u_t is the target conditional velocity. The key difference from EDM is that flow matching learns a direct mapping through time rather than iterative denoising. Conditional flow matching uses simple linear interpolation paths during training, making the training objective straightforward while still enabling complex generation.
+
+## Inference Characteristics
+
+### EDM Sampling
+
+EDM sampling iteratively denoises random noise by reversing the learned diffusion process. Starting from pure Gaussian noise, the sampler makes multiple predictions at decreasing noise levels, each time removing a portion of the noise. The sampling trajectory can be deterministic or stochastic depending on the sampler choice.
+
+Classifier-free guidance modifies the sampling process by computing both conditional (text-guided) and unconditional predictions at each step, then extrapolating away from the unconditional prediction. Higher guidance scales (typically 7-15 for video) increase prompt adherence but can reduce diversity. The guidance computation doubles the inference cost since the model must make two predictions per step.
+
+Sampling quality depends on the number of steps and sampler algorithm. First-order samplers (DDPM, DDIM) require more steps but are simpler, while second-order samplers (Heun, DPM-Solver++) achieve similar quality with 50-70% fewer steps by using higher-order numerical approximations.
+
+### Flow Matching Sampling
+
+Flow matching sampling integrates the learned velocity field forward through time using an ODE solver. Starting from noise, the solver numerically integrates dx/dt = v(x, t) from t=0 to t=1, where the velocity field guides the sample along a continuous path toward the data distribution.
+
+The deterministic nature of ODE integration means the same seed and hyperparameters produce identical outputs, which benefits reproducibility and iterative refinement. Time-shift techniques can reweight the integration schedule to spend more computational budget at critical phases of generation.
+
+Flow matching typically achieves competitive quality with fewer function evaluations (10-20) compared to EDM because the direct velocity prediction avoids the iterative error accumulation of denoising steps. However, classifier-free guidance is less commonly used with flow matching, as the formulation doesn't naturally separate conditional and unconditional paths.
+
+## Text Conditioning Mechanisms
+
+Both paradigms condition generation on text prompts through embedding-based guidance:
+
+**Text encoder integration**: Models typically use T5 or CLIP text encoders to convert prompts into high-dimensional embeddings (for example, 768 or 1024 dimensions). These embeddings are injected into the diffusion model through cross-attention layers, where the model's hidden states attend to the text representations at each layer of the architecture.
+
+**Classifier-free guidance**: During training, the model randomly drops conditioning information (typically 10-20% of samples) to learn both conditional p(x|text) and unconditional p(x) distributions. During inference, the two predictions are combined: `output = unconditional + guidance_scale × (conditional - unconditional)`. This extrapolation increases the influence of the text condition, improving prompt adherence at the cost of reduced diversity.
+
+**Negative prompts**: Some implementations support negative text conditioning, which guides generation away from undesired content by subtracting the influence of negative prompt embeddings from the positive prompt guidance. The modified guidance becomes: `output = unconditional + guidance_scale × (positive_conditional - negative_conditional)`.
+
+## Architecture Implementations
+
+### DiT (Diffusion Transformer)
+
+DiT applies transformer architectures to diffusion models by treating the latent video representation as a sequence of patches. Each frame is divided into spatial patches (similar to Vision Transformers), and the patches are processed through transformer blocks with both spatial and temporal attention.
+
+**Key architectural components**:
+
+- **Patch embedding**: Divides frames into non-overlapping patches and projects them to the model dimension
+- **Positional encoding**: Combines spatial (2D position within frame) and temporal (frame index) positional information
+- **Attention patterns**: 3D attention across height, width, and time dimensions enables modeling spatial structure and temporal dynamics simultaneously
+- **Adaptive layer normalization (AdaLN)**: Conditions the normalization on timestep and text embeddings, modulating the network behavior based on the current noise level and prompt
+- **Hierarchical processing**: Some variants use multi-scale representations with downsampling and upsampling stages
+
+DiT architectures scale effectively with model size and training compute, making them suitable for large-scale video generation.
+
+### WAN (Flow-Based Architecture)
+
+WAN implements flow matching with architectural designs optimized for learning velocity fields. While sharing transformer-based components with DiT, WAN modifications support the continuous-time dynamics of flow matching.
+
+**Flow-specific design choices**:
+
+- Velocity prediction heads that output per-patch velocity fields
+- Time embeddings that integrate smoothly across the continuous [0,1] interval rather than discrete diffusion timesteps
+- Architectural modifications that support deterministic ODE integration during inference
+
+The WAN architecture demonstrates that flow matching can achieve competitive results with specialized architectural considerations for the flow-based training paradigm.
+
+## Hyperparameters and Trade-offs
+
+### Noise Schedule
+
+The noise schedule defines the variance of noise at each timestep, controlling the diffusion process trajectory. Common schedules include:
+
+**Linear schedule**: Noise variance increases linearly from near-zero to one. Simple but can be suboptimal for complex data distributions.
+
+**Cosine schedule**: Uses a cosine function to allocate more capacity to mid-range noise levels where the model learns the most semantic information. Generally produces better results than linear schedules.
+
+**Learned schedules**: Some advanced formulations learn the optimal noise schedule during training, adapting to the specific data distribution.
+
+During inference, the schedule determines the timesteps at which the model makes predictions. Non-uniform schedules can concentrate sampling steps at critical noise levels, improving efficiency.
+
+### Guidance Scale
+
+The guidance scale parameter γ controls the strength of conditional guidance in the formula: `output = unconditional + γ × (conditional - unconditional)`.
+
+**Trade-offs**:
+
+- γ = 1: No guidance, equivalent to standard conditional generation
+- γ = 7-10: Typical range for video, balances prompt adherence and quality
+- γ = 15+: Strong guidance, may improve text alignment but can reduce diversity and introduce artifacts
+- γ < 1: Weakens conditioning, increases diversity
+
+Higher guidance scales amplify the difference between conditional and unconditional predictions, effectively increasing the model's confidence in prompt-related features.
+
+### Inference Steps
+
+The number of function evaluations during sampling determines the quality-speed trade-off:
+
+**EDM typical ranges**:
+
+- 25-50 steps: Standard quality, 2-5 seconds per video (depending on resolution and hardware)
+- 50-100 steps: High quality, diminishing returns above 50
+- <25 steps: Fast sampling, potential quality degradation with first-order samplers
+
+**Flow matching typical ranges**:
+
+- 10-20 steps: Competitive quality due to direct velocity prediction
+- 20-50 steps: Marginal improvements, higher computational cost
+
+Second-order ODE solvers can reduce required steps by 30-50% while maintaining quality through better numerical approximation of the integration path.
+
diff --git a/docs/about/concepts/distributed-training.md b/docs/about/concepts/distributed-training.md
new file mode 100644
index 00000000..9e81efdf
--- /dev/null
+++ b/docs/about/concepts/distributed-training.md
@@ -0,0 +1,357 @@
+---
+description: "Understanding distributed training parallelism in NeMo DFM: tensor parallelism, context parallelism, pipeline parallelism, and data parallelism"
+categories: ["concepts-architecture"]
+tags: ["distributed", "parallelism", "training", "tensor-parallelism"]
+personas: ["mle-focused", "admin-focused"]
+difficulty: "intermediate"
+content_type: "explanation"
+---
+
+(about-concepts-distributed-training)=
+
+# Distributed Training
+
+NeMo DFM scales training across multiple GPUs and nodes using four parallelism strategies. These strategies address different bottlenecks: model size (TP, PP), sequence length (CP), and throughput (DP).
+
+## Overview
+
+| Type | What It Splits | When to Use | Communication |
+|------|----------------|-------------|---------------|
+| **Tensor Parallelism (TP)** | Model weights across GPUs | Model >40 GB per GPU | High-bandwidth (NVLink) |
+| **Context Parallelism (CP)** | Sequence tokens across GPUs | Sequences >32K tokens | High-bandwidth (NVLink) |
+| **Pipeline Parallelism (PP)** | Model layers across GPUs | Very deep models, multi-node | Low-bandwidth (point-to-point) |
+| **Data Parallelism (DP)** | Training batches across GPUs | Standard scaling | Standard (all-reduce) |
+
+**Example**: A 70B parameter model with 16K sequence length on 128 GPUs might use TP=4, CP=2, PP=2, DP=8.
+
+## Tensor Parallelism (TP)
+
+Splits model weights across GPUs within each layer. A 40 GB layer with TP=4 uses 10 GB per GPU.
+
+### How It Works
+
+For a matrix multiplication `Y = XW`:
+1. Weight matrix `W` is split column-wise across GPUs
+2. Each GPU computes partial result using its weight shard
+3. Results are combined via all-reduce operation
+
+**Example**: For a 12,288 × 12,288 weight matrix with TP=4, each GPU holds 12,288 × 3,072.
+
+### When to Use
+
+- **Model size**: Model parameters >40 GB per GPU
+- **Layer size**: Individual layers >10 GB
+- **Hardware**: GPUs connected via NVLink or high-speed interconnect
+
+**Typical configurations**:
+- TP=2: 70B-175B models on A100 80GB
+- TP=4: 175B-400B models on H100 80GB
+- TP=8: >400B models or limited GPU memory
+
+### Configuration
+
+**Automodel**:
+```yaml
+fsdp:
+  tp_size: 4  # Split across 4 GPUs
+  cp_size: 1
+  pp_size: 1
+  dp_size: 2  # Calculated automatically if not specified
+```
+
+**Megatron**:
+```python
+model.tensor_model_parallel_size = 4
+```
+
+### Performance Impact
+
+- **Memory**: Reduces per-GPU memory by `1/tp_size`
+- **Communication**: All-reduce after each layer forward/backward pass
+- **Bandwidth requirement**: High-bandwidth interconnect (NVLink, NVSwitch) required for efficient scaling
+
+## Context Parallelism (CP)
+
+Splits sequence tokens across GPUs. A 64K token sequence with CP=2 processes 32K tokens per GPU.
+
+### How It Works
+
+For attention computation:
+1. Sequence split into chunks across GPUs
+2. Each GPU computes attention for its chunk
+3. Key-value pairs shared via all-gather
+4. Results combined for full attention
+
+**Example**: A 64K token sequence with CP=4 splits into 4 chunks of 16K tokens, reducing attention memory by 75%.
+
+### When to Use
+
+- **Sequence length**: >32K tokens or frames
+- **Memory bottleneck**: Attention memory exceeds 40% of total
+- **Use case**: Video generation (100+ frames), long-context language models
+
+**Typical configurations**:
+- CP=2: 32K-64K token sequences
+- CP=4: 64K-128K token sequences
+- CP=8: >128K token sequences
+
+### Configuration
+
+**Automodel**:
+```yaml
+fsdp:
+  tp_size: 1
+  cp_size: 2  # Split sequence across 2 GPUs
+  pp_size: 1
+  dp_size: 4
+```
+
+**Megatron**:
+```python
+model.context_parallel_size = 2
+```
+
+### Performance Impact
+
+- **Memory**: Reduces attention memory by `1/cp_size`
+- **Communication**: All-gather for key-value pairs per attention layer
+- **Scaling**: Most effective when attention is memory bottleneck
+
+## Pipeline Parallelism (PP)
+
+Splits model layers across GPUs or nodes. A 48-layer model with PP=4 assigns 12 layers per stage.
+
+### How It Works
+
+Model divided into sequential stages:
+1. Stage 1 (GPU 0): Layers 1-12
+2. Stage 2 (GPU 1): Layers 13-24
+3. Stage 3 (GPU 2): Layers 25-36
+4. Stage 4 (GPU 3): Layers 37-48
+
+Activations flow forward through stages; gradients flow backward. Microbatching overlaps computation to reduce idle time.
+
+### When to Use
+
+- **Multi-node training**: Minimizes inter-node bandwidth requirements
+- **Very deep models**: >80 layers that don't fit with TP alone
+- **Heterogeneous networks**: Lower bandwidth between nodes than within
+
+**Typical configurations**:
+- PP=2: 2-node training with fast inter-node links
+- PP=4: 4+ node training
+- PP=8: Large-scale multi-node deployments
+
+### Configuration
+
+**Automodel**:
+```yaml
+fsdp:
+  tp_size: 2
+  cp_size: 1
+  pp_size: 4  # 4 pipeline stages
+  dp_size: 1
+```
+
+**Megatron**:
+```python
+model.pipeline_model_parallel_size = 4
+```
+
+### Performance Impact
+
+- **Memory**: Reduces per-GPU memory by ~`1/pp_size`
+- **Communication**: Point-to-point activation/gradient transfers between stages
+- **Efficiency**: Pipeline bubbles cause idle time during stage transitions; mitigated by microbatching and virtual pipeline parallelism
+
+## Data Parallelism (DP)
+
+Replicates the model and splits batches across GPUs. Each GPU processes different data with the same model.
+
+### How It Works
+
+For batch size 64 with DP=8:
+1. Each GPU gets 8 samples
+2. Each GPU computes gradients independently
+3. Gradients averaged across all GPUs via all-reduce
+4. All GPUs update with averaged gradients
+
+This increases effective batch size and training throughput.
+
+### When to Use
+
+- **Scaling throughput**: Increase samples per second
+- **Batch size**: Increase effective batch size
+- **Standard case**: After applying TP/CP/PP, use remaining GPUs for DP
+
+**Typical configurations**:
+- DP=8: Single 8-GPU node
+- DP=16-32: Multi-node without model parallelism
+- DP=4-16: Remaining GPUs after TP/CP/PP
+
+### Configuration
+
+**Automodel**:
+```yaml
+fsdp:
+  tp_size: 1
+  cp_size: 1
+  pp_size: 1
+  dp_size: 8  # 8 data parallel replicas
+```
+
+**Megatron**:
+```python
+# Automatically calculated: DP = total_gpus / (TP × CP × PP)
+# Example: 32 GPUs with TP=4, CP=2, PP=2 → DP = 32/(4×2×2) = 2
+```
+
+### Performance Impact
+
+- **Memory**: No memory savings (full model copy per GPU)
+- **Communication**: All-reduce for gradients after each backward pass
+- **Scaling**: Near-linear speedup; efficiency depends on batch size
+
+## Combining Parallelism Strategies
+
+All four parallelism types can be combined. Total GPUs = TP × CP × PP × DP.
+
+### Real-World Examples
+
+**Small model, long sequences (8 GPUs)**:
+```yaml
+# Video generation: 13B model, 128K frames
+fsdp:
+  tp_size: 1   # Model fits on single GPU
+  cp_size: 4   # Split long sequence
+  pp_size: 1   # No pipeline needed
+  dp_size: 2   # Use remaining GPUs for throughput
+```
+
+**Large model, standard sequences (64 GPUs)**:
+```yaml
+# Language model: 175B model, 8K tokens
+fsdp:
+  tp_size: 4   # Split large model
+  cp_size: 1   # Sequence fits in memory
+  pp_size: 2   # 2-node deployment
+  dp_size: 8   # Scale throughput
+```
+
+**Massive model, multi-node (256 GPUs)**:
+```yaml
+# 500B+ model across 32 nodes
+fsdp:
+  tp_size: 8   # Within-node parallelism
+  cp_size: 2   # Moderate sequences
+  pp_size: 4   # Across-node parallelism
+  dp_size: 4   # Remaining GPUs
+```
+
+### Design Principles
+
+1. **Start with TP**: If model doesn't fit, add TP first (requires high bandwidth)
+2. **Add CP if needed**: For sequences >32K tokens
+3. **Use PP for multi-node**: Pipeline across nodes to reduce inter-node traffic
+4. **Fill with DP**: Use remaining GPUs for data parallelism
+
+## Choosing Parallelism Strategy
+
+### Decision Flowchart
+
+**Step 1**: Model fits on single GPU?
+- **Yes**: Use DP only (simplest, most efficient)
+- **No**: Go to Step 2
+
+**Step 2**: Single node or multi-node?
+- **Single node (8 GPUs)**: Use TP=2 or TP=4, then DP
+- **Multi-node (16+ GPUs)**: Go to Step 3
+
+**Step 3**: Configure multi-node strategy
+1. Use **PP** across nodes (minimize inter-node bandwidth)
+2. Use **TP** within nodes (leverage NVLink)
+3. Add **CP** if sequences >32K tokens
+4. Use **DP** for remaining GPUs
+
+### Hardware-Specific Guidance
+
+**8x A100 80GB (single node)**:
+```yaml
+# 70B model, 8K tokens
+fsdp:
+  tp_size: 2
+  cp_size: 1
+  pp_size: 1
+  dp_size: 4
+```
+
+**4 nodes × 8 H100 80GB (32 GPUs)**:
+```yaml
+# 175B model, 16K tokens
+fsdp:
+  tp_size: 4   # Within node
+  cp_size: 2   # Long sequences
+  pp_size: 2   # Across nodes (4 → 2 nodes per stage)
+  dp_size: 2   # Remaining GPUs
+```
+
+**32 nodes × 8 H100 80GB (256 GPUs)**:
+```yaml
+# 500B model, 8K tokens
+fsdp:
+  tp_size: 8   # Full node
+  cp_size: 1   # Standard sequences
+  pp_size: 4   # Across nodes
+  dp_size: 8   # Remaining GPUs
+```
+
+### Performance vs Memory Trade-offs
+
+| Priority | Strategy | Rationale |
+|----------|----------|-----------|
+| **Maximum speed** | DP only | No communication overhead, if model fits |
+| **Fit large model** | TP first | Most memory reduction per communication cost |
+| **Long sequences** | CP | Only option for >32K tokens |
+| **Multi-node scaling** | PP | Minimizes expensive inter-node bandwidth |
+
+## Implementation Details
+
+### Automodel (FSDP2)
+
+Automodel uses FSDP2 (Fully Sharded Data Parallel) with automatic optimizations:
+
+- **Weight sharding**: Distributes model weights across DP ranks
+- **Gradient synchronization**: Overlaps communication with computation
+- **Optimizer state sharding**: Distributes optimizer states across DP ranks to reduce per-GPU memory
+- **Checkpointing**: Saves only one copy regardless of DP size
+
+Best for: Standard training workflows with minimal tuning.
+
+**Note**: Configure all parallelism dimensions in the `fsdp:` section of your YAML config. The framework handles DP calculation automatically if `dp_size` is not specified.
+
+### Megatron
+
+Megatron provides explicit control over parallelism configuration:
+
+- **Fine-grained tuning**: Set communication schedules and buffer sizes
+- **Custom patterns**: Optimize for specific network topologies
+- **Large-scale focus**: Optimized for 100+ GPU deployments
+
+Best for: Large-scale training requiring custom optimization.
+
+### Verifying Parallelism Configuration
+
+To check your current parallelism settings at runtime:
+
+**Megatron**:
+```python
+from megatron.core import parallel_state as ps
+
+tp_size = ps.get_tensor_model_parallel_world_size()
+cp_size = ps.get_context_parallel_world_size()
+pp_size = ps.get_pipeline_model_parallel_world_size()
+# DP is calculated: dp_size = world_size / (tp_size * cp_size * pp_size)
+```
+
+**Automodel**:
+Check your configuration YAML or training logs for the applied parallelism settings. The framework logs parallelism configuration at initialization.
diff --git a/docs/about/concepts/index.md b/docs/about/concepts/index.md
new file mode 100644
index 00000000..da3a4f32
--- /dev/null
+++ b/docs/about/concepts/index.md
@@ -0,0 +1,70 @@
+---
+description: "Core concepts and terminology for NeMo DFM including training paradigms, diffusion models, video data representation, and distributed training"
+categories: ["concepts-architecture"]
+tags: ["concepts", "fundamentals", "diffusion", "training", "distributed"]
+personas: ["data-scientist-focused", "mle-focused"]
+difficulty: "beginner"
+content_type: "concept"
+modality: "universal"
+---
+
+(about-concepts)=
+
+# Concepts
+
+Learn about the core concepts you need to understand before using NeMo DFM.
+
+## Core Concepts
+
+These concepts are essential for understanding how NeMo DFM works and making informed decisions about your training and inference workflows.
+
+::::{grid} 1 1 1 2
+:gutter: 1 1 1 2
+
+:::{grid-item-card} {octicon}`git-branch;1.5em;sd-mr-1` Training Paradigms
+:link: about-concepts-training-paradigms
+:link-type: ref
+
+Understand the two main training approaches: Automodel (recipe-based) and Megatron (large-scale distributed), and when to use each.
+:::
+
+:::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` Diffusion Models for Video
+:link: about-concepts-diffusion-models
+:link-type: ref
+
+Learn how diffusion models work for video generation, including EDM and Flow Matching paradigms.
+:::
+
+:::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` Video Data Representation
+:link: about-concepts-video-data
+:link-type: ref
+
+Understand how video data is represented in DFM: latents, VAE encoding, tokenization, and data formats.
+:::
+
+:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Distributed Training
+:link: about-concepts-distributed-training
+:link-type: ref
+
+Learn about parallelism strategies: tensor parallelism (TP), context parallelism (CP), pipeline parallelism (PP), and data parallelism (DP).
+:::
+
+:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration System
+:link: about-concepts-configuration
+:link-type: ref
+
+Understand how DFM's configuration system works: YAML files, CLI overrides, and configuration precedence.
+:::
+
+::::
+
+```{toctree}
+:hidden:
+:maxdepth: 2
+
+Training Paradigms <training-paradigms.md>
+Diffusion Models for Video <diffusion-models.md>
+Video Data Representation <video-data.md>
+Distributed Training <distributed-training.md>
+Configuration System <configuration.md>
+```
diff --git a/docs/about/concepts/training-paradigms.md b/docs/about/concepts/training-paradigms.md
new file mode 100644
index 00000000..ca886854
--- /dev/null
+++ b/docs/about/concepts/training-paradigms.md
@@ -0,0 +1,279 @@
+---
+description: "Understanding the two training paradigms in NeMo DFM: Automodel and Megatron, and when to use each"
+categories: ["concepts-architecture"]
+tags: ["training", "automodel", "megatron", "paradigms"]
+personas: ["mle-focused", "data-scientist-focused"]
+difficulty: "beginner"
+content_type: "explanation"
+---
+
+(about-concepts-training-paradigms)=
+
+# Training Paradigms
+
+NeMo DFM offers two training paradigms: **Automodel** for quick prototyping and fine-tuning, and **Megatron** for large-scale production training. Each paradigm uses different configuration systems, parallelism strategies, and data loading approaches.
+
+## Overview
+
+Choose between two approaches based on your training goal:
+
+| Paradigm | Best For | Complexity | Configuration | Example |
+|----------|----------|------------|---------------|---------|
+| **Automodel** | Quick prototyping, fine-tuning, research | Lower | YAML-based recipes | `finetune.py` |
+| **Megatron** | Large-scale pretraining, production training | Higher | Python recipes + YAML + CLI | `pretrain_dit_model.py` |
+
+## Understanding the Paradigms
+
+### Key Features
+
+Each paradigm takes a different approach to configuration, parallelism, and data loading. Understanding these differences helps you choose the right paradigm for your training workflow.
+
+::::{tab-set}
+:sync-group: paradigm
+
+:::{tab-item} Automodel
+:sync: automodel
+
+Automodel provides recipe-based training that abstracts distributed training complexity behind a single YAML configuration file. Pre-built recipes handle model initialization, data loading, and training loops automatically.
+
+**Configuration**: Single YAML file controls all training parameters. The recipe provides sensible defaults, and you override only what you need to change.
+
+**Parallelism**: FSDP2 automatically distributes training across GPUs using tensor parallelism (TP), context parallelism (CP), pipeline parallelism (PP), and data parallelism (DP). You configure parallelism strategy in the `fsdp` section without managing low-level details.
+
+**Data Loading**: Uses PyTorch DataLoader with standard dataset interfaces. Works with common formats like images, text, and Hugging Face datasets.
+
+**Model Integration**: Works directly with Hugging Face Diffusers models, making fine-tuning pre-trained models straightforward.
+:::
+
+:::{tab-item} Megatron
+:sync: megatron
+
+Megatron provides explicit control over every aspect of distributed training, from parallelism dimensions to data loading pipelines. Built for large-scale pretraining, it supports multi-node clusters with thousands of GPUs and custom model architectures.
+
+**Configuration**: Three-level configuration system provides maximum flexibility:
+
+1. Base recipe (Python) defines training logic and default parameters
+2. YAML override files modify specific parameters for experiments
+3. CLI overrides (highest precedence) enable quick parameter sweeps
+
+This layered approach supports Hydra-style syntax for complex configuration changes.
+
+**Parallelism**: Explicit control over all parallelism dimensions. You specify tensor parallel size, context parallel size, pipeline parallel stages, and data parallel degree independently. This fine-grained control enables optimal scaling for different model architectures and cluster configurations.
+
+**Data Loading**: Uses Energon data loader with webdataset format, optimized for distributed training at scale. Supports efficient data streaming across nodes and advanced features like sample reweighting and mixing multiple datasets.
+
+**Model Customization**: Full access to model architecture, forward pass logic, and training step. You define custom `ForwardStep` functions and modify model components directly.
+:::
+::::
+
+### Use Cases
+
+Your training goal determines which paradigm fits best. Here are the scenarios where each paradigm excels.
+
+::::{tab-set}
+:sync-group: paradigm
+
+:::{tab-item} Automodel
+:sync: automodel
+
+- **Fine-tuning**: Adapt pre-trained models to your dataset
+- **Research prototyping**: Test ideas quickly without infrastructure overhead
+- **Small-scale training**: Single-node or small multi-node setups
+- **Standard architectures**: Using existing model recipes without customization
+:::
+
+:::{tab-item} Megatron
+:sync: megatron
+
+- **Large-scale pretraining**: Training foundation models from scratch on multi-node clusters
+- **Production workflows**: Reproducible training with version-controlled configurations
+- **Custom architectures**: Implementing novel model designs not available in standard recipes
+- **Performance optimization**: Tuning parallelism and memory usage for specific hardware
+- **Multi-stage training**: Complex workflows with different training phases
+:::
+::::
+
+### Architecture
+
+Both paradigms organize code into layers that separate configuration from execution. The layer structure reflects each paradigm's design philosophy.
+
+::::{tab-set}
+:sync-group: paradigm
+
+:::{tab-item} Automodel
+:sync: automodel
+
+Automodel uses a three-layer architecture:
+
+1. **Recipe layer**: Pre-built training recipes (such as `TrainWan21DiffusionRecipe`) encapsulate training logic
+2. **Config layer**: YAML files specify hyperparameters, data paths, and parallelism
+3. **Execution layer**: `recipe.run_train_validation_loop()` handles training iteration
+:::
+
+:::{tab-item} Megatron
+:sync: megatron
+
+Megatron uses a modular architecture with clear separation of concerns:
+
+1. **Recipe layer**: Base Python configuration (`pretrain_config()`) defines model, optimizer, and training parameters
+2. **Override layer**: YAML files and CLI arguments modify base configuration
+3. **Execution layer**: `pretrain()` function orchestrates distributed training with custom forward steps
+4. **Bridge layer**: Megatron-Bridge handles low-level distributed training mechanics
+:::
+::::
+
+## Comparing the Paradigms
+
+The paradigms differ fundamentally in how they balance ease of use against control and scalability.
+
+::::{tab-set}
+:sync-group: paradigm
+
+:::{tab-item} Automodel
+:sync: automodel
+
+**Configuration**: Single YAML file with recipe defaults
+
+**Parallelism**: Automatic FSDP2 (less control)
+
+**Data Loading**: PyTorch DataLoader, standard formats
+
+**Scalability**: Small multi-node
+
+**Setup Complexity**: Low
+
+**Customization**: Recipe-level only
+
+**Best For**: Quick experiments, fine-tuning
+:::
+
+:::{tab-item} Megatron
+:sync: megatron
+
+**Configuration**: Python base + YAML overrides + CLI
+
+**Parallelism**: Explicit TP/CP/PP/DP (full control)
+
+**Data Loading**: Energon data loader with distributed streaming
+
+**Scalability**: Large multi-node clusters
+
+**Setup Complexity**: High
+
+**Customization**: Full code-level access
+
+**Best For**: Large-scale pretraining, production
+:::
+::::
+
+### Configuration Systems
+
+::::{tab-set}
+:sync-group: paradigm
+
+:::{tab-item} Automodel
+:sync: automodel
+
+Uses a single YAML file where you specify training parameters. The recipe provides defaults for most settings, so you only override what matters for your experiment. Configuration is simple and flat.
+:::
+
+:::{tab-item} Megatron
+:sync: megatron
+
+Uses a three-level system: start with a Python recipe that defines base configuration, override specific parameters with YAML files for experiments, and apply final tweaks via CLI for parameter sweeps. This complexity enables reproducible experiments with version control.
+:::
+::::
+
+### Parallelism Strategies
+
+::::{tab-set}
+:sync-group: paradigm
+
+:::{tab-item} Automodel
+:sync: automodel
+
+Automatically configures FSDP2 to distribute your model across GPUs. You specify high-level parallelism settings in the `fsdp` section, and the framework determines optimal shard placement. This works well for standard model architectures.
+:::
+
+:::{tab-item} Megatron
+:sync: megatron
+
+Requires you to explicitly set tensor parallel size, context parallel size, pipeline stages, and data parallel degree. This granular control enables optimal memory usage and communication patterns for very large models or custom architectures.
+:::
+::::
+
+### Data Loading Pipelines
+
+::::{tab-set}
+:sync-group: paradigm
+
+:::{tab-item} Automodel
+:sync: automodel
+
+Uses PyTorch DataLoader with standard Python datasets. This familiar interface works with images, text files, and Hugging Face datasets without preprocessing.
+:::
+
+:::{tab-item} Megatron
+:sync: megatron
+
+Uses the Energon data loader optimized for distributed training at scale. This loader enables efficient streaming of massive datasets across nodes and supports advanced features like deterministic sampling and dataset mixing.
+:::
+::::
+
+## Selecting Your Paradigm
+
+Your training goal determines which paradigm to use.
+
+::::{tab-set}
+:sync-group: paradigm
+
+:::{tab-item} Automodel
+:sync: automodel
+
+**Fine-tuning existing models**: Automodel integrates directly with Hugging Face models and provides pre-built fine-tuning recipes.
+
+**Research experiments**: Quick iteration with YAML-only configuration changes. Test hypotheses in minutes instead of hours.
+
+**Small-scale training**: Training on single-node or small multi-node setups where automatic parallelism configuration works well.
+
+**Standard architectures**: Using proven model architectures without custom modifications.
+:::
+
+:::{tab-item} Megatron
+:sync: megatron
+
+**Pretraining foundation models**: Large-scale training from scratch where Energon's data loading efficiency and explicit parallelism control are essential.
+
+**Production deployments**: Reproducible training with version-controlled Python recipes and configuration overrides.
+
+**Custom model architectures**: Implementing novel designs that require code-level modifications to model structure and training steps.
+
+**Performance-critical training**: Optimizing memory usage and communication patterns for specific hardware configurations.
+
+**Large clusters**: Training on large multi-node clusters where explicit parallelism management becomes necessary.
+:::
+::::
+
+## Paradigm Interoperability
+
+Model checkpoints from one paradigm can often be loaded in the other, but training workflows are not interchangeable. The paradigms use different:
+
+- **Configuration formats**: YAML-only versus Python + YAML + CLI
+- **Data formats**: PyTorch datasets versus webdataset
+- **Parallelism APIs**: FSDP2 versus explicit Megatron parallelism
+
+Plan to use one paradigm consistently throughout your project. Converting training infrastructure between paradigms requires rewriting configuration and data loading code.
+
+**Inference**: Both paradigms can export models to standard formats for inference deployment.
+
+---
+
+## Experimental Comparison
+
+For a detailed experimental comparison of Automodel vs Megatron training paths, including training curves and performance analysis, see [Automodel vs Megatron Comparison](../comparison.md).
+
+The comparison includes:
+- Two-stage training experiment (Text→Image, Text→Video)
+- Training loss curves for both paths
+- Important caveats about implementation differences
+- Performance characteristics analysis
diff --git a/docs/about/concepts/video-data.md b/docs/about/concepts/video-data.md
new file mode 100644
index 00000000..8c579306
--- /dev/null
+++ b/docs/about/concepts/video-data.md
@@ -0,0 +1,426 @@
+---
+description: "How video data is represented in NeMo DFM: latents, VAE encoding, tokenization, and data formats"
+categories: ["concepts-architecture"]
+tags: ["data", "video", "latents", "vae", "tokenization"]
+personas: ["data-scientist-focused", "mle-focused"]
+difficulty: "intermediate"
+content_type: "explanation"
+---
+
+(about-concepts-video-data)=
+
+# Video Data Representation
+
+NeMo DFM processes videos in latent space rather than pixel space, reducing memory requirements and accelerating training by up to 64×.
+
+## Overview
+
+Videos in DFM follow a four-stage pipeline:
+
+1. **Encode to latents**: VAE (Variational Autoencoder) compresses raw pixels into latent space
+2. **Store as tensors**: Compressed latents are saved with text embeddings
+3. **Process with diffusion**: Models operate on compact latent representations
+4. **Decode to pixels**: VAE reconstructs final video frames
+
+**Key benefit**: A 1080p video (1920×1080×3 channels×120 frames = 746 million values) compresses to latents of 16×15×135×240 = 8.6 million values—a 64× reduction.
+
+## Video Latents
+
+### Tensor Format
+
+Video latents are 4D tensors with shape `(C, T, H, W)`:
+
+| Dimension | Description | Example Values |
+|-----------|-------------|----------------|
+| **C** | Channels | 16 (standard for most VAEs) |
+| **T** | Temporal frames | 15, 30, 60, 120 (varies by video length) |
+| **H** | Latent height | 135 for 1080p (1080÷8) |
+| **W** | Latent width | 240 for 1920p (1920÷8) |
+
+**Spatial compression**: VAEs downsample by 8× in both height and width. A 1920×1080 frame becomes 240×135 in latent space.
+
+**Temporal compression**: Some VAEs also compress temporally. A 120-frame video might compress to 15 latent frames (8× temporal compression).
+
+### Why Latents?
+
+**Memory efficiency**: Latent representation is 64× smaller than raw pixels.
+
+- Raw 1080p video (120 frames): 746 MB
+- Latent representation: 12 MB
+- Enables training on longer videos with limited GPU memory
+
+**Training speed**: Diffusion models process 8.6 million values instead of 746 million values—approximately 8× faster per iteration.
+
+**Quality preservation**: VAE reconstruction maintains perceptual quality. Peak Signal-to-Noise Ratio (PSNR) remains above 30 dB for most VAE models.
+
+## VAE Encoding and Decoding
+
+### Encoding Process
+
+The VAE encoder transforms raw video frames into compact latent tensors:
+
+```python
+import torch
+from diffusers import AutoencoderKLWan
+
+# Load video: (batch, channels, time, height, width)
+video_frames = torch.randn(1, 3, 120, 1080, 1920)  # 1080p, 120 frames
+
+# Normalize to [-1, 1] range
+video_frames = video_frames * 2.0 - 1.0
+
+# Initialize VAE (WAN 2.1)
+vae = AutoencoderKLWan.from_pretrained(
+    "Wan-AI/Wan2.1-T2V-14B-Diffusers",
+    subfolder="vae"
+)
+
+# Encode to latents
+latent_dist = vae.encode(video_frames)
+latents = latent_dist.latent_dist.mean  # Use mean for deterministic encoding
+# Output shape: (1, 16, 120, 135, 240)
+# Compression: 1× in time (no temporal compression), 8× in height, 8× in width
+```
+
+**Encoding steps**:
+
+1. Normalize input frames to VAE's expected range (usually [-1, 1])
+2. Pass through encoder network
+3. Quantize or sample latent distribution
+4. Output compressed latent tensor
+
+### Decoding Process
+
+The VAE decoder reconstructs video frames from latents:
+
+```python
+# Generate or load latents
+latents = torch.randn(1, 16, 120, 135, 240)
+
+# Decode to video frames
+reconstructed_video = vae.decode(latents).sample
+# Output shape: (1, 3, 120, 1080, 1920)
+
+# Denormalize from [-1, 1] to [0, 255] for video output
+video_uint8 = ((reconstructed_video + 1.0) * 127.5).clamp(0, 255).to(torch.uint8)
+```
+
+**Decoding steps**:
+
+1. Pass latents through decoder network
+2. Upsample to original spatial and temporal resolution
+3. Denormalize to pixel value range
+4. Output reconstructed video frames
+
+### VAE Models
+
+DFM supports multiple VAE architectures:
+
+**Cosmos Tokenizer** (Continuous Video: `Cosmos-Tokenizer-CV8x8x8`):
+
+- Compression: 8×8×8 (time × height × width)
+- Channels: 16 latent channels
+- Use case: DiT models, continuous latent diffusion
+- Normalization: Input frames in [-1, 1]
+
+**Cosmos Tokenizer** (Discrete Video: `Cosmos-Tokenizer-DV4x8x8`):
+
+- Compression: 4×8×8 (time × height × width)
+- Channels: 6 discrete code channels (codebook size 64K)
+- Use case: Autoregressive models, discrete token generation
+- Normalization: Input frames in [-1, 1]
+
+**WAN VAE**:
+
+- Compression: 1×8×8 (no temporal compression)
+- Channels: 16 latent channels
+- Use case: WAN models, Flow Matching models
+- Normalization: Input frames converted to [-1, 1] internally
+
+Each VAE requires specific normalization. Check model documentation before preprocessing.
+
+## Data Formats
+
+### Training Data Formats
+
+DFM supports two paradigms with different data formats:
+
+#### Automodel Format
+
+Automodel uses pickled `.meta` files containing preprocessed latents:
+
+```python
+# Example .meta file structure
+{
+    "video_latents": torch.Tensor,         # Shape: (C, T, H, W)
+    "text_embeddings": torch.Tensor,       # Shape: (S, D)
+    "first_frame": np.ndarray,             # First frame (H, W, 3) in [0, 255]
+    "metadata": dict,                      # Original video metadata
+    "num_frames": int,                     # Frame count
+    "original_filename": str,              # Source video filename
+    "original_video_path": str,            # Source video path
+    "deterministic_latents": bool,         # Encoding mode used
+    "memory_optimization": bool,           # Memory optimization enabled
+    "model_version": str,                  # VAE model version (e.g., "wan2.1")
+    "resize_settings": dict,               # Resize configuration
+}
+```
+
+**File organization**:
+
+```text
+dataset/
+├── sample_0000.meta
+├── sample_0001.meta
+├── sample_0002.meta
+└── ...
+```
+
+#### Megatron Format
+
+Megatron supports two distributed data formats:
+
+**Webdataset format**:
+
+- Tar archives containing video samples
+- Each sample is a set of files with shared basename
+- Example: `sample001.latent.pth`, `sample001.text.pth`, `sample001.json`
+
+**Energon format**:
+
+- Optimized for distributed data loading across nodes
+- Supports efficient sharding and data parallelism
+- Recommended for multi-node training at scale
+
+Both formats include latents, text embeddings, and metadata per sample.
+
+### DiffusionSample Structure
+
+The `DiffusionSample` class represents a training sample:
+
+```python
+@dataclass
+class DiffusionSample:
+    video: torch.Tensor              # Video latents (C, T, H, W)
+    context_embeddings: torch.Tensor  # Text embeddings (S, D)
+    context_mask: torch.Tensor       # Text mask
+    image_size: torch.Tensor         # [height, width]
+    fps: torch.Tensor                # Frame rate
+    num_frames: torch.Tensor         # Frame count
+    # ... additional metadata
+```
+
+## Text Conditioning
+
+### Text Embeddings
+
+Text prompts guide video generation through learned embeddings. DFM uses T5 or similar transformer-based text encoders.
+
+**Embedding dimensions**:
+
+| Encoder | Sequence Length (S) | Embedding Dim (D) | Model Size |
+|---------|---------------------|-------------------|------------|
+| T5-Base | Up to 512 tokens | 768 | 220M params |
+| T5-Large | Up to 512 tokens | 1024 | 770M params |
+| T5-XXL | Up to 512 tokens | 4096 | 11B params |
+
+**Process**: Text → Tokenizer → Token IDs → Encoder → Embeddings `(S, D)`
+
+### Text Encoding Example
+
+```python
+from transformers import AutoTokenizer, UMT5EncoderModel
+import torch
+
+# Initialize UMT5 encoder (used by WAN models)
+tokenizer = AutoTokenizer.from_pretrained(
+    "Wan-AI/Wan2.1-T2V-14B-Diffusers",
+    subfolder="text_encoder"
+)
+text_encoder = UMT5EncoderModel.from_pretrained(
+    "Wan-AI/Wan2.1-T2V-14B-Diffusers",
+    subfolder="text_encoder"
+)
+
+# Encode prompt
+prompt = "A robot cooking pasta in a modern kitchen"
+inputs = tokenizer(
+    prompt,
+    max_length=512,
+    padding="max_length",
+    truncation=True,
+    return_tensors="pt",
+    return_attention_mask=True,
+)
+
+with torch.no_grad():
+    text_embeddings = text_encoder(
+        input_ids=inputs["input_ids"],
+        attention_mask=inputs["attention_mask"]
+    ).last_hidden_state
+# Output shape: (1, 512, D) where D is embedding dimension
+
+# Embeddings condition the diffusion model
+# via cross-attention layers during generation
+```
+
+**Attention masking**: Padding tokens are masked so the model only attends to real tokens, not padding.
+
+## Video Tokenization
+
+Some models discretize continuous latents into tokens for autoregressive generation.
+
+### Cosmos Video Tokenizer
+
+The Cosmos tokenizer converts continuous latents into discrete token sequences:
+
+**Process**:
+
+1. Encode video to continuous latents: `(C, T, H, W)`
+2. Quantize latents using learned codebook
+3. Output discrete token indices: `(T×H×W,)` flattened sequence
+
+**Use cases**:
+
+- Autoregressive video models (predict next token)
+- Enables language model-style training on videos
+- Supports efficient caching during generation
+
+### Causal Video Tokenizer
+
+Causal tokenizers maintain temporal causality for autoregressive models:
+
+- **Temporal masking**: Each frame can only see previous frames
+- **Autoregressive generation**: Generate frame-by-frame sequentially
+- **Architecture compatibility**: Required for GPT-style video models
+
+**Example**: Generating a 120-frame video autoregressively produces frames 1→2→3→...→120, where each frame conditions on all previous frames.
+
+## Sequence Packing
+
+Sequence packing improves GPU utilization during distributed training:
+
+**Without packing**:
+
+```text
+Batch 1: [sequence_A (50 tokens), padding (14 tokens)]  # 22% wasted
+Batch 2: [sequence_B (40 tokens), padding (24 tokens)]  # 37% wasted
+```
+
+**With packing**:
+
+```text
+Batch 1: [sequence_A (50 tokens), sequence_B (14 tokens)]  # 0% wasted
+```
+
+**Implementation**:
+
+- Combine multiple sequences into fixed-length batches
+- Use attention masks to separate sequences
+- Track sequence boundaries for gradient computation
+
+**Benefits**: Up to 2× throughput improvement on datasets with variable-length videos.
+
+## Data Preprocessing
+
+### Preparation Pipeline
+
+Preprocessing transforms raw videos into training-ready samples:
+
+1. **Load raw video**: Read MP4, AVI, or other video formats
+2. **Resize and crop**: Standardize to target resolution (for example, 1080p)
+3. **Normalize frames**: Convert to expected range ([-1, 1] or [0, 1])
+4. **Encode to latents**: Apply VAE encoder
+5. **Encode text prompts**: Apply text encoder
+6. **Package sample**: Create `DiffusionSample` with metadata
+7. **Save to disk**: Write as `.meta` file or webdataset entry
+
+**Batch processing**: Process videos in parallel to maximize throughput. Use multi-GPU encoding for large datasets.
+
+### Preprocessing Example
+
+```python
+from dfm.src.automodel.utils.data.preprocess_resize import VideoPreprocessor
+from pathlib import Path
+
+# Initialize preprocessor
+preprocessor = VideoPreprocessor(
+    video_folder="raw_videos",
+    wan21_model_id="Wan-AI/Wan2.1-T2V-14B-Diffusers",
+    output_folder="processed_meta",
+    device="cuda",
+    deterministic_latents=True,  # Use deterministic encoding (no flares)
+    target_size=(1080, 1920),    # Target resolution (height, width)
+    resize_mode="bilinear",
+    maintain_aspect_ratio=True,
+)
+
+# Process all videos in folder
+# Requires meta.json with video metadata in video_folder
+preprocessor.process_all_videos()
+
+# Or load existing processed data
+data = preprocessor.load_processed_data("sample_0000.meta")
+
+# Data contains:
+# - video_latents: (16, T, 135, 240)
+# - text_embeddings: (1, 512, D)
+# - first_frame: (1080, 1920, 3)
+# - metadata: Original video metadata
+```
+
+### Preprocessing Tools
+
+DFM provides command-line tools and Python APIs:
+
+**Command-line preprocessing**:
+
+```bash
+python dfm/src/automodel/utils/data/preprocess_resize.py \
+    --video_folder raw_videos/ \
+    --output_folder processed_meta/ \
+    --model Wan-AI/Wan2.1-T2V-14B-Diffusers \
+    --height 1080 \
+    --width 1920 \
+    --resize_mode bilinear \
+    --device cuda
+```
+
+**Python API**:
+
+- `VideoPreprocessor`: End-to-end video preprocessing (`dfm.src.automodel.utils.data.preprocess_resize`)
+- `AutoencoderKLWan.encode()` / `.decode()`: Manual latent encoding (Diffusers library)
+- `UMT5EncoderModel`: Text prompt encoding (Transformers library)
+- `DiffusionSample`: Training sample dataclass (`dfm.src.megatron.data.common.diffusion_sample`)
+
+## Metadata
+
+Each training sample includes metadata for proper model conditioning:
+
+| Metadata Field | Type | Purpose | Example |
+|----------------|------|---------|---------|
+| **image_size** | `(int, int)` | Original video resolution | `(1080, 1920)` |
+| **fps** | `int` | Frame rate | `24`, `30`, `60` |
+| **num_frames** | `int` | Total frame count | `120` |
+| **padding_mask** | `torch.Tensor` | Valid vs padded regions | Binary mask |
+| **position_ids** | `torch.Tensor` | Spatial/temporal positions | 3D position indices |
+
+**Why metadata matters**:
+
+- **Resolution conditioning**: Models can generate videos at different resolutions
+- **FPS conditioning**: Control playback speed and motion dynamics
+- **Frame count conditioning**: Generate videos of varying lengths
+- **Padding masks**: Prevent model from learning on invalid padded regions
+
+**Example usage**:
+
+```python
+# Model conditions on metadata during training
+loss = model(
+    latents=sample.video,
+    text_embeddings=sample.context_embeddings,
+    image_size=sample.image_size,  # Conditions generation
+    fps=sample.fps,                # Conditions motion dynamics
+    num_frames=sample.num_frames,  # Conditions temporal length
+)
+```
diff --git a/docs/about/index.md b/docs/about/index.md
new file mode 100644
index 00000000..206787f5
--- /dev/null
+++ b/docs/about/index.md
@@ -0,0 +1,99 @@
+---
+description: "Overview of NeMo DFM, a framework for large-scale training and inference of video diffusion models with Automodel and Megatron support"
+categories: ["getting-started"]
+tags: ["overview", "platform", "diffusion", "video-models", "getting-started"]
+personas: ["data-scientist-focused", "mle-focused", "admin-focused", "devops-focused"]
+difficulty: "beginner"
+content_type: "concept"
+modality: "universal"
+---
+
+(about-overview)=
+
+# Overview of NeMo DFM
+
+NeMo DFM (Diffusion Foundation Models) trains and runs inference on video diffusion models at scale. It combines two training approaches—Automodel for recipe-based workflows and Megatron for multi-node distributed training—with support for multiple architectures including DiT, WAN, and EDM.
+
+**Use NeMo DFM to:**
+
+- Train video diffusion models using Flow Matching or EDM paradigms
+- Scale training across GPUs and nodes with tensor, context, and pipeline parallelism
+- Run efficient video generation inference on trained models
+- Experiment with different architectures (DiT, WAN, EDM) using the same framework
+
+## Who Should Use DFM
+
+- **Machine Learning Engineers**: Train video foundation models using diffusion and autoregressive architectures with configuration-driven workflows.
+- **Data Scientists**: Process video datasets with VAE encoding and tokenization pipelines for diffusion model training.
+- **Cluster Administrators**: Deploy and monitor large-scale distributed training jobs across multi-node GPU clusters.
+- **Researchers**: Experiment with diffusion architectures (DiT, EDM, WAN), training paradigms (Flow Matching, EDM), and parallelism strategies.
+
+## What DFM Provides
+
+**Two Training Paradigms**:
+
+- **Automodel**: Recipe-based training with DTensor for 3D parallelism, optimized for experimentation and prototyping
+- **Megatron**: Large-scale distributed training with comprehensive parallelism support (TP, CP, PP, DP) for production workloads
+
+**Architectures**:
+
+- **DiT** (Diffusion Transformer): Transformer-based diffusion models for video generation
+- **WAN**: Flow Matching architecture for alternative training dynamics
+- **EDM** (Elucidating Diffusion Models): Improved diffusion training with better convergence
+
+**Video Processing**:
+
+- VAE encoding for latent space representation
+- Tokenization pipelines for efficient video data handling
+- Support for variable-length videos and diverse resolutions
+
+**Distributed Training**:
+
+- Tensor parallelism (TP) for splitting model layers across GPUs
+- Context parallelism (CP) for long-sequence training
+- Pipeline parallelism (PP) for splitting models across stages
+- Data parallelism (DP) for scaling batch sizes
+
+## Learn Core Concepts
+
+Understand the foundational concepts before training or deploying video diffusion models.
+
+::::{grid} 1 1 1 2
+:gutter: 1 1 1 2
+
+:::{grid-item-card} {octicon}`git-branch;1.5em;sd-mr-1` Training Paradigms
+:link: about-concepts-training-paradigms
+:link-type: ref
+
+Understand the two main training approaches: Automodel (recipe-based) and Megatron (large-scale distributed), and when to use each.
+:::
+
+:::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` Diffusion Models for Video
+:link: about-concepts-diffusion-models
+:link-type: ref
+
+Learn how diffusion models work for video generation, including EDM and Flow Matching paradigms.
+:::
+
+:::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` Video Data Representation
+:link: about-concepts-video-data
+:link-type: ref
+
+Understand how DFM represents video data: latents, VAE encoding, tokenization, and data formats.
+:::
+
+:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Distributed Training
+:link: about-concepts-distributed-training
+:link-type: ref
+
+Learn about parallelism strategies: tensor parallelism (TP), context parallelism (CP), pipeline parallelism (PP), and data parallelism (DP).
+:::
+
+:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration System
+:link: about-concepts-configuration
+:link-type: ref
+
+Understand how DFM's configuration system works: YAML files, CLI overrides, and configuration precedence.
+:::
+
+::::
diff --git a/docs/conf.py b/docs/conf.py
index 9f9bb77f..c2439241 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -39,6 +39,7 @@
     "sphinx.ext.doctest",  # Allows testing in docstrings
     "sphinx.ext.napoleon",  # For google style docstrings
     "sphinx_copybutton",  # For copy button in code blocks
+    "sphinx_design",  # For grid layouts and card components
 ]
 
 templates_path = ["_templates"]
@@ -61,9 +62,17 @@
     "deflist",  # Supports definition lists with term: definition format
     "fieldlist",  # Enables field lists for metadata like :author: Name
     "tasklist",  # Adds support for GitHub-style task lists with [ ] and [x]
+    "substitution",  # Enables variable substitutions like {{product_name}}
 ]
 myst_heading_anchors = 5  # Generates anchor links for headings up to level 5
 
+# MyST substitutions - variables that can be used in markdown files
+myst_substitutions = {
+    "product_name": "NeMo DFM",
+}
+
+myst_heading_anchors = 5  # Generates anchor links for headings up to level 5
+
 # -- Options for Autodoc2 ---------------------------------------------------
 sys.path.insert(0, os.path.abspath(".."))
 
diff --git a/docs/get-started/automodel.md b/docs/get-started/automodel.md
new file mode 100644
index 00000000..60a76757
--- /dev/null
+++ b/docs/get-started/automodel.md
@@ -0,0 +1,658 @@
+---
+description: "End-to-end Automodel quickstart: fine-tune and generate videos"
+categories: ["getting-started", "automodel"]
+tags: ["quickstart", "tutorial", "automodel"]
+personas: ["data-scientist-focused"]
+difficulty: "beginner"
+content_type: "tutorial"
+---
+
+(gs-automodel)=
+
+# Automodel Workflow
+
+Complete end-to-end tutorial for fine-tuning and generating videos using NeMo DFM's Automodel approach.
+
+:::{card}
+
+**Goal**: Fine-tune a pretrained video model and generate videos from your checkpoint.
+
+^^^
+
+**In this tutorial, you will**:
+
+1. Fine-tune the WAN2.1 model on your dataset
+2. Generate videos from your trained model
+3. Experiment with generation parameters
+
+**Time**: 30-45 minutes (depending on training duration)
+
+:::
+
+:::{button-ref} gs-index
+:color: secondary
+:outline:
+:ref-type: ref
+
+← Back to Get Started
+:::
+
+## Before You Start
+
+Make sure you have completed:
+
+- ✅ [Installation](installation.md)
+- ✅ Multi-GPU setup (recommended: 8 GPUs)
+- ✅ Dataset in Energon format or custom dataloader
+
+---
+
+(gs-automodel-training-section)=
+## Fine-Tune WAN2.1 Model
+
+Fine-tune the WAN2.1 text-to-video model using Automodel's recipe-based training approach.
+
+**Key concept**: Automodel handles parallelism automatically using FSDP2—no manual tensor or pipeline parallelism configuration needed.
+
+:::{dropdown} What happens during training
+:icon: info
+
+1. Load pretrained WAN2.1 model from Hugging Face
+2. Configure FSDP2 parallelism automatically
+3. Train on your dataset with flow matching
+4. Save checkpoints periodically
+:::
+
+### 1. Prepare Your Dataset
+
+(gs-automodel-data-requirements)=
+
+You can prepare your dataset in two ways:
+
+- **Start with raw videos**: Place your `.mp4` files in a folder and use data-preparation scripts to scan videos and generate a `meta.json` entry for each sample
+- **Bring your own `meta.json`**: If you already have annotations, create `meta.json` yourself following the schema below
+
+#### Dataset Structure
+
+```text
+<your_video_folder>/
+├── video1.mp4
+├── video2.mp4
+└── meta.json
+```
+
+:::{note}
+If you have captions, you can also include per-video named `<video>.jsonl` files; the scripts will pick up the text automatically.
+:::
+
+#### meta.json Format
+
+:::{dropdown} Complete meta.json Schema
+:icon: info
+
+Each entry in `meta.json` should include:
+
+```json
+[
+  {
+    "file_name": "video1.mp4",
+    "width": 1280,
+    "height": 720,
+    "start_frame": 0,
+    "end_frame": 121,
+    "vila_caption": "A detailed description of the video1.mp4 contents..."
+  },
+  {
+    "file_name": "video2.mp4",
+    "width": 1280,
+    "height": 720,
+    "start_frame": 0,
+    "end_frame": 12,
+    "vila_caption": "A detailed description of the video2.mp4 contents..."
+  }
+]
+```
+
+**Fields**:
+- `file_name`: Name of the video file
+- `width`: Video width in pixels
+- `height`: Video height in pixels
+- `start_frame`: Starting frame index (usually 0)
+- `end_frame`: Ending frame index
+- `vila_caption`: Text description/caption for the video
+:::
+
+#### Preprocess Videos to .meta Files
+
+The preprocessing script converts each source video into a single `.meta` file that preserves the full temporal sequence as latents. Training can sample temporal windows/clips from the sequence on the fly.
+
+```bash
+python dfm/src/automodel/utils/data/preprocess_resize.py \
+  --video_folder <your_video_folder> \
+  --output_folder ./processed_meta \
+  --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
+  --height 480 \
+  --width 720 \
+  --center-crop
+```
+
+**Key arguments**:
+- `--video_folder`: Path to folder containing videos and `meta.json`
+- `--output_folder`: Path where `.meta` files will be saved
+- `--model`: Wan2.1 model ID (default: `Wan-AI/Wan2.1-T2V-14B-Diffusers`)
+- `--height/--width`: Target resolution (both must be specified together)
+- `--center-crop`: Crop to exact size after aspect-preserving resize
+- `--device`: Device to use (`cuda` or `cpu`, default: `cuda` if available)
+- `--stochastic`: Use stochastic encoding instead of deterministic (may cause flares)
+- `--no-memory-optimization`: Disable Wan's built-in memory optimization
+
+**Output**: Creates `.meta` files containing:
+- Encoded video latents (normalized)
+- Text embeddings (from UMT5)
+- First frame as JPEG
+- Metadata
+
+### 2. Create Training Configuration
+
+Create a YAML configuration file with your training parameters.
+
+**Create** `wan2_1_finetune.yaml`:
+
+```yaml
+seed: 42
+
+wandb:
+  project: wan-t2v-finetuning
+  mode: online
+  name: wan2_1_finetuning_run_1
+
+dist_env:
+  backend: nccl
+  timeout_minutes: 30
+
+model:
+  pretrained_model_name_or_path: Wan-AI/Wan2.1-T2V-1.3B-Diffusers
+
+step_scheduler:
+  global_batch_size: 8
+  local_batch_size: 1
+  num_epochs: 100
+  ckpt_every_steps: 1000
+  log_every: 2
+
+data:
+  dataloader:
+    _target_: dfm.src.automodel.datasets.build_wan21_dataloader
+    meta_folder: /path/to/your/dataset/meta/
+    num_workers: 2
+    device: cpu
+
+optim:
+  learning_rate: 5e-6
+  optimizer:
+    weight_decay: 0.01
+    betas: [0.9, 0.999]
+
+flow_matching:
+  use_sigma_noise: true
+  timestep_sampling: uniform
+  logit_mean: 0.0
+  logit_std: 1.0
+  flow_shift: 3.0
+  mix_uniform_ratio: 0.1
+
+fsdp:
+  tp_size: 1
+  cp_size: 1
+  pp_size: 1
+  dp_replicate_size: 1
+  dp_size: 8
+
+checkpoint:
+  enabled: true
+  checkpoint_dir: /path/to/checkpoints/wan2_1_finetuning/
+  model_save_format: torch_save
+  save_consolidated: false
+  restore_from: null
+```
+
+#### Key Configuration Parameters
+
+:::{list-table} Configuration Parameters
+:header-rows: 1
+:name: config-params
+
+* - Parameter
+  - Description
+  - Default
+  - Recommended
+* - `model.pretrained_model_name_or_path`
+  - Hugging Face model ID
+  - Required
+  - `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`
+* - `data.dataloader.meta_folder`
+  - Dataset metadata location
+  - Required
+  - Your dataset path
+* - `step_scheduler.global_batch_size`
+  - Effective batch size across all GPUs
+  - `8`
+  - 4-8 (depends on GPU memory)
+* - `step_scheduler.local_batch_size`
+  - Per-GPU batch size
+  - `1`
+  - Usually 1
+* - `step_scheduler.num_epochs`
+  - Training epochs
+  - `100`
+  - Adjust based on dataset size
+* - `step_scheduler.ckpt_every_steps`
+  - Checkpoint interval (steps)
+  - `1000`
+  - 500-2000
+* - `step_scheduler.log_every`
+  - Logging interval (steps)
+  - `2`
+  - 1-10
+* - `optim.learning_rate`
+  - Learning rate
+  - `5e-6`
+  - 1e-6 to 1e-5
+* - `fsdp.dp_size`
+  - Data parallel size
+  - `8`
+  - Match GPU count
+* - `checkpoint.checkpoint_dir`
+  - Where to save checkpoints
+  - Required
+  - Path with enough storage
+:::
+
+:::{dropdown} Parallelism settings (`fsdp`)
+:icon: gear
+
+- `tp_size=1`: Tensor parallelism disabled (automatic for this model size)
+- `cp_size=1`: Context parallelism disabled
+- `pp_size=1`: Pipeline parallelism disabled
+- `dp_size=8`: Data parallelism across 8 GPUs
+:::
+
+### 3. Run Training
+
+Execute the training script:
+
+::::: {tab-set}
+
+:::: {tab-item} Custom Configuration
+
+```bash
+python dfm/examples/automodel/finetune/finetune.py -c /path/to/wan2_1_finetune.yaml
+```
+
+Or using `uv` and `torchrun` for distributed training:
+
+```bash
+export UV_PROJECT_ENVIRONMENT=
+
+uv run --group automodel --with . \
+  torchrun --nproc-per-node=8 \
+  examples/automodel/finetune/finetune.py \
+  -c /path/to/wan2_1_finetune.yaml
+```
+
+:::{dropdown} Multi-Node with SLURM
+:icon: server
+
+For multi-node training with SLURM, use this script:
+
+```bash
+#!/bin/bash
+#SBATCH -N 2
+#SBATCH --ntasks-per-node 1
+#SBATCH --gpus-per-node=8
+#SBATCH --exclusive
+
+export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
+export MASTER_PORT=29500
+export NUM_GPUS=8
+
+# Per-rank UV cache to avoid conflicts
+unset UV_PROJECT_ENVIRONMENT
+mkdir -p /opt/uv_cache/${SLURM_JOB_ID}_${SLURM_PROCID}
+export UV_CACHE_DIR=/opt/uv_cache/${SLURM_JOB_ID}_${SLURM_PROCID}
+
+uv run --group automodel --with . \
+  torchrun \
+  --nnodes=$SLURM_NNODES \
+  --nproc-per-node=$NUM_GPUS \
+  --rdzv_backend=c10d \
+  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
+  examples/automodel/finetune/finetune.py \
+  -c examples/automodel/finetune/wan2_1_t2v_flow_multinode.yaml
+```
+
+**Key differences for multi-node**:
+- Uses `wan2_1_t2v_flow_multinode.yaml` config
+- Sets `MASTER_ADDR` and `MASTER_PORT` for distributed coordination
+- Configures per-rank UV cache to avoid conflicts
+- Uses `--nnodes` and `--rdzv_backend=c10d` for multi-node setup
+:::
+
+::::
+
+::::{tab-item} Default Configuration
+
+```bash
+python dfm/examples/automodel/finetune/finetune.py
+```
+
+This uses the default config at `examples/automodel/finetune/wan2_1_t2v_flow.yaml` (relative to the DFM installation directory).
+
+You can also explicitly specify the config file:
+
+```bash
+python dfm/examples/automodel/finetune/finetune.py -c examples/automodel/finetune/wan2_1_t2v_flow.yaml
+```
+
+::::
+
+:::::
+
+:::{dropdown} What happens during training
+:icon: info
+
+1. **Initialization** (2-5 minutes):
+   - Downloads WAN2.1 model from Hugging Face (if not cached)
+   - Initializes FSDP2 parallelism across GPUs
+   - Loads your dataset
+
+2. **Training loop**:
+   - Processes batches across distributed GPUs
+   - Logs loss every `step_scheduler.log_every` iterations
+   - Saves checkpoints every `step_scheduler.ckpt_every_steps` iterations
+
+3. **Checkpoint saves**:
+   - Checkpoints save to `checkpoint.checkpoint_dir`
+   - Each checkpoint is ~50GB (model weights + optimizer states)
+:::
+
+#### Expected Output
+
+```text
+[INFO] Loading pretrained model: Wan-AI/Wan2.1-T2V-1.3B-Diffusers
+[INFO] Initializing FSDP2 with dp_size=8
+[INFO] Starting training loop...
+[INFO] Epoch 1/100, Iter 1/5000, Loss: 0.234
+[INFO] Epoch 1/100, Iter 2/5000, Loss: 0.221
+...
+[INFO] Checkpoint saved: /path/to/checkpoints/wan2_1_finetuning/step_1000/
+```
+
+### 4. Validate Training
+
+Use the validation script to perform a quick qualitative check of a trained checkpoint.
+
+:::{dropdown} Validation Script Details
+:icon: info
+
+The validation script (`wan_validate.py`):
+- Reads prompts from `.meta` files in `--meta_folder` (uses `metadata.vila_caption`; latents are ignored)
+- Loads the `WanPipeline` and, if provided, restores weights from `--checkpoint`
+- Checkpoint loading priority: `ema_shadow.pt` → `consolidated_model.bin` → sharded FSDP `model/*.distcp`
+- Generates short videos for each prompt with specified settings (`--guidance_scale`, `--num_inference_steps`, `--height/--width`, `--num_frames`, `--fps`, `--seed`)
+- Writes videos to `--output_dir`
+- Intended for qualitative comparison across checkpoints; does not compute quantitative metrics
+:::
+
+```bash
+uv run --group automodel --with . \
+  python examples/automodel/generate/wan_validate.py \
+  --meta_folder <your_meta_folder> \
+  --guidance_scale 5 \
+  --checkpoint ./checkpoints/step_1000 \
+  --num_samples 10
+```
+
+:::{note}
+You can use `--checkpoint ./checkpoints/LATEST` to automatically use the most recent checkpoint.
+:::
+
+### 5. Monitor Training
+
+Monitor console output for decreasing loss values and checkpoint saves. If `wandb.mode: online`, view metrics in the WandB dashboard.
+
+Verify checkpoints are being saved:
+
+```bash
+ls -lh /path/to/checkpoints/wan2_1_finetuning/
+```
+
+Expected: `step_1000/`, `step_2000/`, `latest/` directories with checkpoint files.
+
+### Hardware Requirements
+
+:::{dropdown} System Requirements
+:icon: server
+
+| Component | Minimum | Recommended |
+|-----------|---------|-------------|
+| GPU | A100 40GB | A100 80GB / H100 |
+| GPUs | 4 | 8+ |
+| RAM | 128 GB | 256 GB+ |
+| Storage | 500 GB SSD | 2 TB NVMe |
+:::
+
+### Supported Models
+
+| Model | Parameters | Parallelization | Status |
+|-------|------------|-----------------|--------|
+| WAN 2.1 T2V 1.3B | 1.3B | FSDP2 via Automodel + DDP | ✅ |
+| WAN 2.1 T2V 14B | 14B | FSDP2 via Automodel + DDP | ✅ |
+| FLUX | TBD | TBD | 🔄 In Progress |
+
+### Advanced Topics
+
+:::{dropdown} Pretraining vs Fine-tuning
+:icon: gear
+
+| Setting | Fine-tuning | Pretraining |
+|---------|-------------|-------------|
+| `learning_rate` | 5e-6 | 5e-5 |
+| `weight_decay` | 0.01 | 0.1 |
+| `flow_shift` | 3.0 | 2.5 |
+| `logit_std` | 1.0 | 1.5 |
+| Dataset size | 100s-1000s | 10K+ |
+:::
+
+:::{dropdown} Custom Parallelization
+:icon: gear
+
+You can customize parallelism settings in your config:
+
+```yaml
+fsdp:
+  tp_size: 2  # Tensor parallel
+  dp_size: 4  # Data parallel
+```
+:::
+
+:::{dropdown} Checkpoint Management
+:icon: gear
+
+Clean up old checkpoints to save storage:
+
+```python
+from pathlib import Path
+import shutil
+
+def cleanup_old_checkpoints(checkpoint_dir, keep_last_n=3):
+    checkpoints = sorted(Path(checkpoint_dir).glob("step_*"))
+    for old_ckpt in checkpoints[:-keep_last_n]:
+        shutil.rmtree(old_ckpt)
+```
+:::
+
+### Troubleshooting
+
+:::{dropdown} Out of Memory Errors
+:icon: alert
+
+```
+RuntimeError: CUDA out of memory
+```
+
+**Solution**: Reduce `step_scheduler.global_batch_size`:
+
+```yaml
+step_scheduler:
+  global_batch_size: 4  # or 2
+```
+:::
+
+---
+
+(gs-automodel-inference-section)=
+## Generate Videos
+
+Generate videos using pretrained models from Hugging Face.
+
+:::{note} The examples in this section use `Wan-AI/Wan2.2-T2V-A14B-Diffusers` (a newer, larger model) for inference, while the training section uses `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` (smaller model suitable for fine-tuning). Both models follow the same workflow.
+:::
+
+**Generation time**: 2-5 minutes per video (single GPU), faster with parallelism
+
+**Requirements**: Pretrained Hugging Face model (`Wan-AI/Wan2.2-T2V-A14B-Diffusers`), GPU with 16GB+ memory recommended
+
+:::{dropdown} What happens during inference
+:icon: info
+
+1. Load pretrained model from Hugging Face
+2. Configure distributed parallelism (optional)
+3. Generate video from text prompt
+4. Save video file
+:::
+
+### Generate from Pretrained Model
+
+#### Generate a Video
+
+```bash
+python dfm/examples/automodel/generate/wan_generate.py \
+    --prompt "A butterfly flying over colorful flowers in a garden" \
+    --height 480 \
+    --width 848 \
+    --num-frames 111 \
+    --output butterfly_garden.mp4
+```
+
+:::: {tab-set}
+
+::: {tab-item} Expected Output
+
+```text
+[Loading] Loading VAE and pipeline...
+[Setup] Pipeline loaded and parallelized via NeMoAutoDiffusionPipeline
+[Inference] Starting distributed inference...
+[Inference] Saved butterfly_garden.mp4
+[Complete] Automodel FSDP2 inference completed!
+```
+
+:::
+
+::: {tab-item} Output File
+
+- Filename: `butterfly_garden.mp4`
+- Size: 5-15 MB
+- Duration: ~4.6 seconds (111 frames at 24 FPS)
+
+:::
+
+::::
+
+#### View the Video
+
+```bash
+# Play with ffplay
+ffplay butterfly_garden.mp4
+
+# Or open with default player
+open butterfly_garden.mp4  # macOS
+xdg-open butterfly_garden.mp4  # Linux
+```
+
+### Generation Parameters
+
+:::{list-table} Generation Parameters
+:header-rows: 1
+:name: generation-params
+
+* - Parameter
+  - Description
+  - Default
+  - Notes
+* - `--prompt`
+  - Text description of video
+  - Required
+  - Be specific and descriptive
+* - `--height`
+  - Video height (pixels)
+  - `480`
+  - Common: 360, 480, 720
+* - `--width`
+  - Video width (pixels)
+  - `848`
+  - Common: 640, 848, 1280
+* - `--num-frames`
+  - Number of frames
+  - `111`
+  - Must be 4n+1 format (51, 111, 149, 189, 229)
+* - `--output`
+  - Output filename
+  - `t2v_fsdp2_rank0.mp4`
+  - Any `.mp4` path
+* - `--num-inference-steps`
+  - Diffusion steps
+  - `20`
+  - More steps = better quality, slower
+* - `--seed`
+  - Random seed
+  - `42`
+  - Use same seed for reproducible results
+:::
+
+### Troubleshooting
+
+:::{dropdown} Out of Memory Errors
+:icon: alert
+
+```
+RuntimeError: CUDA out of memory
+```
+
+**Solution**: Reduce resolution and frames:
+
+```bash
+python dfm/examples/automodel/generate/wan_generate.py \
+    --prompt "Your prompt" \
+    --height 360 \
+    --width 640 \
+    --num-frames 51 \
+    --output output.mp4
+```
+
+:::
+
+---
+
+## Related Tutorials
+
+- [Megatron DiT Tutorial](megatron.md) - Train DiT models from scratch
+- [Megatron WAN Tutorial](megatron-wan.md) - Train WAN models with Megatron
+
+## Related Documentation
+
+- [Fine-Tuning Pretrained Models](../tutorials/fine-tuning-pretrained-models.md) - Comprehensive recipe with detailed configuration options
+- [Training Paradigms](../about/concepts/training-paradigms.md) - Understand AutoModel vs Megatron differences
+- [Performance Benchmarks](../reference/performance.md) - Training throughput metrics
+- [AutoModel vs Megatron Comparison](../about/comparison.md) - Experimental comparison
+
diff --git a/docs/get-started/index.md b/docs/get-started/index.md
new file mode 100644
index 00000000..a7ab06a9
--- /dev/null
+++ b/docs/get-started/index.md
@@ -0,0 +1,99 @@
+(gs-index)=
+
+# Get Started with NeMo DFM
+
+**Estimated Time**: 1-2 hours (depending on chosen path)
+
+This guide helps you get started with training video diffusion models using NeMo DFM. Each tutorial is a comprehensive, end-to-end guide that takes you from installation through training and inference.
+
+**By completing a tutorial, you will have:**
+
+✅ A working NeMo DFM installation
+✅ Hands-on experience with video model training and inference
+✅ Understanding of Automodel vs. Megatron approaches
+✅ Ability to generate videos from trained checkpoints
+
+## Before You Start
+
+Make sure you have these prerequisites ready before beginning the tutorials:
+
+- **Python 3.10+**
+- **Multi-GPU system** (recommended: 8 GPUs for optimal performance)
+- **Git** for cloning the repository
+- **~50GB storage** for datasets and checkpoints
+- Basic command-line familiarity
+
+---
+
+## Getting Started Path
+
+Follow these steps to build your first video generation model:
+
+::::::{grid} 1 1 1 1
+
+:::::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` 1. Installation
+:link: installation
+:link-type: doc
+
+Get NeMo DFM installed and verify your setup with a quick test.
++++
+{bdg-secondary}`environment` {bdg-secondary}`first-run`
+:::::
+
+:::::{grid-item}
+:margin: 0
+:padding: 0
+
+::::{grid} 1 3 3 3
+:margin: 3 1 0 0
+:gutter: 3
+:padding: 3
+
+:::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` 2a. Automodel Tutorial
+:link: automodel
+:link-type: doc
+
+Fine-tune pretrained models with automatic parallelism. Best for quick prototyping.
++++
+{bdg-secondary}`automodel` {bdg-success}`Fast start` {bdg-primary}`Data scientists`
+:::
+
+:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` 2b. Megatron DiT Tutorial
+:link: megatron
+:link-type: doc
+
+Train DiT models from scratch with full distributed control. Best for large-scale training.
++++
+{bdg-secondary}`megatron` {bdg-secondary}`dit` {bdg-info}`Full control` {bdg-primary}`MLEs`
+:::
+
+:::{grid-item-card} {octicon}`video;1.5em;sd-mr-1` 2c. Megatron WAN Tutorial
+:link: megatron-wan
+:link-type: doc
+
+Train WAN models for video generation with Megatron. Best for video-specific workflows.
++++
+{bdg-secondary}`megatron` {bdg-secondary}`wan` {bdg-info}`Video models` {bdg-primary}`MLEs`
+:::
+
+::::
+:::::
+
+::::::
+
+## Quick Comparison
+
+Not sure which path to choose? Compare the approaches:
+
+| Feature | Automodel | Megatron DiT | Megatron WAN |
+|---------|-----------|--------------|--------------|
+| **Best for** | Fine-tuning pretrained models | Pretraining DiT from scratch | Pretraining WAN from scratch |
+| **Configuration** | Single YAML file | Recipe + YAML + CLI overrides | Recipe + YAML + CLI overrides |
+| **Parallelism** | Automatic (FSDP2) | Manual (TP+CP+PP+DP) | Manual (TP+CP+PP+DP) |
+| **Model source** | Hugging Face models | Custom checkpoints | Custom checkpoints |
+| **Data format** | Energon or custom dataloader | Webdataset | Webdataset |
+| **Setup time** | Fast (~10 mins) | Moderate (~30 mins) | Moderate (~30 mins) |
+| **Complexity** | ⭐⭐☆☆☆ | ⭐⭐⭐☆☆ | ⭐⭐⭐☆☆ |
+| **Control** | Less (automatic) | More (manual) | More (manual) |
+
+**Still unsure?** Start with [Automodel](automodel.md)—it's faster to learn and you can always switch to Megatron later.
diff --git a/docs/get-started/installation.md b/docs/get-started/installation.md
new file mode 100644
index 00000000..86a79c7c
--- /dev/null
+++ b/docs/get-started/installation.md
@@ -0,0 +1,230 @@
+---
+description: "Installation guide for NeMo DFM"
+categories: ["getting-started"]
+tags: ["installation", "setup", "prerequisites"]
+personas: ["mle-focused", "admin-focused"]
+difficulty: "beginner"
+content_type: "how-to"
+---
+
+(gs-installation)=
+
+# Installation Quickstart
+
+Set up your environment for training and inference with NeMo DFM. This guide covers three installation methods: Docker (recommended), pip, and source.
+
+:::{card}
+
+**Goal**: Get NeMo DFM installed and verify your setup with a quick test.
+
+^^^
+
+**In this tutorial, you will**:
+
+1. Choose an installation method (Docker, pip, or source)
+2. Install NeMo DFM and dependencies
+3. Verify your installation works correctly
+4. Confirm GPU availability for training and inference
+
+:::
+
+## Before You Start
+
+Verify you have the following before installation:
+
+- **Python 3.10 or later**: Check with `python --version`
+- **NVIDIA GPU with CUDA support**: Required for training and inference
+- **Docker** (recommended): Provides pre-configured environment with all dependencies
+- **Git**: Required for cloning the repository
+
+## Installation Methods
+
+Choose the installation method that best fits your use case:
+
+::::{tab-set}
+
+:::{tab-item} Docker (Recommended)
+
+Docker installation provides a pre-configured environment with all dependencies. This method is recommended for development and testing.
+
+1. Clone the repository:
+
+   ```bash
+   git clone https://github.com/NVIDIA-NeMo/DFM.git
+   cd DFM
+   ```
+
+2. Initialize submodules:
+
+   ```bash
+   git submodule update --init --recursive
+   ```
+
+3. Build the container:
+
+   ```bash
+   docker build -f docker/Dockerfile.ci -t dfm:latest .
+   ```
+
+4. Run the container:
+
+   ```bash
+   docker run --gpus all -v $(pwd):/opt/DFM -it dfm:latest bash
+   ```
+
+5. Install DFM inside the container:
+
+   The Docker image includes all dependencies installed during build. For development, install DFM in editable mode:
+
+   ```bash
+   source /opt/venv/bin/activate
+   uv pip install --no-deps -e .
+   ```
+
+   ```{note}
+   The `--no-deps` flag prevents reinstalling dependencies that are already installed during the Docker build process. This step is only needed for editable development installs.
+   ```
+
+The Docker image includes:
+
+- PyTorch 25.09 with CUDA support
+- All core dependencies (accelerate, diffusers==0.35.1, easydict, ftfy, imageio, imageio-ffmpeg, megatron-energon, opencv-python-headless==4.10.0.84)
+- Pre-configured virtual environment at `/opt/venv` (requires activation with `source /opt/venv/bin/activate`)
+
+:::
+
+:::{tab-item} Pip
+
+Install NeMo DFM directly from the repository using pip.
+
+1. Clone the repository:
+
+   ```bash
+   git clone https://github.com/NVIDIA-NeMo/DFM.git
+   cd DFM
+   ```
+
+2. Initialize submodules:
+
+   ```bash
+   git submodule update --init --recursive
+   ```
+
+3. Install dependencies:
+
+   ```bash
+   pip install -e .
+   ```
+
+**Optional** Install with extra features:
+
+```bash
+# Install with Automodel support
+pip install -e ".[automodel]"
+
+# Install with Megatron-Bridge support
+pip install -e ".[megatron-bridge]"
+```
+
+:::
+
+:::{tab-item} Source
+
+For development or custom configurations, install from source.
+
+1. Clone the repository:
+
+   ```bash
+   git clone https://github.com/NVIDIA-NeMo/DFM.git
+   cd DFM
+   ```
+
+2. Initialize submodules:
+
+   ```bash
+   git submodule update --init --recursive
+   ```
+
+3. Create a virtual environment:
+
+   ```bash
+   python3.10 -m venv venv
+   source venv/bin/activate  # On Windows: venv\Scripts\activate
+   ```
+
+4. Install build dependencies:
+
+   ```bash
+   pip install -e ".[build]"
+   ```
+
+5. Install DFM:
+
+   ```bash
+   pip install -e .
+   ```
+
+:::
+
+::::
+
+## Verify Installation
+
+Confirm your installation succeeded by running these verification checks.
+
+::::{tab-set}
+
+:::{tab-item} Python Import
+
+```python
+import dfm
+print("DFM installed successfully!")
+```
+
+:::
+
+:::{tab-item} GPU Availability
+
+```python
+import torch
+print(f"CUDA available: {torch.cuda.is_available()}")
+print(f"GPU count: {torch.cuda.device_count()}")
+if torch.cuda.is_available():
+    print(f"GPU name: {torch.cuda.get_device_name(0)}")
+```
+
+```{note}
+PyTorch is included in the Docker image. For pip or source installations, install PyTorch separately if needed for GPU checks.
+```
+
+:::
+
+:::{tab-item} Package Version
+
+```python
+import dfm
+print(f"DFM version: {dfm.__version__}")
+```
+
+:::
+
+::::
+
+## Core Dependencies
+
+All core dependencies install automatically with NeMo DFM:
+
+- `accelerate`: Distributed training acceleration
+- `diffusers==0.35.1`: Hugging Face Diffusers library for diffusion models
+- `easydict`: Dictionary access utilities
+- `ftfy`: Text encoding fixes
+- `megatron-energon`: Megatron-based efficient data loading
+- `imageio`, `imageio-ffmpeg`: Video I/O operations
+- `opencv-python-headless==4.10.0.84`: Image processing without GUI dependencies
+
+### Optional Dependencies
+
+Install these with extras flags:
+
+- `nemo-automodel`: AutoModel support via `pip install -e ".[automodel]"`
+- `megatron-bridge`: Megatron-Bridge support via `pip install -e ".[megatron-bridge]"`
diff --git a/docs/get-started/megatron-wan.md b/docs/get-started/megatron-wan.md
new file mode 100644
index 00000000..5f814650
--- /dev/null
+++ b/docs/get-started/megatron-wan.md
@@ -0,0 +1,300 @@
+---
+description: "End-to-end Megatron WAN quickstart: prepare data, train, and generate videos"
+categories: ["getting-started", "megatron"]
+tags: ["quickstart", "tutorial", "megatron", "wan"]
+personas: ["mle-focused"]
+difficulty: "intermediate"
+content_type: "tutorial"
+---
+
+(gs-megatron-wan)=
+
+# Megatron WAN Workflow
+
+Complete end-to-end tutorial for pretraining a WAN 2.1 model and generating videos using NeMo DFM's Megatron approach.
+
+:::{card}
+
+**Goal**: Pretrain a WAN 2.1 model from scratch with manual control over distributed training parallelism (TP/PP/CP/DP).
+
+^^^
+
+**In this tutorial, you will**:
+
+1. Prepare WAN dataset (raw videos → WebDataset shards)
+2. Train WAN model with custom parallelism configuration
+3. Generate videos from trained checkpoint
+
+**Time**: 1-2 hours (depending on training duration)
+
+:::
+
+:::{button-ref} gs-index
+:color: secondary
+:outline:
+:ref-type: ref
+
+← Back to Get Started
+:::
+
+## Before You Start
+
+Make sure you have completed:
+
+- ✅ [Installation](installation.md)
+- ✅ Multi-GPU setup (minimum: 2 GPUs, recommended: 8+ GPUs)
+- ✅ ~50GB storage for dataset and checkpoints
+
+---
+
+## Overview
+
+WAN 2.1 is an open-source implementation of large-scale text-to-video/image generative models built on top of Megatron-Core and Megatron-Bridge. It supports advanced parallelism strategies (data, tensor, sequence, and context parallelism) and optimized kernels (e.g., Transformer Engine fused attention).
+
+**Currently Supported**: WAN 2.1 Text-to-Video (1.3B and 14B models)
+
+---
+
+(gs-megatron-wan-prepare-data-section)=
+
+## Prepare WAN Dataset
+
+Convert raw videos into WAN-ready WebDataset format for Megatron training.
+
+**Requirements**: Raw `.mp4` videos with corresponding `.json` metadata files containing captions
+
+### 1. Set Up Dataset Paths
+
+```bash
+# Define your input (raw videos) and output (WebDataset shards) folders
+DATASET_SRC=/opt/raw_videos            # contains .mp4 and per-video .jsonl captions
+DATASET_PATH=/opt/wan_webdataset      # output WebDataset shards
+```
+
+### 2. Optional: Hugging Face Token
+
+If your WAN models require authentication on first download:
+
+```bash
+export HF_TOKEN=<your_huggingface_token>
+```
+
+### 3. Create WAN Shards
+
+WAN's VAE encoder and T5 encoder extract videos' latents and caption embeddings offline before training:
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node 1 \
+  examples/megatron/recipes/wan/prepare_energon_dataset_wan.py \
+  --video_folder "${DATASET_SRC}" \
+  --output_dir "${DATASET_PATH}" \
+  --model "Wan-AI/Wan2.1-T2V-1.3B-Diffusers" \
+  --height 480 --width 832 \
+  --center-crop
+```
+
+**Key arguments**:
+- `--height/--width`: Control resize target (832×480 is supported for both 1.3B and 14B models)
+- `--center-crop`: Run center crop to exact target size after resize
+
+### 4. Process with Energon
+
+Use Energon to process shards and create metadata:
+
+```bash
+energon prepare "${DATASET_PATH}"
+```
+
+**Interactive prompts**:
+- Enter a train/val/test split, e.g., "8,1,1"
+- When asked for the sample type, choose: **"Crude sample (plain dict for cooking)"**
+
+### 5. What Gets Produced
+
+Each shard contains:
+- `pth`: WAN video latents
+- `pickle`: Text embeddings
+- `json`: Useful side-info (text caption, sizes, processing choices, etc.)
+
+Energon writes a `.nv-meta` directory with dataset info and a `dataset.yaml` you can version/control.
+
+You're ready to launch training. In the training config, point the WAN config (or CLI overrides) to the processed data output directory as `dataset.path=${DATASET_PATH}`.
+
+:::{dropdown} Quick Start with Mock Dataset
+:icon: beaker
+
+If you want to run without a real dataset (for debugging or performance measurement), pass `--mock`:
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+  examples/megatron/recipes/wan/pretrain_wan.py \
+  --config-file examples/megatron/recipes/wan/conf/wan_1_3B.yaml \
+  --training-mode pretrain \
+  --mock
+```
+
+You may adjust mock shapes (`F_latents`, `H_latents`, `W_latents`) and packing behavior (`number_packed_samples`) in `WanMockDataModuleConfig` (see `dfm/src/megatron/recipes/wan/wan.py`) to simulate different data scenarios.
+:::
+
+---
+
+(gs-megatron-wan-training-section)=
+
+## Train WAN Model
+
+Pretrain a WAN 2.1 model using Megatron's distributed training.
+
+**Requirements**: Prepared dataset from [data preparation](#gs-megatron-wan-prepare-data-section), ~50GB storage for checkpoints
+
+### Sequence Packing for WAN
+
+This recipe leverages sequence packing to maximize throughput. When batches contain videos with different shapes or resolutions, naive batching and padding require significant padded tokens. Sequence packing stacks multiple samples (with different resolutions) into a single sequence instead of padding; hence no computation is wasted on padded tokens.
+
+**Requirements**:
+- Set `train.micro_batch_size=1` and `dataset.micro_batch_size=1`
+- Ensure `model.qkv_format=thd` (required with context parallelism and recommended with sequence packing)
+
+### Training Mode Presets
+
+The script exposes a `--training-mode` with `pretrain` and `finetune` presets for flow-matching hyperparameters as a starting point for experiments.
+
+:::{dropdown} Understanding Training Mode Presets
+:icon: info
+
+**Pretraining preset** (`--training-mode pretrain`):
+- Uses noisier, biased sampling (e.g., logit-normal, higher logit_std, lower flow_shift)
+- Purpose: Stability and broad learning
+
+**Finetuning preset** (`--training-mode finetune`):
+- Uses uniform, lower-noise settings (e.g., uniform sampling, lower logit_std, higher flow_shift)
+- Purpose: Refine details and improve quality
+:::
+
+### 1. Prepare Configuration
+
+We provide example configs for running 1.3B and 14B model sizes on mock dataset (see `wan_1_3B.yaml` and `wan_14B.yaml` under `examples/megatron/recipes/wan/conf`).
+
+Copy and edit one of the example configs:
+
+```bash
+cp examples/megatron/recipes/wan/conf/wan_1_3B.yaml examples/megatron/recipes/wan/conf/my_wan.yaml
+```
+
+Edit `my_wan.yaml` to set:
+- `dataset.path`: Path to your WebDataset directory
+- `train.global_batch_size/micro_batch_size`: Keep micro_batch_size=1
+- `model.tensor_model_parallel_size` / `model.context_parallel_size`: Based on GPUs
+- `checkpoint.save` and `checkpoint.load`: Checkpoint directory
+
+:::{note}
+Users can learn more about argument details at [Megatron-Bridge docs](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docs/megatron-lm-to-megatron-bridge.md).
+:::
+
+### 2. Run Training
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+  examples/megatron/recipes/wan/pretrain_wan.py \
+  --training-mode pretrain \
+  --config-file examples/megatron/recipes/wan/conf/my_wan.yaml
+```
+
+### 3. CLI Overrides
+
+You can also override any config values from the command line:
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+  examples/megatron/recipes/wan/pretrain_wan.py \
+  --config-file examples/megatron/recipes/wan/conf/my_wan.yaml \
+  --training-mode pretrain \
+  dataset.path=/opt/wan_webdataset \
+  train.global_batch_size=8 \
+  train.micro_batch_size=1 \
+  model.tensor_model_parallel_size=2 \
+  model.context_parallel_size=4 \
+  checkpoint.save=/opt/pretrained_checkpoints \
+  checkpoint.load=/opt/pretrained_checkpoints
+```
+
+:::{note}
+If you use `logger.wandb_project` and `logger.wandb_exp_name`, export `WANDB_API_KEY`.
+:::
+
+### Monitor Training
+
+Monitor console output for decreasing loss values and checkpoint saves.
+
+Verify checkpoints are being saved:
+
+```bash
+ls -lh /opt/pretrained_checkpoints/
+```
+
+Expected: `iter_0001000/`, `iter_0002000/` directories with checkpoint files.
+
+---
+
+(gs-megatron-wan-inference-section)=
+
+## Generate Videos
+
+Generate videos from your trained WAN model checkpoint.
+
+**Requirements**: Trained checkpoint from [training](#gs-megatron-wan-training-section)
+
+### Run Inference
+
+After training, run inference with `examples/megatron/recipes/wan/inference_wan.py`:
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node 1 \
+  examples/megatron/recipes/wan/inference_wan.py \
+  --task t2v-1.3B \
+  --frame_nums 81 \
+  --sizes 480*832 \
+  --checkpoint_dir /opt/pretrained_checkpoints \
+  --checkpoint_step 10000 \
+  --prompts "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
+  --sample_steps 50
+```
+
+**Parameters**:
+- `--checkpoint_step`: Use specific checkpoint for inference
+- `--sizes`: Specify video shape (height*width, e.g., 480*832)
+- `--frame_nums`: Specify number of frames (e.g., 81)
+- `--sample_steps`: Number of noise diffusion steps (default: 50)
+
+:::{note}
+Current inference path is single-GPU. Parallel inference is not yet supported.
+:::
+
+---
+
+## Parallelism Support
+
+The table below shows current parallelism support for different WAN model sizes:
+
+| Model | Data Parallel | Tensor Parallel | Sequence Parallel | Context Parallel | FSDP |
+|-------|---------------|-----------------|-------------------|------------------|------|
+| 1.3B | ✅ | ✅ | ✅ | ✅ | Coming Soon |
+| 14B  | ✅ | ✅ | ✅ | ✅ | Coming Soon |
+
+## Related Tutorials
+
+- [Megatron DiT Tutorial](megatron.md) - DiT model training workflow
+- [Automodel Tutorial](automodel.md) - Fine-tune models with automatic parallelism
+
+## Related Documentation
+
+- [Text-to-Video Training](../tutorials/text-to-video-training.md) - Comprehensive recipe with detailed configuration options
+- [Training Paradigms](../about/concepts/training-paradigms.md) - Understand Megatron approach
+- [Distributed Training](../about/concepts/distributed-training.md) - Parallelism strategies
+- [Performance Benchmarks](../reference/performance.md) - Training throughput metrics
+
+---
+
+## References
+
+WAN Team. (2025). Wan: Open and advanced large-scale video generative models (WAN 2.1). GitHub. https://github.com/Wan-Video/Wan2.1/
+
diff --git a/docs/get-started/megatron.md b/docs/get-started/megatron.md
new file mode 100644
index 00000000..6a58e5e8
--- /dev/null
+++ b/docs/get-started/megatron.md
@@ -0,0 +1,590 @@
+---
+description: "End-to-end Megatron quickstart: prepare data, train, and generate videos"
+categories: ["getting-started", "megatron"]
+tags: ["quickstart", "tutorial", "megatron"]
+personas: ["mle-focused"]
+difficulty: "intermediate"
+content_type: "tutorial"
+---
+
+(gs-megatron)=
+
+# Megatron Workflow
+
+Complete end-to-end tutorial for pretraining a DiT model and generating videos using NeMo DFM's Megatron approach.
+
+:::{card}
+
+**Goal**: Pretrain a DiT model from scratch with manual control over distributed training parallelism (TP/PP/CP/DP).
+
+^^^
+
+**In this tutorial, you will**:
+
+1. Prepare dataset (Smithsonian Butterflies: `huggan/smithsonian_butterflies_subset`)
+2. Train DiT model with custom parallelism configuration
+3. Generate videos from trained checkpoint
+
+**Time**: 1-2 hours (depending on training duration)
+
+:::
+
+:::{button-ref} gs-index
+:color: secondary
+:outline:
+:ref-type: ref
+
+← Back to Get Started
+:::
+
+## Before You Start
+
+Make sure you have completed:
+
+- ✅ [Installation](installation.md)
+- ✅ Multi-GPU setup (minimum: 2 GPUs, recommended: 8+ GPUs)
+- ✅ ~50GB storage for dataset and checkpoints
+
+---
+
+(gs-megatron-prepare-data-section)=
+
+## Prepare Dataset
+
+Convert the Smithsonian Butterflies dataset from Hugging Face into webdataset format for Megatron training.
+
+**Dataset**: `huggan/smithsonian_butterflies_subset` (~800 images with captions)
+
+**Requirements**: ~10GB free storage, internet connection for download
+
+### 1. Verify Dependencies
+
+1. Ensure required packages are installed:
+
+   ```bash
+   pip install pandas webdataset transformers mediapy
+   ```
+
+2. Check the preparation script exists:
+
+   ```bash
+   ls -l examples/megatron/recipes/dit/prepare_energon_dataset_butterfly.py
+   ```
+
+### 2. Run Data Preparation
+
+:::: {tab-set}
+
+::: {tab-item} Single GPU
+
+```bash
+cd /opt/DFM  # Or your DFM installation path
+
+torchrun --nproc-per-node 1 \
+    examples/megatron/recipes/dit/prepare_energon_dataset_butterfly.py \
+    --output-dir butterfly_webdataset
+```
+
+**Processing time**: ~30 minutes
+
+:::
+
+::: {tab-item} Multi-GPU
+
+```bash
+torchrun --nproc-per-node 4 \
+    examples/megatron/recipes/dit/prepare_energon_dataset_butterfly.py \
+    --output-dir butterfly_webdataset
+```
+
+**Processing time**: ~8 minutes (each GPU processes a subset in parallel)
+
+:::
+
+::::
+
+### 3. Prepare Dataset with Energon
+
+After creating webdataset shards, use Energon to process them:
+
+```bash
+energon prepare butterfly_webdataset
+```
+
+:::{dropdown} Energon Prepare Workflow
+:icon: info
+
+**Interactive prompts**:
+1. Enter train/val/test split (e.g., "1,0,0" for 100% training)
+2. When asked for sample type, choose: **"Crude sample (plain dict for cooking)"** (option 11)
+
+**Sample structure**: Each sample contains:
+- `json`: Metadata (e.g., `image_height`, `image_width`)
+- `pickle`: Text embeddings
+- `pth`: Image/video latents
+
+**Note**: CrudeWebdataset doesn't need a field map. You'll need to provide a `Cooker` for your dataset samples in your `TaskEncoder`. You can also add `subflavors` in your meta dataset specification.
+:::
+
+### 4. Verify Dataset
+
+Check that webdataset shards were created:
+
+```bash
+ls -lh butterfly_webdataset/
+```
+
+Expected: `rank0-000000.tar`, `rank1-000000.tar`, etc. Each tar contains ~200 samples with `.pth` (latents), `.pickle` (text embeddings), and `.json` (metadata) files.
+
+:::{dropdown} Inspect Sample Format
+:icon: info
+
+```python
+import webdataset as wds
+
+dataset = wds.WebDataset("butterfly_webdataset/rank0-000000.tar")
+sample = next(iter(dataset))
+
+print(sample.keys())  # ['__key__', '.pth', '.pickle', '.json']
+```
+
+Each sample: `.pth` (image latents), `.pickle` (T5 embeddings), `.json` (metadata)
+:::
+
+:::{dropdown} Optional: Cache T5 and Tokenizer
+:icon: info
+
+If you already have the T5 model or video tokenizer downloaded, you can point to them:
+
+```bash
+torchrun --nproc-per-node $num_gpus \
+    examples/megatron/recipes/dit/prepare_energon_dataset_butterfly.py \
+    --output-dir butterfly_webdataset \
+    --t5_cache_dir $t5_cache_dir \
+    --tokenizer_cache_dir $tokenizer_cache_dir
+```
+:::
+
+### Troubleshooting
+
+:::{dropdown} Out of Memory During Preparation
+:icon: alert
+
+```text
+RuntimeError: CUDA out of memory
+```
+
+**Solution**: Use more GPUs to split work:
+
+```bash
+torchrun --nproc-per-node 8 \
+    examples/megatron/recipes/dit/prepare_energon_dataset_butterfly.py \
+    --output-dir butterfly_webdataset
+```
+:::
+
+---
+
+(gs-megatron-training-section)=
+
+## Train DiT Model
+
+Pretrain a Diffusion Transformer (DiT) model on the butterfly dataset using Megatron's distributed training.
+
+**Requirements**: Prepared dataset from [data preparation](#gs-megatron-prepare-data-section), ~50GB storage for checkpoints
+
+### Sequence Packing
+
+This recipe leverages sequence packing to maximize training efficiency. Sequence packing stacks multiple samples into a single sequence instead of padding individual samples to a fixed length.
+
+:::{dropdown} Understanding Sequence Packing
+:icon: info
+
+**Why use sequence packing**: When batches contain videos with different shapes or resolutions, naive batching and padding require significant padded tokens. Sequence packing eliminates wasted computation on padded tokens.
+
+**Requirements**:
+- Set `train.micro_batch_size=1` and `dataset.micro_batch_size=1`
+- Ensure `model.qkv_format=thd` (required with context parallelism and recommended with sequence packing)
+
+**Key parameters**:
+- `task_encoder_seq_length`: Controls the maximum sequence length passed to the model
+- `packing_buffer_size`: Determines the number of samples processed to create different buckets
+
+For more details, see the `select_samples_to_pack` and `pack_selected_samples` methods in [DiffusionTaskEncoderWithSequencePacking](https://github.com/NVIDIA-NeMo/DFM/blob/main/dfm/src/megatron/data/common/diffusion_task_encoder_with_sp.py#L50). See also the [Energon packing documentation](https://nvidia.github.io/Megatron-Energon/advanced/packing.html).
+:::
+
+### 1. Understand Configuration Layers
+
+Megatron uses a **three-layer configuration system** with increasing precedence:
+
+```yaml
+Layer 1: Recipe defaults (pretrain_config() function)
+  ↓
+Layer 2: YAML file overrides (--config-file)
+  ↓
+Layer 3: CLI overrides (highest precedence)
+```
+
+**Example**:
+```bash
+torchrun pretrain_dit_model.py \
+    --config-file my_config.yaml \  # Layer 2
+    model.tensor_model_parallel_size=4  # Layer 3 overrides Layer 2
+```
+
+CLI parameters override YAML settings, which override recipe defaults.
+
+### 2. Prepare Configuration File
+
+First, copy the example config file and update it with your own settings:
+
+```bash
+cp examples/megatron/recipes/dit/conf/dit_pretrain_example.yaml examples/megatron/recipes/dit/conf/my_config.yaml
+```
+
+Edit `my_config.yaml` to set:
+- `model.vae_cache_folder`: Path to VAE cache folder
+- `dataset.path`: Path to your dataset folder
+- `checkpoint.save` and `checkpoint.load`: Path to checkpoint folder
+- `train.global_batch_size`: Set to be divisible by NUM_GPUs
+- `logger.wandb_exp_name`: Your experiment name
+
+:::{note}
+If using `wandb_project` and `wandb_exp_name`, ensure the `WANDB_API_KEY` environment variable is set.
+:::
+
+### 3. Run Training
+
+Start training using your configuration:
+
+```bash
+cd /opt/DFM  # Or your DFM installation path
+
+torchrun --nproc-per-node 2 \
+    examples/megatron/recipes/dit/pretrain_dit_model.py \
+    --config-file examples/megatron/recipes/dit/conf/my_config.yaml
+```
+
+:::{dropdown} Training with Defaults (No Config File)
+:icon: info
+
+You can also run with minimal configuration:
+
+```bash
+torchrun --nproc-per-node 2 \
+    examples/megatron/recipes/dit/pretrain_dit_model.py \
+    --dataset-path "/path/to/butterfly_webdataset"
+```
+
+This uses recipe defaults, but a config file is recommended for reproducibility.
+:::
+
+#### Expected Output
+
+```text
+[INFO] Megatron-Bridge DiT Pretraining Script with YAML & CLI Overrides
+[INFO] Loaded base configuration
+[INFO] Starting pretraining...
+[INFO] Iteration    1/10000, Loss: 0.456
+[INFO] Iteration    2/10000, Loss: 0.442
+[INFO] Iteration  100/10000, Loss: 0.312
+[INFO] Checkpoint saved: checkpoints/dit_butterfly/iter_2000/
+```
+
+### 4. Custom Configuration and CLI Overrides
+
+You can override any config values from the command line:
+
+```bash
+torchrun --nproc-per-node 2 \
+    examples/megatron/recipes/dit/pretrain_dit_model.py \
+    --config-file examples/megatron/recipes/dit/conf/my_config.yaml \
+    train.train_iters=20000 \
+    model.num_layers=32
+```
+
+:::{dropdown} Example: Complete Custom Config
+:icon: gear
+
+Create YAML override file `dit_butterfly_config.yaml`:
+
+   ```yaml
+   # Model parallelism
+   model:
+     tensor_model_parallel_size: 2
+     pipeline_model_parallel_size: 1
+     context_parallel_size: 1
+
+   # Training parameters
+   train:
+     global_batch_size: 64
+     micro_batch_size: 2
+     train_iters: 10000
+
+   # Optimizer
+   optimizer:
+     lr: 0.0001
+     weight_decay: 0.01
+
+   # Checkpointing
+   checkpoint:
+     save_interval: 500
+     checkpoint_dir: /path/to/checkpoints/dit_butterfly/
+   ```
+
+Then run with this config file. CLI overrides (like `model.tensor_model_parallel_size=4`) will override YAML values.
+:::
+
+:::{dropdown} Training Split Note
+:icon: info
+
+If you dedicate 100% of the data to training, pass `dataset.use_train_split_for_val=true` to use a subset of training data for validation purposes:
+
+```bash
+torchrun --nproc-per-node $num_gpus \
+    examples/megatron/recipes/dit/pretrain_dit_model.py \
+    --config-file examples/megatron/recipes/dit/conf/my_config.yaml \
+    dataset.use_train_split_for_val=true
+```
+:::
+
+### Configuration Parameters
+
+:::{dropdown} Key Parameters
+:icon: info
+
+**Training**: `train.global_batch_size` (32-128), `train.micro_batch_size` (1-4), `train.train_iters` (5000-10000), `optimizer.lr` (1e-4 to 5e-4)
+
+**Parallelism**: `model.tensor_model_parallel_size` (TP, power of 2), `model.pipeline_model_parallel_size` (PP), `model.context_parallel_size` (CP). DP computed as `num_gpus / (TP * PP * CP)`
+
+**Checkpointing**: `checkpoint.save_interval` (default: 2000), `checkpoint.checkpoint_dir`, `checkpoint.load_checkpoint`
+:::
+
+### Monitor Training
+
+Monitor console output for decreasing loss values and checkpoint saves.
+
+Verify checkpoints are being saved:
+
+```bash
+ls -lh checkpoints/dit_butterfly/
+```
+
+Expected: `iter_0001000/`, `iter_0002000/` directories with `model_weights.pt` and `optimizer_states.pt` files.
+
+:::{dropdown} Validation and Sample Generation
+:icon: info
+
+During validation, the model generates one sample per GPU at the start of each validation round. These samples are:
+- Saved to a `validation_generation` folder within `checkpoint_dir`
+- Logged to WandB if the `WANDB_API_KEY` environment variable is configured
+
+**VAE Requirements**: To decode the generated latent samples, the model requires access to the video tokenizer used during dataset preparation. Specify the VAE artifacts location using the `vae_cache_folder` argument, otherwise they will be downloaded in the first validation round.
+:::
+
+:::{dropdown} Resume from Checkpoint
+:icon: redo
+
+Resume training from a saved checkpoint:
+
+```bash
+torchrun --nproc-per-node 2 \
+    examples/megatron/recipes/dit/pretrain_dit_model.py \
+    --config-file examples/megatron/recipes/dit/conf/my_config.yaml \
+    checkpoint.load_checkpoint=/path/to/checkpoints/dit_butterfly/iter_5000/
+```
+
+Training continues from iteration 5000.
+:::
+
+:::{dropdown} Quick Start with Mock Dataset
+:icon: beaker
+
+If you want to run without a real dataset (for debugging or performance measurement), pass `--mock`:
+
+```bash
+torchrun --nproc-per-node $num_gpus \
+    examples/megatron/recipes/dit/pretrain_dit_model.py \
+    --config-file examples/megatron/recipes/dit/conf/dit_pretrain.yaml \
+    --mock
+```
+:::
+
+### Troubleshooting
+
+:::{dropdown} Out of Memory Errors
+:icon: alert
+
+```text
+RuntimeError: CUDA out of memory
+```
+
+**Solution**: Reduce batch size:
+
+```bash
+torchrun --nproc-per-node 2 \
+    examples/megatron/recipes/dit/pretrain_dit_model.py \
+    --dataset-path "/path/to/butterfly_webdataset" \
+    train.micro_batch_size=1 \
+    train.global_batch_size=32
+```
+:::
+
+---
+
+(gs-megatron-inference-section)=
+
+## Generate Videos
+
+Generate videos from your trained DiT model checkpoint using Megatron inference.
+
+**Generation time**: 3-8 minutes per video (depends on resolution and steps)
+
+**Requirements**: Trained checkpoint from [training](#gs-megatron-training-section), Cosmos tokenizer for video decoding
+
+**Inference script**: `examples/megatron/recipes/dit/inference_dit_model.py`
+
+### 1. Prepare Model Checkpoint
+
+The inference script expects a consolidated `model.pth` file. Training saves checkpoints in `checkpoints/dit_butterfly/iter_5000/` with `model.pth` and `extra_state.pt` files.
+
+:::{dropdown} Consolidate Sharded Checkpoint (If Needed)
+:icon: alert
+
+If your checkpoint is distributed across multiple files, consolidate:
+
+```python
+import torch
+
+checkpoint = {}
+for i in range(num_gpus):
+    shard = torch.load(f"checkpoints/iter_5000/model_rank_{i}.pt")
+    checkpoint.update(shard)
+
+torch.save(checkpoint, "model.pth")
+```
+:::
+
+### 2. Run Inference
+
+:::: {tab-set}
+
+::: {tab-item} Basic Generation
+
+```bash
+cd /opt/DFM  # Or your DFM installation path
+
+torchrun --nproc-per-node 2 \
+    examples/megatron/recipes/dit/inference_dit_model.py \
+    --t5_cache_dir $artifact_dir \
+    --tokenizer_cache_dir $tokenizer_cache_dir \
+    --tokenizer_model Cosmos-0.1-Tokenizer-CV4x8x8 \
+    --checkpoint_path $checkpoint_dir \
+    --prompt "A beautiful monarch butterfly with orange and black wings" \
+    --height 704 \
+    --width 1280 \
+    --num-video-frames 121 \
+    --video-save-path butterfly_monarch.mp4
+```
+
+**Note**: The script requires your trained model checkpoint (`--checkpoint_path`) and a path to save generated videos (`--video_save_path`). Optional arguments `--t5_cache_dir` and `--tokenizer_cache_dir` avoid re-downloading artifacts.
+
+:::
+
+::: {tab-item} Custom Settings
+
+```bash
+torchrun --nproc-per-node 2 \
+    examples/megatron/recipes/dit/inference_dit_model.py \
+    --prompt "A blue morpho butterfly in a rainforest" \
+    --height 704 \
+    --width 1280 \
+    --num-video-frames 121 \
+    --num-steps 50 \
+    --guidance 9.0 \
+    --seed 42 \
+    --cp-size 2 \
+    --video-save-path morpho_rainforest.mp4
+```
+
+**Additional parameters**: `--num-steps` (default: 35), `--guidance` (default: 7), `--seed`, `--cp-size`
+
+:::
+
+::::
+
+### Generation Parameters
+
+**Required**: `--prompt`, `--height` (divisible by 16), `--width` (divisible by 16), `--num-video-frames` (common: 61, 121, 241), `--video-save-path`
+
+**Optional**: `--num-steps` (default: 35), `--guidance` (default: 7.0), `--seed` (default: 1), `--cp-size`
+
+**Optional caching**: You can pass `--t5_cache_dir` and `--tokenizer_cache_dir` to avoid re-downloading artifacts if they are already downloaded.
+
+### 3. View Generated Video
+
+Check that video was created:
+
+```bash
+ls -lh idx=0_rank=0_butterfly_monarch.mp4
+```
+
+**Note**: Megatron inference adds prefix `idx={i}_rank={rank}_` to filename.
+
+### Troubleshooting
+
+:::{dropdown} Model Loading Error
+:icon: alert
+
+```text
+FileNotFoundError: model.pth not found
+```
+
+**Solution**: Verify checkpoint path in script (line 247) or copy `model.pth` to working directory:
+
+```bash
+cp checkpoints/dit_butterfly/iter_5000/model.pth .
+```
+:::
+
+:::{dropdown} Out of Memory Errors
+:icon: alert
+
+```text
+RuntimeError: CUDA out of memory
+```
+
+**Solution**: Reduce resolution and frames:
+
+```bash
+--height 480 --width 848 --num-video-frames 61
+```
+:::
+
+### Model Architecture Customization
+
+The model architecture can be customized through parameters such as `num_layers` and `num_attention_heads`. A comprehensive list of configuration options is available in the [Megatron-Bridge documentation](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docs/megatron-lm-to-megatron-bridge.md).
+
+### Parallelism Support
+
+The table below shows current parallelism support for different DiT model sizes:
+
+| Model | Data Parallel | Tensor Parallel | Sequence Parallel | Context Parallel |
+|-------|---------------|-----------------|-------------------|------------------|
+| **DiT-S (330M)** | TBD | TBD | TBD | TBD |
+| **DiT-L (450M)** | TBD | TBD | TBD | TBD |
+| **DiT-XL (700M)** | ✅ | ✅ | ✅ | ✅ |
+
+---
+
+## Related Tutorials
+
+- [Megatron WAN Tutorial](megatron-wan.md) - Train WAN models for video generation
+- [AutoModel Tutorial](automodel.md) - Fine-tune models with automatic parallelism
+
+## Related Documentation
+
+- [Training from Scratch](../tutorials/training-from-scratch.md) - Comprehensive recipe with detailed configuration options
+- [Training Paradigms](../about/concepts/training-paradigms.md) - Understand Megatron approach
+- [Distributed Training](../about/concepts/distributed-training.md) - Parallelism strategies
+- [Performance Benchmarks](../reference/performance.md) - Training throughput metrics
diff --git a/docs/index.md b/docs/index.md
index a00dc70b..b50f29d7 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,3 +1,193 @@
-```{include} ../README.md
-:relative-docs: docs/
-```
+---
+description: "NeMo DFM is a state-of-the-art framework for fast, large-scale training and inference of video world models using diffusion-based and autoregressive techniques"
+categories:
+  - documentation
+  - home
+tags:
+  - diffusion-models
+  - video-generation
+  - large-scale-training
+  - distributed
+personas:
+  - Data Scientists
+  - Machine Learning Engineers
+  - Cluster Administrators
+  - DevOps Professionals
+difficulty: beginner
+content_type: index
+modality: universal
+---
+
+(dfm-home)=
+
+# NeMo DFM Documentation
+
+Welcome to the NeMo DFM documentation.
+
+## Introduction to NeMo DFM
+
+Learn about NeMo DFM, how it works at a high-level, and the key features.
+
+::::{grid} 1 2 2 2
+:gutter: 1 1 1 2
+
+:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` About NeMo DFM
+:link: about-overview
+:link-type: ref
+Overview of NeMo DFM and its capabilities.
++++
+{bdg-secondary}`target-users` {bdg-secondary}`how-it-works`
+:::
+
+:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Concepts
+:link: about-concepts
+:link-type: ref
+Explore the core concepts for diffusion models, architectures, and training in NeMo DFM.
++++
+{bdg-secondary}`architectures` {bdg-secondary}`training` {bdg-secondary}`data-handling`
+:::
+
+::::
+
+## Get Started
+
+Install NeMo DFM and choose your training path: Automodel for quick prototyping or Megatron for large-scale training.
+
+::::::{grid} 1 1 1 1
+
+:::::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` 1. Installation
+:link: gs-installation
+:link-type: ref
+
+Get NeMo DFM installed and verify your setup with a quick test.
++++
+{bdg-secondary}`environment` {bdg-secondary}`first-run`
+:::::
+
+:::::{grid-item}
+:margin: 0
+:padding: 0
+
+::::{grid} 1 3 3 3
+:margin: 3 1 0 0
+:gutter: 3
+:padding: 3
+
+:::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` 2a. Automodel Tutorial
+:link: gs-automodel
+:link-type: ref
+
+Fine-tune pretrained models with automatic parallelism. Best for quick prototyping.
++++
+{bdg-secondary}`automodel` {bdg-success}`Fast start` {bdg-primary}`Data scientists`
+:::
+
+:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` 2b. Megatron DiT Tutorial
+:link: gs-megatron
+:link-type: ref
+
+Train DiT models from scratch with full distributed control. Best for large-scale training.
++++
+{bdg-secondary}`megatron` {bdg-secondary}`dit` {bdg-info}`Full control` {bdg-primary}`MLEs`
+:::
+
+:::{grid-item-card} {octicon}`video;1.5em;sd-mr-1` 2c. Megatron WAN Tutorial
+:link: gs-megatron-wan
+:link-type: ref
+
+Train WAN models for video generation with Megatron. Best for video-specific workflows.
++++
+{bdg-secondary}`megatron` {bdg-secondary}`wan` {bdg-info}`Video models` {bdg-primary}`MLEs`
+:::
+
+::::
+:::::
+
+::::::
+
+---
+
+## Tutorials
+
+Comprehensive training recipes with detailed configurations and advanced topics for production workflows.
+
+::::{grid} 1 3 3 3
+:gutter: 1 1 1 2
+
+:::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` Fine-Tuning Pretrained Models
+:link: tutorial-fine-tuning-pretrained-models
+:link-type: ref
+
+Fine-tune pretrained models with automatic parallelism. Advanced configuration options.
++++
+{bdg-secondary}`automodel` {bdg-success}`Quick start` {bdg-info}`Advanced`
+:::
+
+:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Training from Scratch
+:link: tutorial-training-from-scratch
+:link-type: ref
+
+Train DiT models from scratch with full distributed control. Sequence packing and Energon format details.
++++
+{bdg-secondary}`megatron` {bdg-info}`Full control` {bdg-info}`Advanced`
+:::
+
+:::{grid-item-card} {octicon}`video;1.5em;sd-mr-1` Text-to-Video Training
+:link: tutorial-text-to-video-training
+:link-type: ref
+
+Train production-scale text-to-video models. WebDataset preparation and inference workflows.
++++
+{bdg-secondary}`megatron` {bdg-info}`Video models` {bdg-info}`Advanced`
+:::
+
+::::
+
+---
+
+::::{toctree}
+:hidden:
+Home <self>
+::::
+
+::::{toctree}
+:hidden:
+:caption: About NeMo DFM
+:maxdepth: 1
+about/index.md
+about/concepts/index.md
+Paradigm Comparison <about/comparison>
+::::
+
+::::{toctree}
+:hidden:
+:caption: Get Started
+:maxdepth: 2
+
+get-started/index.md
+Installation <get-started/installation>
+Automodel <get-started/automodel>
+Megatron DiT <get-started/megatron>
+Megatron WAN <get-started/megatron-wan>
+::::
+
+::::{toctree}
+:hidden:
+:caption: Tutorials
+:maxdepth: 2
+
+tutorials/index.md
+Fine-Tuning Pretrained Models <tutorials/fine-tuning-pretrained-models>
+Training from Scratch <tutorials/training-from-scratch>
+Text-to-Video Training <tutorials/text-to-video-training>
+::::
+
+::::{toctree}
+:hidden:
+:caption: Reference
+:maxdepth: 2
+
+About References <reference/index.md>
+Performance Benchmarks <reference/performance>
+apidocs/index.rst
+::::
diff --git a/docs/reference/index.md b/docs/reference/index.md
new file mode 100644
index 00000000..1bf965f4
--- /dev/null
+++ b/docs/reference/index.md
@@ -0,0 +1,65 @@
+---
+description: "Comprehensive technical reference for NeMo DFM APIs, infrastructure components, and integration tools"
+categories: ["reference"]
+tags: ["python-api", "infrastructure", "integrations-apis", "distributed", "gpu-accelerated"]
+personas: ["mle-focused", "data-scientist-focused", "admin-focused"]
+difficulty: "reference"
+content_type: "reference"
+modality: "universal"
+---
+
+(ref-overview)=
+
+# References
+
+NeMo DFM's reference documentation provides comprehensive technical details, API references, and integration information to help you maximize your NeMo DFM implementation. Use these resources to understand the technical foundation of NeMo DFM and integrate it with other tools and systems.
+
+## API Quicklinks
+
+Quickly access core NeMo DFM API references. Use these links to jump directly to the technical API documentation for each major module.
+
+::::{grid} 1 1 1 2
+:gutter: 1 1 1 2
+
+:::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` Automodel API
+:link: ../apidocs/dfm/dfm.src.automodel.html
+:link-type: url
+Automodel implementations using Dtensors.
++++
+{bdg-secondary}`automodel`
+{bdg-secondary}`dtensors`
+{bdg-secondary}`flow-matching`
+:::
+
+:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Megatron API
+:link: ../apidocs/dfm/dfm.src.megatron.html
+:link-type: url
+Megatron-based large-scale training framework.
++++
+{bdg-secondary}`megatron`
+{bdg-secondary}`distributed-training`
+{bdg-secondary}`dit`
+:::
+
+:::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` Common API
+:link: ../apidocs/dfm/dfm.src.common.html
+:link-type: url
+Common utilities and shared components.
++++
+{bdg-secondary}`utilities`
+{bdg-secondary}`tokenization`
+{bdg-secondary}`preprocessing`
+:::
+
+:::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` Performance Benchmarks
+:link: performance
+:link-type: doc
+Training throughput and performance metrics across GPU systems.
++++
+{bdg-secondary}`benchmarks`
+{bdg-secondary}`performance`
+{bdg-secondary}`throughput`
+:::
+
+::::
+
diff --git a/docs/reference/performance.md b/docs/reference/performance.md
new file mode 100644
index 00000000..259e63ee
--- /dev/null
+++ b/docs/reference/performance.md
@@ -0,0 +1,105 @@
+---
+description: "Performance benchmarks and training throughput metrics for NeMo DFM across different GPU systems"
+categories: ["reference"]
+tags: ["performance", "benchmarks", "throughput", "gpu"]
+personas: ["mle-focused", "admin-focused"]
+difficulty: "reference"
+content_type: "reference"
+---
+
+(ref-performance)=
+
+# Performance Benchmarks
+
+NeMo DFM provides current performance benchmarks for models across different GPU systems and configurations. These benchmarks help you understand expected training throughput and optimize your training setup.
+
+:::{note}
+For updated YAML configurations, refer to `examples/megatron/recipes/wan/conf` in the repository.
+:::
+
+## Nomenclature
+
+Understanding the terminology used in performance benchmarks:
+
+:::{dropdown} Parallelism Abbreviations
+:icon: info
+
+- **GBS**: Global Batch Size
+- **MBS**: Micro Batch Size
+- **FSDP**: Fully Sharded Data Parallel
+  - FSDP = 1: use FSDP
+  - FSDP = 0: use DDP (Distributed Data Parallel)
+- **TP**: Tensor Parallel Size
+- **SP**: Sequence Parallel
+- **PP**: Pipeline Parallel Size
+- **CP**: Context Parallel Size
+- **VP**: Virtual Pipeline Parallel Size
+- **EP**: Expert Parallel Size
+:::
+
+## Performance Metrics
+
+We measure performance using:
+
+- **Tokens/sec/GPU**: Throughput per GPU
+- **Model TFLOP/sec/GPU**: Model floating-point operations per second per GPU
+
+## Performance Summary
+
+The performance data includes:
+
+- **Pre-training Performance**: Throughput metrics for various model sizes and architectures
+- **System Configurations**: Results across different GPU systems (DGX-GB200, DGX-GB300, DGX-H100)
+
+---
+
+## Megatron-Core Pre-Training Performance
+
+Performance benchmarks using the Megatron-Core backend with Megatron-Bridge.
+
+:::: {tab-set}
+
+::: {tab-item} DGX-GB200
+
+| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
+|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------|
+| WAN 2.1 14B | 32 | 64 | 1 | 37440 | 0 | 1 | 0 | 1 | 4 | 0 | 0 | 787.59 |
+
+:::
+
+::: {tab-item} DGX-GB300
+
+| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
+|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------|
+| WAN 2.1 14B | 32 | 64 | 1 | 37440 | 0 | 1 | 0 | 1 | 2 | 0 | 0 | 1,022.26 |
+
+:::
+
+::: {tab-item} DGX-H100
+
+| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
+|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|-------------------------|
+| WAN 2.1 14B | 128 | 128 | 1 | 37440 | 0 | 2 | 1 | 1 | 4 | 0 | 0 | 325.77 |
+
+:::
+
+::::
+
+---
+
+## NeMo AutoModel Pre-Training Performance
+
+Performance benchmarks using the NeMo AutoModel backend with FSDP2.
+
+### System: DGX-H100
+
+| Model | #-GPUs | GBS | MBS | Sequence Length | FSDP | DP | TP | SP | PP | CP | VP | EP | Model TFLOP / sec / GPU |
+|-------|--------|-----|-----|-----------------|------|----|----|----|----|----|----|----|-------------------------|
+| WAN 2.1 14B | 8 | 8 | 1 | 37440 | 1 | 8 | 1 | 1 | 1 | 1 | 0 | 0 | 175.88 |
+| WAN 2.1 14B | 64 | 64 | 1 | 37440 | 1 | 64 | 1 | 1 | 1 | 1 | 0 | 0 | 228.85 |
+
+## Related Documentation
+
+- [Distributed Training](../about/concepts/distributed-training.md) - Learn about parallelism strategies
+- [Training Paradigms](../about/concepts/training-paradigms.md) - Understand AutoModel vs Megatron differences
+- [Get Started](../get-started/index.md) - Start training with NeMo DFM
diff --git a/docs/tutorials/fine-tuning-pretrained-models.md b/docs/tutorials/fine-tuning-pretrained-models.md
new file mode 100644
index 00000000..43545d97
--- /dev/null
+++ b/docs/tutorials/fine-tuning-pretrained-models.md
@@ -0,0 +1,289 @@
+---
+description: "Comprehensive guide for fine-tuning pretrained models with automatic parallelism and advanced configuration options"
+categories: ["tutorials", "automodel"]
+tags: ["training", "recipe", "automodel", "wan", "advanced"]
+personas: ["mle-focused", "data-scientist-focused"]
+difficulty: "intermediate"
+content_type: "tutorial"
+---
+
+(tutorial-fine-tuning-pretrained-models)=
+
+# Fine-Tuning Pretrained Models
+
+Comprehensive guide for fine-tuning pretrained diffusion models with automatic parallelism and distributed training support. This approach uses NeMo Automodel backend, which handles parallelism automatically—ideal for quick prototyping and fine-tuning workflows.
+
+**Currently Supported:** WAN 2.1 Text-to-Video (1.3B and 14B models)
+
+:::{note}
+For a quick start guide, see [Automodel Workflow](../get-started/automodel.md). This tutorial provides detailed configuration options and advanced topics.
+:::
+
+---
+
+## Quick Start
+
+### 1. Docker Setup
+
+```bash
+# Build image
+docker build -f docker/Dockerfile.ci -t dfm-training .
+
+# Run container
+docker run --gpus all -it \
+  -v $(pwd):/workspace \
+  -v /path/to/data:/data \
+  --ipc=host \
+  --ulimit memlock=-1 \
+  --ulimit stack=67108864 \
+  dfm-training bash
+
+# Inside container: Initialize submodules
+export UV_PROJECT_ENVIRONMENT=
+git submodule update --init --recursive 3rdparty/
+```
+
+### 2. Prepare Data
+
+We provide two ways to prepare your dataset:
+
+- Start with raw videos: Place your `.mp4` files in a folder and use our data-preparation scripts to scan the videos and generate a `meta.json` entry for each sample (which includes `width`, `height`, `start_frame`, `end_frame`, and a caption). If you have captions, you can also include per-video named `<video>.jsonl`; the scripts will pick up the text automatically. The final dataset layout is shown below.
+- Bring your own `meta.json`: If you already have annotations, create `meta.json` yourself following the schema shown below.
+
+**Create video dataset:**
+In the following example we use two video files, solely for demonstration purposes. Actual training datasets will have a large number of files.
+```
+<your_video_folder>/
+├── video1.mp4
+├── video2.mp4
+└── meta.json
+```
+
+**meta.json format:**
+```json
+[
+  {
+    "file_name": "video1.mp4",
+    "width": 1280,
+    "height": 720,
+    "start_frame": 0,
+    "end_frame": 121,
+    "vila_caption": "A detailed description of the video1.mp4 contents..."
+  },
+  {
+    "file_name": "video2.mp4",
+    "width": 1280,
+    "height": 720,
+    "start_frame": 0,
+    "end_frame": 12,
+    "vila_caption": "A detailed description of the video2.mp4 contents..."
+  }
+]
+```
+
+**Preprocess videos to .meta files:**
+
+The preprocessing script converts each source video into a single `.meta` file that preserves the full temporal sequence as latents. Training can sample temporal windows/clips from the sequence on the fly.
+
+```bash
+python dfm/src/automodel/utils/data/preprocess_resize.py \
+  --video_folder <your_video_folder> \
+  --output_folder ./processed_meta \
+  --model Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
+  --height 480 \
+  --width 720 \
+  --center-crop
+```
+
+**Key arguments:**
+- `--video_folder`: Path to folder containing videos and `meta.json`
+- `--output_folder`: Path where `.meta` files will be saved
+- `--model`: Wan2.1 model ID (default: `Wan-AI/Wan2.1-T2V-14B-Diffusers`)
+- `--height/--width`: Target resolution (both must be specified together)
+- `--center-crop`: Crop to exact size after aspect-preserving resize
+- `--device`: Device to use (`cuda` or `cpu`, default: `cuda` if available)
+- `--stochastic`: Use stochastic encoding instead of deterministic (can cause flares)
+- `--no-memory-optimization`: Disable Wan's built-in memory optimization
+
+**Output:** Creates `.meta` files containing:
+- Encoded video latents (normalized)
+- Text embeddings (from UMT5)
+- First frame as JPEG
+- Metadata
+
+### 3. Train
+
+**Single-node (8 GPUs):**
+```bash
+export UV_PROJECT_ENVIRONMENT=
+
+uv run --group automodel --with . \
+  torchrun --nproc-per-node=8 \
+  examples/automodel/finetune/finetune.py \
+  -c examples/automodel/finetune/wan2_1_t2v_flow.yaml
+```
+
+**Multi-node with SLURM:**
+```bash
+#!/bin/bash
+#SBATCH -N 2
+#SBATCH --ntasks-per-node 1
+#SBATCH --gpus-per-node=8
+#SBATCH --exclusive
+
+export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
+export MASTER_PORT=29500
+export NUM_GPUS=8
+
+# Per-rank UV cache to avoid conflicts
+unset UV_PROJECT_ENVIRONMENT
+mkdir -p /opt/uv_cache/${SLURM_JOB_ID}_${SLURM_PROCID}
+export UV_CACHE_DIR=/opt/uv_cache/${SLURM_JOB_ID}_${SLURM_PROCID}
+
+uv run --group automodel --with . \
+  torchrun \
+  --nnodes=$SLURM_NNODES \
+  --nproc-per-node=$NUM_GPUS \
+  --rdzv_backend=c10d \
+  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
+  examples/automodel/finetune/finetune.py \
+  -c examples/automodel/finetune/wan2_1_t2v_flow_multinode.yaml
+```
+
+### 4. Validate
+
+Use this step to perform a quick qualitative check of a trained checkpoint. The validation script:
+- Reads prompts from `.meta` files in `--meta_folder` (uses `metadata.vila_caption`; latents are ignored).
+- Loads the `WanPipeline` and, if provided, restores weights from `--checkpoint` (prefers `ema_shadow.pt`, then `consolidated_model.bin`, then sharded FSDP `model/*.distcp`).
+- Generates short videos for each prompt with the specified settings (`--guidance_scale`, `--num_inference_steps`, `--height/--width`, `--num_frames`, `--fps`, `--seed`) and writes them to `--output_dir`.
+- Intended for qualitative comparison across checkpoints; it does not compute quantitative metrics.
+
+```bash
+uv run --group automodel --with . \
+  python examples/automodel/generate/wan_validate.py \
+  --meta_folder <your_meta_folder> \
+  --guidance_scale 5 \
+  --checkpoint ./checkpoints/step_1000 \
+  --num_samples 10
+```
+
+**Note:** You can use `--checkpoint ./checkpoints/LATEST` to automatically use the most recent checkpoint.
+
+---
+
+## Configuration
+
+### Fine-tuning Config (`wan2_1_t2v_flow.yaml`)
+
+Note: The inline configuration below is provided for quick reference. The canonical, up-to-date files are maintained in the repository: [examples/automodel/](../../examples/automodel/), [examples/automodel/finetune/wan2_1_t2v_flow.yaml](../../examples/automodel/finetune/wan2_1_t2v_flow.yaml), and [examples/automodel/finetune/wan2_1_t2v_flow_multinode.yaml](../../examples/automodel/finetune/wan2_1_t2v_flow_multinode.yaml).
+
+```yaml
+model:  # Base pretrained model to fine-tune
+  pretrained_model_name_or_path: Wan-AI/Wan2.1-T2V-1.3B-Diffusers  # HF repo or local path
+
+step_scheduler:  # Global training schedule
+  global_batch_size: 8  # Effective batch size across all GPUs
+  local_batch_size: 1  # Per-GPU batch size
+  num_epochs: 10  # Number of passes over the dataset
+  ckpt_every_steps: 100  # Save a checkpoint every N steps
+
+data:  # Data input configuration
+  dataloader:  # DataLoader parameters
+    meta_folder: "<your_processed_meta_folder>"  # Folder containing .meta files
+    num_workers: 2  # Worker processes per rank
+
+optim:  # Optimizer/training hyperparameters
+  learning_rate: 5e-6  # Base learning rate
+
+flow_matching:  # Flow-matching training settings
+  timestep_sampling: "uniform"  # Strategy for sampling timesteps
+  flow_shift: 3.0  # Scalar shift applied to the target flow
+
+fsdp:  # Distributed training (for example, FSDP) configuration
+  dp_size: 8  # Total data-parallel replicas (single node: 8 GPUs)
+
+checkpoint:  # Checkpointing behavior
+  enabled: true  # Enable periodic checkpoint saving
+  checkpoint_dir: "./checkpoints"  # Output directory for checkpoints
+```
+
+### Multi-node Config Differences
+
+```yaml
+fsdp:  # Overrides for multi-node runs
+  dp_size: 16           # Total data-parallel replicas (2 nodes × 8 GPUs)
+  dp_replicate_size: 2  # Number of replicated groups across nodes
+```
+
+### Pretraining vs Fine-tuning
+
+| Setting | Fine-tuning | Pretraining |
+|---------|-------------|-------------|
+| `learning_rate` | 5e-6 | 5e-5 |
+| `weight_decay` | 0.01 | 0.1 |
+| `flow_shift` | 3.0 | 2.5 |
+| `logit_std` | 1.0 | 1.5 |
+| Dataset size | 100s-1000s | 10K+ |
+
+---
+
+## Hardware Requirements
+
+| Component | Minimum | Recommended |
+|-----------|---------|-------------|
+| GPU | A100 40GB | A100 80GB / H100 |
+| GPUs | 4 | 8+ |
+| RAM | 128 GB | 256 GB+ |
+| Storage | 500 GB SSD | 2 TB NVMe |
+
+---
+
+## Features
+
+- ✅ **Flow Matching**: Pure flow matching training
+- ✅ **Distributed**: FSDP2 + Tensor Parallelism
+- ✅ **Mixed Precision**: BF16 by default
+- ✅ **WandB**: Automatic logging
+- ✅ **Checkpointing**: consolidated, and sharded formats
+- ✅ **Multi-node**: SLURM and torchrun support
+
+---
+
+## Supported Models
+
+| Model | Parameters | Parallelization | Status |
+|-------|------------|-----------------|--------|
+| Wan 2.1 T2V 1.3B | 1.3B | FSDP2 using Automodel + DDP | ✅ |
+| Wan 2.1 T2V 14B | 14B | FSDP2 using Automodel + DDP | ✅ |
+| FLUX | TBD | TBD | 🔄 In Progress |
+
+---
+
+## Advanced
+
+**Custom parallelization:**
+```yaml
+fsdp:
+  tp_size: 2  # Tensor parallel
+  dp_size: 4  # Data parallel
+```
+
+**Checkpoint cleanup:**
+```python
+from pathlib import Path
+import shutil
+
+def cleanup_old_checkpoints(checkpoint_dir, keep_last_n=3):
+    checkpoints = sorted(Path(checkpoint_dir).glob("step_*"))
+    for old_ckpt in checkpoints[:-keep_last_n]:
+        shutil.rmtree(old_ckpt)
+```
+
+---
+
+## Related Documentation
+
+- [Automodel Quick Start](../get-started/automodel.md) - Quick start guide
+- [Training Paradigms](../about/concepts/training-paradigms.md) - Understanding Automodel vs Megatron approaches
+- [Performance Benchmarks](../reference/performance.md) - Training throughput metrics
+
diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
new file mode 100644
index 00000000..76750bec
--- /dev/null
+++ b/docs/tutorials/index.md
@@ -0,0 +1,67 @@
+---
+description: "Advanced training recipes and comprehensive tutorials for NeMo DFM"
+categories: ["tutorials"]
+tags: ["recipes", "training", "advanced"]
+personas: ["mle-focused", "data-scientist-focused"]
+difficulty: "intermediate"
+content_type: "tutorial"
+---
+
+(tutorials-index)=
+
+# Tutorials Overview
+
+Comprehensive training recipes and advanced tutorials for NeMo DFM. These tutorials provide detailed configurations, advanced topics, and in-depth guidance for production training workflows.
+
+**When to use these tutorials**:
+- You've completed the [Get Started](../get-started/index.md) guides
+- You need detailed configuration options and advanced settings
+- You're setting up production training workflows
+- You want to understand all available options and parameters
+
+## Available Tutorials
+
+::::{grid} 1 2 2 2
+:gutter: 1 1 1 2
+
+:::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` Fine-Tuning Pretrained Models
+:link: fine-tuning-pretrained-models
+:link-type: doc
+
+Fine-tune pretrained models with automatic parallelism. Detailed configuration options, preprocessing modes, and advanced topics for quick prototyping workflows.
++++
+{bdg-secondary}`automodel` {bdg-secondary}`wan` {bdg-success}`Quick start` {bdg-info}`Advanced`
+:::
+
+:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Training from Scratch
+:link: training-from-scratch
+:link-type: doc
+
+Train DiT models from scratch with full distributed control. Complete recipe with sequence packing, Energon format details, and parallelism configuration for large-scale training.
++++
+{bdg-secondary}`megatron` {bdg-secondary}`dit` {bdg-info}`Full control` {bdg-info}`Advanced`
+:::
+
+:::{grid-item-card} {octicon}`video;1.5em;sd-mr-1` Text-to-Video Training
+:link: text-to-video-training
+:link-type: doc
+
+Train production-scale text-to-video models with WAN 2.1. Detailed recipe with WebDataset preparation, pretraining workflows, and inference for video generation.
++++
+{bdg-secondary}`megatron` {bdg-secondary}`wan` {bdg-info}`Video models` {bdg-info}`Advanced`
+:::
+
+::::
+
+---
+
+## Relationship to Get Started Guides
+
+These tutorials complement the [Get Started](../get-started/index.md) guides:
+
+- **Get Started**: Quick start, basic workflows, essential steps
+- **Tutorials**: Comprehensive recipes, detailed configurations, advanced topics
+
+If you're new to NeMo DFM, start with the [Get Started](../get-started/index.md) guides, then refer to these tutorials for advanced configuration and detailed options.
+
+---
diff --git a/docs/tutorials/text-to-video-training.md b/docs/tutorials/text-to-video-training.md
new file mode 100644
index 00000000..131c71aa
--- /dev/null
+++ b/docs/tutorials/text-to-video-training.md
@@ -0,0 +1,188 @@
+---
+description: "Comprehensive guide for training production-scale text-to-video models with WAN 2.1, WebDataset preparation, and inference workflows"
+categories: ["tutorials", "megatron"]
+tags: ["training", "recipe", "megatron", "wan", "advanced"]
+personas: ["mle-focused"]
+difficulty: "intermediate"
+content_type: "tutorial"
+---
+
+(tutorial-text-to-video-training)=
+
+# Text-to-Video Training
+
+Comprehensive guide for training large-scale text-to-video generation models using WAN 2.1 architecture. This approach uses Megatron-Core and Megatron-Bridge for scalable training with advanced parallelism strategies (data, tensor, sequence, and context parallelism) and optimized kernels (for example, Transformer Engine fused attention).
+
+**Use case**: Train production-scale text-to-video models with full control over distributed training parallelism.
+
+:::{note}
+For a quick start guide, see [Megatron WAN Workflow](../get-started/megatron-wan.md). This tutorial provides detailed configuration options and advanced topics.
+:::
+
+---
+
+## Dataset Preparation
+
+This recipe uses NVIDIA's [Megatron-Energon](https://github.com/NVIDIA/Megatron-Energon) as an efficient multi-modal data loader. Datasets should be in the WebDataset-compatible format (typically sharded `.tar` archives). Energon supports large-scale distributed loading, sharding, and sampling for video-text and image-text pairs. Set `dataset.path` to your WebDataset directory or shard pattern. See Megatron-Energon docs for format details, subflavors, and advanced options.
+
+If you do not have a dataset yet or only need to validate performance/plumbing, see the "Quick Start with Mock Dataset" section below.
+
+### Dataset Preparation Example
+
+Starting with a directory containing raw .mp4 videos and their corresponding .json metadata files containing captions, you can turn the data into WAN-ready WebDataset shards using our helper script. We then use Energon to process those shards and create its metadata. After this, you can set training script's `dataset.path` argument to the output processed data folder and start training.
+
+```bash
+# 1) Define your input (raw videos) and output (WebDataset shards) folders. For example:
+DATASET_SRC=/opt/raw_videos            # contains .mp4 and  per-video .jsonl captions
+DATASET_PATH=/opt/wan_webdataset      # output WebDataset shards
+
+# 2) (Optional) If your WAN models require auth on first download
+export HF_TOKEN=<your_huggingface_token>
+
+# 3) Create WAN shards with latents + text embeddings
+# Wan's VAE encoder and T5 encoder is used to extract videos' latents and caption embeddings offline before training, using the following core arguments:
+#    --height/--width: control resize target (832x480 is supported for both 1.3B and 14B model)
+#    --center-crop: run center crop to exact target size after resize
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node 1 \
+  examples/megatron/recipes/wan/prepare_energon_dataset_wan.py \
+  --video_folder "${DATASET_SRC}" \
+  --output_dir "${DATASET_PATH}" \
+  --model "Wan-AI/Wan2.1-T2V-1.3B-Diffusers" \
+  --height 480 --width 832 \
+  --center-crop
+
+# 4) Use Energon to process shards and create its metadata/spec
+energon prepare "${DATASET_PATH}"
+# In the interactive prompts:
+# - Enter a train/val/test split, for example, "8,1,1"
+# - When asked for the sample type, choose: "Crude sample (plain dict for cooking)"
+```
+
+What gets produced:
+- Each shard contains:
+  - pth: contain WAN video latents
+  - pickle: contain text embeddings
+  - json: contain useful side-info (text caption, sizes, processing choices, and so on)
+- Energon writes a `.nv-meta` directory with dataset info and a `dataset.yaml` you can version/control.
+
+You're ready to launch training. In the training config, we will point the WAN config (or CLI overrides) to the processed data output directory as `dataset.path=${DATASET_PATH}`.
+
+---
+
+## Build Container
+
+Follow the instructions in the [container section](https://github.com/NVIDIA-NeMo/DFM#-built-your-own-container) of the main README.
+
+---
+
+## Pretraining
+
+This recipe leverages sequence packing to maximize throughput. When a batch containing videos with different shapes or resolution, naive batching and padding method require significant number of padded tokens, due to the inherent size of videos. Sequence packing stacks multiple samples (with different resolutions) into a single sequence instead of padding; hence no computation is wasted on padded tokens. When using sequence packing:
+- Set `train.micro_batch_size=1` and `dataset.micro_batch_size=1`
+- Ensure `model.qkv_format=thd` (required with context parallelism and recommended with sequence packing)
+
+Multiple parallelism techniques including tensor, sequence, and context parallelism are supported and configurable per your hardware.
+
+Wan training is driven by `examples/megatron/recipes/wan/pretrain_wan.py`, which supports both a YAML config file and CLI overrides.
+
+The script exposes a `--training-mode` with `pretrain` and `finetune` presets for flow-matching hyperparameters as a starting point for experiments. This presets specify that pretraining uses noisier, biased sampling (for example, logit-normal, higher logit_std, lower flow_shift) for stability and broad learning, while finetuning uses uniform, lower-noise settings (for example, uniform sampling, lower logit_std, higher flow_shift) to refine details and improve quality.
+
+**Notes**: If you use `logger.wandb_project` and `logger.wandb_exp_name`, export `WANDB_API_KEY`.
+
+### Pretraining Script Example
+
+We provide example scripts for running 1.3B and 14B model sizes on mock dataset (see `wan_1_3B.yaml` and `wan_14B.yaml` under `examples/megatron/recipes/wan/conf`). From these starting points, users can set their own configuration by copy one of the example override configs and update it with your settings (for example, with actual processed data path, and specific configurations based on available hardware, and so on). Users can learn more about arguments detail at [Megatron-Bridge docs](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docs/megatron-lm-to-megatron-bridge.md).
+
+```bash
+cp examples/megatron/recipes/wan/conf/wan_1_3B.yaml examples/megatron/recipes/wan/conf/my_wan.yaml
+# Edit my_wan.yaml to set:
+# - dataset.path: Path to your WebDataset directory
+# - train.global_batch_size/micro_batch_size: Keep micro_batch_size=1
+# - model.tensor_model_parallel_size / model.context_parallel_size: Based on GPUs
+# - checkpoint.save and checkpoint.load: Checkpoint directory
+```
+
+Then run:
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+  examples/megatron/recipes/wan/pretrain_wan.py \
+  --training-mode pretrain \
+  --config-file examples/megatron/recipes/wan/conf/my_wan.yaml
+```
+
+You can also override any config values from the command line. For example:
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+  examples/megatron/recipes/wan/pretrain_wan.py \
+  --config-file examples/megatron/recipes/wan/conf/my_wan.yaml \
+  --training-mode pretrain \
+  dataset.path=/opt/wan_webdataset \
+  train.global_batch_size=8 \
+  train.micro_batch_size=1 \
+  model.tensor_model_parallel_size=2 \
+  model.context_parallel_size=4 \
+  checkpoint.save=/opt/pretrained_checkpoints \
+  checkpoint.load=/opt/pretrained_checkpoints
+```
+
+### Quick Start with Mock Dataset
+
+If you want to run without a real dataset (for debugging or performance measurement), pass `--mock`:
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+  examples/megatron/recipes/wan/pretrain_wan.py \
+  --config-file examples/megatron/recipes/wan/conf/wan_1_3B.yaml \
+  --training-mode pretrain \
+  --mock
+```
+
+You can adjust mock shapes (`F_latents`, `H_latents`, `W_latents`) and packing behavior (`number_packed_samples`) in `WanMockDataModuleConfig` (see `dfm/src/megatron/recipes/wan/wan.py`) to simulate different data scenarios.
+
+---
+
+## Inference
+
+After training, users can run inferencing with `examples/megatron/recipes/wan/inference_wan.py`. Set `--checkpoint_step` to use specific checkpoint for inference. Set `--sizes` and `--frame_nums` to specify video shape (frames, height, width). Set `--sample_steps` (default to 50) for number of noise diffusion steps.
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node 1 \
+  examples/megatron/recipes/wan/inference_wan.py  \
+  --task t2v-1.3B \
+  --frame_nums 81 \
+  --sizes 480*832 \
+  --checkpoint_dir /opt/pretrained_checkpoints \
+  --checkpoint_step 10000 \
+  --prompts "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
+  --sample_steps 50
+```
+
+**Note**: Current inference path is single-GPU. Parallel inference is not yet supported.
+
+---
+
+## Parallelism Support
+
+The table below shows current parallelism support for different model sizes:
+
+| Model | Data Parallel | Tensor Parallel | Sequence Parallel | Context Parallel | FSDP |
+|---|---|---|---|---|---|
+| 1.3B | ✅ | ✅ | ✅ | ✅ |Coming Soon|
+| 14B  | ✅ | ✅ | ✅ | ✅ |Coming Soon|
+
+---
+
+## References
+
+Wan Team. (2025). [Wan: Open and advanced large-scale video generative models (WAN 2.1)](https://github.com/Wan-Video/Wan2.1/). GitHub.
+
+---
+
+## Related Documentation
+
+- [Megatron WAN Quick Start](../get-started/megatron-wan.md) - Quick start guide
+- [Training Paradigms](../about/concepts/training-paradigms.md) - Understanding Megatron approach
+- [Performance Benchmarks](../reference/performance.md) - Training throughput metrics
+
diff --git a/docs/tutorials/training-from-scratch.md b/docs/tutorials/training-from-scratch.md
new file mode 100644
index 00000000..bf1c8f5e
--- /dev/null
+++ b/docs/tutorials/training-from-scratch.md
@@ -0,0 +1,210 @@
+---
+description: "Comprehensive guide for training DiT models from scratch with full distributed control, sequence packing, and parallelism configuration"
+categories: ["tutorials", "megatron"]
+tags: ["training", "recipe", "megatron", "dit", "advanced"]
+personas: ["mle-focused"]
+difficulty: "intermediate"
+content_type: "tutorial"
+---
+
+(tutorial-training-from-scratch)=
+
+# Training from Scratch
+
+Comprehensive guide for training Diffusion Transformer (DiT) models from scratch with full control over distributed training parallelism. This approach uses Megatron-Core and Megatron-Bridge for maximum scalability and efficiency, supporting advanced parallelism strategies (tensor, sequence, context parallelism) for large-scale training.
+
+**Use case**: Train custom DiT models from scratch with manual parallelism control for production-scale workloads.
+
+:::{note}
+For a quick start guide, see [Megatron Workflow](../get-started/megatron.md). This tutorial provides detailed configuration options and advanced topics.
+:::
+
+---
+
+## Dataset Preparation
+
+This recipe uses NVIDIA's [Megatron-Energon](https://github.com/NVIDIA/Megatron-Energon) as an efficient multi-modal data loader. Datasets should be in the WebDataset-compatible format (typically sharded `.tar` archives). Energon efficiently supports large-scale distributed loading, sharding, and sampling for multi-modal pairs (for example, text-image, text-video). Set `dataset.path` to your WebDataset location or shard pattern. See the Megatron-Energon documentation for format details and advanced options.
+
+### Dataset Preparation Example
+
+As an example, you can use the [butterfly-dataset](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) available on Hugging Face.
+
+The script below packs the Hugging Face dataset into WebDataset format, which Energon requires.
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+       examples/megatron/recipes/dit/prepare_energon_dataset_butterfly.py
+```
+
+In case you already have the T5 model or video tokenizer downloaded, you can point to them with optional arguments `--t5_cache_dir` and `--tokenizer_cache_dir`.
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+       examples/megatron/recipes/dit/prepare_energon_dataset_butterfly.py \
+       --t5_cache_dir $t5_cache_dir \
+       --tokenizer_cache_dir $tokenizer_cache_dir
+```
+
+Then you need to run `energon prepare $dataset_path` and choose `CrudeWebdataset` as the sample type:
+
+```bash
+energon prepare ./
+  import pynvml  # type: ignore[import]
+Found 8 tar files in total. The first and last ones are:
+- rank0-000000.tar
+- rank7-000000.tar
+If you want to exclude some of them, cancel with ctrl+c and specify an exclude filter in the command line.
+Please enter a desired train/val/test split like "0.5, 0.2, 0.3" or "8,1,1": 1,0,0
+Saving info to /opt/datasets/butterfly_webdataset_new/.nv-meta/.info.yaml
+Sample 0, keys:
+ - json
+ - pickle
+ - pth
+Json content of sample 0 of rank0-000000.tar:
+{
+  "image_height": 1,
+  "image_width": 512
+}
+Sample 1, keys:
+ - json
+ - pickle
+ - pth
+Json content of sample 1 of rank0-000000.tar:
+{
+  "image_height": 1,
+  "image_width": 512
+}
+Found the following part types in the dataset: pth, json, pickle
+Do you want to create a dataset.yaml interactively? [Y/n]: y
+The following sample types are available:
+0. CaptioningSample
+1. ImageClassificationSample
+2. ImageSample
+3. InterleavedSample
+4. MultiChoiceVQASample
+5. OCRSample
+6. Sample
+7. SimilarityInterleavedSample
+8. TextSample
+9. VQASample
+10. VidQASample
+11. Crude sample (plain dict for cooking)
+Please enter a number to choose a class: 11
+CrudeWebdataset does not need a field map. You will need to provide a `Cooker` for your dataset samples in your `TaskEncoder`.
+Furthermore, you might want to add `subflavors` in your meta dataset specification.
+Done
+```
+
+---
+
+## Build Container
+
+Follow the instructions in the [container](https://github.com/NVIDIA-NeMo/DFM#-built-your-own-container) section of the main README.
+
+---
+
+## Pretraining
+
+After you have the dataset and container ready, you can start training the DiT model on your own dataset. This repository leverages [sequence packing](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/features/optimizations/sequence_packing.html) to maximize training efficiency. Sequence packing stacks multiple samples into a single sequence instead of padding individual samples to a fixed length; therefore, `micro_batch_size` must be set to 1. Additionally, `qkv_format` should be set to `thd` to signal to Transformer Engine that sequence packing is enabled.
+
+For data loading, Energon provides two key hyperparameters related to sequence packing: `task_encoder_seq_length` and `packing_buffer_size`. The `task_encoder_seq_length` parameter controls the maximum sequence length passed to the model, while `packing_buffer_size` determines the number of samples processed to create different buckets. You can look at `select_samples_to_pack` and `pack_selected_samples` methods of [DiffusionTaskEncoderWithSequencePacking](https://github.com/NVIDIA-NeMo/DFM/blob/main/dfm/src/megatron/data/common/diffusion_task_encoder_with_sp.py#L50) to get a better sense of these parameters. For further details you can look at [Energon packing](https://nvidia.github.io/Megatron-Energon/advanced/packing.html) documentation.
+
+Multiple parallelism techniques including tensor, sequence, and context parallelism are supported and can be configured based on your computational requirements.
+
+The model architecture can be customized through parameters such as `num_layers` and `num_attention_heads`. A comprehensive list of configuration options is available in the [Megatron-Bridge documentation](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docs/megatron-lm-to-megatron-bridge.md).
+
+**Note:** If using the `wandb_project` and `wandb_exp_name` arguments, ensure the `WANDB_API_KEY` environment variable is set.
+
+**Note:** During validation, the model generates one sample per GPU at the start of each validation round. These samples are saved to a `validation_generation` folder within `checkpoint_dir` and are also logged to Wandb if the `WANDB_API_KEY` environment variable is configured. To decode the generated latent samples, the model requires access to the video tokenizer used during dataset preparation. Specify the VAE artifacts location using the `vae_cache_folder` argument, otherwise they will be downloaded in the first validation round.
+
+### Pretraining Script Example
+
+First, copy the example config file and update it with your own settings:
+
+```bash
+cp examples/megatron/recipes/dit/conf/dit_pretrain_example.yaml examples/megatron/recipes/dit/conf/my_config.yaml
+# Edit my_config.yaml to set:
+# - model.vae_cache_folder: Path to VAE cache folder
+# - dataset.path: Path to your dataset folder
+# - checkpoint.save and checkpoint.load: Path to checkpoint folder
+# - train.global_batch_size: Set to match be divisible by NUM_GPUs
+# - logger.wandb_exp_name: Your experiment name
+```
+
+Then run:
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run \
+       --nproc-per-node $NUM_GPUS examples/megatron/recipes/dit/pretrain_dit_model.py \
+       --config-file examples/megatron/recipes/dit/conf/my_config.yaml
+```
+
+You can still override any config values from the command line:
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run \
+       --nproc-per-node $num_gpus examples/megatron/recipes/dit/pretrain_dit_model.py \
+       --config-file examples/megatron/recipes/dit/conf/my_config.yaml \
+       train.train_iters=20000 \
+       model.num_layers=32
+```
+
+**Note:** If you dedicate 100% of the data to training, you need to pass `dataset.use_train_split_for_val=true` to use a subset of training data for validation purposes.
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run \
+       --nproc-per-node $num_gpus examples/megatron/recipes/dit/pretrain_dit_model.py \
+       --config-file examples/megatron/recipes/dit/conf/my_config.yaml \
+       dataset.use_train_split_for_val=true
+```
+
+### Quick Start with Mock Dataset
+
+If you want to run the code without having the dataset ready (for performance measurement purposes, for example), you can pass the `--mock` flag to activate a mock dataset.
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run \
+       --nproc-per-node $num_gpus examples/megatron/recipes/dit/pretrain_dit_model.py \
+       --config-file examples/megatron/recipes/dit/conf/dit_pretrain.yaml \
+       --mock
+```
+
+---
+
+## Inference
+
+After training completes, you can run inference using [inference_dit_model.py](https://github.com/NVIDIA-NeMo/DFM/blob/main/examples/megatron/recipes/dit/inference_dit_model.py). The script requires your trained model checkpoint (`--checkpoint_path`) and a path to save generated videos (`--video_save_path`). You can pass two optional arguments, `--t5_cache_dir` and `--tokenizer_cache_dir`, to avoid re-downloading artifacts if they are already downloaded.
+
+```bash
+uv run --group megatron-bridge python -m torch.distributed.run --nproc-per-node $num_gpus \
+    examples/megatron/recipes/dit/inference_dit_model.py \
+    --t5_cache_dir $artifact_dir \
+    --tokenizer_cache_dir $tokenizer_cache_dir \
+    --tokenizer_model Cosmos-0.1-Tokenizer-CV4x8x8 \
+    --checkpoint_path $checkpoint_dir \
+    --num_video_frames 10 \
+    --height 240 \
+    --width 416 \
+    --video_save_path $save_path \
+    --prompt "$prompt"
+```
+
+---
+
+## Parallelism Support
+
+The table below shows current parallelism support for different model sizes:
+
+| Model | Data Parallel | Tensor Parallel | Sequence Parallel | Context Parallel |
+|---|---|---|---|---|
+| **DiT-S (330M)** | TBD | TBD | TBD | TBD |
+| **DiT-L (450M)** | TBD | TBD | TBD| TBD |
+| **DiT-XL (700M)** | ✅ | ✅ | ✅ | ✅ |
+
+---
+
+## Related Documentation
+
+- [Megatron Quick Start](../get-started/megatron.md) - Quick start guide
+- [Training Paradigms](../about/concepts/training-paradigms.md) - Understanding Megatron approach
+- [Performance Benchmarks](../reference/performance.md) - Training throughput metrics
+
diff --git a/pyproject.toml b/pyproject.toml
index e508957d..226835e2 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -71,12 +71,15 @@ build-backend = "setuptools.build_meta"
 
 [dependency-groups]
 docs = [
-    "myst-parser>=4.0.1",
-    "nvidia-sphinx-theme>=0.0.8",
-    "sphinx>=8.1.3",
-    "sphinx-autobuild>=2024.10.3",
-    "sphinx-autodoc2>=0.5.0",
-    "sphinx-copybutton>=0.5.2",
+    "sphinx>=8.2.3",
+    "sphinx-autobuild>=2025.8.25",    # For live doc serving while editing docs
+    "sphinx-autodoc2>=0.5.0",         # For documenting Python API
+    "sphinx-copybutton>=0.5.2",       # Adds a copy button for code blocks
+    "myst-parser>=4.0.1",             # For our markdown docs
+    "nvidia-sphinx-theme>=0.0.8",     # Our NVIDIA theme
+    "sphinxcontrib-mermaid>=1.0.0",   # For mermaid diagrams
+    "sphinx-design>=0.6.1",           # For our design elements
+    "swagger-plugin-for-sphinx>=6.0.0", # For Swagger API documentation
 ]
 test = [
     "coverage>=7.8.1",