Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 PR merge requirementsWaiting for:
This rule is failing.
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for MatrixGame 3.0, adding new DiT architectures, action modules, and I2V pipelines while refactoring existing MatrixGame 2.0 components for clarity. Feedback identifies a critical import mismatch for extrinsic builders and a batching issue where only the first item's actions are used for camera trajectories. Performance concerns were raised regarding inefficient VAE device transfers and redundant timestep embedding computations. Additionally, the review suggests using modulo for angle normalization, warns against silent tensor cropping, and notes potential compatibility regressions in the VAE normalization logic.
| batch.keyboard_cond = batch.keyboard_cond.to(device=device, dtype=target_dtype) | ||
| batch.mouse_cond = batch.mouse_cond.to(device=device, dtype=target_dtype) | ||
|
|
||
| extrinsics_all = build_matrixgame3_extrinsics_from_actions(batch.keyboard_cond[0], batch.mouse_cond[0]).to(device) |
There was a problem hiding this comment.
The camera extrinsics are computed using only the first item in the batch (batch.keyboard_cond[0]), which means all items in a batch will share the same camera trajectory even if they have different action inputs. This will lead to incorrect results for batch sizes greater than 1 when different actions are provided for each batch item.
| self.vae = self.vae.to(get_local_torch_device()) | ||
|
|
||
| image = batch.pil_image | ||
| if isinstance(image, torch.Tensor): | ||
| if image.dim() == 5: | ||
| image = image[:, :, :1] | ||
| elif image.dim() == 4: | ||
| image = image.unsqueeze(2) | ||
| else: | ||
| raise ValueError(f"Unexpected tensor dimensions for MatrixGame3 image: {image.shape}") | ||
| video_condition = image.to(get_local_torch_device(), dtype=torch.float32) | ||
| else: | ||
| image = self.preprocess(image, | ||
| vae_scale_factor=self.vae.spatial_compression_ratio, | ||
| height=height, | ||
| width=width).to(get_local_torch_device(), dtype=torch.float32) | ||
| video_condition = image.unsqueeze(2).to(get_local_torch_device(), dtype=torch.float32) | ||
|
|
||
| vae_dtype = PRECISION_TO_TYPE[fastvideo_args.pipeline_config.vae_precision] | ||
| vae_autocast_enabled = (vae_dtype != torch.float32) and not fastvideo_args.disable_autocast | ||
|
|
||
| with torch.autocast(device_type="cuda", dtype=vae_dtype, enabled=vae_autocast_enabled): | ||
| if fastvideo_args.pipeline_config.vae_tiling: | ||
| self.vae.enable_tiling() | ||
| if not vae_autocast_enabled: | ||
| video_condition = video_condition.to(vae_dtype) | ||
| encoder_output = self.vae.encode(video_condition) | ||
|
|
||
| img_cond = encoder_output.mode() | ||
|
|
||
| if (hasattr(self.vae.config, 'latents_mean') and hasattr(self.vae.config, 'latents_std')): | ||
| latents_mean = torch.tensor(self.vae.config.latents_mean, device=img_cond.device, | ||
| dtype=img_cond.dtype).view(1, -1, 1, 1, 1) | ||
| latents_std = torch.tensor(self.vae.config.latents_std, device=img_cond.device, | ||
| dtype=img_cond.dtype).view(1, -1, 1, 1, 1) | ||
| img_cond = (img_cond - latents_mean) / latents_std | ||
| elif (hasattr(self.vae, "shift_factor") and self.vae.shift_factor is not None): | ||
| if isinstance(self.vae.shift_factor, torch.Tensor): | ||
| img_cond -= self.vae.shift_factor.to(img_cond.device, img_cond.dtype) | ||
| else: | ||
| img_cond -= self.vae.shift_factor | ||
|
|
||
| if hasattr(self.vae, 'scaling_factor'): | ||
| if isinstance(self.vae.scaling_factor, torch.Tensor): | ||
| img_cond = img_cond * self.vae.scaling_factor.to(img_cond.device, img_cond.dtype) | ||
| else: | ||
| img_cond = img_cond * self.vae.scaling_factor | ||
|
|
||
| batch.image_latent = img_cond | ||
|
|
||
| if hasattr(self, 'maybe_free_model_hooks'): | ||
| self.maybe_free_model_hooks() | ||
|
|
||
| self.vae.to("cpu") |
There was a problem hiding this comment.
Moving the VAE model to the GPU and back to CPU within the forward method is extremely inefficient. This triggers heavy synchronization and PCIe transfers for every batch, significantly impacting inference performance. Model placement should be managed at the pipeline level or via a dedicated component offloader.
| while new_yaw > 180: | ||
| new_yaw -= 360 | ||
| while new_yaw < -180: | ||
| new_yaw += 360 |
| if noise_pred.shape != current_latents.shape: | ||
| aligned_t = min(noise_pred.shape[2], current_latents.shape[2]) | ||
| aligned_h = min(noise_pred.shape[3], current_latents.shape[3]) | ||
| aligned_w = min(noise_pred.shape[4], current_latents.shape[4]) | ||
| logger.warning( | ||
| "Aligning noise/sample shapes before scheduler.step: noise=%s sample=%s -> (*,*,%d,%d,%d)", | ||
| tuple(noise_pred.shape), | ||
| tuple(current_latents.shape), | ||
| aligned_t, | ||
| aligned_h, | ||
| aligned_w, | ||
| ) | ||
| noise_pred = noise_pred[:, :, :aligned_t, :aligned_h, :aligned_w] | ||
| current_latents = current_latents[:, :, :aligned_t, :aligned_h, :aligned_w] | ||
|
|
There was a problem hiding this comment.
Silently cropping the noise_pred or current_latents when shapes mismatch can hide underlying issues with model configuration, patch alignment, or padding logic. It is recommended to ensure consistent shapes through proper padding within the model wrapper or by investigating the root cause of the mismatch to avoid data loss at the edges of the video.
| timestep_tokens = timestep_tokens.unsqueeze(1).repeat( | ||
| 1, post_patch_num_frames * post_patch_height * post_patch_width | ||
| ) |
There was a problem hiding this comment.
Repeating the timestep for every spatial token and then passing the flattened tensor to the condition_embedder leads to a massive amount of redundant computation in the TimestepEmbedder. Since the timesteps are identical for all tokens in this fallback case, it is much more efficient to compute the embedding once per batch and then expand it spatially.
| dims = (1 if self.channel_first else -1) | ||
| rms = (x.pow(2).mean(dims, keepdim=True) + 1e-6).sqrt() | ||
| return (x / rms) * self.gamma + self.bias |
There was a problem hiding this comment.
The manual RMS calculation replaces F.normalize and removes self.scale. If self.scale was non-unity in the original implementation, this change will break compatibility with existing checkpoints. Additionally, the epsilon 1e-6 is hardcoded; it would be better to use a parameter or a standard constant for consistency.
|
This PR has merge conflicts with the base branch. Please rebase: git fetch origin main
git rebase origin/main
# Resolve any conflicts, then:
git push --force-with-lease |
Purpose
Add MatrixGame3.0 inference support to FastVideo.
https://huggingface.co/FastVideo/Matrix-Game-3.0-Base-Distilled-Diffusers
Changes
fastvideo/models/dits/matrixgame3.fastvideo/pipelines/basic/matrixgame.fastvideo/pipelines/stages/matrixgame_denoising.py.test.1.mp4
ours.mp4
TODO
Test Plan
# Commands you ranTest Results
Test output
Checklist
pre-commit run --all-filesand fixed all issuesFor model/pipeline changes, also check: