Skip to content

[feat] Add MatrixGame3.0#1201

Draft
H1yori233 wants to merge 8 commits intohao-ai-lab:mainfrom
H1yori233:feat/kaiqin/add-mg-3
Draft

[feat] Add MatrixGame3.0#1201
H1yori233 wants to merge 8 commits intohao-ai-lab:mainfrom
H1yori233:feat/kaiqin/add-mg-3

Conversation

@H1yori233
Copy link
Copy Markdown
Contributor

@H1yori233 H1yori233 commented Mar 31, 2026

Purpose

Add MatrixGame3.0 inference support to FastVideo.
https://huggingface.co/FastVideo/Matrix-Game-3.0-Base-Distilled-Diffusers

Changes

  • Add matrixgame3.0 model and action module fastvideo/models/dits/matrixgame3.
  • Add matrixgame3.0 pipeline fastvideo/pipelines/basic/matrixgame.
  • Add matrixgame3.0 specific denoising stage fastvideo/pipelines/stages/matrixgame_denoising.py.
  • Add matrixgame3.0 specific light vae.
test.1.mp4
ours.mp4

TODO

  • migrate origin matrixgame2.0 model/pipeline name
  • inference for base model (bidirectional multi-step)
  • code cleanup

Test Plan

# Commands you ran

Test Results

Test output
# Paste output here

Checklist

  • I ran pre-commit run --all-files and fixed all issues
  • I added or updated tests for my changes
  • I updated documentation if needed
  • I considered GPU memory impact of my changes

For model/pipeline changes, also check:

  • I verified SSIM regression tests pass
  • I updated the support matrix if adding a new model

@mergify mergify bot added type: feat New feature or capability scope: training Training pipeline, methods, configs scope: inference Inference pipeline, serving, CLI scope: infra CI, tests, Docker, build scope: model Model architecture (DiTs, encoders, VAEs) labels Mar 31, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 31, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for:

  • #approved-reviews-by>=1
  • check-success=fastcheck-passed
  • check-success=full-suite-passed
  • check-success~=pre-commit
This rule is failing.
  • #approved-reviews-by>=1
  • check-success=fastcheck-passed
  • check-success=full-suite-passed
  • check-success~=pre-commit
  • title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model)\]

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for MatrixGame 3.0, adding new DiT architectures, action modules, and I2V pipelines while refactoring existing MatrixGame 2.0 components for clarity. Feedback identifies a critical import mismatch for extrinsic builders and a batching issue where only the first item's actions are used for camera trajectories. Performance concerns were raised regarding inefficient VAE device transfers and redundant timestep embedding computations. Additionally, the review suggests using modulo for angle normalization, warns against silent tensor cropping, and notes potential compatibility regressions in the VAE normalization logic.

Comment thread fastvideo/pipelines/stages/matrixgame_denoising.py Outdated
batch.keyboard_cond = batch.keyboard_cond.to(device=device, dtype=target_dtype)
batch.mouse_cond = batch.mouse_cond.to(device=device, dtype=target_dtype)

extrinsics_all = build_matrixgame3_extrinsics_from_actions(batch.keyboard_cond[0], batch.mouse_cond[0]).to(device)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The camera extrinsics are computed using only the first item in the batch (batch.keyboard_cond[0]), which means all items in a batch will share the same camera trajectory even if they have different action inputs. This will lead to incorrect results for batch sizes greater than 1 when different actions are provided for each batch item.

Comment on lines +890 to +943
self.vae = self.vae.to(get_local_torch_device())

image = batch.pil_image
if isinstance(image, torch.Tensor):
if image.dim() == 5:
image = image[:, :, :1]
elif image.dim() == 4:
image = image.unsqueeze(2)
else:
raise ValueError(f"Unexpected tensor dimensions for MatrixGame3 image: {image.shape}")
video_condition = image.to(get_local_torch_device(), dtype=torch.float32)
else:
image = self.preprocess(image,
vae_scale_factor=self.vae.spatial_compression_ratio,
height=height,
width=width).to(get_local_torch_device(), dtype=torch.float32)
video_condition = image.unsqueeze(2).to(get_local_torch_device(), dtype=torch.float32)

vae_dtype = PRECISION_TO_TYPE[fastvideo_args.pipeline_config.vae_precision]
vae_autocast_enabled = (vae_dtype != torch.float32) and not fastvideo_args.disable_autocast

with torch.autocast(device_type="cuda", dtype=vae_dtype, enabled=vae_autocast_enabled):
if fastvideo_args.pipeline_config.vae_tiling:
self.vae.enable_tiling()
if not vae_autocast_enabled:
video_condition = video_condition.to(vae_dtype)
encoder_output = self.vae.encode(video_condition)

img_cond = encoder_output.mode()

if (hasattr(self.vae.config, 'latents_mean') and hasattr(self.vae.config, 'latents_std')):
latents_mean = torch.tensor(self.vae.config.latents_mean, device=img_cond.device,
dtype=img_cond.dtype).view(1, -1, 1, 1, 1)
latents_std = torch.tensor(self.vae.config.latents_std, device=img_cond.device,
dtype=img_cond.dtype).view(1, -1, 1, 1, 1)
img_cond = (img_cond - latents_mean) / latents_std
elif (hasattr(self.vae, "shift_factor") and self.vae.shift_factor is not None):
if isinstance(self.vae.shift_factor, torch.Tensor):
img_cond -= self.vae.shift_factor.to(img_cond.device, img_cond.dtype)
else:
img_cond -= self.vae.shift_factor

if hasattr(self.vae, 'scaling_factor'):
if isinstance(self.vae.scaling_factor, torch.Tensor):
img_cond = img_cond * self.vae.scaling_factor.to(img_cond.device, img_cond.dtype)
else:
img_cond = img_cond * self.vae.scaling_factor

batch.image_latent = img_cond

if hasattr(self, 'maybe_free_model_hooks'):
self.maybe_free_model_hooks()

self.vae.to("cpu")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Moving the VAE model to the GPU and back to CPU within the forward method is extremely inefficient. This triggers heavy synchronization and PCIe transfers for every batch, significantly impacting inference performance. Model placement should be managed at the pipeline level or via a dedicated component offloader.

Comment on lines +61 to +64
while new_yaw > 180:
new_yaw -= 360
while new_yaw < -180:
new_yaw += 360
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using while loops for angle normalization is inefficient and can be slow if the input values are far outside the target range. A more efficient approach is to use the modulo operator.

    new_yaw = (new_yaw + 180) % 360 - 180

Comment on lines +1006 to +1020
if noise_pred.shape != current_latents.shape:
aligned_t = min(noise_pred.shape[2], current_latents.shape[2])
aligned_h = min(noise_pred.shape[3], current_latents.shape[3])
aligned_w = min(noise_pred.shape[4], current_latents.shape[4])
logger.warning(
"Aligning noise/sample shapes before scheduler.step: noise=%s sample=%s -> (*,*,%d,%d,%d)",
tuple(noise_pred.shape),
tuple(current_latents.shape),
aligned_t,
aligned_h,
aligned_w,
)
noise_pred = noise_pred[:, :, :aligned_t, :aligned_h, :aligned_w]
current_latents = current_latents[:, :, :aligned_t, :aligned_h, :aligned_w]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Silently cropping the noise_pred or current_latents when shapes mismatch can hide underlying issues with model configuration, patch alignment, or padding logic. It is recommended to ensure consistent shapes through proper padding within the model wrapper or by investigating the root cause of the mismatch to avoid data loss at the edges of the video.

Comment on lines +735 to +737
timestep_tokens = timestep_tokens.unsqueeze(1).repeat(
1, post_patch_num_frames * post_patch_height * post_patch_width
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Repeating the timestep for every spatial token and then passing the flattened tensor to the condition_embedder leads to a massive amount of redundant computation in the TimestepEmbedder. Since the timesteps are identical for all tokens in this fallback case, it is much more efficient to compute the embedding once per batch and then expand it spatially.

Comment on lines +231 to +233
dims = (1 if self.channel_first else -1)
rms = (x.pow(2).mean(dims, keepdim=True) + 1e-6).sqrt()
return (x / rms) * self.gamma + self.bias
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The manual RMS calculation replaces F.normalize and removes self.scale. If self.scale was non-unity in the original implementation, this change will break compatibility with existing checkpoints. Additionally, the epsilon 1e-6 is hardcoded; it would be better to use a parameter or a standard constant for consistency.

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 5, 2026

This PR has merge conflicts with the base branch. Please rebase:

git fetch origin main
git rebase origin/main
# Resolve any conflicts, then:
git push --force-with-lease

@mergify mergify bot added the needs-rebase PR has merge conflicts label Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase PR has merge conflicts scope: inference Inference pipeline, serving, CLI scope: infra CI, tests, Docker, build scope: model Model architecture (DiTs, encoders, VAEs) scope: training Training pipeline, methods, configs type: feat New feature or capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant