A curated collection of papers, models, and resources for the field of Video Generation.
Note
This repository is proudly maintained by the frontline research mentors at QuenithAI (应达学术). It aims to provide the most comprehensive and cutting-edge map of papers and technologies in the field of video generation.
Your contributions are also vital—feel free to open an issue or submit a pull request to become a collaborator of this repository. We expect your participation!
If you require expert 1-on-1 guidance on your submissions to top-tier conferences and journals, we invite you to contact us via WeChat or E-mail.
本仓库由 「应达学术」(QuenithAI) 的一线科研导师团队倾力打造并持续维护,旨在为您呈现视频生成领域最全面、最前沿的视频生成领域的论文。
您的贡献对我们和社区来说至关重要——我们诚邀有志之士通过 open an issue 或 submit a pull request 来成为这个项目的合作者之一,期待您的加入!
⚡ Latest Updates
- (Mar 14th, 2026): We have updated all accepted papers in ICLR 2026.
- (Mar 9th, 2026): We have updated all accepted papers in CVPR 2026.
- (Nov 19th, 2025): We have updated all accepted papers in AAAI 2026.
- (Sep 13th, 2025): Add a new direction: 🎯 Reinforcement Learning for Video Generation.
- (Aug 21th, 2025): Add a new direction: 🗣️ Audio-Driven Video Generation.
- (Aug 20th, 2025): Initial commit and repository structure established.
- Controllable Video Generation: A Survey
- Diffusion Model-Based Video Editing: A Survey
- From Sora What We Can See: A Survey of Text-to-Video Generation
- A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights
- A Survey on Video Diffusion Models
- Video Diffusion Models: A Survey
- Survey of Video Diffusion Models: Foundations, Implementations, and Applications
- Video Diffusion Generation: Comprehensive Review and Open Problems
-
[AAAI 2026] FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
-
[AAAI 2026] DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation
-
[AAAI 2026] GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
-
[AAAI 2026] EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation
-
[CVPR 2026] ID-Composer: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation
-
[CVPR 2026] StreamDiT: Real-Time Streaming Text-to-Video Generation
-
[CVPR 2026] Wan-Alpha: Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner
-
[CVPR 2026] MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
-
[CVPR 2026] OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
-
[CVPR 2026] HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
-
[CVPR 2026] SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
-
[CVPR 2026] VISTA: A Test-Time Self-Improving Video Generation Agent
-
[CVPR 2026] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
-
[CVPR 2026] LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
-
[CVPR 2026] CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation
-
[CVPR 2026] Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context
-
[CVPR 2026 Findings] BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation
-
[ICLR 2026] Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback
-
[ICLR 2026] VideoRepair: Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement
-
[ICLR 2026] M4V: Multimodal Mamba for Efficient Text-to-Video Generation
-
[ICLR 2026] Video-As-Prompt: Unified Semantic Control for Video Generation
-
[ICLR 2026] Generating Human Motion Videos using a Cascaded Text-to-Video Framework
-
[ICLR 2026] TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
-
[ICLR 2026] NoisEasier: Boosting Text-to-Video Generation with Direct Noise Optimization
-
[ICLR 2026] Video-MSG: Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
-
[ICLR 2026] FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation
-
[ICLR 2026] DiffuPhyGS: Text-to-Video Generation with 3D Gaussians and Learnable Physical Properties via Diffusion Priors
-
[ICLR 2026] RichSpace: Enriching Text-to-Video Prompt Space via Text Embedding Interpolation
-
[ICLR 2026] JDM: Joint Distribution Modeling for Fine-Grained Text-to-Video Generation
-
[ICLR 2026] Towards One-step Causal Video Generation via Adversarial Self-Distillation
-
[ICLR 2026] TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation
-
[ICLR 2026] Jailbreaking on Text-to-Video Models via Scene Splitting Strategy
-
[ICLR 2026] CCC: Prompt Evolution for Video Generation via Structured MLLM Feedback
-
[ICLR 2026] Ask-A-Video: Controlling Video Generation with Vision Language Models
-
[ICLR 2026] BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration
-
[ICLR 2026] Subject-driven Video Generation Emerges from Experience Replays
-
[ICLR 2026] Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model
-
[ICLR 2026] Cosmos-Eval: Towards Explainable Evaluation of Physics and Semantics in Text-to-Video Models
-
[CVPR 2025] AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM
-
[CVPR 2025] Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way
-
[CVPR 2025] Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
-
[CVPR 2025] Identity-Preserving Text-to-Video Generation by Frequency Decomposition
-
[CVPR 2025] Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation
-
[CVPR 2025] TransPixeler: Advancing Text-to-Video Generation with Transparency
-
[CVPR 2025] LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
-
[CVPR 2025] Improving Text-to-Video Generation via Instance-aware Structured Caption
-
[CVPR 2025] Compositional Text-to-Video Generation with Blob Video Representations
-
[CVPR 2025] Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
-
[ICCV 2025] T2Bs: Text‑to‑Character Blendshapes via Video Generation
-
[ICCV 2025] Animate Your Word: Bringing Text to Life via Video Diffusion Prior
-
[NeurIPS 2025] Safe‑Sora: Safe Text‑to‑Video Generation via Graphical Watermarking
-
[ICCV 2025] Prompt‑A‑Video: Prompt Your Video Diffusion Model via Preference‑Aligned LLM
-
[ICCV 2025] MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text‑to‑Video Generation
-
[ICCV 2025] TITAN‑Guide: Taming Inference‑Time Alignment for Guided Text‑to‑Video Diffusion Models
-
[ICCV 2025] Video‑T1: Test‑Time Scaling for Video Generation
-
[ICCV 2025] AnimateYourMesh: Feed‑Forward 4D Foundation Model for Text‑Driven Mesh Animation
-
[ICLR 2025] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-Video Generation
-
[ICLR 2025] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
-
[ICLR 2025] Pyramidal Flow Matching for Efficient Video Generative Modeling
-
[NeurIPS 2025] Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation
-
[NeurIPS 2025] ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models
-
[NeurIPS 2025] PanoWan: Lifting Diffusion Video Generation Models to 360° with Latitude/Longitude-Aware Mechanisms
- LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation
- S²Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation
- LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation
- Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation
- V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models
- QuaDreamer: Controllable Panoramic Video Generation for Quadruped Robots
- PoseGuard: Pose-Guided Generation with Safety Guardrails
- GV-VAD : Exploring Video Generation for Weakly-Supervised Video Anomaly Detection
- GVD: Guiding Video Diffusion Model for Scalable Video Distillation
- Compositional Video Synthesis by Temporal Object-Centric Learning
- Enhancing Scene Transition Awareness in Video Generation via Post-Training
- Yume: An Interactive World Generation Model
- EndoGen: Conditional Autoregressive Endoscopic Video Generation
- MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation
- TokensGen: Harnessing Condensed Tokens for Long Video Generation
- Conditional Video Generation for High-Efficiency Video Compression
- Taming Diffusion Transformer for Real-Time Mobile Video Generation
- LoViC: Efficient Long Video Generation with Context Compression
- World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving
- NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models
- Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective
- Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers
- Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
- Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions
- Scaling RL to Long Videos
- PromptTea: Let Prompts Tell TeaCache the Optimal Threshold
- Bridging Sequential Deep Operator Network and Video Diffusion: Residual Refinement of Spatio-Temporal PDE Solutions
- Omni-Video: Democratizing Unified Video Understanding and Generation
- Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation
- MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos
- Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations
- PresentAgent: Multimodal Agent for Presentation Video Generation
- RefTok: Reference-Based Tokenization for Video Generation
- Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching
- Lost in Latent Space: An Empirical Study of Latent Diffusion Models for Physics Emulation
- LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
- LLM-based Realistic Safety-Critical Driving Video Generation
- Geometry-aware 4D Video Generation for Robot Manipulation
- Populate-A-Scene: Affordance-Aware Human Video Generation
- FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion
- Epona: Autoregressive Diffusion World Model for Autonomous Driving
- VMoBA: Mixture-of-Block Attention for Video Diffusion Models
- SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
- Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
- GenHSI: Controllable Generation of Human-Scene Interaction Videos
- SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution
- Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation
- VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory
- FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation
- RDPO: Real Data Preference Optimization for Physics Consistency Video Generation
- Emergent Temporal Correspondences from Video Diffusion Transformers
- Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition
- FastInit: Fast Noise Initialization for Temporally Consistent Video Generation
- PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models
- Causally Steered Diffusion for Automated Video Counterfactual Generation
- VideoMAR: Autoregressive Video Generation with Continuous Tokens
- M4V: Multi-Modal Mamba for Text-to-Video Generation
- GigaVideo‑1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning
- DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
- Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation
- MagCache: Fast Video Generation with Magnitude-Aware Cache
- Seedance 1.0: Exploring the Boundaries of Video Generation Models
- How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models
- Self Forcing: Bridging the Train‑Test Gap in Autoregressive Video Diffusion
- From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models
- Hi‑VAE: Efficient Video Autoencoding with Global and Detailed Motion
- ContentV: Efficient Training of Video Generation Models with Limited Compute
- Astraea: A GPU‑Oriented Token‑wise Acceleration Framework for Video Diffusion Transformers
- FPSAttention: Training-Aware FP8 and Sparsity Co‑Design for Fast Video Diffusion
- LayerFlow: A Unified Model for Layer‑Aware Video Generation
- FullDiT2: Efficient In‑Context Conditioning for Video Diffusion Transformers
- DenseDPO: Fine‑Grained Temporal Preference Optimization for Video Diffusion Models
- Chipmunk: Training‑Free Acceleration of Diffusion Transformers with Dynamic Column‑Sparse Deltas
- Context as Memory: Scene‑Consistent Interactive Long Video Generation with Memory Retrieval
- CamCloneMaster: Enabling Reference‑based Camera Control for Video Generation
- Dual‑Expert Consistency Model for Efficient and High‑Quality Video Generation
- Sparse‑vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers
- LumosFlow: Motion‑Guided Long Video Generation
- Motion aware video generative model
- Many‑for‑Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks
- OpenS2V‑Nexus: A Detailed Benchmark and Million‑Scale Dataset for Subject‑to‑Video Generation
- Wan: Open and Advanced Large‑Scale Video Generative Models
-
[CVPR 2024] Make Pixels Dance: High-Dynamic Video Generation
-
[CVPR 2024] VGen: Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
-
[CVPR 2024] GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
-
[CVPR 2024] SimDA: Simple Diffusion Adapter for Efficient Video Generation
-
[CVPR 2024] MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
-
[CVPR 2024] Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models
-
[CVPR 2024] PEEKABOO: Interactive Video Generation via Masked-Diffusion
-
[CVPR 2024] EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
-
[CVPR 2024] A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
-
[CVPR 2024] BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
-
[CVPR 2024] Mind the Time: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
-
[CVPR 2024] MotionDirector: Motion Customization of Text-to-Video Diffusion Models
-
[CVPR 2024] Hierarchical Patch-wise Diffusion Models for High-Resolution Video Generation
-
[CVPR 2024] DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation
-
[CVPR 2024] Grid Diffusion Models for Text-to-Video Generation
-
[ECCV 2024] Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
-
[ECCV 2024] W.A.L.T.: Photorealistic Video Generation with Diffusion Models
-
[ECCV 2024] MoVideo: Motion-Aware Video Generation with Diffusion Models
-
[ECCV 2024] DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model
-
[ECCV 2024] MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing
-
[ECCV 2024] HARIVO: Harnessing Text-to-Image Models for Video Generation
-
[ECCV 2024] MEVG: Multi-event Video Generation with Text-to-Video Models
-
[NeurIPS 2024] DEMO: Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning
-
[ICML 2024] Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
-
[ICLR 2024] VDT: General-purpose Video Diffusion Transformers via Mask Modeling
-
[ICLR 2024] VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation
-
[AAAI 2024] Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos
-
[AAAI 2024] E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning
-
[AAAI 2024] ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation
-
[AAAI 2024] F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text to-Video Synthesis
- Gender Bias in Text-to-Video Generation Models: A case study of Sora
- VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
- Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance
- CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training
- DirectorLLM for Human-Centric Video Generation
- Can Video Generation Replace Cinematographers? Research on the Cinematic Language of Generated Video
- LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
- T-SVG: Text-Driven Stereoscopic Video Generation
- Mojito: Motion Trajectory and Intensity Control for Video Generation
- SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
- Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation
- STIV: Scalable Text and Image Conditioned Video Generation
- GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
- CPA: Camera-pose-awareness Diffusion Transformer for Video Generation
- MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation
- Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop
- Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models
- DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation
- InTraGen: Trajectory-controlled Video Generation for Object Interactions
- Optical-Flow Guided Prompt Optimization for Coherent Video Generation
- VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement
- Motion Control for Enhanced Complex Action Video Generation
- GameGen-X: Interactive Open-world Game Video Generation
- Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning
- ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation
- Animating the Past: Reconstruct Trilobite via Video Generation
- ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way
- T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design
- The Dawn of Video Generation: Preliminary Explorations with SORA-like Models
- Compositional 3D-aware Video Generation with LLM Director
- Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation
- FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance
- Still-Moving: Customized Video Generation without Customized Video Data
- VEnhancer: Generative Space-Time Enhancement for Video Generation
- Mobius: A High Efficient Spatial-Temporal Parallel Training Paradigm for Text-to-Video Generation Task
- VIMI: Grounding Video Generation through Multi-modal Instruction
- GVDIFF: Grounded Text-to-Video Generation with Diffusion Models
- Evaluation of Text-to-Video Generation Models: A Dynamics Perspective
- Text-Animator: Controllable Visual Text Video Generation
- MotionBooth: Motion-Aware Customized Text-to-Video Generation
- Hierarchical Patch Diffusion Models for High-Resolution Video Generation
- Compositional Video Generation as Flow Equalization
- MotionClone: Training-Free Motion Cloning for Controllable Video Generation
- VideoTetris: Towards Compositional Text-To-Video Generation
- VideoPhy: Evaluating Physical Commonsense for Video Generation
- I4VGen: Image as Free Stepping Stone for Text-to-Video Generation
- DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control
- The Lost Melody: Empirical Observations on Text-to-Video Generation From A Storytelling Perspective
- TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation
- MotionMaster: Training-free Camera Motion Transfer For Video Generation
- ConCLVD: Controllable Chinese Landscape Video Generation via Diffusion Model
- MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
- CameraCtrl: Enabling Camera Control for Text-to-Video Generation
- Grid Diffusion Models for Text-to-Video Generation
- StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
- S2DM: Sector-Shaped Diffusion Models for Video Generation
- Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
-
[CVPR 2023] Align your Latents: High-resolution Video Synthesis with Latent Diffusion Models
-
[CVPR 2023] Text2Video-Zero: Text-to-image Diffusion Models are Zero-shot Video Generators
-
[CVPR 2023] Video Probabilistic Diffusion Models in Projected Latent Space
-
[ICCV 2023] PYOCO: Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
-
[ICCV 2023] Gen-1: Structure and Content-guided Video Synthesis with Diffusion Models
-
[NeurIPS 2023] UniPi: Learning Universal Policies via Text-Guided Video Generation
-
[NeurIPS 2023] VideoComposer: Compositional Video Synthesis with Motion Controllability
-
[ICLR 2023] CogVideo: Large-scale Pretraining for Text-to-video Generation via Transformers
-
[ICLR 2023] Make-A-Video: Text-to-video Generation without Text-video Data
-
[ICLR 2023] Phenaki: Variable Length Video Generation From Open Domain Textual Description
- StreamDiT: Real-Time Streaming Text-to-Video Generation
- SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
- FlashVideo: A Framework for Swift Inference in Text-to-Video Generation
- A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
- Photorealistic Video Generation with Diffusion Models
- GenTron: Diffusion Transformers for Image and Video Generation
- Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
- StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter
- ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models
- MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation
- FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
- GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
- Make Pixels Dance: High-Dynamic Video Generation
- VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models
- POS: A Prompts Optimization Suite for Augmenting Text-to-Video Generation
- VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
- LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
- Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
- Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
- LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
- VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
- Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
- VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
- Dual-Stream Diffusion Net for Text-to-Video Generation
- Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
- Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance
- DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation
- ControlVideo: Training-free Controllable Text-to-Video Generation
- Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
-
[AAAI 2026] IPRO: Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
-
[CVPR 2026] ALG: Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance
-
[CVPR 2026] ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
-
[CVPR 2026] FlexiMMT: Let Your Image Move with Your Motion! - Implicit Multi-Object Multi-Motion Transfer
-
[CVPR 2026] Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation
-
[ICLR 2026] MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement
-
[ICLR 2026] MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation
-
[ICLR 2026] Anchor Frame Bridging for Coherent First-Last Frame Video Generation
-
[ICLR 2026] LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
-
[ICLR 2026] ReactID: Synchronizing Realistic Actions and Identity in Personalized Video Generation
-
[ICLR 2026] Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy
-
[ICLR 2026] Unleashing Guidance Without Classifiers for Human-Object Interaction Animation
-
[CVPR 2025] MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation
-
[CVPR 2025] MotionPro: A Precise Motion Controller for Image-to-Video Generation
-
[CVPR 2025] Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
-
[CVPR 2025] Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think
-
[CVPR 2025] I2VGuard: Safeguarding Images against Misuse in Diffusion-based Image-to-Video Models
-
[CVPR 2025] LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
-
[ICCV 2025] AnyI2V: Animating Any Conditional Image with Motion Control
-
[ICCV 2025] Versatile Transition Generation with Image-to-Video Diffusion
-
[ICCV 2025] TIP‑I2V: A Million‑Scale Real Text and Image Prompt Dataset for Image‑to‑Video Generation
-
[ICCV 2025] Unified Video Generation via Next‑Set Prediction in Continuous Domain
-
[NeurIPS 2025] GenRec: Unifying Video Generation and Recognition with Diffusion Models
-
[ICCV 2025] Precise Action‑to‑Video Generation Through Visual Action Prompts
-
[ICCV 2025] STIV: Scalable Text and Image Conditioned Video Generation
-
[ICLR 2025] FrameBridge: Improving Image‑to‑Video Generation with Bridge Models
-
[ICLR 2025] SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation
-
[ICLR 2025] Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation
-
[ICLR 2025] Pyramidal Flow Matching for Efficient Video Generative Modeling
-
[NeurIPS 2025] MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation
- Waver: Wave Your Way to Lifelike Video Generation
- FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
- UniVideo: Unified Understanding, Generation, and Editing for Videos
- Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
- Physics‑Grounded Motion Forecasting via Equation Discovery for Trajectory‑Guided Image‑to‑Video Generation
- Enhancing Motion Dynamics of Image‑to‑Video Models via Adaptive Low‑Pass Guidance
- Frame In‑N‑Out: Unbounded Controllable Image‑to‑Video Generation
- Dynamic‑I2V: Exploring Image‑to‑Video Generation Models via Multimodal LLM
- Order Matters: On Parameter‑Efficient Image‑to‑Video Probing for Recognizing Nearly Symmetric Actions
- EvAnimate: Event‑Conditioned Image‑to‑Video Generation for Human Animation
- Step‑Video‑TI2V Technical Report: A State‑of‑the‑Art Text‑Driven Image‑to‑Video Generation Model
- DreamInsert: Zero‑Shot Image‑to‑Video Object Insertion from A Single Image
- I2V3D: Controllable image‑to‑video generation with 3D guidance
- Extrapolating and Decoupling Image‑to‑Video Generation Models: Motion Modeling Is Easier Than You Think
- Object‑Centric Image‑to‑Video Generation with Language Guidance
- VidCRAFT3: Camera, Object, and Lighting Control for Image‑to‑Video Generation
- MotionCanvas: Cinematic Shot Design with Controllable Image‑to‑Video Generation
- Through‑The‑Mask: Mask‑based Motion Trajectories for Image‑to‑Video Generation
-
[CVPR 2024] Animate Anyone: Consistent and Controllable Image-to-video Synthesis for Character Animation
-
[CVPR 2024] Your Image Is My Video: Reshaping the Receptive Field via Image-to-Video Differentiable AutoAugmentation and Fusion
-
[CVPR 2024] TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
-
[CVPR 2024] Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning
-
[CVPR 2024] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
-
[ECCV 2024] MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model
-
[ECCV 2024] $\mathrm R2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
-
[ECCV 2024] PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
-
[ECCV 2024] Rethinking Image-to-Video Adaptation: An Object-Centric Perspective
-
[NeurIPS 2024] TPC: Test-time Procrustes Calibration for Diffusion-based Human Image Animation
-
[NeurIPS 2024] Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model
-
[ICML 2024] Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
-
[SIGGRAPH 2024] I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models
-
[SIGGRAPH 2024] Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling
-
[AAAI 2024] Continuous Piecewise-Affine Based Motion Model for Image Animation
- OmniDrag: Enabling Motion Control for Omnidirectional Image‑to‑Video Generation
- CamI2V: Camera‑Controlled Image‑to‑Video Diffusion Model
- Identifying and Solving Conditional Image Leakage in Image‑to‑Video Diffusion Model
- CamCo: Camera‑Controllable 3D‑Consistent Image‑to‑Video Generation
- CamViG: Camera Aware Image‑to‑Video Generation with Multimodal Transformers
- $R^2$‑Tuning: Efficient Image‑to‑Video Transfer Learning for Video Temporal Grounding
- TRIP: Temporal Residual Learning with Image Noise Prior for Image‑to‑Video Diffusion Models
- Your Image is My Video: Reshaping the Receptive Field via Image‑To‑Video Differentiable AutoAugmentation and Fusion
- Tuning‑Free Noise Rectification for High Fidelity Image‑to‑Video Generation
- AtomoVideo: High Fidelity Image‑to‑Video Generation
- ConsistI2V: Enhancing Visual Consistency for Image‑to‑Video Generation
- AIGCBench: Comprehensive Evaluation of Image‑to‑Video Content Generated by AI
-
[AAAI 2026] Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry
-
[AAAI 2026] FAME: Fairness-aware Attention-modulated Video Editing
-
[CVPR 2026] Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
-
[CVPR 2026] VideoCoF: Unified Video Editing with Temporal Reasoner
-
[CVPR 2026] V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
-
[CVPR 2026] EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing
-
[CVPR 2026] EasyV2V: A High-quality Instruction-based Video Editing Framework
-
[CVPR 2026] VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
-
[CVPR 2026] Generative Video Motion Editing with 3D Point Tracks
-
[CVPR 2026] NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing
-
[CVPR 2026] EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing
-
[CVPR 2026] PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
-
[CVPR 2026] FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing
-
[ICLR 2026] Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks
-
[ICLR 2026] UniVideo: Unified Understanding, Generation, and Editing for Videos
-
[ICLR 2026] EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
-
[ICLR 2026] LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning
-
[ICLR 2026] DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing
-
[ICLR 2026] Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model
-
[ICLR 2026] FlowGuide: Precision-Guided Enhancement for Face Image and Video Editing
-
[ICLR 2026] DragStream: Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!
-
[ICLR 2026] Follow-Your-Creation: Empowering 4D Creation through Video Inpainting
-
[ICLR 2026] Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning
-
[ICLR 2026] FastVMT: Eliminating Redundancy in Video Motion Transfer
-
[CVPR 2025] VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors
-
[CVPR 2025] VideoDirector: Precise Video Editing via Text-to-Video Models
-
[CVPR 2025] VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing
-
[CVPR 2025] Align-A-Video: Deterministic Reward Tuning of Image Diffusion Models for Consistent Video Editing
-
[CVPR 2025] Unity in Diversity: Video Editing via Gradient-Latent Purification
-
[CVPR 2025] VEU-Bench: Towards Comprehensive Understanding of Video Editing
-
[CVPR 2025] SketchVideo: Sketch-based Video Generation and Editing
-
[CVPR 2025] FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video
-
[CVPR 2025] Visual Prompting for One-shot Controllable Video Editing without Inversion
-
[CVPR 2025] FADE: Frequency-Aware Diffusion Model Factorization for Video Editing
-
[ICCV 2025] Reangle-A-Video: 4D Video Generation as Video-to-Video Translation
-
[ICCV 2025] DIVE: Taming DINO for Subject-Driven Video Editing
-
[ICCV 2025] DynamicFace: High-Quality and Consistent Face Swapping for Image and Video using Composable 3D Facial Priors
-
[ICCV 2025] QK-Edit: Revisiting Attention-based Injection in MM-DiT for Image and Video Editing
-
[ICCV 2025] Teleportraits: Training-Free People Insertion into Any Scene
-
[ICLR 2025] VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing
-
[NeurIPS 2025] REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing
-
[AAAI 2025] FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing
-
[AAAI 2025] EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models
-
[AAAI 2025] VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment
-
[AAAI 2025] Re-Attentional Controllable Video Diffusion Editing
-
[WACV 2025] IP-FaceDiff: Identity-Preserving Facial Video Editing with Diffusion
-
[WACV 2025] SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
-
[WACV 2025] MagicStick: Controllable Video Editing via Control Handle Transformations
-
[WACV 2025] Ada-VE: Training-Free Consistent Video Editing Using Adaptive Motion Prior
-
[WACV 2025] FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing
- EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
- OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models
- UniVideo: Unified Understanding, Generation, and Editing for Videos
- InstructX: Towards Unified Visual Editing with MLLM Guidance
- Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
- ConsistEdit: Highly Consistent and Precise Training-free Visual Editing
- First Frame Is the Place to Go for Video Content Customization
- Consistent Video Editing as Flow‑Driven Image‑to‑Video Generation
- UNIC: Unified In‑Context Video Editing
- DreamVE: Unified Instruction‑based Image and Video Editing
- Controllable Pedestrian Video Editing for Multi‑View Driving Scenarios via Motion Sequence
- Low‑Cost Test‑Time Adaptation for Robust Video Editing
- From Long Videos to Engaging Clips: A Human‑Inspired Video Editing Framework with Multimodal Narrative Understanding
- STR‑Match: Matching SpatioTemporal Relevance Score for Training‑Free Video Editing
- Shape‑for‑Motion: Precise and Consistent Video Editing with 3D Proxy
- DFVEdit: Conditional Delta Flow Vector for Zero‑shot Video Editing
- Good Noise Makes Good Edits: A Training‑Free Diffusion‑Based Video Editing with Image and Text Prompts
- LoRA‑Edit: Controllable First‑Frame‑Guided Video Editing via Mask‑Aware LoRA Fine‑Tuning
- TV‑LiVE: Training‑Free, Text‑Guided Video Editing via Layer Informed Vitality Exploitation
- FADE: Frequency‑Aware Diffusion Model Factorization for Video Editing
- FlowDirector: Training‑Free Flow Steering for Precise Text‑to‑Video Editing
- FullDiT2: Efficient In‑Context Conditioning for Video Diffusion Transformers
- OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation
- Motion‑Aware Concept Alignment for Consistent Video Editing
- Zero‑to‑Hero: Zero‑Shot Initialization Empowering Reference‑Based Video Appearance Editing
- REGen: Multimodal Retrieval‑Embedded Generation for Long‑to‑Short Video Editing
- From Shots to Stories: LLM‑Assisted Video Editing with Unified Language Representations
- DAPE: Dual‑Stage Parameter‑Efficient Fine‑Tuning for Consistent Video Editing with Diffusion Models
- Photoshop Batch Rendering Using Actions for Stylistic Video Editing
- Efficient Temporal Consistency in Diffusion‑Based Video Editing with Adaptor Modules: A Theoretical Framework
- Vidi: Large Multimodal Models for Video Understanding and Editing
- Visual Prompting for One‑Shot Controllable Video Editing without Inversion
- CamMimic: Zero‑Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models
- VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing
- Shot Sequence Ordering for Video Editing: Benchmarks, Metrics, and Cinematology‑Inspired Computing Methods
- InstructVEdit: A Holistic Approach for Instructional Video Editing
- HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks
- VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation
- GIFT: Generated Indoor video frames for Texture‑less point tracking
- RASA: Replace Anyone, Say Anything — A Training‑Free Framework for Audio‑Driven and Universal Portrait Video Editing
- V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes
- Alias‑Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space
- VACE: All‑in‑One Video Creation and Editing
- Get In Video: Add Anything You Want to the Video
- VideoPainter: Any‑length Video Inpainting and Editing with Plug‑and‑Play Context Control
- VideoGrain: Modulating Space‑Time Attention for Multi‑grained Video Editing
- VideoDiff: Human‑AI Video Co‑Creation with Alternatives
- SportsBuddy: Designing and Evaluating an AI‑Powered Sports Video Storytelling Tool Through Real‑World Deployment
- AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming and Keyframe Selection
- MotionCanvas: Cinematic Shot Design with Controllable Image‑to‑Video Generation
- SST‑EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
- IP‑FaceDiff: Identity‑Preserving Facial Video Editing with Diffusion
- Qffusion: Controllable Portrait Video Editing via Quadrant‑Grid Attention Learning
- Text‑to‑Edit: Controllable End‑to‑End Video Ad Creation via Multimodal LLMs
- Enhancing Low‑Cost Video Editing with Lightweight Adaptors and Temporal‑Aware Inversion
- Edit as You See: Image‑Guided Video Editing via Masked Motion Modeling
-
[CVPR 2024] A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing
-
[CVPR 2024] VidToMe: Video Token Merging for Zero-Shot Video Editing
-
[CVPR 2024] Video-P2P: Video Editing with Cross-Attention Control
-
[CVPR 2024] CCEdit: Creative and Controllable Video Editing via Diffusion Models
-
[CVPR 2024] RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
-
[CVPR 2024] DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing
-
[CVPR 2024] MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers
-
[CVPR 2024] MotionEditor: Editing Video Motion via Content-Aware Diffusion
-
[CVPR 2024] CAMEL: CAusal Motion Enhancement Tailored for Lifting Text-Driven Video Editing
-
[ICLR 2024] Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models
-
[ICLR 2024] Video Decomposition Prior: Editing Videos Layer by Layer
-
[ICLR 2024] FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing
-
[ICLR 2024] TokenFlow: Consistent Diffusion Features for Consistent Video Editing
-
[ECCV 2024] VIDEOSHOP: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
-
[ECCV 2024] WAVE: Warping DDIM Inversion Features for Zero-Shot Text-to-Video Editing
-
[ECCV 2024] DreamMotion: Space-Time Self-similar Score Distillation for Zero-Shot Video Editing
-
[ECCV 2024] Object-Centric Diffusion for Efficient Video Editing
-
[ECCV 2024] Video Editing via Factorized Diffusion Distillation
-
[ECCV 2024] SAVE: Protagonist Diversification with Structure Agnostic Video Editing
-
[ECCV 2024] DNI: Dilutional Noise Initialization for Diffusion Video Editing
-
[ECCV 2024] MagDiff: Multi-alignment Diffusion for High-Fidelity Video Generation and Editing
-
[ECCV 2024] DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency
- MAKIMA: Tuning‑free Multi‑Attribute Open‑domain Video Editing via Mask‑Guided Attention Modulation
- DriveEditor: A Unified 3D Information‑Guided Framework for Controllable Object Editing in Driving Scenes
- Re‑Attentional Controllable Video Diffusion Editing
- MoViE: Mobile Diffusion for Video Editing
- DIVE: Taming DINO for Subject‑Driven Video Editing
- Trajectory Attention for Fine‑grained Video Motion Control
- VideoDirector: Precise Video Editing via Text‑to‑Video Models
- StableV2V: Stablizing Shape Consistency in Video‑to‑Video Editing
- OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models
- A Reinforcement Learning‑Based Automatic Video Editing Method Using Pre‑trained Vision‑Language Model
- Taming Rectified Flow for Inversion and Editing
- AutoVFX: Physically Realistic Video Editing from Natural Language Instructions
- Shaping a Stabilized Video by Mitigating Unintended Changes for Concept‑Augmented Video Editing
- RNA: Video Editing with ROI‑based Neural Atlas
- FreeMask: Rethinking the Importance of Attention Masks for Zero‑Shot Video Editing
- DNI: Dilutional Noise Initialization for Diffusion Video Editing
- Blended Latent Diffusion under Attention Control for Real‑World Video Editing
- DeCo: Decoupled Human‑Centered Diffusion Video Editing with Motion Consistency
- InVi: Object Insertion In Videos Using Off‑the‑Shelf Diffusion Models
- MVOC: A Training‑Free Multiple Video Object Composition Method with Diffusion Models
- VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
- COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing
- NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing
- FRAG: Frequency Adapting Group for Diffusion Video Editing
- Zero‑Shot Video Editing through Adaptive Sliding Score Distillation
- Ada‑VE: Training‑Free Consistent Video Editing Using Adaptive Motion Prior
- Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting
- Temporally Consistent Object Editing in Videos using Extended Attention
- MotionFollower: Editing Video Motion via Lightweight Score‑Guided Diffusion
- Streaming Video Diffusion: Online Video Editing with Diffusion Models
- I2VEdit: First‑Frame‑Guided Video Editing via Image‑to‑Video Diffusion Models
- ReVideo: Remake a Video with Motion and Content Control
- Slicedit: Zero‑Shot Video Editing With Text‑to‑Image Diffusion Models Using Spatio‑Temporal Slices
- GenVideo: One‑shot Target‑image and Shape Aware Video Editing using T2I Diffusion Models
- Ctrl‑Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model
- S3Editor: A Sparse Semantic‑Disentangled Self‑Training Framework for Face Video Editing
- ExpressEdit: Video Editing with Natural Language and Sketching
- EVA: Zero‑shot Accurate Attributes and Multi‑Object Video Editing
- Edit3K: Universal Representation Learning for Video Editing Components
- Videoshop: Localized Semantic Video Editing with Noise‑Extrapolated Diffusion Inversion
- AnyV2V: A Tuning‑Free Framework For Any Video‑to‑Video Editing Tasks
- DreamMotion: Space‑Time Self‑Similar Score Distillation for Zero‑Shot Video Editing
- EffiVED: Efficient Video Editing via Text‑instruction Diffusion Models
- AICL: Action In‑Context Learning for Video Diffusion Model
- Video Editing via Factorized Diffusion Distillation
- VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis
- FastVideoEdit: Leveraging Consistency Models for Efficient Text‑to‑Video Editing
- Place Anything into Any Video
- UniEdit: A Unified Tuning‑Free Framework for Video Motion and Appearance Editing
- Anything in Any Scene: Photorealistic Video Object Insertion
- Object‑Centric Diffusion for Efficient Video Editing
- VASE: Object‑Centric Appearance and Shape Manipulation of Real Videos
- Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions
-
[AAAI 2026] OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding
-
[AAAI 2026] MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models
-
[CVPR 2026] Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
-
[CVPR 2026] First Frame Is the Place to Go for Video Content Customization
-
[CVPR 2026] UCPE: Unified Camera Positional Encoding for Controlled Video Generation
-
[CVPR 2026] Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context
-
[CVPR 2026] ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding
-
[CVPR 2026] NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
-
[CVPR 2026] SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
-
[CVPR 2026] VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
-
[CVPR 2026] Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
-
[CVPR 2026] OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer
-
[CVPR 2026] Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models
-
[CVPR 2026] ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation
-
[CVPR 2026] PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation
-
[CVPR 2026] BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
-
[CVPR 2026] RealWonder: Real-Time Physical Action-Conditioned Video Generation
-
[CVPR 2026] LAMP: Language-Assisted Motion Planning for Controllable Video Generation
-
[CVPR 2026] WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories
-
[CVPR 2026] StableWorld: Towards Stable and Consistent Long Interactive Video Generation
-
[ICLR 2026] Controllable Video Generation with Provable Disentanglement
-
[ICLR 2026] 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation
-
[ICLR 2026] MoCa: Modeling Object Consistency for 3D Camera Control in Video Generation
-
[ICLR 2026] MotionStream: Real-Time Video Generation with Interactive Motion Controls
-
[ICLR 2026] Time-to-Move: Training-Free Motion-Controlled Video Generation via Dual-Clock Denoising
-
[ICLR 2026] Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models
-
[ICLR 2026] Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation
-
[ICLR 2026] TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
-
[ICLR 2026] MIMIC: Mask-Injected Manipulation Video Generation with Interaction Control
-
[ICLR 2026] NewtonGen: Physics-consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics
-
[ICLR 2026] MATRIX: Mask Track Alignment for Interaction-aware Video Generation
-
[ICLR 2026] ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask
-
[ICLR 2026] Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models
-
[ICLR 2026] ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing
-
[ICLR 2026] Light-X: Generative 4D Video Rendering with Camera and Illumination Control
-
[ICLR 2026] Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control
-
[ICLR 2026] LightCtrl: Training-free Controllable Video Relighting
-
[CVPR 2025] IM-Zero: Instance-level Motion Controllable Video Generation in a Zero-shot Manner
-
[CVPR 2025] AnimateAnything: Consistent and Controllable Animation for Video Generation
-
[CVPR 2025] Customized Condition Controllable Generation for Video Soundtrack
-
[CVPR 2025] StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation
-
[ICCV 2025] Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation
-
[ICCV 2025] MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers
-
[ICCV 2025] MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
-
[ICCV 2025] InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models
-
[ICCV 2025] Free-Form Motion Control (SynFMC): Controlling the 6D Poses of Camera and Objects in Video Generation
-
[ICCV 2025] RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control
-
[ICCV 2025] MagicMotion: Video Generation with a Smart Director
-
[ICCV 2025] UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving
-
[ICLR 2025] MotionClone: Training-Free Motion Cloning for Controllable Video Generation
-
[AAAI 2025] CAGE: Unsupervised Visual Composition and Animation for Controllable Video Generation
-
[AAAI 2025] TrackGo: A Flexible and Efficient Method for Controllable Video Generation
-
[WACV 2025] Fine-grained Controllable Video Generation via Object Appearance and Context
- Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
- PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation
- In-Video Instructions: Visual Signals as Generative Control
- IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation
- ATI: Any Trajectory Instruction for Controllable Video Generation
- CamContextI2V: Context‑aware Controllable Video Generation
- Any2Caption: Interpreting Any Condition to Caption for Controllable Video Generation
- MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance
- MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent
-
[AAAI 2026] EchoMimicV3: EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation
-
[AAAI 2026] FantasyTalking2: FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation
-
[CVPR 2026 Findings] UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
-
[CVPR 2026] Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
-
[CVPR 2026] ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars
-
[CVPR 2026] StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
-
[CVPR 2026] UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
-
[CVPR 2026] UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
-
[CVPR 2026] EmoDiffTalk: Emotion-aware Diffusion for Editable 3D Gaussian Talking Head
-
[ICLR 2026] InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing
-
[ICLR 2026] Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers
-
[ICLR 2026] AUHead: Realistic Emotional Talking Head Generation via Action Units Control
-
[ICLR 2026] ApoAvatar: Expressive Audio-Driven Avatar Generation via Refocused Audio-Pose Priors
-
[ICLR 2026] AvatarSync: Rethinking Talking-Head Animation through Phoneme-Guided Autoregressive Perspective
-
[ICLR 2026] RETA: Real-Time and Expressive Talking Head Animation without Emotion Label
-
[ICLR 2026] XTalker: Turn, Smile, and Speak in Controllable Talking Portrait Animation
-
[ICLR 2026] GenFaceTalk: Generalizable One-Shot Talking-Head Generation for Diverse Styles
-
[ICLR 2026] Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation
-
[ICLR 2026] A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages
-
[ICLR 2026] Weakly Supervised Motion Learning for Co-speech Gesture Video Generation
-
[ICLR 2026] SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representations
-
[ICLR 2026] THEval. Evaluation Framework for Talking Head Video Generation
-
[ICLR 2026] AnyAvatar: Dynamic and Consistent Audio-Driven Human Animation for Multiple Characters
-
[ICLR 2026] Talk2Me: High-Fidelity and Controllable Audio-Driven Avatars with Gaussian Splatting
-
[ICLR 2026] Audio-driven 3D Conversational Full-body Human Avatar Generation from a Single Image
-
[ICLR 2026] InterAvatar: Real-time Interactive Portrait Animation via Behavioral Interaction Prompts
-
[ICLR 2026] KeyVID: Keyframe-Aware Video Diffusion for Audio-Visual Animation
-
[CVPR 2025] KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
-
[CVPR 2025] AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers
-
[CVPR 2025] MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation
-
[CVPR 2025] Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation
-
[CVPR 2025] INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations
-
[ICCV 2025] FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
-
[ICCV 2025] GaussianSpeech: Audio-Driven Personalized 3D Gaussian Avatars
-
[ICCV 2025] ACTalker: Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation
-
[ICLR 2025] Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation
-
[ICLR 2025] Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
-
[ICLR 2025] CyberHost: A One-stage Diffusion Framework for Audio-driven Talking Body Generation
-
[NeurIPS 2025] MTV: Audio-Sync Video Generation with Multi-Stream Temporal Control
-
[AAAI 2025] EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions
-
[AAAI 2025] PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis
- Wan-S2V: Audio-Driven Cinematic Video Generation
- TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis
- Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
- Scaling Up Audio‑Synchronized Visual Animation: An Efficient Training Paradigm
- SpA2V: Harnessing Spatial Auditory Cues for Audio‑driven Spatially‑aware Video Generation
- OmniAvatar: Efficient Audio‑Driven Avatar Video Generation with Adaptive Body Animation
- AlignHuman: Improving Motion and Fidelity via Timestep‑Segment Preference Optimization for Audio‑Driven Human Animation
- LLIA — Enabling Low‑Latency Interactive Avatars: Real‑Time Audio‑Driven Portrait Video Generation with Diffusion Models
- TalkingMachines: Real‑Time Audio‑Driven FaceTime‑Style Video via Autoregressive Diffusion Models
-
[CVPR 2024] FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models
-
[ECCV 2024] UniTalker: Scaling up Audio-Driven 3D Facial Animation Through A Unified Model
-
[ECCV 2024] Audio-Driven Talking Face Generation with Stabilized Synchronization Loss
-
[NeurIPS 2024] VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
- AV‑Link: Temporally‑Aligned Diffusion Features for Cross‑Modal Audio‑Video Generation
- SAVGBench: Benchmarking Spatially Aligned Audio‑Video Generation
- SINGER: Vivid Audio‑driven Singing Video Generation with Multi‑scale Spectral Diffusion Model
- SyncFlow: Toward Temporally Aligned Joint Audio‑Video Generation from Text
- FLOAT: Generative Motion Latent Flow Matching for Audio‑driven Talking Portrait
- Stereo‑Talker: Audio‑driven 3D Human Synthesis with Prior‑Guided Mixture‑of‑Experts
- A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation
- DiffTED: One‑shot Audio‑driven TED Talk Video Generation with Diffusion‑based Co‑speech Gestures
-
[AAAI 2026] FantasyHSI: Video-Generation-Centric 4D Human Synthesis In Any Scene through A Graph-based Multi-Agent Framework
-
[CVPR 2026] UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer
-
[CVPR 2026] MultiAnimate: Pose-Guided Image Animation Made Extensible
-
[CVPR 2026] Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
-
[CVPR 2026] Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling
-
[CVPR 2026] One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer
-
[CVPR 2026] Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
-
[CVPR 2026] Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
-
[ICLR 2026] MTVCraft: Tokenizing 4D Motion for Arbitrary Character Animation
-
[ICLR 2026] InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
-
[ICLR 2026] DanceTogether: Generating Interactive Multi-Person Video without Identity Drifting
-
[ICLR 2026] Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer
-
[ICLR 2026] AUHead: Realistic Emotional Talking Head Generation via Action Units Control
-
[ICLR 2026] MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation
-
[ICLR 2026] OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation
-
[CVPR 2025] X-Dyna: Expressive Dynamic Human Image Animation
-
[CVPR 2025] StableAnimator: High-Quality Identity-Preserving Human Image Animation
-
[CVPR 2025] Disco4D: Disentangled 4D Human Generation and Animation from a Single Image
-
[ICCV 2025] DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
-
[ICCV 2025] Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance
-
[ICCV 2025] Multi-identity Human Image Animation with Structural Video Diffusion
-
[ICCV 2025] OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
-
[ICCV 2025] AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion
-
[ICCV 2025] Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation
-
[ICLR 2025] Animate-X: Universal Character Image Animation with Enhanced Motion Representation
- SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation
- Wan-Animate: Unified Character Animation and Replacement with Holistic Replication
- StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation
- FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers
- ARIG: Autoregressive Interactive Head Generation for Real-time Conversations
- StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation
- HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions
- TT‑DF: A Large‑Scale Diffusion‑Based Dataset and Benchmark for Human Body Forgery Detection
- AnimateAnywhere: Rouse the Background in Human Image Animation
- UniAnimate‑DiT: Human Image Animation with Large‑Scale Video Diffusion Transformer
- Taming Consistency Distillation for Accelerated Human Image Animation
- Multi‑identity Human Image Animation with Structural Video Diffusion
- DreamActor‑M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
- DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High‑Quality Human Image Animation
- EvAnimate: Event‑conditioned Image‑to‑Video Generation for Human Animation
-
[CVPR 2024] MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion
-
[CVPR 2024] MotionEditor: Editing Video Motion via Content-Aware Diffusion
-
[CVPR 2024] MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
-
[ECCV 2024] Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance
-
[NeurIPS 2024] HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation
-
[NeurIPS 2024] TPC: Test-time Procrustes Calibration for Diffusion-based Human Image Animation
-
[ICLR 2024] DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
- DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses
- High Quality Human Image Animation using Regional Supervision and Motion Blur Condition
- Dormant: Defending against Pose-driven Human Image Animation
- TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models
- UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation
- VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation
-
[AAAI 2026] Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices
-
[CVPR 2026] Adaptive Hybrid Caching for Efficient Text-to-Video Diffusion Model Acceleration
-
[CVPR 2026] FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters
-
[CVPR 2026] Transition Matching Distillation for Fast Video Generation
-
[CVPR 2026] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
-
[CVPR 2026] Dual-Granularity Memory for Efficient Video Generation
-
[CVPR 2026 Findings] S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation
-
[CVPR 2026] FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction
-
[CVPR 2026] FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
-
[CVPR 2026] Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration
-
[CVPR 2026] SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
-
[CVPR 2026] SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models
-
[CVPR 2026] Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache
-
[CVPR 2026] TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration
-
[CVPR 2026] Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention
-
[CVPR 2026] Accelerating Text-to-Video Generation with Calibrated Sparse Attention
-
[CVPR 2026] SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers
-
[CVPR 2026] DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
-
[CVPR 2026] Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers
-
[ICLR 2026] SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
-
[ICLR 2026] Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers
-
[ICLR 2026] LongLive: Real-time Interactive Long Video Generation
-
[ICLR 2026] Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
-
[ICLR 2026] Stable Video Infinity: Infinite-Length Video Generation with Error Recycling
-
[ICLR 2026] MotionStream: Real-Time Video Generation with Interactive Motion Controls
-
[ICLR 2026] SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation
-
[ICLR 2026] DVD-Quant: Data-free Video Diffusion Transformers Quantization
-
[ICLR 2026] VMoBA: Mixture-of-Block Attention for Video Diffusion Models
-
[ICLR 2026] DSA: Efficient Inference For Video Generation Models via Distributed Sparse Attention
-
[ICLR 2026] BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching
-
[ICLR 2026] PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation
-
[ICLR 2026] Flow Caching for Autoregressive Video Generation
-
[ICLR 2026] Streaming Autoregressive Video Generation via Diagonal Distillation
-
[ICLR 2026] FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge
-
[ICLR 2026] Real-Time Motion-Controllable Autoregressive Video Diffusion
-
[ICLR 2026] Neodragon: Mobile Video Generation Using Diffusion Transformer
-
[ICLR 2026] QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification
-
[ICLR 2026] QVGen: Pushing the Limit of Quantized Video Generative Models
-
[ICLR 2026] BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation
-
[ICLR 2026] Towards One-step Causal Video Generation via Adversarial Self-Distillation
-
[CVPR 2025] Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
-
[CVPR 2025] CausVid: From Slow Bidirectional to Fast Autoregressive VDMs
-
[CVPR 2025] BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers
-
[ICCV 2025] AdaCache: Adaptive Caching for Faster Video Generation with Diffusion Transformers
-
[ICCV 2025] TaylorSeer: From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers
-
[ICCV 2025] Accelerating Diffusion Transformer via Gradient-Optimized Cache
-
[ICCV 2025] V.I.P.: Iterative Online Preference Distillation for Efficient Video Diffusion Models
-
[ICCV 2025] DMDX: Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis
-
[ICCV 2025] OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for DiT
-
[ICLR 2025] FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality
-
[NeurIPS 2025] MagCache: Fast Video Generation with Magnitude-Aware Cache
-
[ICML 2025] Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
-
[ICML 2025] Fast Video Generation with Sliding Tile Attention
-
[ICML 2025] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing
-
[ICML 2025] AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration
- DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder
- LongLive: Real-time Interactive Long Video Generation
- Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
- UltraGen: High-Resolution Video Generation with Hierarchical Attention
- Stable Video Infinity: Infinite-Length Video Generation with Error Recycling
- MotionStream: Real-Time Video Generation with Interactive Motion Controls
- Less is Enough: Training‑Free Video Diffusion Acceleration via Runtime‑Adaptive Caching
- Compact Attention: Exploiting Structured Spatio‑Temporal Sparsity for Fast Video Generation
- MagCache: Fast Video Generation with Magnitude‑Aware Cache
- Seedance 1.0: Exploring the Boundaries of Video Generation Models
- SuperGen: An Efficient Ultra‑high‑resolution Video Generation System with Sketching and Tiling
- MixCache: Mixture‑of‑Cache for Video Diffusion Transformer Acceleration
- SwiftVideo: A Unified Framework for Few‑Step Video Generation through Trajectory‑Distribution Alignment
- Taming Diffusion Transformer for Real‑Time Mobile Video Generation
- SRDiffusion: Accelerate Video Diffusion Inference via Sketching‑Rendering Cooperation
- Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic‑Aware Permutation
- DVD‑Quant: Data‑free Video Diffusion Transformers Quantization
- AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset
- Region Masking to Accelerate Video Processing on Neuromorphic Hardware
- DSV: Exploiting Dynamic Sparsity to Accelerate Large‑Scale Video DiT Training
-
[CVPR 2024] Cache Me if You Can: Accelerating Diffusion Models through Block Caching
-
[NeurIPS 2024] Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
-
[NeurIPS 2024] Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy
-
[NeurIPS 2024] Fast and Memory-Efficient Video Diffusion Using Streamlined Inference
-
[IJCAI 2024] FasterVD: On Acceleration of Video Diffusion Models
- Accelerating Video Diffusion Models via Distribution Matching
- Adaptive Caching for Faster Video Generation with Diffusion Transformers
- OSV: One Step is Enough for High-Quality Image to Video Generation
- HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions
- AnimateDiff-Lightning: Cross-Model Diffusion Distillation
-
[AAAI 2026] VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
-
[CVPR 2026] Inference-time Physics Alignment of Video Generative Models with Latent World Models
-
[CVPR 2026] Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
-
[CVPR 2026] MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models
-
[CVPR 2026] PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation
-
[CVPR 2026] PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation
-
[CVPR 2026] Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization
-
[CVPR 2026] FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation
-
[ICLR 2026] Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
-
[ICLR 2026] Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation
-
[ICLR 2026] Consistent Noisy Latent Rewards for Trajectory Preference Optimization in Diffusion Models
-
[ICLR 2026] BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
-
[ICCV 2025] LongAnimation: Long Animation Generation with Dynamic Global‑Local Memory
-
[ICLR 2025] DartControl: A Diffusion‑Based Autoregressive Motion Model for Real‑Time Text‑Driven Motion Control
-
[ICLR 2025] FLIP: Flow‑Centric Generative Planning as General‑Purpose Manipulation World Model
-
[CVPR 2025] VideoDPO: Omni‑Preference Alignment for Video Diffusion Generation
- Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
- LongCat-Video Technical Report
- PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning
- LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
- VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
- Video Perception Models for 3D Scene Synthesis
- RDPO: Real Data Preference Optimization for Physics Consistency Video Generation
- VQ‑Insight: Teaching VLMs for AI‑Generated Video Quality Understanding via Progressive Visual Reinforcement Learning
- Toward Rich Video Human‑Motion2D Generation
- AlignHuman: Improving Motion and Fidelity via Timestep‑Segment Preference Optimization for Audio‑Driven Human Animation
- Multimodal Large Language Models: A Survey
- Seedance 1.0: Exploring the Boundaries of Video Generation Models
- ContentV: Efficient Training of Video Generation Models with Limited Compute
- Photography Perspective Composition: Towards Aesthetic Perspective Recommendation
- Scaling Image and Video Generation via Test‑Time Evolutionary Search
- InfLVG: Reinforce Inference‑Time Consistent Long Video Generation with GRPO
- AvatarShield: Visual Reinforcement Learning for Human‑Centric Video Forgery Detection
- RLVR‑World: Training World Models with Reinforcement Learning
- Diffusion‑NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models
- DanceGRPO: Unleashing GRPO on Visual Generation
- VideoHallu: Evaluating and Mitigating Multi‑modal Hallucinations on Synthetic Video Understanding
- Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning
- SkyReels‑V2: Infinite‑length Film Generative Model
- FingER: Content Aware Fine‑grained Evaluation with Reasoning for AI‑Generated Videos
- Aligning Anime Video Generation with Human Feedback
- Discriminator‑Free Direct Preference Optimization for Video Diffusion
- Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments
- OmniCam: Unified Multimodal Video Generation via Camera Control
- VPO: Aligning Text‑to‑Video Generation Models with Prompt Optimization
- Zero‑Shot Human‑Object Interaction Synthesis with Multimodal Priors
- Judge Anything: MLLM as a Judge Across Any Modality
- MagicID: Hybrid Preference Optimization for ID‑Consistent and Dynamic‑Preserved Video Customization
- Unified Reward Model for Multimodal Understanding and Generation
- Pre‑Trained Video Generative Models as World Simulators
- Harness Local Rewards for Global Benefits: Effective Text‑to‑Video Generation Alignment with Patch‑level Reward Models
- IPO: Iterative Preference Optimization for Text‑to‑Video Generation
- MJ‑VIDEO: Fine‑Grained Benchmarking and Rewarding Video Preferences in Video Generation
- HuViDPO: Enhancing Video Generation through Direct Preference Optimization for Human‑Centric Alignment
- Zeroth‑order Informed Fine‑Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
- Improving Video Generation with Human Feedback
- VisionReward: Fine‑Grained Multi‑Dimensional Human Preference Learning for Image and Video Generation
- OnlineVPO: Align Video Diffusion Model with Online Video‑Centric Preference Optimization
- The Matrix: Infinite‑Horizon World Generation with Real‑Time Moving Control
- Improving Dynamic Object Interactions in Text‑to‑Video Generation with AI Feedback
- Free$^2$Guide: Gradient‑Free Path Integral Control for Enhancing Text‑to‑Video Generation with Large Vision‑Language Models
- A Reinforcement Learning‑Based Automatic Video Editing Method Using Pre‑trained Vision‑Language Model
- Video to Video Generative Adversarial Network for Few‑shot Learning Based on Policy Gradient
- WorldSimBench: Towards Video Generation Models as World Simulators
- Animating the Past: Reconstruct Trilobite via Video Generation
- VideoAgent: Self‑Improving Video Generation
- E‑Motion: Future Motion Simulation via Event Sequence Diffusion
- SePPO: Semi‑Policy Preference Optimization for Diffusion Alignment
- Video Diffusion Alignment via Reward Gradients
- InstructVideo: Instructing Video Diffusion Models with Human Feedback
- AdaDiff: Adaptive Step Selection for Fast Diffusion Models
QuenithAI is a professional organization composed of top researchers, dedicated to providing high-quality 1-on-1 research mentoring for university students worldwide. Our mission is to help students bridge the gap from theoretical knowledge to cutting-edge research and publish their work in top-tier conferences and journals.
Maintaining this Video-Generation-Paper-List list requires significant effort, just as completing a high-quality paper requires focused dedication and expert guidance. If you're looking for one-on-one support from top scholars on your own research project, to quickly identify innovative ideas and make publications, we invite you to contact us ASAP.
➡️ Contact us via WeChat or E-mail to start your research journey.
「应达学术」(QuenithAI) 是一家由顶尖研究者组成,致力于为全球高校学生提供高质量1V1科研辅导的专业机构。我们的使命是帮助学生培养出色卓越的科研技能,在顶级会议和期刊上发表自己的成果。
维护一个GitHub调研仓库需要巨大的精力,正如完成一篇高质量的论文一样,离不开专注的投入和专业的指导。如果您希望在自己的研究项目中,获得来自顶尖学者的一对一支持,我们诚邀您与我们取得联系。
➡️ 欢迎通过 微信 或 邮件 联系我们,开启您的科研之旅。
Contributions are welcome! Please see our Contribution Guidelines for details on how to add new papers, correct information, or improve the repository.
Join our community to stay up-to-date with the latest advancements, share your work, and collaborate with other researchers and developers in the field of video generation. If you are interested, please contact our administrator to join the group.