MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Ruiyuan Gao¹, Kai Chen², Bo Xiao³, Lanqing Hong^4†, Zhenguo Li⁴, Qiang Xu^1†

¹The Chinese University of Hong Kong, ²Hong Kong University of Science and Technology, ³Huawei Cloud ⁴Huawei Noah's Ark Lab

Paper arXiv Code

Take a quick look at MagicDrive-V2 through the video, better viewed on large screen with music !

Abstract

The rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is vital for applications like autonomous driving. Although DiT with 3D VAE has become a standard framework for video generation, it introduces challenges in controllable driving video generation, especially for geometry control, rendering existing control methods ineffective. To address these issues, we propose MagicDrive-V2, a novel approach that integrates the MVDiT block and spatial-temporal conditional encoding to enable multi-view video generation and precise geometric control. Additionally, we introduce an efficient method for obtaining contextual descriptions for videos to support diverse textual control, along with a progressive training strategy using mixed video data to enhance training efficiency and generalizability. Consequently, MagicDrive-V2 enables multi-view driving video synthesis with 3.3× resolution and 4× frame count (compared to current SOTA), rich contextual control, and geometric controls. Extensive experiments demonstrate MagicDrive-V2’s ability, unlocking broader applications in autonomous driving.

All generated videos on the following page are 6×848×1600 with 20s at 12 FPS from single inference!

Architecture: Multi-View DiT for Scalable Video Generation

architecture of MagicDrive-V2

MagicDrive-V2 adopts a newly proposed multi-view DiT architecture for multi-view street video generation. Together with the control blocks, MagicDrive-V2 is capable of multiple control signals for fine-grained geometric control.

Control: Spatial-Temporal Encoding for Conditions.

MagicDrive-V2 encoding

MagicDrive-V2
encoding

Left/Up: The spatial-temporal latents produced by 3D VAE cannot fit previous strategies on spatial latents. Right/Down:MagicDrive-V2 adopts spatial temporal encoding for each condition to correctly control the video generation.

Training: Progressive Training with Mixed Data Configurations

MagicDrive-V2 stages

MagicDrive-V2 data

Left/Up: MagicDrive-V2 adopts a progressive boostrap training, where we starts from low-resolution images to high-resolution long videos. Right/Down: for each stage, MagicDrive-V2 utilize a mixed data for training.

Various Driving Scenarios

MagicDrive-V2 provides accurate ego trajectory control in the 3D space, which supports generating various driving scenarios.

"yield the way"

"through the intersection"

"over a speed bump"

"change the lane"

More Video Examples

Precise control over objects and road sematics is available by MagicDrive-V2. We generations with same configurations but diffferent weather/time-of-day conditions.

Object-level Control

MagicDrive-V2 provides precise controls over each object's class, size, and trajectory.

Extension: Fine-tuned results on Waymo Open Dataset with Rich Text Control.

To show diversity and controllability. We generate the following videos with the same boxes & road maps & ego trajectory from the validation split of the Waymo Open Dataset. We use different text prompts to change the weather/time-of-day/background of each video (from Sec. 4.4 of our paper). All videos are with 3×848×1600, 193 frames at 10 FPS from single inference. Only 3 same-sized views are available in Waymo for training.

🌧️Rainy:

🌫️Foggy:

🌙At Night:

🌄Twilight:

☀️Sunny:

🏞Hill area, more trees🌳, less buildings: