The rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is vital for applications like autonomous driving. Although DiT with 3D VAE has become a standard framework for video generation, it introduces challenges in controllable driving video generation, especially for geometry control, rendering existing control methods ineffective. To address these issues, we propose MagicDrive-V2, a novel approach that integrates the MVDiT block and spatial-temporal conditional encoding to enable multi-view video generation and precise geometric control. Additionally, we introduce an efficient method for obtaining contextual descriptions for videos to support diverse textual control, along with a progressive training strategy using mixed video data to enhance training efficiency and generalizability. Consequently, MagicDrive-V2 enables multi-view driving video synthesis with 3.3× resolution and 4× frame count (compared to current SOTA), rich contextual control, and geometric controls. Extensive experiments demonstrate MagicDrive-V2’s ability, unlocking broader applications in autonomous driving.
MagicDrive-V2 adopts a newly proposed multi-view DiT architecture for multi-view street video generation. Together with the control blocks, MagicDrive-V2 is capable of multiple control signals for fine-grained geometric control.
Left/Up: The spatial-temporal latents produced by 3D VAE cannot fit previous strategies on spatial latents. Right/Down:MagicDrive-V2 adopts spatial temporal encoding for each condition to correctly control the video generation.
Left/Up: MagicDrive-V2 adopts a progressive boostrap training, where we starts from low-resolution images to high-resolution long videos. Right/Down: for each stage, MagicDrive-V2 utilize a mixed data for training.
MagicDrive-V2 provides accurate ego trajectory control in the 3D space, which supports generating various driving scenarios.
"yield the way"
"through the intersection"
"over a speed bump"
"change the lane"
Precise control over objects and road sematics is available by MagicDrive-V2. We generations with same configurations but diffferent weather/time-of-day conditions.
MagicDrive-V2 provides precise controls over each object's class, size, and trajectory.
To show diversity and controllability. We generate the following videos with the same boxes & road maps & ego trajectory from the validation split of the Waymo Open Dataset. We use different text prompts to change the weather/time-of-day/background of each video (from Sec. 4.4 of our paper). All videos are with 3×848×1600, 193 frames at 10 FPS from single inference. Only 3 same-sized views are available in Waymo for training.
🌧️Rainy:
🌫️Foggy:
🌙At Night:
🌄Twilight:
☀️Sunny:
🏞Hill area, more trees🌳, less buildings: