The rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is essential for applications like autonomous driving. However, existing methods are limited by scalability and how control conditions are integrated, failing to meet the needs for high-resolution and long videos for autonomous driving applications. In this paper, we introduce MagicDrive3D, a novel approach based on the DiT architecture, and tackle these challenges. Our method enhances scalability through flow matching and employs a progressive training strategy to manage complex scenarios. By incorporating spatial-temporal conditional encoding, MagicDrive3D achieves precise control over spatial-temporal latents. Comprehensive experiments show its superior performance in generating realistic street scene videos with higher resolution and more frames. MagicDrive3D significantly improves video generation quality and spatial-temporal controls, expanding its potential applications across various tasks in autonomous driving.
MagicDriveDiT adopts a newly proposed multi-view DiT architecture for multi-view street video generation. Together with the control blocks, MagicDriveDiT is capable of multiple control signals for fine-grained geometric control.
Left/Up: The spatial-temporal latents produced by 3D VAE cannot fit previous strategies on spatial latents. Right/Down:MagicDriveDiT adopts spatial temporal encoding for each condition to correctly control the video generation.
Left/Up: MagicDriveDiT adopts a progressive boostrap training, where we starts from low-resolution images to high-resolution long videos. Right/Down: for each stage, MagicDriveDiT utilize a mixed data for training.
MagicDriveDiT provides accurate ego trajectory control in the 3D space, which supports generating various driving scenarios.
"yield the way"
"through the intersection"
"over a speed bump"
"change the lane"
Precise control over objects and road sematics is available by MagicDriveDiT. We generations with same configurations but diffferent weather/time-of-day conditions.