MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

1The Chinese University of Hong Kong, 2Hong Kong University of Science and Technology, 3Huawei Cloud 4Huawei Noah's Ark Lab

Take a quick look at MagicDriveDiT through the video, better viewed on large screen with music !

Abstract

The rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is essential for applications like autonomous driving. However, existing methods are limited by scalability and how control conditions are integrated, failing to meet the needs for high-resolution and long videos for autonomous driving applications. In this paper, we introduce MagicDrive3D, a novel approach based on the DiT architecture, and tackle these challenges. Our method enhances scalability through flow matching and employs a progressive training strategy to manage complex scenarios. By incorporating spatial-temporal conditional encoding, MagicDrive3D achieves precise control over spatial-temporal latents. Comprehensive experiments show its superior performance in generating realistic street scene videos with higher resolution and more frames. MagicDrive3D significantly improves video generation quality and spatial-temporal controls, expanding its potential applications across various tasks in autonomous driving.

All generated videos on this page are 6×848×1600 with 20s at 12 FPS from single inference!



Architecture: Multi-View DiT for Scalable Video Generation

architecture of MagicDriveDiT

MagicDriveDiT adopts a newly proposed multi-view DiT architecture for multi-view street video generation. Together with the control blocks, MagicDriveDiT is capable of multiple control signals for fine-grained geometric control.

Control: Spatial-Temporal Encoding for Conditions.

MagicDriveDiT encoding
MagicDriveDiT
        encoding

Left/Up: The spatial-temporal latents produced by 3D VAE cannot fit previous strategies on spatial latents. Right/Down:MagicDriveDiT adopts spatial temporal encoding for each condition to correctly control the video generation.

Training: Progressive Training with Mixed Data Configurations

MagicDriveDiT stages
MagicDriveDiT data

Left/Up: MagicDriveDiT adopts a progressive boostrap training, where we starts from low-resolution images to high-resolution long videos. Right/Down: for each stage, MagicDriveDiT utilize a mixed data for training.

Various Driving Scenarios

MagicDriveDiT provides accurate ego trajectory control in the 3D space, which supports generating various driving scenarios.

More Video Examples

Precise control over objects and road sematics is available by MagicDriveDiT. We generations with same configurations but diffferent weather/time-of-day conditions.