Recent advancements in diffusion models have shown remarkable promise in data synthesis, enhancing a wide range of 2D perception tasks. However, achieving precise control in street view generation for 3D perception tasks still remains a formidable challenge. Specifically, when adopting Bird's-Eye View (BEV) as the sole condition for street view generation, height control becomes the major hurdle, which is indispensable for accurately representing object dimensions, occlusion patterns, and road surface elevations, particularly for 3D object detection. In this paper, we introduce MagicDrive, a novel street view generation framework that incorporates diverse 3D geometry control, including the camera poses, road maps, 3D bounding boxes, and textual descriptions, by employing customized encoding strategies. A cross-view attention module is further introduced to ensure the multi-camera view consistency. The versatility of MagicDrive empowers high-quality street-view data synthesis that accurately reflects diverse 3D geometry control, benefiting 3D perception tasks such as BEV segmentation and 3D object detection.
Given geometric conditions, MagicDrive can generate unlimited number of street-views with diversity. Here we randomly sample 10 initial noise and perform Spherical Linear Interpolation (Slerp) between them, resulting in 100 noises for generation (like figure 6 in the DDIM paper).
Using MagicDrive, one can generate diverse street-view images even to similar scene annotations. We show generation according to continuous annotation sequences.
scene-0012: see road changes with the ego car's direction
scene-0105: see object distance change
scene-0103: see various semantics
MagicDrive controls the position of objects precisely, while keeping other objects unchanged. Drag the slider to see how the vehicle (in the bounding box) moves from left to right.
MagicDrive considers controls from road BEV map, object bounding box, camera pose, and textual description.
Generation from MagicDrive can be used as data augmentation, supporting both BEV segmentation and 3D object detection tasks.