MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Authors:

Yanbo Ding、Shaobin Zhuang、Kunchang Li、Zhengrong Yue、Yu Qiao、Yali Wang

Paper:

Introduction

In recent years, the field of text-to-image generation has seen significant advancements, with models like Stable Diffusion and DALL-E pushing the boundaries of what is possible. However, these models often struggle with generating images that contain multiple objects with complex spatial relationships in a 3D world. This limitation is particularly evident when precise control over 3D attributes such as object orientation, spatial relationships, and camera views is required. To address this challenge, the authors introduce MUSES, a novel AI system designed for 3D-controllable image generation from user queries. MUSES employs a multi-modal agent collaboration approach, mimicking the workflow of human professionals to achieve precise 3D control in image generation.

Related Work

Controllable Image Generation

Before the advent of diffusion models, GAN-based methods like ControlGAN and AttnGAN incorporated text features via attention modules to guide image generation. With the rise of diffusion models, the Stable Diffusion series quickly became dominant in the text-to-image generation market. However, text-based control alone proved insufficient for precise and fine-grained image generation. Models like ControlNet, GLIGEN, and T2I-Adapter introduced additional control conditions such as depth maps and sketches. Despite these advancements, existing methods still struggle to control 3D properties of objects. MUSES takes an innovative approach by planning 3D layouts and incorporating 3D models and simulations to achieve 3D-controllable image generation.

LLM-Based Agents

Large Language Models (LLMs) like ChatGPT and Llama have revolutionized natural language processing, while Multimodal LLMs (MLLMs) like LLaVA and InternVL have enabled impressive performance on visual tasks. The combination of LLMs and MLLMs in multi-agent systems has achieved remarkable success across various domains, including visual understanding, gaming, software development, video generation, and autonomous driving. Unlike previous works, MUSES uses LLMs to plan 3D layouts, bridging the gap between linguistic understanding and 3D spatial reasoning, particularly in complex 3D scenes.

Research Methodology

Overview of MUSES

MUSES is a generic AI system with a distinct multi-modal agent collaboration pipeline. It comprises three collaborative agents:

Layout Manager: Responsible for 2D-to-3D layout lifting.
Model Engineer: Handles 3D object acquisition and calibration.
Image Artist: Manages 3D-to-2D image rendering.

Layout Manager: 2D-to-3D Layout Lifting

The Layout Manager employs a Large Language Model (LLM) to plan 3D layouts based on user input. The process involves two main steps:

2D Layout Planning via In-Context Learning: The LLM generates a 2D layout through in-context learning, using examples from a 2D layout shop.
3D Layout Planning via Chain-of-Thought Reasoning: The LLM lifts the 2D layout to 3D space by determining object depth, orientation, and camera view through step-by-step reasoning.

Model Engineer: 3D Object Acquisition and Calibration

The Model Engineer consists of two key roles:

Model Retriever: Acquires 3D models of objects using a decision tree approach that prioritizes retrieval from a self-collected 3D model shop, online search, and text-to-3D generation.
Model Aligner: Calibrates the orientations of acquired 3D models to face the camera using a fine-tuned CLIP as a face-camera classifier.

Image Artist: 3D-to-2D Image Rendering

The Image Artist assembles the 3D-aligned objects into a complete scene in Blender, generating high-quality 2D images. The process involves:

3D Scene Composition: Assembling 3D object models according to the 3D layout.
Condition Image Generation: Creating depth maps and canny edge images for fine-grained control.
Final Image Generation: Using ControlNet to generate the final image based on the 3D-to-2D condition images.

Experimental Design

Datasets and Metrics

The experiments were conducted on two datasets:

T2I-CompBench: Evaluates object count and spatial relationships but lacks detailed text prompts for object orientations and camera views.
T2I-3DisBench: A newly introduced dataset with 50 detailed texts encapsulating complex 3D information, including object orientations and camera views.

Both automatic and user evaluations were conducted on T2I-3DisBench. Visual Question Answering (VQA) on InternVL was used to rate the generated images on four key dimensions: object count, orientation, 3D spatial relationship, and camera view.

Implementation Details

MUSES is designed to be modular and extensible, allowing for the integration of various LLMs, CLIPs, and ControlNets. The experiments were conducted on an Ubuntu 20.04 system with 8 NVIDIA RTX 3090 GPUs. Specific settings for Llama-3-8B, ViT-L/14, ViT-B/32, and SD 3 ControlNet were used to ensure precise, consistent, and reliable outputs.

Results and Analysis

SOTA Comparison on T2I-CompBench

MUSES consistently outperforms both specialized/multi-agent approaches and generic models across all metrics, including object count, relationship, and attribute binding. The innovative 3D layout planning and 3D-to-2D image conditions enhance object relationship understanding, leading to the best performance on spatial-related metrics.

SOTA Comparison on T2I-3DisBench

MUSES also outperforms other methods on T2I-3DisBench, both in automatic and user evaluations. Existing approaches struggle with complex prompts containing 3D information, highlighting the importance of MUSES’ 3D-integration design. User evaluations show a strong preference for MUSES, demonstrating its effectiveness in handling complex 3D scenes.

Ablation Studies

Ablation studies reveal that each component of MUSES is crucial for 3D-controllable image generation. Removing any component degrades the results, with the most significant impact observed when the Model Engineer is removed, resulting in poor object shaping and orientation.

Overall Conclusion

MUSES represents a significant advancement in 3D-controllable image generation, achieving fine-grained control over 3D object properties and camera views. The introduction of T2I-3DisBench provides a comprehensive benchmark for evaluating complex 3D image scenes. Future work will focus on improving efficiency, expanding capabilities to control lighting conditions, and potentially extending to video generation.

Acknowledgments

The authors acknowledge the contributions of teams at OpenAI, Meta AI, and InstantX for their open-source models, and CGTrader for supporting 3D model crawler free downloads. They also appreciate the valuable insights from researchers at the Shenzhen Institute of Advanced Technology and the Shanghai AI Laboratory.

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Authors:

Paper:

Introduction

Related Work

Controllable Image Generation

LLM-Based Agents

Research Methodology

Overview of MUSES

Layout Manager: 2D-to-3D Layout Lifting

Model Engineer: 3D Object Acquisition and Calibration

Image Artist: 3D-to-2D Image Rendering

Experimental Design

Datasets and Metrics

Implementation Details

Results and Analysis

SOTA Comparison on T2I-CompBench

SOTA Comparison on T2I-3DisBench

Ablation Studies

Overall Conclusion

Acknowledgments

Related Posts