Creating realistic 3D models for applications such as virtual reality, filmmaking, and engineering design has traditionally been a daunting challenge, often involving extensive manual trial and error. In contrast, generative artificial intelligence (AI) models specializing in 2D images have simplified artistic processes, allowing creators to generate lifelike visuals from simple text prompts. However, these models are not optimized for crafting 3D shapes.
To address this gap, a novel technique known as Score Distillation has emerged, which utilizes 2D image generation models to produce 3D shapes. Despite its promise, the results from this method frequently lack clarity and realism, appearing either blurry or cartoonish. Researchers from MIT have investigated the algorithms behind the generation of 2D images and 3D shapes to understand the discrepancies in quality, pinpointing the underlying causes that lead to inferior 3D models. Building on these insights, they developed a straightforward enhancement to Score Distillation that enables the generation of sharp, high-quality 3D shapes that closely resemble the quality of the best 2D AI-generated images.
Various methods aim to rectify the challenges associated with generating 3D shapes, often involving the retraining or fine-tuning of generative AI models. However, this process can be both costly and time-consuming. The technique devised by the MIT researchers provides a solution that matches or even surpasses the quality of 3D shapes produced by these more complex methods, without requiring additional training or complicated post-processing. By identifying the root causes of the previously encountered issues, the researchers have advanced the mathematical understanding of Score Distillation and similar techniques. This deeper comprehension paves the way for further enhancements in future models. As Artem Lukoianov, an electrical engineering and computer science (EECS) graduate student and lead author of the research, states, “Now we know where we should be heading, which allows us to find more efficient solutions that are faster and higher-quality.” Ultimately, their work aims to facilitate the design process, acting as a co-pilot for creators seeking to generate more realistic 3D shapes.
Lukoianov collaborated with several co-authors, including Haitz Sáez de Ocáriz Borde from Oxford University, Kristjan Greenewald from the MIT-IBM Watson AI Lab, Vitor Campagnolo Guizilini from the Toyota Research Institute, Timur Bagautdinov from Meta, and senior authors Vincent Sitzmann and Justin Solomon, both from MIT’s CSAIL. Their research is set to be presented at the Conference on Neural Information Processing Systems.
To transition from 2D images to 3D shapes, diffusion models like DALL-E have emerged as a leading type of generative AI capable of creating realistic images from random noise. Researchers train these models by introducing noise to existing images and teaching the model to reverse the process, effectively denoising the images based on user prompts. However, these diffusion models struggle to generate realistic 3D shapes due to the limited availability of 3D training data. In response, a technique called Score Distillation Sampling (SDS) was introduced in 2022, allowing for the combination of 2D images into a cohesive 3D representation using a pre-trained diffusion model.
The SDS technique begins with a random 3D representation and generates a 2D view of an object from a random camera angle. It adds noise to this image, denoises it using the diffusion model, and optimizes the random 3D representation to align it with the denoised image. This cycle is repeated until the desired 3D object is achieved. However, the resulting 3D shapes typically suffer from a lack of clarity, often appearing blurry or overly saturated. Lukoianov notes, “This has been a bottleneck for a while. We know the underlying model is capable of doing better, but people didn’t know why this is happening with 3D shapes.”
The MIT team delved into the specifics of the SDS method and discovered a key mismatch between a formula essential to the process and its equivalent in 2D diffusion models. This formula dictates how the model updates the random representation by selectively adding and removing noise incrementally to refine it. Due to the complexity of this formula, SDS had resorted to using randomly sampled noise at each stage, which the researchers found to be the cause of the produced blurry and cartoonish 3D shapes.
Instead of attempting to solve the cumbersome formula with complete precision, the researchers experimented with approximation techniques until they found the most effective one. Their approach involves deriving the missing term from the current rendering of the 3D shape, rather than relying on random sampling. This method yields sharper and more realistic 3D shapes, an outcome that aligns with their analytical predictions.
The researchers also enhanced the quality of the generated 3D shapes by increasing image resolution and fine-tuning certain model parameters. Ultimately, they demonstrated that it is possible to use an off-the-shelf, pretrained image diffusion model to create smooth, realistically designed 3D shapes without the need for expensive retraining. The resulting 3D objects are comparable in sharpness to those generated via alternative methodologies.
However, it’s essential to recognize that because their method builds on a pretrained diffusion model, it is susceptible to the biases and limitations inherent in that model, which can result in inaccuracies and hallucinations. Improving the foundational diffusion model could, therefore, significantly enhance the researchers’ results. Alongside examining the formulas for future improvements, the team is also interested in how their findings could inform advancements in image editing techniques.
This important work is supported by funding from various organizations, including the Toyota Research Institute, the U.S. National Science Foundation, and the Amazon Science Hub, among others, marking a significant step forward in the quest to improve 3D modeling and rendering capabilities in generative AI.
Source link