In today’s AI landscape, sequence models have gained significant traction for their ability to analyze complex data and make predictions about subsequent actions. A prime example of this is the next-token prediction models, such as ChatGPT, which forecast each word in a sequence to generate coherent responses to user inquiries. On the other end of the spectrum are full-sequence diffusion models like Sora, which transform textual input into stunning, realistic visuals by iteratively removing noise from entire video sequences. Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have introduced a straightforward modification to the diffusion training process that enhances the flexibility of sequence denoising.
When these next-token and full-sequence diffusion models are applied to areas like computer vision and robotics, they exhibit different strengths and weaknesses. Next-token models are capable of generating sequences of varying lengths. However, they often lack an understanding of the optimal outcomes in the distant future, meaning they might not be able to direct their sequence generation towards a specific goal that is multiple tokens away. This limitation necessitates the incorporation of additional mechanisms for long-term planning. Conversely, while diffusion models can engage in future-conditioned sampling, they cannot generate variable-length sequences as effectively as next-token models can.
To leverage the advantages of both approaches, the CSAIL team developed a novel training method known as “Diffusion Forcing.” The term is inspired by “Teacher Forcing,” a traditional training approach that breaks down the generation of full sequences into smaller, manageable steps of next-token generation, similar to how a good instructor simplifies complex subjects.
Diffusion Forcing aligns well with both diffusion models and teacher forcing. Both techniques involve training schemes that predict masked (i.e., noisy) tokens from unmasked ones. In the scenario of diffusion models, the process incorporates gradual noise addition to data, resembling fractional masking. The Diffusion Forcing technique trains neural networks to purify a collection of tokens by removing varying amounts of noise and simultaneously predicting subsequent tokens. This leads to the creation of a versatile and dependable sequence model, resulting in higher-quality artificial videos and more accurate decision-making in robotic systems.
By effectively filtering through noisy data and projecting potential future actions, Diffusion Forcing equips robots with the capability to focus on crucial visual data, thereby ignoring distractions while executing tasks. This methodology not only generates stable and coherent video sequences but can also guide AI agents through complex digital environments. The potential applications are profound, ranging from household robots that can adapt to new tasks to AI-generated entertainment that captivates audiences.
“Sequence models aim to condition on known past information and predict unknown future outcomes, employing a form of binary masking,” explains Boyuan Chen, lead author and PhD student in electrical engineering and computer science (EECS) at MIT. “However, this masking doesn’t always have to be binary. With Diffusion Forcing, we introduce various levels of noise to each token, essentially acting as a form of fractional masking. During testing, our model is capable of ‘unmasking’ a set of tokens and diffusing a sequence into the near future at a reduced noise level. It learns to differentiate between reliable information and irrelevant data, thereby addressing out-of-distribution inputs more effectively.”
The performance of Diffusion Forcing has been tested through various experiments, demonstrating its ability to navigate through misleading information to fulfill tasks while simultaneously anticipating future actions. For instance, when integrated into a robotic arm, this technique enabled the device to swap two toy fruits across three circular mats—an elementary demonstration of a broader array of long-horizon tasks that necessitate memory use. The researchers facilitated training for the robot by controlling it remotely in a virtual reality environment, instructing it to replicate the user’s movements via its camera. Even when confronted with visual distractions, the robot was able to correctly position the objects.
In exploring video generation, the research team trained Diffusion Forcing on gameplay data from “Minecraft” and vibrant digital environments created in Google’s DeepMind Lab Simulator. Given a single frame, their method successfully produced more stable, higher-resolution videos than comparable methods, including Sora-like full-sequence diffusion models and next-token models like ChatGPT. The latter sometimes struggled to produce coherent videos beyond 72 frames.
Beyond video generation, Diffusion Forcing functions as a motion planner, helping steer actions towards desired outcomes or rewards. Its inherent versatility allows it to create plans with varying horizons, engage in tree search, and incorporate the understanding that the future can be more unpredictable than the present. In a test involving a 2D maze, Diffusion Forcing outperformed six alternative models by generating expedited plans to achieve goal locations, suggesting its potential as an efficient planning tool for future robotic applications.
In various demonstrations, Diffusion Forcing showcased its capabilities as either a full sequence model, a next-token prediction model, or a combination of both. According to Chen, this adaptable approach could serve as a foundation for a “world model”—an AI system capable of simulating real-world dynamics by learning from billions of internet videos. This technology could empower robots to undertake novel tasks by predicting necessary actions based on their environment. For instance, a robot asked to open a door could generate a visual guide illustrating the process, even without prior training on that specific task.
The researchers aim to expand their approach to larger datasets and state-of-the-art transformer models to enhance performance further. Their vision includes building a ChatGPT-like model to enable robots to perform successfully in novel environments without the need for human demonstrations.
“With Diffusion Forcing, we bridge the gap between video generation and robotics,” states Vincent Sitzmann, senior author and MIT assistant professor leading the Scene Representation group at CSAIL. “Ultimately, we aspire to harness the vast knowledge contained in online videos to empower robots in their everyday tasks. There are numerous exciting research challenges ahead, especially in understanding how robots can learn to imitate humans based on observational learning, despite the differences in their physical forms.”
Alongside Chen and Sitzmann, the research team includes Diego Martí Monsó, visiting researcher; Yilun Du, EECS graduate student; Max Simchowitz, former postdoc and incoming assistant professor at Carnegie Mellon University; and Russ Tedrake, the Toyota Professor of EECS and a vice president of robotics research at the Toyota Research Institute. Their groundbreaking work is partially funded by the U.S. National Science Foundation, Singapore’s Defence Science and Technology Agency, the Intelligence Advanced Research Projects Activity through the U.S. Department of the Interior, and the Amazon Science Hub. They are set to present their findings at the upcoming NeurIPS conference in December.
Source link