In the iconic animated series “The Jetsons,” Rosie, the robotic housekeeper, effortlessly transitions between tasks like vacuuming, cooking, and taking out the trash. In contrast, creating robots capable of performing a broad spectrum of tasks in real life poses significant challenges. Engineers typically gather task-specific data to train these robots in controlled environments, which is often a tedious and costly process. This method results in robots that struggle to adapt to unfamiliar tasks or environments.
To address these limitations, researchers at MIT have developed an innovative technique that aggregates a vast array of heterogeneous data from various sources. This comprehensive approach enables the training of general-purpose robots, allowing them to learn a wide range of tasks more effectively. The researchers’ method involves aligning diverse data types—from simulations to real-world robot operations—into a unified “language” that a generative AI model can utilize. By merging such extensive datasets, this technique significantly reduces the need for task-specific data, allowing robots to learn multiple tasks without starting from square one each time.
This advanced method offers a faster and more cost-effective alternative to traditional robot training techniques, demonstrating over a 20 percent improvement in performance during both simulated and real-world tests compared to the conventional method of training from scratch. Lirui Wang, a graduate student in electrical engineering and computer science (EECS) and the lead author of the research paper detailing this technique, emphasizes that the diverse nature of existing robot training data poses a significant obstacle. He notes, “In robotics, people often claim that we don’t have enough training data. But another major issue is the variety of sources and types of data.” The team’s research will be showcased at the Conference on Neural Information Processing Systems.
The innovative framework developed by the MIT team draws inspiration from large language models (LLMs), like GPT-4. Typically, robotic policies—which dictate how robots interpret sensor information (e.g., camera images or the spatial orientation of robotic arms)—are developed using imitation learning. This process relies on a human performing gestures to generate training data fed into an AI model. However, this method often falls short when the robot encounters novel tasks or environments.
In their work, Wang and his colleagues leveraged the pretraining approach used in large language models, which entails training on extensive datasets before being fine-tuned on specific tasks. This allows the models to adapt and perform effectively across a range of activities. Wang explains, “In the language domain, the data consist solely of sentences. However, in robotics, where data exhibits significant diversity, we need a tailored architecture for pretraining.”
The complexity of robotic data is multifaceted, encompassing camera imagery, language instructions, and depth perception maps. Furthermore, individual robots vary greatly in mechanical configurations—different numbers, orientations of arms, grippers, and sensors—while the environments in which they operate can differ significantly.
To address these challenges, the MIT researchers designed an architecture dubbed Heterogeneous Pretrained Transformers (HPT), which integrates data from these various sources and modalities. At its core, HPT employs a machine-learning model known as a transformer—a central component in large language models. In this architecture, sensory data from vision and proprioceptive sources (feedback related to body position and motion) are aligned into a common input format, or “token,” that the transformer can process. As each input is represented by a consistent number of tokens, the transformer unifies the data into a shared processing space, ultimately evolving into a massive pretrained model as it ingests more information.
Importantly, users only need to supply HPT with minimal information regarding their robot’s design and the specific tasks to be accomplished. The transformer then transfers the knowledge gained from its pretraining phase to quickly adapt to new tasks.
One of the key hurdles in developing HPT was the compilation of a comprehensive dataset for pretraining. This dataset included 52 distinct datasets with over 200,000 robot trajectories, which spanned categories like human demonstration videos and computer simulations. Additionally, the researchers devised a method to convert raw proprioceptive signals from an array of sensors into a format suitable for transformer processing.
“Proprioception is vital for enabling intricate motions,” Wang elaborates, “and because the number of tokens remains constant, we attribute equal importance to both proprioception and visual data.” Evaluation of HPT revealed a significant enhancement in robot performance—exceeding 20 percent improvement—compared to traditional training approaches, even in scenarios where the tasks deviated considerably from the pretraining data.
David Held, an associate professor at Carnegie Mellon University’s Robotics Institute, who was not involved in this research, acknowledged the significance of this work, stating, “This paper introduces a novel strategy for training a single policy across multiple robot variations. It facilitates scaling up diverse datasets for robotic learning while allowing rapid adaptation to new robot designs.”
Moving forward, the MIT team aims to investigate how increased data diversity can enhance HPT’s performance further. They also aspire to improve HPT’s capabilities to process unlabeled data in a manner akin to GPT-4 and other large language models. Wang asserts, “Our ultimate goal is to develop a universal robot brain that users can download for their robots without any prior training. We are just beginning, but we anticipate that scaling our approach will lead to breakthroughs in robotic policies similar to those seen in language models.”
The research, partially funded by initiatives from Amazon and the Toyota Research Institute, has promising implications for the future of robotics, suggesting a path toward more adaptable and capable robotic systems.
Source link