Toyota's Robots Are Learning to Do Housework—By Copying Humans

Carmaker Toyota is developing robots capable of learning to do household chores by observing how humans take on the tasks. The project is an example of robotics getting a boost from generative AI.
A person in a lab teleoperating robotic arms that are holding a small broom and dustpan while another person watches...
Will Knight at the Toyota Research Institute in Cambridge, Massachusetts.Courtesy of Toyota Research Institute

As someone who quite enjoys the Zen of tidying up, I was only too happy to grab a dustpan and brush and sweep up some beans spilled on a tabletop while visiting the Toyota Research Lab in Cambridge, Massachusetts last year. The chore was more challenging than usual because I had to do it using a teleoperated pair of robotic arms with two-fingered pincers for hands.

Courtesy of Toyota Research Institute

As I sat before the table, using a pair of controllers like bike handles with extra buttons and levers, I could feel the sensation of grabbing solid items, and also sense their heft as I lifted them, but it still took some getting used to.

After several minutes tidying, I continued my tour of the lab and forgot about my brief stint as a teacher of robots. A few days later, Toyota sent me a video of the robot I’d operated sweeping up a similar mess on its own, using what it had learned from my demonstrations combined with a few more demos and several more hours of practice sweeping inside a simulated world.

Autonomous sweeping behavior. Courtesy of Toyota Research Institute

Most robots—and especially those doing valuable labor in warehouses or factories—can only follow preprogrammed routines that require technical expertise to plan out. This makes them very precise and reliable but wholly unsuited to handling work that requires adaptation, improvisation, and flexibility—like sweeping or most other chores in the home. Having robots learn to do things for themselves has proven challenging because of the complexity and variability of the physical world and human environments, and the difficulty of obtaining enough training data to teach them to cope with all eventualities.

There are signs that this could be changing. The dramatic improvements we’ve seen in AI chatbots over the past year or so have prompted many roboticists to wonder if similar leaps might be attainable in their own field. The algorithms that have given us impressive chatbots and image generators are also already helping robots learn more efficiently.

The sweeping robot I trained uses a machine-learning system called a diffusion policy, similar to the ones that power some AI image generators, to come up with the right action to take next in a fraction of a second, based on the many possibilities and multiple sources of data. The technique was developed by Toyota in collaboration with researchers led by Shuran Song, a professor at Columbia University who now leads a robot lab at Stanford.

Toyota is trying to combine that approach with the kind of language models that underpin ChatGPT and its rivals. The goal is to make it possible to have robots learn how to perform tasks by watching videos, potentially turning resources like YouTube into powerful robot training resources. Presumably they will be shown clips of people doing sensible things, not the dubious or dangerous stunts often found on social media.

“If you've never touched anything in the real world, it's hard to get that understanding from just watching YouTube videos,” Russ Tedrake, vice president of Robotics Research at Toyota Research Institute and a professor at MIT, says. The hope, Tedrake says, is that some basic understanding of the physical world combined with data generated in simulation, will enable robots to learn physical actions from watching YouTube clips. The diffusion approach “is able to absorb the data in a much more scalable way,” he says.

Toyota announced its Cambridge robotics institute back in 2015 along with a second institute and headquarters in Palo Alto, California. In its home country of Japan—as in the US and other rich nations—the population is aging fast. The company hopes to build robots that can help people continue living independent lives as they age.

The lab in Cambridge has dozens of robots working away on chores including peeling vegetables, using hand mixers, preparing snacks, and flipping pancakes. Language models are proving helpful because they contain information about the physical world, helping the robots make sense of the objects in front of them and how they can be used.

It’s important to note that despite many demos slick enough to impress a casual visitor, the robots still make lots of errors. Like earlier versions of the model behind ChatGPT, they can veer between seeming humanlike and making strange errors. I saw one robot effortlessly operating a manual hand mixer and another struggling to grasp a bottletop.

Toyota is not the only big tech company hoping to use language models to advance robotics research. Last week, for example, a team at Google DeepMind recently revealed Auto-R, software that uses a large language model to help robots determine the tasks that they could realistically—and safely—do in the real world.

Progress is also being made on the hardware needed to advance robot learning. Last week a group at Stanford University led by Chelsea Finn posted videos of a low-cost mobile teleoperated robotics system called ALOHA. They say the fact that it is mobile allows the robot to tackle a wider range of tasks, giving it a wider range of experiences to learn from than a system locked in one place.

And while it’s easy to be dazzled by robot demo videos, the ALOHA team was good enough to post a highlight reel of failure modes showing the robot fumbling, breaking, and spilling things. Hopefully another robot will learn how to clean up after it.