AI Update: Voice Commands & Chaining Tasks

Research
1X Technologies Humanoid Robot EVE Tidying Office Autonomously
Team Member
Title
Hometown
Languages
Where
When

We have previously developed an autonomous model that can merge many tasks into a single goal-conditioned neural network. However, when multi-task models are small (<100M parameters), adding data to fix one task’s behavior often adversely affects behaviors on other tasks. Increasing the model parameter count can mitigate this forgetting problem, but also take longer to train, which slows down our ability to find out what demonstrations we should gather to improve robot behavior. 

How do we iterate quickly on the data while building a generalist robot that can do many tasks with a single neural network? We want to decouple our ability to quickly improve task performance from our ability to merge multiple capabilities into a single neural network. To accomplish this, we’ve built a voice-controlled natural language interface to chain short-horizon capabilities across multiple small models into longer ones. With humans directing the skill chaining, this allows us to accomplish the long-horizon behaviors shown in this video:

Although humans can do long horizon chores trivially, chaining multiple autonomous robot skills in a sequence is hard because the second skill has to generalize to all the slightly random starting positions that the robot finds itself in when the first skill finishes. This compounds with every successive skill - the third skill has to handle the variation in outcomes of the second skill, and so forth.

From the user perspective, the robot is capable of doing many natural language tasks and the actual number of models controlling the robot is abstracted away. This allows us to merge the single-task models into goal-conditioned models over time. Single-task models also provide a good baseline to do shadow mode evaluations: comparing how a new model’s predictions differ from an existing baseline at test-time. Once the goal-conditioned model matches single-task model predictions well, we can switch over to a more powerful, unified model with no change to the user workflow.

Directing robots with this high-level language interface offers a new user experience for data collection. Instead of using VR to control a single robot, an operator can direct multiple robots with high level language and let the low-level policies execute low-level actions to realize those high-level goals. Because high-level actions are sent infrequently, operators can even control robots remotely, as shown below:

Note that the above video is not completely autonomous; humans are dictating when robots should switch tasks. Naturally, the next step after building a dataset of vision-to-natural language command pairs is to automate the prediction of high level actions using vision-language models like GPT-4o, VILA, and Gemini Vision.

Stay tuned! 
Eric Jang

Back to Top      ↑
Be the first to know the latest news and updates from 1X.

AI Update: Voice Commands & Chaining Tasks

Research
1X Technologies Humanoid Robot EVE Tidying Office Autonomously
Team Member
Title
Hometown
Languages
Where
When

We have previously developed an autonomous model that can merge many tasks into a single goal-conditioned neural network. However, when multi-task models are small (<100M parameters), adding data to fix one task’s behavior often adversely affects behaviors on other tasks. Increasing the model parameter count can mitigate this forgetting problem, but also take longer to train, which slows down our ability to find out what demonstrations we should gather to improve robot behavior. 

How do we iterate quickly on the data while building a generalist robot that can do many tasks with a single neural network? We want to decouple our ability to quickly improve task performance from our ability to merge multiple capabilities into a single neural network. To accomplish this, we’ve built a voice-controlled natural language interface to chain short-horizon capabilities across multiple small models into longer ones. With humans directing the skill chaining, this allows us to accomplish the long-horizon behaviors shown in this video:

Although humans can do long horizon chores trivially, chaining multiple autonomous robot skills in a sequence is hard because the second skill has to generalize to all the slightly random starting positions that the robot finds itself in when the first skill finishes. This compounds with every successive skill - the third skill has to handle the variation in outcomes of the second skill, and so forth.

From the user perspective, the robot is capable of doing many natural language tasks and the actual number of models controlling the robot is abstracted away. This allows us to merge the single-task models into goal-conditioned models over time. Single-task models also provide a good baseline to do shadow mode evaluations: comparing how a new model’s predictions differ from an existing baseline at test-time. Once the goal-conditioned model matches single-task model predictions well, we can switch over to a more powerful, unified model with no change to the user workflow.

Directing robots with this high-level language interface offers a new user experience for data collection. Instead of using VR to control a single robot, an operator can direct multiple robots with high level language and let the low-level policies execute low-level actions to realize those high-level goals. Because high-level actions are sent infrequently, operators can even control robots remotely, as shown below:

Note that the above video is not completely autonomous; humans are dictating when robots should switch tasks. Naturally, the next step after building a dataset of vision-to-natural language command pairs is to automate the prediction of high level actions using vision-language models like GPT-4o, VILA, and Gemini Vision.

Stay tuned! 
Eric Jang

Back to Top      ↑