News
In machine learning, a world model is a computer program that can imagine how the world evolves in response to an agent’s behavior. Building on advancements in video generation and world models for autonomous vehicles, we have trained a world model that serves as a virtual simulator for our robots.
From the same starting image sequence, our world model can imagine multiple futures from different robot action proposals.
It can also predict non-trivial object interactions like rigid bodies, effects of dropping objects, partial observability, deformable objects (curtains, laundry), and articulated objects (doors, drawers, curtains, chairs).
In this post we’ll share why world models for robots are important, the capabilities and limitations of our current models, and a new dataset and public competition to encourage more research in this direction.
The Robotics Problem
World models solve a very practical and yet often overlooked challenge when building general-purpose robots: evaluation. If you train a robot to perform 1000 unique tasks, it is very hard to know whether a new model has made the robot better at all 1000 tasks, compared to a prior model. Even the same model weights can experience a rapid degradation in performance in a matter of days due to subtle changes in the environment background or ambient lighting.
If the environment keeps changing over time, then old experiments performed in that environment are no longer reproducible because the old environment no longer exists! This problem gets worse if you are evaluating multi-task systems in a constantly-changing setting like the home or the office. This makes careful robotic science in the real world frustratingly hard.
Careful measurement of capabilities allows one to predict how capabilities will scale when one increases data, compute, and model size – these “scaling laws” defend the enormous investment that goes into general-purpose AI systems like ChatGPT. If robotics is to have its “ChatGPT moment”, we must first establish its “Scaling Laws”.
Other Ways To Evaluate
Physics-based simulation (Bullet, Mujoco, Isaac Sim, Drake) are a reasonable way to quickly test robot policies. They are resettable and reproducible, allowing researchers to carefully compare different control algorithms. However, these simulators are mostly designed for rigid body dynamics and require a lot of manual asset authoring. How to simulate robot hands opening a cardboard box of coffee filters, cutting fruit with a knife, unscrewing a frozen jar of preserves, or interacting with other intelligent agents like humans? Everyday objects and animals encountered in home environments are notoriously difficult to simulate, so simulation environments used in robotics tend to be visually sterile and lack the diversity of the real world use case. Small-scale evaluation on a limited number of tasks in real or sim is not predictive of large-scale evaluation in the real world.
World Models
We’re taking a radically new approach to evaluation of general-purpose robots: learning a simulator directly from raw sensor data and using it to evaluate our policies across millions of scenarios. By learning a simulator directly from real data, you can absorb the full complexity of the real world without manual asset creation.
Over the last year, we’ve gathered thousands of hours of data on EVE humanoids doing diverse mobile manipulation tasks in homes and offices and interacting with people. We combined the video and action data to train a world model that can anticipate future video from observations and actions.
Action Controllability
Our world model is capable of generating diverse outcomes based on different action commands. Below we show various generations conditioning the world model on four different trajectories, each of which start from the same initial frames. As before, the examples shown are not included during training.
The main value of the world model comes from simulating object interactions. In the following generations, we provide the model the same initial frames and three different sets of actions to grasp boxes. In each scenario, the box(es) grasped are lifted and moved in accordance with the motion of the gripper, while the other boxes remain undisturbed.
Even when actions are not provided, the world model generates plausible video, such as learning that people and obstacles should be avoided when driving:
Long-Horizon Tasks
We can also generate long-horizon videos. The example below simulates a complete t-shirt folding demonstration. T-shirts and deformable objects tend to be difficult to implement in rigid body simulators.
Current Failure Modes
Object Coherence
Our model can fail to maintain the shape and color of objects during interaction, and at times, objects may completely disappear. Additionally, when objects are occluded or displayed at unfavorable angles, their appearance can become distorted throughout the generation.
Laws of Physics
The generation on the left demonstrates that our model has an emergent understanding of physical properties, as evidenced by the spoon falling to the table when released by the gripper. However, there are many instances where generations fail to adhere to physical laws, such as on the right where the plate remains suspended in the air.
Self-recognition
We placed EVE in front of a mirror to see if generations would result in mirrored actions, but we did not see successful recognition or “self-understanding"
World Model Challenge
As shown by the examples above, there is still much work to be done. World models have the potential to solve general purpose simulation and evaluation, enabling robots that are safe, reliable, and intelligent in a wide variety of scenarios. As such, we see this effort as a grand challenge in robotics that the community can work on solving together. To help accelerate progress towards solving world models for robotics, we are releasing over 100 hours of vector-quantized video (Apache 2.0), pretrained baseline models, and the 1X World Model Challenge, a three-stage challenge with cash prizes.
Active Challenges
Compression Challenge | Prize: $10,000 USD
The first challenge, compression, is about how well one can minimize training loss on an extremely diverse robot dataset. The lower the loss, the better the model understands the training data. Even though there are many different ways to implement a world model, optimizing loss well is a general objective that underpins nearly all large-scale deep learning tasks. A $10k prize is awarded to the first submission that achieves a loss of 8.0 on our private test set. The Github repo provides code and pretrained weights for Llama and GENIE-based world models.
Coming Soon
Sampling Challenge
The second challenge, sampling, is about how well and how quickly a model can generate videos of the future. Details of the Sampling Challenge will be announced soon, based on lessons learned from running the Stage 1 Challenge.
Evaluation Challenge
The third challenge, evaluation, is our holy grail: can you predict how well a robot performs before you test it in the real world? Details of the Evaluation Challenge will be announced after we’ve learned lessons from Stage 1 and Stage 2 Challenges.
Submit solutions to: challenge@1x.tech
We’re Hiring!
If you’re excited about these directions, we have open roles on the 1X AI team. Internally, we have a large dataset of high resolution robot data across even more diverse scenarios. Our ambitions for world models go beyond just solving the general evaluation problem; once you can step an agent in this world model and perform evaluation, you can follow on with policy enhancement and policy training in a completely learned simulation.
Github - starter code, evals, baseline implementations
Discord - chat with our engineers
We have previously developed an autonomous model that can merge many tasks into a single goal-conditioned neural network. However, when multi-task models are small (<100M parameters), adding data to fix one task’s behavior often adversely affects behaviors on other tasks. Increasing the model parameter count can mitigate this forgetting problem, but also take longer to train, which slows down our ability to find out what demonstrations we should gather to improve robot behavior.
How do we iterate quickly on the data while building a generalist robot that can do many tasks with a single neural network? We want to decouple our ability to quickly improve task performance from our ability to merge multiple capabilities into a single neural network. To accomplish this, we’ve built a voice-controlled natural language interface to chain short-horizon capabilities across multiple small models into longer ones. With humans directing the skill chaining, this allows us to accomplish the long-horizon behaviors shown in this video:
Although humans can do long horizon chores trivially, chaining multiple autonomous robot skills in a sequence is hard because the second skill has to generalize to all the slightly random starting positions that the robot finds itself in when the first skill finishes. This compounds with every successive skill - the third skill has to handle the variation in outcomes of the second skill, and so forth.
From the user perspective, the robot is capable of doing many natural language tasks and the actual number of models controlling the robot is abstracted away. This allows us to merge the single-task models into goal-conditioned models over time. Single-task models also provide a good baseline to do shadow mode evaluations: comparing how a new model’s predictions differ from an existing baseline at test-time. Once the goal-conditioned model matches single-task model predictions well, we can switch over to a more powerful, unified model with no change to the user workflow.
Directing robots with this high-level language interface offers a new user experience for data collection. Instead of using VR to control a single robot, an operator can direct multiple robots with high level language and let the low-level policies execute low-level actions to realize those high-level goals. Because high-level actions are sent infrequently, operators can even control robots remotely, as shown below:
Note that the above video is not completely autonomous; humans are dictating when robots should switch tasks. Naturally, the next step after building a dataset of vision-to-natural language command pairs is to automate the prediction of high level actions using vision-language models like GPT-4o, VILA, and Gemini Vision.
Stay tuned!
Eric Jang
In the latest episode of the Venture Europe Podcast, Bernt Børnich, CEO of 1X, sits down with host Calin Fabri to explore the evolving world of humanoid robotics.
Bernt shares his journey from a curious child dismantling kitchen gadgets to founding and leading 1X. He gives insight into the development of NEO, 1X’s next-generation android designed to assist with everyday tasks at home. He discusses the importance of designing safe, compliant humanoids capable of working alongside people in their daily environments.
Bernt also discusses 1X's strategic expansion, with AI development centered in San Francisco Bay and a new manufacturing facility built in Norway.
Throughout the episode, he explores the technical and ethical challenges of integrating androids into society, aiming to create an abundant supply of labor.
Listen on Apple Podcast
Listen on Google Podcast
Listen on Amazon Music
MOSS; NORWAY: 1X is currently developing its own production facility, actuator manufacturing, and robot assembly facility in Moss, Norway, right next to our campus and engineering team. This decision is more than just a matter of convenience—it's a commitment to keep building a vertically integrated company where every component of EVE and NEO is designed and produced in-house.
“The close proximity of both the actuator manufacturing, robot assembly, and testing site offers great advantages, especially for our team of creative engineers, brimming with fresh, yet untested ideas. Being adjacent to the manufacturing and assembly process allows them to quickly understand the practical aspects of transforming their creative concepts into feasible, efficient-to-manufacture products, says VP of Manufacturing Operations & Engineering, Csaba Hartmann.
The manufacturing team consists of diverse professionals, including specialized manufacturing engineers and mechanical designers, process engineers, automation experts, quality engineers, supply chain experts, safety officers, and others. Each member plays a role in designing, trialing, and rolling out our large-scale manufacturing initiatives, contributing to enhancing scalability, rapid iterations, and safety at every stage of the manufacturing and assembly process.
“Enabling teams that work side by side with each other and thus can easily get and act on feedback, is crucial for us to evolve and improve our products rapidly”, says Hartmann.
All 1X androids are designed with a safety-first mindset, featuring gearless motors and a soft exterior. Our commitment to safety extends beyond design, incorporating measures throughout the assembly process to ensure products are built to specs: thorough testing, quality control, and precise assembly processes.
We’re adopting quality control measures inspired by the automotive industry. We conduct thorough Design Failure Mode and Effects Analysis (DFMEA) on each assembly component to proactively identify and mitigate potential safety risks.
“Our quality team interprets the results of the DFMEA and PFMEA and then defines the rigorous checks for the assembly process to ensure no safety aspect is overlooked,” says Hartmann.
The assembly process includes rigorous checks of critical quality parameters to ensure no safety aspect is overlooked. Precision in the use of testing and assembly tools is emphasized to maintain high standards of accuracy. All components, especially motors, undergo extensive testing at multiple stages of assembly to validate their performance and reliability.
"At 1X, we prioritize scalable, cost-efficient manufacturing by integrating engineering expertise and rigorous quality control. Our approach leverages advanced technologies and carefully selected materials to enhance production efficiency. Committed to scalability, we ensure every process is optimized for cost-effectiveness and growth", says 1X CEO Bernt Børnich.
Join us
If you find this work interesting, we’d like to call attention to a few roles that we are hiring for to accelerate our mission toward creating an abundant supply of labor via safe intelligent androids:
- CNC Programmer and Operation Specialist
- Quality Engineer
- Senior Full-Stack Engineer
- Senior Mechanical Engineer
- Senior Electric Motor Design Engineer
We also have other open roles across mechanical, electrical, and software disciplines. Follow 1x_tech on X for more updates, and join us in living in the future.
1X will be attending the NVIDIA GTC Conference on March 18th. Our involvement signifies 1X's dedication to advancing in the field of Embodied AI, showcasing our latest developments, and engaging with the global AI community.
The NVIDIA GTC Conference is renowned for being a pivotal event that gathers innovators, researchers, and industry leaders worldwide to explore the latest advancements in AI, machine learning, and related technologies. Attendees can look forward to a program full of insightful talks, dynamic workshops, and demonstrations.
For more information about the conference or to register:
NVIDIA GTC Conference Official Page
Conference Program
We look forward to connecting with professionals to share our passion for AI and robotics at the event. See you at NVIDIA GTC.
1X's mission is to provide an abundant supply of physical labor via safe, intelligent androids. Our environments are designed for humans, so we design our hardware to take after the human form for maximum generality. To make the best use of this general-purpose hardware, we also pursue the maximally general approach to autonomy: learning motor behaviors end-to-end from vision using neural networks.
We deployed this system on EVE for patrolling tasks in 2023, and are now excited to share some of the new capabilities our androids have learned purely end-to-end from data:
Every behavior you see in the above video is controlled by a single vision-based neural network that emits actions at 10Hz. The neural network consumes images and emits actions to control the driving, the arms, gripper, torso, and head. The video contains no teleoperation, no computer graphics, no cuts, no video speedups, no scripted trajectory playback. It's all controlled via neural networks, all autonomous, all 1X speed.
To train the ML models that generate these behaviors, we have assembled a high-quality, diverse dataset of demonstrations across 30 EVE robots. We use that data to train a “base model” that understands a broad set of physical behaviors, from cleaning to tidying homes to picking up objects to interacting socially with humans and other robots. We then fine-tuned that model into a more specific family of capabilities (e.g. a model for general door manipulation and another for warehouse tasks) and then fine-tuned those models further to align the behavior with solving specific tasks (e.g. open this specific door). This strategy allows us to onboard new skills in just a few minutes of data collection and training on a desktop GPU.
All of the capabilities shown in the video were trained by our android operators. They represent a new generation of "Software 2.0 Engineers'' who express robot capabilities through data instead of writing code. Our ability to teach our robots short mobile manipulation skills is no longer constrained by the number of AI engineers, so this creates a lot of flexibility in what our androids can do for our customers.
Join Us!
If you find this work interesting, we’d like to call attention to two roles that we are hiring for to accelerate our mission toward general-purpose physically embodied intelligence:
Over the last year we’ve built out a data engine for solving general-purpose mobile manipulation tasks in a completely end-to-end manner. We’ve convinced ourselves that it works, so now we're hiring AI researchers in the SF Bay Area to scale it up to 10x as many robots and teleoperators. We're looking for experts in imitation learning, reinforcement learning, large-scale training, and skills relevant to scaling up deployments of autonomous vehicles. You'll be working in a fast-paced team of generalists that ship features to our fleet on a 24-hour release cycle. The work is a mix of pioneering new learning algorithms and fixing speed bottlenecks in our data flywheel. We are relentless in simplifying algorithms and infrastructure as much as possible.
We're also hiring android operators in both our Oslo and Mountain View offices to collect data, train models with that data, and evaluate those models. Unlike most data collection jobs, our teleoperators are empowered to train their own models to automate their own tasks and think deeply about how data maps to learned robot behavior. If you want to experience what it is like to live in a real-life "Westworld", we'd love for you to apply.
We also have other open roles across mechanical, electrical, and software disciplines that make the foundation possible to ship all of this cutting-edge ML technology. Follow 1x_tech on X for more updates, and join us in living in the future.