Opt in to receive updates. Unsubscribe anytime.

Experience NEO

At 1X, we build robots that help humans in the most diverse environment imaginable: people’s homes. To deploy our Redwood AI model safely and reliably, we need to anticipate its policy behavior across all that it can encounter. From retrieving a rarely-used kitchen gadget tucked away in a cluttered drawer, to navigating a living room unexpectedly rearranged overnight, physically evaluating each policy we've trained across these varied scenarios would take several lifetimes. How can we accelerate the evaluation of generalist robot models?

Today, we share progress on the 1X World Model (1XWM): a bridge between the world of atoms and the world of bits.

The 1X World Model enables NEO to anticipate the outcomes of robot actions and their consequences on the world.



This approach significantly accelerates experimentation, allowing us to evaluate the reliability and effectiveness of robotic policies in a fraction of the time, without requiring intensive, environment-specific engineering within traditional physics-based simulators or mock real-world sets.

Precise action controllability that enables us to compare policies that take different actions given the same observations.

Scaling results when leveraging a specific source of data: autonomous policy rollouts

High correlation between the world model and real world evaluations

Quickly iterate on architectural decisions.

Select the best checkpoint from training runs.

Curate datasets of long-tail scenarios in production and re-evaluate models on them.

Optimize robot policies at scale through efficient training-evaluation cycles.



We train our world model on sequences of video frames, robot observations, and input action trajectories. We encode the inputs to latent representations, and predict the latent encodings of future frames. We also predict the state value of the final frame to evaluate task success and completion.

Specializing Video Generation Models for Action Controllability

Most video generation models are text-to-video (T2V), using a language prompt to generate the video, and optionally, one or more reference frames for guidance. However, world models for simulating robots need to be action-controllable, steered by exact robot trajectories rather than loose directives like “Grab the mug” or “Wipe the countertop.” We demonstrate the action-controllability of 1XWM by providing it with a few initial frames of real footage along with multiple subsequent action trajectories. From this anchor point, the 1XWM simulates the consequences of taking those exact actions, including the physics of objects like doors being opened and cloths being wiped across a countertop.

LAUNDRY

OPEN

CLEAN

Does 1XWM improve as data is scaled up? What kinds of data best improve its understanding of physics?

To study this, we train 1XWM to not only predict future states and images, but also whether the task attempt succeeded or failed at the end of each generation. We see improvements in accuracy across the board as we scale up the number of tasks and the diversity of robot behaviors. We explore this in the following tasks: Airfryer, Arcade, and Shelf.

Air fryer

Arcade

Shelf



We observe a clear improvement in generation quality as we train on task-specific data. When confronted with an unfamiliar task and environment, the WM often struggles to model the object interactions exactly, without having knowledge of their specific properties. Training on task-specific data allows the WM to update based on the subtle dynamics of the task at hand.



For example, when trained on smaller amounts of data, 1XWM hallucinates the air fryer tray and body as a single unit, pulling the entire unit off the counter. After adding interaction data with the air fryer, 1XWM gains a better understanding of how the tray separates from the air fryer, and even learns to model subtle interactions such as the confinement of the tray movement within the base of the air fryer.



We also observe that training on both shelf and arcade data improves accuracy compared to training on shelf alone. This positive transfer of accuracy and task understanding from one task to another reinforces our belief in the capability of 1XWM on scaling.

The more task-oriented robotics data we accumulate, the more accurately we can predict task-level future outcomes. We then turn this capability into an evaluation engine as a novel application of world modeling for robotics development.

An aligned world model can solve the evaluation problem by forecasting the actions of candidate robot policies. Given 1X World Model generations from each policy on datasets of initial states, we can compare their respective performances. Importantly, we can curate datasets of production-setting states and generate counterfactual results from states that an autonomous policy has previously failed in.

For every set of model checkpoint weights, we predict future states and success likelihood that we show are distributionally aligned to actual real-world futures. This gives us insight into model performance at scale and allows us to make model architecture and checkpoint selection decisions with an instant feedback loop.



In the plot below, we ablate the decision to include proprioception (robot joint states) as input to our robot policy, and plot the 1X World Model predicted success rate for each checkpoint. We then run real world evaluation on the most and least promising checkpoints according to 1XWM. We find that there is indeed a correlation between the predicted success rates and the true task scores. This allows us to make a likely forecast that proprioception improves policy performance.

Given a true real-world success rate gap of 15% between two policies, a World Model with 70% accuracy can accurately predict the better policy with 90% success. Given that we see a consistent predicted performance gap across checkpoints, and that we can evaluate policies on identical starting states, we can be even more confident in such a verdict.



As another experiment, we compare using two different image encoders for a policy. We compare the most promising predicted checkpoints from both policies, and see that the predicted better ViT-L model does indeed perform better in the real world. For more experiments, see our technical report.

As we deploy robots in home, we will need to move away from task-specific evaluation and towards production-level evaluation, capable of handling a wider, more ambiguous array of full-body manipulation tasks and objects. Improving the generalization capability and accuracy of 1XWM will be the
first step towards this goal.

LIMITATION #1

LIMITATION #2

We've shown promising results in scaling up the 1X World Model (1XWM) to predict the future. As we increase the amount of training compute and real world NEO data, 1XWM's accuracy of predicting whether NEO accomplishes a task also increases. 

The implications of accurate hallucinated rollouts extend far beyond rigorous evaluation for humanoid robots. Consider the implications of what happens when the data generated by 1XWM – the joint distribution over all the sensor readings and actions observed by the robot – become indistinguishable from the real data. This moment has already happened for LLMs, and we think that it will also soon be true for synthetic robotics data. Data and evals are the cornerstone of solving autonomy, and 1XWM provides a unified path for tackling both challenges.

1X World Model

Having complete and unfettered access to all parts of the world is important for NEO to accomplish tasks in home environments, and to enable our Redwood AI to learn from the broadest set of physical interactions possible. Our latest RL controller provides Redwood with a complete mobility toolkit to access the world, including natural walking in any direction, sitting, standing, kneeling, getting up from the floor, going up and down stairs with stereo vision. Bringing all of these capabilities together for the first time in a unified controller represents an important milestone in unlocking the full potential of humanoid robots.

SIT & STAND

Full stairs

Bridging Natural Movement with Omnidirectional steering

The past decade has seen rapid progress in legged robot locomotion, primarily driven by torque-transparent actuators, deep Reinforcement Learning (RL) algorithms, and GPU-accelerated simulation. Using off-the-shelf software packages and a desktop GPU, it is now possible to train a bipedal robot to stand upright in simulation and follow walking direction commands in under an hour. Even though the policy is trained completely in a simulation environment whose physics are merely an approximation of the real world, it is trained on so many randomized physics parameters – e.g. friction, mass, sensor noise – that the model ends up being robust to the real world’s physics parameters. Once trained, this walking controller can receive walking direction commands from either the teleoperator or the AI model (e.g. Redwood), and then translate those high level directions into the dynamic, contact-aware interactions with the world.



Beyond walking and turning, bipeds can also perform side-stepping. This is useful for navigating the close quarters of a kitchen or the space between the sofa and the coffee table, where the footprint would be too small for a wheeled base robot.



However, these basic walking RL controllers often require additional “shaping rewards” to achieve a natural human-like gait in all directions. These tend to be highly specialized to walking, which means that the same process of hand-tuning rewards must be repeated for every new behavior. Gait patterns can vary based on the direction of walking, so this often requires unique shaping terms based on the direction of motion.

Is there a more scalable way to increase controller capabilities without hand-written shaping rewards for each of them? One method is to collect motion capture references from humans moving naturally, retarget them to NEO’s joints and body, and then train the RL controller to match those kinematic reference trajectories.

Because the reference only specifies where the body should be, the RL controller still needs to figure out how to keep the robot stable, while “keeping tempo” and tracking the reference trajectory as closely as possible.

Using these techniques, it is possible to over-fit a policy to track a single human motion capture trajectory and achieve very dynamic and fluid motions such as dancing or walking. Examples of “single trajectory replay controller” are shown below for natural running and pivoting:

Pivot



These behaviors, while elegant, are not readily useful for general purpose tasks as they only can replay a single trajectory. They do not expose a steerable interface with which a high level policy like the AI model can execute the right actions. It is also not obvious how to transition smoothly from one reference to another, as motion capture datasets rarely include the transition behaviors between arbitrary task-centric motions, e.g. switching from shuffling quickly side-to-side to a skipping motion.

To handle multiple trajectories, we could train the controller to follow multiple mocap references, taking in the encoded kinematic trajectory as an input. However, this approach encounters a teleoperation UX problem: it is not obvious how one provides the high-dimensional kinematic trajectories at test-time using a more limited input device like a gamepad joystick or a VR controller. The model is trained on high-dimensional kinematic trajectories with nuanced rhythm and periodicity, but the commands provided by teleoperation are coarse-grained, which results in the RL controller interpreting it into an unnatural gait.

How does one achieve steerability and robustness, while still achieving the fluidity of motion capture-based RL training? We developed a two-stage controller, consisting of a high level “kinematic planner” to synthesize kinematic targets that resemble human motion capture data, and a low level controller that tries to achieve those plans.



The lower level RL controller takes as input a kinematic reference trajectory of body poses that it must attempt to track while maintaining balance. This is paired with a high level motion generator model that is trained with supervised learning to convert input commands like joystick direction to the richer kinematic trajectories. The generative model also plays the role of smoothing transitions during behavior changes.

An important reason for having robots with legs in the home is to traverse stairs. We develop a “stair mode” in our controller which engages the use of stereo RGB vision to infer the height of the floor around NEO.

To climb up and down stairs in a graceful way, NEO’s RL controller must anticipate the necessary height of each step well in advance of making contact with that step. Unlike most humanoid robots, which employ a time-of-flight depth sensor or a lidar to estimate the floor plane, NEO’s RL controller is purely vision based. Depth is predicted directly from the RGB stereo pair, and this is fused with the proprioceptive history for NEO to figure out how and where to step.



Stairs are not always idealized; through domain randomization in simulation, the controller is also robust enough to support side-stepping and handling stairs of mixed heights.

There are numerous home chores that require NEO to work at floor height for extended periods of time: removing a stain from the carpet, reorganizing the bottom drawer of a cabinet, packing a suitcase, and sorting socks. We extend our RL controller to be able to safely sit, kneel and lie down on the floor, as well as get back up from each of these poses.

Kneel

Sit & stand

The controller provides an “action interface” for which teleoperation or Redwood AI is able to interact in a safe, contact-rich manner with the physical world. To demonstrate the controllability of the natural walking behavior, we fine-tuned the Redwood model to do a soccer ball dribbling task.

Here is Redwood interacting with this new RL controller, where it predicts whole body joint targets and walking pelvis velocities from vision. The controller then translates those intents into the specific forces applied by the leg to walk in the direction of the ball.

We have developed the first general-purpose, fully AI and teleoperation compatible controller that unlocks the full kinematic workspace that is available to a bipedal humanoid robot. This will enable us to train Redwood AI to fully explore the entire state space of the home: every high and low shelf, every nook and cranny, every floor.

We will then use that data to make an AI like no one has ever seen.

Redwood AI: Mobility

We are excited to introduce Redwood, 1X’s breakthrough AI model that we will be deploying to homes. Redwood is a vision-language transformer tailored for the humanoid form factor and capable of performing end-to-end mobile manipulation tasks like retrieving objects for users, opening doors, and navigating around the home. Redwood empowers NEO Gamma to learn from real-world experience, on top of hardware designed for compliance, safety, and resilience.

: Handles variation in tasks—like picking up never-before-seen objects in unfamiliar locations. Trained on a large dataset of teleoperated and autonomous episodes from EVE and NEO, Redwood exhibits emergent behaviors such as choosing

 Redwood is among the first VLAs to control locomotion jointly with manipulation, enabling bracing and leaning behaviors during manipulation.

 Allows NEO to position itself precisely for tasks, perform actions that require movement across space, and manipulate objects while on the move.

Redwood is compute-efficient and runs fully on NEO’s onboard embedded GPU.

In order to power autonomy on both EVE and NEO platforms, Redwood fuses pre-trained language embeddings, vision tokens from a pre-trained vision transformer, and proprioception embeddings from a sequence of joint positions and joint applied forces. These are passed through several more transformer blocks, which extract a latent representation vector. We decode this representation into EVE or NEO actions using a 



Generalizing to manipulating new objects in locations not seen in the training data is crucial for the model to work in home environments, where the home is never in the exact same configuration twice. This is achieved by training on a diverse dataset gathered on NEOs in 1X offices as well as employee homes.

To further improve generalization to new scenarios despite its small size (160M parameters), Redwood is trained not only to predict actions, but a variety of “cognitive” prediction targets like estimating the current location of NEO’s hands and relevant objects in image space. These cognitive tasks help ground NEO’s visual representations and allow it to generalize better to unseen environments, despite having a small model size. Below, we show a continuous take of Redwood being able to grasp unseen bottles from a variety of locations.

Whole Body Control and Multi-contact manipulation

Manipulation and locomotion behaviors are typically decoupled in most robotic systems. However, manipulation in the home necessitates going beyond picking small objects on counter-tops and tables: humans use their legs, hips and spine to bend down to pick toys and clothes off the ground, and lean into heavy doors when pushing them open. These “whole body control” tasks make it impossible to cleanly separate locomotion and manipulation.

To enable similar capabilities on NEO, Redwood predicts not only the arm and hand commands, but also walking, manipulation, and pelvis pose commands simultaneously. This greatly expands the kinematic reach and payload capacity NEO can work with.



Coordinating all parts of the body to engage with the environment also enables multi-contact manipulation, such as bracing a hand against a wall when pulling open heavy doors.

Solving chores requires combining manipulation with navigating across the home. In real home tasks, the objects of interest are rarely all in front of the robot at the start. Furthermore, navigating to and getting close enough to an object to grasp it needs to take into account the way the model will choose to pick up the object. If one does not train navigation skills jointly with those for manipulation, then a separate navigation stack may fail to position the robot in an optimal position to grasp the target object. Vice versa, if manipulation behavior does not take into account navigation behavior that might follow it, this could lead to unwanted collisions or carrying the object across the room in an unsafe way.

To that end, Redwood is trained on a large diverse set of object navigation and pick-and-place demonstrations within the home and is trained to plan navigation and manipulation behaviors jointly. An emergent property of training from these demonstrations is that Redwood can automatically decide to use the left, right, or both hands to pick up an object.

Running Redwood onboard allows NEO to be deployed in more diverse environments: in basements, in the garden, in homes with spotty Internet infrastructure, in wilderness campsites.

To that end, Redwood is a 160M parameter transformer model that runs on NEO’s onboard GPU at around 5hz. To pack as much intelligence as possible into a relatively small number of parameters, we’ve found that the additional cognitive losses help with grounding the representations, especially in unseen environments.

Voice control is an intuitive interface to interact with general-purpose robots in the home. Using an offboard speech-to-speech LLM, we extract the goal the user intends to command NEO with from a conversational context, and then convert the command into a vector offboard using a sentence encoder. This vector is then passed as an input into the Redwood model, which is trained on thousands of such text embeddings.

Large-scale behavior cloning methods typically only imitate successful demonstrations. Redwood is trained to learn from both successful and failure rollouts, allowing NEO to improve from any interaction it has with the world regardless of success. The failure rollouts provide supervision signals on the cognitive prediction heads, which helps prevent overfitting to a relatively narrow distribution of states seen during successful demonstrations. The successful demonstrations supervise both the action diffusion heads and the cognitive predictions.

We think that general purpose autonomous humanoids, with their intelligence incubated in the home, will be a generation-defining technology that reshapes quality of life for the elderly, for busy parents, and for use cases that scarcely cannot be imagined today. We’re looking for driven, high-agency engineers to help us scale up Redwood to the next level, and to deploy a production-grade AI in as many homes as possible this year. If working on Redwood excites you, we have a large number of open roles in Palo Alto:

Research Engineer, Reinforcement Learning

Redwood AI

This move comes as we prepare for large-scale deployment of NEO, your friendly home humanoid. With the aim of having hundreds of NEO’s arrive in homes across the United States in 2025, followed by rapid expansion– Mustally will be instrumental in enabling our growth along the way.

“I am excited to welcome Mustally to the Executive Team at 1X,” said Bernt Børnich, CEO of 1X Technologies. “He will play a key role in helping us accelerate our growth plans from our global HQ in Palo Alto, California.”

Mustally brings a strong track record of leadership across Fortune 500 companies and high-growth startups. He has built finance organizations from the ground up, raised over $50 billion in capital through public and private markets, and led transformative commercial partnerships and M&A initiatives. Prior to joining 1X, Mustally served as Managing Director, Global Treasurer, and Head of Financial Services at Lucid Group, and held senior finance roles at Herc Holdings, Hyundai Capital, and National Grid, following a career in investment banking and management consulting.

“I’m thrilled to join the 1X team, which is well positioned to lead the physical AI and robotics sector,” said Mustally Hussain. “My focus will be on driving financial discipline and strategic growth as 1X scales its manufacturing and commercialization footprint in the U.S., followed by global expansion.”

Welcoming Mustally Hussain as CFO

Developing AI for humanoid robots involves tackling many open research challenges – in safety, dexterity, visual understanding, and much more. It helps to compare notes with other labs tackling similar challenges, in order to accelerate progress towards a future of NEOs doing all the tasks needed to keep your home in order autonomously.

To that end, 1X AI and NVIDIA are pleased to announce our research collaboration effort. As a first step, the teams worked together to prepare an autonomy demo for Jensen Huang’s GTC 2025 Keynote, featuring NEO doing a dish loading task autonomously.

The following is a look into where, how and when we taught NEO to do the dishes with the NVIDIA team.



To make this collaboration possible, the 1X AI Team created a dataset API for NVIDIA to access data collected from 1X offices and employee homes, and an inference SDK to serve model predictions at a continuous 5Hz vision-action loop using an onboard NVIDIA GPU in NEO’s head or an offboard GPU.

A crucial step when onboarding a new learning codebase onto NEO is to verify correctness, i.e., overfitting a baseline model to a small amount of demonstration data and making sure that the time synchronization between images and actions is consistent all the way from data collection to training to runtime inference.

We demonstrate this by working with the NVIDIA GEAR team to train a single end-to-end neural network based on the 

 model to autonomously grasp a cup, hand it over to the other hand, and place it in a dishwasher to showcase how NEO fits compactly into the kitchen space while still having the kinematic reach to carry the cup from sink to dishwasher.

This is a good “first task” to learn because it checks for basic compatibility of an external research codebase with the logging and inference architecture. The obvious next step after verifying correctness is to feed thousands of hours of internally collected NEO data into the model.

Over the course of a week, our teams developed this model at a 1X employee’s home, swapped notes on action spaces, control frequencies, and other imitation learning tricks needed to get good performance on NEO Gamma. Moments like these – where friends are just hanging out in the home while a NEO does dishes in the background – will soon become an everyday occurrence.



When working in homes, the safety of NEO Gamma becomes particularly evident. NEO’s mechanically compliant and safe design allowed engineers to get in extremely close quarters with the robot while testing a variety of experimental architectures.


Our teams are both looking forward to continuing to learn from each other and push the industry forward. We hope that together we can accelerate our path to humanoids living and learning among us and providing a helping hand wherever one is needed.


 team and Jensen Huang for being gracious with their time and having us be a part of the NVIDIA experience at GTC. 

Additionally, thank you to 

 team for designing the jacket and capturing the moment.

1X & NVIDIA Research Collaboration 

NEO Gamma is the next generation of home humanoids designed and engineered by 1X Technologies. The Gamma series includes improvements across NEO’s hardware and AI, featuring a new design that is deeply considerate of life at home.


NEO Gamma's new design features Emotive Ear Rings to improve the communication and has a minimalist design aesthetic that fits into the home.




NEO can now walk with a natural human gait and arm swings, squat down to pick things up from the ground and sit in chairs.




NEO Gamma comes outfitted with soft covers to reduce impact on the surrounding environment and increase overall safety.




1X trained a visual manipulation model capable of picking up a large variety of objects in different scenarios, including environments not seen during training.




NEO Gamma’s Companion feature set integrates a new in-house language model that enables natural conversation and body language.

NEO Gamma’s design opens the door to start internal home testing—a first step in creating fully autonomous humanoids.

“There is a not-so-distant future where we all have our own robot helper at home, like Rosey the Robot or Baymax. But for humanoid robots to truly integrate into everyday life, they must be developed alongside humans, not in isolation.

The home provides real-world context and the diversity of data needed for humanoids to grow in intelligence and autonomy. It also teaches them the nuances of human life—how to open the door for the elderly, move carefully around pets, or adapt to the unpredictability of the surrounding world. Robots confined to industrial space or lab development miss out on this critical understanding.

With NEO Gamma, every engineering and design decision was made with one goal in mind: getting NEO into customers’ homes as quickly as possible… We’re close. We can’t wait to share more soon.”

NEO Gamma introduces a number of AI advancements, enhancing both autonomy and safe teleoperation in the home.




NEO now walks with a natural human gait and arm swings, squat down to pick things up from the ground and sit in chairs, all while maintaining balance. These dynamic control skills – running at 100Hz – are learned using Reinforcement Learning from human motion capture data. This range of motion allows NEO Gamma to experience household scenarios and tasks that would be otherwise inaccessible to other robot form factors.




1X trained a visual manipulation model capable of picking up a large variety of objects in different scenarios, including environments not seen during training. NEO Gamma leverages neural networks trained to predict teleoperated actions directly from raw sensor data.




NEO Gamma’s Companion feature set integrates a conversational voice interface, allowing users to converse naturally with a 1X developed language model (LLM). These developments bring NEO closer to human-friendly user interaction, bridging the gap between humanoid robotics and daily life.

NEO Gamma is designed to be approachable, with all new features created to integrate seamlessly into people’s everyday lives. NEO’s new look aims to complement living spaces rather than disrupt them.




NEO Gamma’s Ear Rings provide real-time visual feedback for a more intuitive and interactive experience.




3D printed from durable and soft nylon using a Japanese Shimaseki machine, the suit utilizes a unique whole-garment seamless knitting process that allows the fabric to conform to NEO without impeding performance.




With safety as the top priority, NEO Gamma features 1X’s Tendon Drive for joint actuation, encased in soft covers to enhance passive safety.

NEO Gamma features significant hardware upgrades, making it quieter, more reliable, and well suited to deliver a consumer product experience.




NEO Gamma boasts a 10x increase in hardware reliability and a 10 dB decrease in noise, bringing NEO’s operating levels down to the level of a refrigerator.




NEO Gamma features four microphones (front, back, left, and right) with beamforming and echo cancellation, ensuring crystal-clear audio capture and A three-speaker system—one in the chest for AI voice interactions and two outward-facing speakers in the pelvis for bass, 360-degree sound effects, and music.

Media Inquiries: press@1x.tech

Introducing NEO Gamma

 1X announces the acquisition of Kind Humanoid, uniting two robotics teams with aligned visions to advance humanoid technology.

“It’s rare to find someone who is not just a powerhouse engineer but also completely aligned philosophically and strategically on how humanoids as products should take shape,” said Bernt Bornich, CEO of 1X. “Having Christoph join the 1X team here in the Bay will accelerate our path to a world full of humanoid robots."

Kind Humanoid’s unique expertise and culture complement 1X’s mission to create an abundance of labor through safe and intelligent humanoids. The relationship between the two newly joined companies is built on the foundational belief that humanoids need to be developed alongside humans-- living and learning among us.

“Joining 1X feels like the perfect next chapter for Kind Humanoid,” said Christoph Kohstall, CEO of Kind Humanoid. “From starting out in a small garage to now becoming part of a team that shares our belief in humanoids living and learning among us, this acquisition brings our vision closer to reality. Together, we can create robots that truly connect with people and make a difference where it matters most.”

Kind Humanoid is a Palo Alto-based robotics startup founded by Christoph Kohstall, a former scientist at Stanford and member of Google's robotics team. The company developed Mona, a bipedal humanoid robot designed for home use and to bring additional labor to fields such as healthcare. Operating from a small garage, the team brought together a deeply bio-inspired robot body with advanced Al and large language models developing a robot that can interact with and assist everyday people.

1X is a leader in humanoid robotics, developing general-purpose robots designed to live and learn among humans in the home—an early step toward advancing physical intelligence. With a focus on safety, 1X is at the forefront of the next wave of robotics, offering products that are accessible and practical for everyday use. 1X’s mission is to create an abundance of labor through safe, intelligent humanoids that work alongside people

1X Acquires Kind Humanoid

Research

1X World Model: Sampling Challenge Update

product

1X Unveils NEO Beta, A Humanoid Robot for the Home

Scaling NEO Production: 1X builds in-house manufacturing facility

Company

1X Strengthens Leadership Team with New Hires

1X Attends NVIDIA GTC

Opening new HQ in Sunnyvale

Stories

Cooking With NEO Beta and Nick DiGiovanni

Podcast: 1X CEO, Bernt Børnich on the Venture Europe Podcast

NEO Featured in NVIDIA GTC Keynote

LinkedIn

YouTube

Instagram

TikTok

We have previously developed an autonomous model that can 

 goal-conditioned neural network. However, when multi-task models are small (<100M parameters), adding data to fix one task’s behavior often adversely affects behaviors on other tasks. Increasing the model parameter count can mitigate this forgetting problem, but also take longer to train, which slows down our ability to find out what demonstrations we should gather to improve robot behavior.

How do we iterate quickly on the data while building a generalist robot that can do many tasks with a single neural network? We want to decouple our ability to quickly improve task performance from our ability to merge multiple capabilities into a single neural network. To accomplish this, we’ve built a voice-controlled natural language interface to chain short-horizon capabilities across multiple small models into longer ones. With humans directing the skill chaining, this allows us to accomplish the long-horizon behaviors shown in this video:

Although humans can do long horizon chores trivially, chaining multiple autonomous robot skills in a sequence is hard because the second skill has to generalize to all the slightly random starting positions that the robot finds itself in when the first skill finishes. This compounds with every successive skill - the third skill has to handle the variation in outcomes of the second skill, and so forth.

From the user perspective, the robot is capable of doing many natural language tasks and the actual number of models controlling the robot is abstracted away. This allows us to merge the single-task models into goal-conditioned models over time. Single-task models also provide a good baseline to do 

 evaluations: comparing how a new model’s predictions differ from an existing baseline at test-time. Once the goal-conditioned model matches single-task model predictions well, we can switch over to a more powerful, unified model with no change to the user workflow.

Directing robots with this high-level language interface offers a new user experience for data collection. Instead of using VR to control a single robot, an operator can direct multiple robots with high level language and let the low-level policies execute low-level actions to realize those high-level goals. Because high-level actions are sent infrequently, operators can even control robots remotely, as shown below:

Note that the above video is not completely autonomous; humans are dictating when robots should switch tasks. Naturally, the next step after building a dataset of vision-to-natural language command pairs is to automate the prediction of high level actions using vision-language models like 

AI Update: Voice Commands & Chaining Tasks

Discover

1X World Model

Redwood AI: Mobility

Redwood AI

ExperienceNEO