
1X Appoints Mohi Khansari as Head of Robot Learning
1X is excited to announce that Mohi Khansari has been appointed Head of Robot Learning, expanding his role after more than a year as Distinguished AI Engineer at the company.
Home robots need common sense behavior and a deep understanding of the physical world.
Many robot foundation models today are vision-language-action models (VLAs), which take a pretrained VLM and add an output head to predict robot actions (PI0.6, Helix, Groot N1.5). VLMs benefit from internet-scale knowledge, but are trained on objectives that emphasize visual and semantic understanding over prediction of physical dynamics. Tens of thousands of hours of costly robot data are needed to teach a model how to solve tasks considered simple for a human. Additionally, auxiliary objectives are often used to further coax spatial reasoning of physical interactions (MolmoAct, Gemini-Robotics 1.5).
In this blog, we introduce our video-pretrained world model, 1XWM, integrated into NEO as a robot policy. While VLAs directly predict action trajectories from static image-language input, our world-model based policy derives robot actions from text-conditioned video generation. By leveraging the world dynamics inherent in internet-scale video, our world model generalizes to novel objects, motions, and tasks without pre-training on large-scale robot data or any related teleoperated demonstrations. This represents a shift towards a new regime: allowing robot intelligence to benefit from the scaling of video pretraining, enabled by a hardware stack designed for high fidelity transfer from human embodiment.
Internet video implicitly encodes the structural priors of reality: how people and objects move, where forces are applied, and the implicit constraints of interaction. Accurate translation of video into action comes from more than just a model—it benefits from an embodiment that is kinematically and dynamically congruent with the human form.
The hardware as a first-class citizen in the AI stack closes the human-robot translation gap. By combining embodiment with human-like compliance, interaction dynamics—friction, inertia, and contact behavior—often match human motion closely enough for the model’s learned priors to remain in distribution. What the model can visualize, NEO usually can do.
Raw video provides a visual what, but lacks the control how. To bring video knowledge to world model form, we utilize a two-stage grounding process enabled by our integrated stack, following existing works like DreamGen and UniPi:
At inference time, the system receives a text prompt and a starting frame. The WM rolls out the intended future, the IDM extracts the necessary trajectory, and the robot executes the sequence in the real world.

At test-time, we send a prompt from our frontend to the model inference server, which will then execute the action on-robot.
The 1XWM backbone is built upon a 14B generative video model. To adapt this model to NEO’s embodiment, we use a multi-stage training strategy:
Previous works like DALL-E 3 show that the prompt adherence of visual foundation models can be significantly improved by training on descriptive visual captions. However, many egocentric datasets contain only brief task descriptions. To address this, we prompt a VLM to generate more detailed captions to use for training through a caption upsampling process.
To select the best 1XWM backbone checkpoint, we use an evaluation metric that compares the dynamic time warping distance between the ground truth future action sequence and the actions from our trained IDM run on the generated video. This helps filter for generations that not only look good visually but also yield accurate actions.
Similarly to DreamGen, we use a two-image IDM predictor with a sliding window of W=8 frames for efficient training and inference. However, we use a Depth Anything backbone with a separate flow matching head instead of a single diffusion transformer. Frames at times t and t+W are passed through the depth backbone and the embeddings are used as conditioning for the flow matching head. The IDM is trained on 400 hours of unfiltered robot data including random play data and motions that don’t correspond to any meaningful task. This allows us to faithfully track NEO’s motions anywhere.
At test-time, given the starting frame and a text prompt instructing NEO on what to do, 1XWM rolls out future video frames. We then extract the corresponding robot action trajectory with the IDM, and execute it directly on the robot. To ensure smooth trajectories, the IDM’s outputs are timewise averaged across a batch of initial noise values and sliding windows.
Caption: Our NEO post-training dataset contains primarily high quality pick-and-place data (98.5%), filtered for table top manipulation with hands in view. By leveraging the web-scale pretraining of the base video model, 1XWM can generalize to a wide range of unseen objects, environments, and tasks.
We first seek to understand: what are the limits on task generalization beyond what NEO has already seen and done? How closely does generated video align with real-world execution?
We have NEO:
We observe that 1XWM generated videos generally align well with real-world execution. Generated and post-factual video can look very similar viewed side-by-side, showing 1XWM has strong spatial, kinematic, and physical understanding.
WM
REAL
WM
REAL
Next, we try tasks requiring two-handed coordination and human interaction—abilities not included in our training dataset. This suggests that such knowledge comes from video pre-training and ego human mid-training. Because NEO’s embodiment is so similar to humans, the affordances learned in human video data translate directly.
WM
REAL
WM
REAL
Sometimes, generations tend to be overly optimistic about task completion and depth understanding. Generated rollouts can look visually plausible while subtly violating real-world constraints (e.g. object consistency, depth, geometry, contact). Our post-training recipe significantly reduces these errors, but monocular pretraining can still lead to weak 3D grounding, where the real robot undershoots or overshoots even when the generated video “succeeds.” This opens future work to integrate depth or stereo for better spatial grounding.
Beyond anecdotal examples, we show real-world results measuring the performance of 1XWM on in-distribution (ID) and out of distribution (OOD) tasks, running 30 times each. 1XWM succeeds with stable success rates across diverse action primitives, although some dextrous tasks remain challenging (e.g. pouring, drawing).
If so, we can measure and improve video quality using visual metrics and estimate the likelihood of real-world success.
Sometimes it is visually obvious whether a generated rollout is likely to succeed. For example, prompting 1XWM with “pull tissue” can occasionally generate videos of NEO picking up the tissue box instead. We’ve generally found close to 0% success rate when executing bad generations.
This makes us think that ideas like test-time compute can improve task success. Inspired by this, we try generating multiple rollouts in parallel and executing the best one. We study the “pull tissue” task between one and eight parallel generations, finding that the ability to select the highest quality generation from eight choices does lead to improved task success.
This selection process can be done manually, but is amenable to automation with a VLM evaluator. We leave exploration of best-of-N sampling strategies to future work. For simplicity and disambiguation with our quality of video ranking, all other results in this blog post use a single generation per attempt.
Given our hypothesis of correlation between video quality and task success, we visually ablate a few training choices, specifically the impact of upsampling captions and training on egocentric human data.
To evaluate, we generate videos of NEO performing tasks and then ask for human feedback. We use three evaluation datasets, each containing 500 starting image-prompt pairs:
We ask human annotators to review each generated video and accept or reject it based on physical plausibility, task completion, and consistency with NEO’s embodiment and capabilities.
We find that upsampling captions improves video generation quality on every evaluation split. Upsampled captions better match the detailed text conditioning used during video-model pretraining and also provide clearer conditioning for task-specific motion.
Adding egocentric human data improves generation quality on both New Tasks and T2I splits. This is consistent with our hypothesis that ego human data contributes a transferable prior for manipulation that maps well onto NEO’s humanoid embodiment. For in-distribution tasks for which we have good NEO data coverage, egocentric data may dilute the post-training mix and cause a neutral or negative change in visual score.
Finally, we run real-world ablations on egocentric mid-training and caption upsampling, techniques we found to improve visual generation quality. For each task, we run all models 30 times in the same setting, on the same robot, and without test-time filtering. We do not control further with blind evaluation or checkpoint selection.
On the only In-distribution task, Grab chips, we find that all models perform similarly, suggesting that in-distribution tasks are least sensitive to changes in data or caption.
Our Scrub dish task was the most challenging across models. Models without either the egocentric human midtraining phase or caption upsampling incorrectly interpret the instruction. We find that our model with both is the only model to achieve a nonzero success rate.
As a whole, we do see a connection where including changes that impact visual model quality improve task success in real-world experiments.
1XWM transfers the knowledge of web-scale video data to imagine and then execute diverse tasks on NEO. From here, we’re excited to improve 1XWM to solve more complex, longer-horizon tasks required for useful household autonomy.
Building on the rapid progress observed in generative video models and equipped with data from our NEO production ramp, we will continue scaling 1XWM to improve quality and task success. Today, the 1XWM backbone takes 11 seconds to run, using multi-gpu inference built in collaboration with our friends at Verda (Antonio and Aditya). Each inference generates 5 seconds of real-time video, and the IDM takes 1 second to extract actions from the generated rollout. To solve reactive tasks and contact-rich manipulation, faster inference will improve reaction time, or latency between prompt and action execution. As we consider tasks longer than 5 seconds, closed-loop replanning with memory context will help us handle drift, partial observability, and recovery.
We don’t need perfect performance across the board to make deep progress. Nonzero success across a broad set of tasks provides the lever to unlock self-improvement.
1XWM creates a flywheel where exploration, evaluation, and policy refinement are driven by NEO’s own experience, rather than being limited by expert demonstrations. When learning is driven by experience, NEO can only improve from here, and we are excited to build a future where NEO can teach itself to master any task in any home.
If this sounds motivating to you, come help us build the future: https://1x.recruitee.com/o/ai-research-engineer-world-models
Opt in to receive updates. Unsubscribe anytime.