1X World Model | From Video to Action: A New Way Robots Learn

JAN 12 '26AI Team

Home robots need common sense behavior and a deep understanding of the physical world.

Many robot foundation models today are vision-language-action models (VLAs), which take a pretrained VLM and add an output head to predict robot actions (PI0.6, Helix, Groot N1.5). VLMs benefit from internet-scale knowledge, but are trained on objectives that emphasize visual and semantic understanding over prediction of physical dynamics. Tens of thousands of hours of costly robot data are needed to teach a model how to solve tasks considered simple for a human. Additionally, auxiliary objectives are often used to further coax spatial reasoning of physical interactions (MolmoAct, Gemini-Robotics 1.5).

In this blog, we introduce our video-pretrained world model, 1XWM, integrated into NEO as a robot policy. While VLAs directly predict action trajectories from static image-language input, our world-model based policy derives robot actions from text-conditioned video generation. By leveraging the world dynamics inherent in internet-scale video, our world model generalizes to novel objects, motions, and tasks without pre-training on large-scale robot data or any related teleoperated demonstrations. This represents a shift towards a new regime: allowing robot intelligence to benefit from the scaling of video pretraining, enabled by a hardware stack designed for high fidelity transfer from human embodiment.

The humanoid advantage

Internet video implicitly encodes the structural priors of reality: how people and objects move, where forces are applied, and the implicit constraints of interaction. Accurate translation of video into action comes from more than just a model—it benefits from an embodiment that is kinematically and dynamically congruent with the human form.

The hardware as a first-class citizen in the AI stack closes the human-robot translation gap. By combining embodiment with human-like compliance, interaction dynamics—friction, inertia, and contact behavior—often match human motion closely enough for the model’s learned priors to remain in distribution. What the model can visualize, NEO usually can do.

From video knowledge to world models

1.Today, frontier text-to-video models like Veo and Sora generate strikingly realistic videos. However, zero-shot generations from these models are not grounded in robot embodiments and often fall short along key axes required for control:
2.Visual / spatial: Is the rollout consistent with the robot’s camera intrinsics and egocentric viewpoint? Does it preserve depth and the precise spatial relationships required for manipulation?
3.Kinematic: Does the robot in the rollout move in ways feasible for the embodiment, respecting morphology, joint limits, velocities, and actuation constraints?
4.Physics: Does the rollout avoid physically impossible outcomes (e.g. object teleportation) which would not translate to real-world success?

Raw video provides a visual what, but lacks the control how. To bring video knowledge to world model form, we utilize a two-stage grounding process enabled by our integrated stack, following existing works like DreamGen and UniPi:

1.The World Model backbone (WM): A text-conditioned diffusion model trained on web-scale video, mid-trained on egocentric human data, and fine-tuned on NEO-specific sensorimotor logs. It predicts the high-fidelity evolution of a scene with strong visual, spatial, and physical accuracy.
2.The Inverse Dynamics Model (IDM): We bridge pixels to actuators by training an IDM to predict the exact action sequence required to transition between the model's generated frames. We use the IDM’s metrics and rejection sampling to enforce kinematic correctness of generations.

At inference time, the system receives a text prompt and a starting frame. The WM rolls out the intended future, the IDM extracts the necessary trajectory, and the robot executes the sequence in the real world.

At test-time, we send a prompt from our frontend to the model inference server, which will then execute the action on-robot.

The 1XWM training and inference recipe

The 1XWM backbone is built upon a 14B generative video model. To adapt this model to NEO’s embodiment, we use a multi-stage training strategy:

1.Egocentric Mid-training: We train on 900 hours of egocentric human video to align the model with first-person manipulation tasks. At this stage, the model captures general manipulation behaviors but struggles to generate videos of NEO performing tasks.
2.Embodiment Fine-tuning: We then fine-tune on 70 hours of robot data, which adapts the model to NEO’s visual appearance and kinematics.

Previous works like DALL-E 3 show that the prompt adherence of visual foundation models can be significantly improved by training on descriptive visual captions. However, many egocentric datasets contain only brief task descriptions. To address this, we prompt a VLM to generate more detailed captions to use for training through a caption upsampling process.

To select the best 1XWM backbone checkpoint, we use an evaluation metric that compares the dynamic time warping distance between the ground truth future action sequence and the actions from our trained IDM run on the generated video. This helps filter for generations that not only look good visually but also yield accurate actions.

Similarly to DreamGen, we use a two-image IDM predictor with a sliding window of W=8 frames for efficient training and inference. However, we use a Depth Anything backbone with a separate flow matching head instead of a single diffusion transformer. Frames at times t and t+W are passed through the depth backbone and the embeddings are used as conditioning for the flow matching head. The IDM is trained on 400 hours of unfiltered robot data including random play data and motions that don’t correspond to any meaningful task. This allows us to faithfully track NEO’s motions anywhere.

At test-time, given the starting frame and a text prompt instructing NEO on what to do, 1XWM rolls out future video frames. We then extract the corresponding robot action trajectory with the IDM, and execute it directly on the robot. To ensure smooth trajectories, the IDM’s outputs are timewise averaged across a batch of initial noise values and sliding windows.

DISTRIBUTION BY SEGMENT COUNT

: 0.0%

Caption: Our NEO post-training dataset contains primarily high quality pick-and-place data (98.5%), filtered for table top manipulation with hands in view. By leveraging the web-scale pretraining of the base video model, 1XWM can generalize to a wide range of unseen objects, environments, and tasks.

Exploring tasks 1XWM can do

We first seek to understand: what are the limits on task generalization beyond what NEO has already seen and done? How closely does generated video align with real-world execution?

We have NEO:

grasp objects in and out of distribution
manipulate unseen objects with interesting affordances
perform entirely novel tasks requiring new motions

We observe that 1XWM generated videos generally align well with real-world execution. Generated and post-factual video can look very similar viewed side-by-side, showing 1XWM has strong spatial, kinematic, and physical understanding.

REAL

Grasping

Sample

REAL

New Behaviors

Sample

Next, we try tasks requiring two-handed coordination and human interaction—abilities not included in our training dataset. This suggests that such knowledge comes from video pre-training and ego human mid-training. Because NEO’s embodiment is so similar to humans, the affordances learned in human video data translate directly.

REAL

Two-Handed

Sample

REAL

Human Interaction

Sample

Sometimes, generations tend to be overly optimistic about task completion and depth understanding. Generated rollouts can look visually plausible while subtly violating real-world constraints (e.g. object consistency, depth, geometry, contact). Our post-training recipe significantly reduces these errors, but monocular pretraining can still lead to weak 3D grounding, where the real robot undershoots or overshoots even when the generated video “succeeds.” This opens future work to integrate depth or stereo for better spatial grounding.

Beyond anecdotal examples, we show real-world results measuring the performance of 1XWM on in-distribution (ID) and out of distribution (OOD) tasks, running 30 times each. 1XWM succeeds with stable success rates across diverse action primitives, although some dextrous tasks remain challenging (e.g. pouring, drawing).

1XWM SUCCESS RATE BY TASK (±1 SE)

Can we connect video quality and task success?

If so, we can measure and improve video quality using visual metrics and estimate the likelihood of real-world success.

Sometimes it is visually obvious whether a generated rollout is likely to succeed. For example, prompting 1XWM with “pull tissue” can occasionally generate videos of NEO picking up the tissue box instead. We’ve generally found close to 0% success rate when executing bad generations.

This makes us think that ideas like test-time compute can improve task success. Inspired by this, we try generating multiple rollouts in parallel and executing the best one. We study the “pull tissue” task between one and eight parallel generations, finding that the ability to select the highest quality generation from eight choices does lead to improved task success.

This selection process can be done manually, but is amenable to automation with a VLM evaluator. We leave exploration of best-of-N sampling strategies to future work. For simplicity and disambiguation with our quality of video ranking, all other results in this blog post use a single generation per attempt.

PULL TISSUE TASK STUDY WITH PARALLEL GENERATIONS (±1 SE)

: 0.0%

The importance of egocentric data and high-quality captions

Given our hypothesis of correlation between video quality and task success, we visually ablate a few training choices, specifically the impact of upsampling captions and training on egocentric human data.

To evaluate, we generate videos of NEO performing tasks and then ask for human feedback. We use three evaluation datasets, each containing 500 starting image-prompt pairs:

In-distribution contains hard tasks and backgrounds in-distribution with our robot training data: mostly pick and place in scenes with clutter and challenging object positions.
New tasks consist of a split across novel tasks: whisk the bowl, pull tissue, relative size object identification (pick bigger object), bimanual manipulation, etc. collected with simple backgrounds in the real world.
OOD T2I is composed entirely of pick tasks, using seed frames generated by a text-to-image (T2I) model, which randomly samples out-of-distribution household objects and backgrounds.

We ask human annotators to review each generated video and accept or reject it based on physical plausibility, task completion, and consistency with NEO’s embodiment and capabilities.

HUMAN ACCEPTANCE OF GENERATED VIDEOS (±1 SE)

1XWM NO EGO, NO UPSAMPLING

1XWM NO EGO

1XWM

We find that upsampling captions improves video generation quality on every evaluation split. Upsampled captions better match the detailed text conditioning used during video-model pretraining and also provide clearer conditioning for task-specific motion.

Adding egocentric human data improves generation quality on both New Tasks and T2I splits. This is consistent with our hypothesis that ego human data contributes a transferable prior for manipulation that maps well onto NEO’s humanoid embodiment. For in-distribution tasks for which we have good NEO data coverage, egocentric data may dilute the post-training mix and cause a neutral or negative change in visual score.

1XWM REAL-WORLD ABLATION STUDIES (±1 SE)

1XWM NO EGO, NO UPSAMPLING

1XWM NO EGO

1XWM

Finally, we run real-world ablations on egocentric mid-training and caption upsampling, techniques we found to improve visual generation quality. For each task, we run all models 30 times in the same setting, on the same robot, and without test-time filtering. We do not control further with blind evaluation or checkpoint selection.

On the only In-distribution task, Grab chips, we find that all models perform similarly, suggesting that in-distribution tasks are least sensitive to changes in data or caption.

Our Scrub dish task was the most challenging across models. Models without either the egocentric human midtraining phase or caption upsampling incorrectly interpret the instruction. We find that our model with both is the only model to achieve a nonzero success rate.

As a whole, we do see a connection where including changes that impact visual model quality improve task success in real-world experiments.

Towards full world model autonomy

1XWM transfers the knowledge of web-scale video data to imagine and then execute diverse tasks on NEO. From here, we’re excited to improve 1XWM to solve more complex, longer-horizon tasks required for useful household autonomy.

Building on the rapid progress observed in generative video models and equipped with data from our NEO production ramp, we will continue scaling 1XWM to improve quality and task success. Today, the 1XWM backbone takes 11 seconds to run, using multi-gpu inference built in collaboration with our friends at Verda (Antonio and Aditya). Each inference generates 5 seconds of real-time video, and the IDM takes 1 second to extract actions from the generated rollout. To solve reactive tasks and contact-rich manipulation, faster inference will improve reaction time, or latency between prompt and action execution. As we consider tasks longer than 5 seconds, closed-loop replanning with memory context will help us handle drift, partial observability, and recovery.

We don’t need perfect performance across the board to make deep progress. Nonzero success across a broad set of tasks provides the lever to unlock self-improvement.

1XWM creates a flywheel where exploration, evaluation, and policy refinement are driven by NEO’s own experience, rather than being limited by expert demonstrations. When learning is driven by experience, NEO can only improve from here, and we are excited to build a future where NEO can teach itself to master any task in any home.

If this sounds motivating to you, come help us build the future: https://1x.recruitee.com/o/ai-research-engineer-world-models

Discover

1X Appoints Mohi Khansari as Head of Robot Learning

1X is excited to announce that Mohi Khansari has been appointed Head of Robot Learning, expanding his role after more than a year as Distinguished AI Engineer at the company.

JAN 09 '261X

NEO Home Robot | Order Today

1X is proud to announce the launch of NEO, the world’s first consumer-ready humanoid robot designed to transform life at home.

OCT 28 '25Eric JangDar SleeperBernt Børnich

Welcoming Vikram Kothari as VP of Operations

Vikram joins 1X after over eight years at SpaceX leading Dragon, Starship, Raptor, and Launch Avionics supply chain orgs.

AUG 07 '251X

ExperienceNEO

Opt in to receive updates. Unsubscribe anytime.