Physics-Informed BEV World Model

Anonymous Author(s)

Typical Problems in World Models and our solution comparison

TL;DR: PIWM is a lightweight, physics‑informed generative model that predicts future images from the current image and actions — enabling forecasting with strong existential and temporal consistency in dynamic environments.

Every single frame (Every Pixel) in these videos is generated by neural networks, while humans play with them in real time.

Physics-Informed BEV World Model Demonstration

Beyond generative quality, adherence to physical consistency more directly determines practical utility.

Baseline (DIAMOND)

High visual quality but weak physical consistency

Hard Mask (Simple Geometry Prior)

Over‑constrained behavior, lane changes difficult.

Soft Mask (ours)

High physical consistency (interactive & temporal)

Abstract

A major challenge in deploying world models is the trade-off between size and performance. Large world models can capture rich physical dynamics but require massive computing resources, making them impractical for edge devices. Small world models are easier to deploy but often struggle to learn accurate physics, leading to poor predictions. To address this, we propose the Physics-Informed BEV World Model (PIWM), a compact model designed to efficiently capture physical interactions in bird’s-eye-view (BEV) representations. PIWM incorporates a Soft Mask mechanism during training to improve dynamic object modeling and future prediction. We also introduce a simple yet effective inference technique called Warm Start, which enhances prediction quality even in zero-shot settings. Experiments demonstrate that, at the same parameter scale (400M), PIWM surpasses the baseline by 60.6% in weighted overall score. Moreover, even when compared to the largest baseline model (400M), the smallest PIWM variant (130M with Soft Mask) achieves a 7.4% higher weighted overall score while delivering 28% faster inference speed.

Podcast

On a run and want to get a gist of our paper? Listen to the following podcast!

What is Soft Mask?

This viewer is interactive — drag to rotate, scroll to zoom. TopFront 45° button to reset.

Quantitative comparison:

Results of Human evaluation scores. The metrics considered are Interactive Existential Consistency (IEC), Kinematics Response (KIR), and Temporal Existential Consistency (TEC), Weighted Overall (WO). indicates the experiments are evaluated by 4 humans. While the rest are evaluated by 24 humans. The baseline is DIAMOND.

At the same parameter scale (400M), PIWM surpasses the baseline by 60.6% in weighted overall score. Moreover, even when compared with the largest baseline model (400M), the smallest PIWM (130M Soft Mask) achieves a 7.4% higher weighted overall score with a 28% faster inference speed.

Why human subjective ratings matter?

Check out what the Genie 3 creators say in interview, timestamp 25:25: How Do You Measure the Quality of a World Model?

BibTeX

@misc{anonymous,
        title={Enhancing Physical Consistency in Lightweight World Models}, 
        author={Anonymous Author(s) for now},
        year={2025},
        eprint={2509.12437},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2509.12437}, 
  }
}