Physics-Informed BEV World Model

Anonymous Author(s)

PIWM: a lightweight, physics‑informed generative model that predicts future images from the current image and actions — enabling forecasting with strong existential and temporal consistency in dynamic environments.

Every single frame in these videos is generated by neural networks, while humans play with them in real time.

Physics-Informed BEV World Model Demonstration

Beyond generative quality, adherence to physical consistency more directly determines practical utility.

Baseline (DIAMOND).

High visual quality but weak physical consistency

Soft Mask.

High physical consistency (interactive & temporal)

Hard Mask.

Over‑constrained behavior, lane changes difficult.

Abstract

A major challenge in deploying world models is the trade-off between size and performance. Large world models can capture rich physical dynamics but require massive computing resources, making them impractical for edge devices. Small world models are easier to deploy but often struggle to learn accurate physics, leading to poor predictions. We propose the Physics-Informed BEV World Model (PIWM), a compact model designed to efficiently capture physical interactions in bird's-eye-view (BEV) representations. PIWM uses Soft Mask during training to improve dynamic object modeling and future prediction. We also introduce simple yet effective techniques Warm Start for inference to enhance prediction quality with zero-shot model. Experiments on 2,000 driving episodes from HighwayEnv show that PIWM achieves more accurate and physically consistent predictions than baseline small models, while remaining lightweight enough for potential deployment on edge computing platforms.

What is Soft Mask?

This viewer is interactive — drag to rotate, scroll to zoom. TopFront 45° button to reset.

Qualitative comparison of typical scenes:

Quantitative comparison:

Results of Human evaluation scores. The metrics considered are Interactive Existential Consistency (IEC), Kinematics Response (KIR), and Temporal Existential Consistency (TEC), Weighted Overall (WO). indicates the experiments are evaluated by 4 humans. While the rest are evaluated by 24 humans.

At the same parameter scale (400M), PIWM surpasses the baseline by 60.6% in weighted overall score. Moreover, even when compared with the largest baseline model (400M), the smallest PIWM (130M Soft Mask) achieves a 7.4% higher weighted overall score with a 28% faster inference speed.

Why human subjective ratings matter?

Check out what the Genie 3 creators say in interview, timestamp 25:25: How Do You Measure the Quality of a World Model?

BibTeX

@misc{wang2025enhancingphysicalconsistencylightweight,
        title={Enhancing Physical Consistency in Lightweight World Models}, 
        author={Dingrui Wang and Zhexiao Sun and Zhouheng Li and Cheng Wang and Youlun Peng and Hongyuan Ye and Baha Zarrouki and Wei Li and Mattia Piccinini and Lei Xie and Johannes Betz},
        year={2025},
        eprint={2509.12437},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2509.12437}, 
  }
}