PhysicalRealismBench-U: Attributable Physical Realism Evaluation for Video World Models
PhysicalRealismBench-U: Attributable Physical Realism Evaluation for Video World Models
Intelligence is not only linguistic, but also visual and physical. While LLMs are becoming an increasingly mature technology and are successfully used in multiple digital domains spanning email editing, text summarization or even coding, their multimodal extension lacks visual and physical understanding of the world. On the one hand, they can recite complex physics laws using formal languages; on the other hand, they donʼt fully grasp object permanence, motion understanding, or how objects collide.
Today, we release PhysicalRealismBench-U — a physical realism benchmark with a synthetic dataset containing programmatic physics violations — along with an evaluation pipeline to evaluate state-of-the-art VLMs in the context of physics understanding.
We show that even the best existing models fail at fundamental physical reasoning tasks, which even kids would easily solve. Our findings are especially critical in the fast-emerging space of Physical General Intelligence or World Models.
The Problem: Intuitive Physics
Neither a cat trying to gracefully catch a bird, nor a basketball player who skilfully shoots into the basket needs to write equations of motion to perform their tasks. Instead, they intuitively understand the laws of physics. They “know” how objects interact with each other, or how they fall. This happens due to the combination of evolution and lifelong learning.
The same abilities are needed for physical general intelligence. An autonomous driving system that doesnʼt respect object permanence across occlusions will make catastrophic planning errors. A robot that fails to conserve support relations will take dangerous actions. Yet existing evaluation approaches fall short of catching those failures as they often focus on linguistic skills or generic understanding of concepts in images or videos.
Those shortcomings are becoming increasingly important as VLMs are often re-purposed to serve as the “robotics brain” or used as an evaluator in various world model benchmarks such as VBench-2.0, WorldModelBench, or PAI-Bench, or as a reward function (VLM-RMs, RL-VLM-F, ERL-VLM, etc.).
However, we show that current VLMs may skip frames, rely on spatial heuristics (see Insight 2: Border Proximity Triggers False Reasoning), and miss fundamental violations. Those findings have important ramifications, e.g., if used to evaluate world models they can produce a false sense of progress.
The question is not whether current models sometimes get physics wrong — they do. The question is how systematically they fail, and whether the field has the tools to measure and diagnose these failures precisely enough to drive improvement. Our findings suggest the answer is negative: even state-of-the-art VLMs fail to detect basic violations like objects vanishing or moving without cause, and existing benchmarks lack the attribution machinery needed to turn these failures into actionable insights. A bare violated / not-violated verdict can be correct by chance, so it cannot distinguish a model that perceived the violation from one that guessed; only requiring the offending object and the frame range to be named makes a correct answer evidence of understanding, and that is exactly what existing benchmarks omit.
This motivates both PhysicalRealismBench-U and a broader call for the community to prioritise physical realism as a first-class evaluation citizen.
Video
synthetic ground truth
Physics Laws
Template Library
Conservation of Mass, Gravity, Impenetrability, Conservation of Momentum.
Scene-Specific Q&A
object · time span
evidence · law tag
VLM Judge
Violation Yes/NO
+ Reasoning
Attributable Diagnosis
which law - which object - which frames
Video
synthetic ground truth
Physics Laws
Template Library
Conservation of Mass, Gravity, Impenetrability, Conservation of Momentum.
Scene-Specific Q&A
object · time span
evidence · law tag
VLM Judge
Violation Yes/NO
+ Reasoning
Attributable Diagnosis
which law - which object - which frames
Video
synthetic ground truth
Physics Laws
Template Library
Conservation of Mass, Gravity, Impenetrability, Conservation of Momentum.
Scene-Specific Q&A
object · time span
evidence · law tag
VLM Judge
Violation Yes/NO
+ Reasoning
Attributable Diagnosis
which law - which object - which frames
[Fig. 1 — Evaluation pipeline diagram: Video → Physics Law Templates → Scene-Specific Q&A → VLM Judge + CV metrics → Attributable Diagnosis]
How does our work extend existing benchmarks
Our work is inspired by recently proposed benchmarks (Physion-Eval, PhysBench, PAI-Bench, WorldModelBench, VideoPhy-2) targeting physical realism. We complement them with synthetic data (from 3D rendered videos) that enables attributable and diagnosable results: ground-truth labels are programmatically and precisely computed, and each video exhibits at most one violation type, avoiding compounding factors.
For the evaluation, we score the model on jointly identifying the violation type, the violating frames, and the violating object, which prevents potential shortcuts by the evaluated model and provides a strong certificate that the model has perceived the physical violation, both spatially and temporally.
Comparison Table (Understanding Benchmarks)
Each cell reflects what the benchmark provides regarding VLM evaluation; dataset annotations that aren't used in a quantitative metric are not counted.



Tab. 1: Comparison of PhysicalRealismBench-U with existing physical-understanding benchmarks, restricted to each benchmark's quantitatively scored VLM evaluation. Annotations not consumed by a metric are not counted, since they remain outside what the benchmark actually evaluates — Physion-Eval, for instance, uses an automatic VLM evaluation which does not contain law/entity/time-span labels, and their evaluation from expert law/object/time attribution annotations of VLMs remains qualitative rather than metric-based regarding object/time attribution and does not evaluate the law attribution.
* Per the table's scope (quantitatively scored VLM evaluation): Physion-Eval provides law attribution annotations and reasoning (which includes time attribution and object reference in free form), but does not quantitatively score them for VLM evaluation; object references and timestamps are inspected qualitatively on selected examples rather than evaluated via a defined metric over the full set.
** VideoPhy-2 uses global physical common sense scores and rule level scores but no direct law attribution.
PhysicalRealismBench-U: Physics-First, Scene-Specific, Attributable
PhysicalRealismBench-U is organised around two core design principles. The overarching goal is to turn implicit physics into measurable and diagnosable signals. This release focuses on physical understanding (Track U).
Design Principles
1. Physics-first, but not physics-only.
Every benchmark item is anchored to a specific physical law, so each evaluation result is traceable to a named principle rather than to an aggregate quality score. But no judgment reduces to a closed-form physics check: every law we evaluate requires combining motion analysis with scene context understanding. For instance, an object hovering is only a violation if there is nothing supporting it.
2. Scene-specific evaluation mapped to first principles.
Each test case carries structured and dense ground truth: the physical law being tested, the relevant object(s), the relevant frame range, and the concrete failure mode when a violation is present. Questions are instantiated from per-law templates bound to concrete scenes — not abstract physics categories.
Track U: Physical Understanding
The released track evaluates whether VLMs can perceive and reason about physical events in videos. Given a video and a physics-grounded question, the model must decide for each target physical law whether it is obeyed or violated, and provide a short rationale explaining its judgment.
Physics Ontology
Every question in PhysicalRealismBench-U is tied to a specific physical law, so every result points to a named principle rather than a vague aggregated quality score. Physics has only a handful of fundamental rules, but each can fail in more than one way. We organise the benchmark around four core laws, and for each law we test its distinct failure modes:
Conservation of mass — things shouldn't appear from nothing, vanish into nothing, or suddenly change size.
Gravity & support — unsupported things should fall; supported things shouldn't hover.
Impenetrability — solid objects shouldn't pass through one another, or through solid surfaces.
Conservation of momentum — things shouldn't speed up, stop, or change direction without a cause.
Every video is made by injecting at most one of these failures, so each video may break exactly one law in exactly one specific way. We grade models on the laws — does this video obey or violate the law? — and report results broken down by the specific kind of failure. That's the payoff: failures aren't just detectable, they're diagnosable — traceable to a named law, the specific object involved, and the exact moment things went wrong.
CORE LAW SET
law 01
Conservation of Mass
Object permanence
Identity preservation
Size / volume consistency
law 02
Gravity / Support
Hovering objects
Falling inconsistencies
Missing support
law 03
Impenetrability
Contact violations
Ghosting through surfaces
Intersection events
law 04
Conservation of Momentum
Spontaneous acceleration
Velocity discontinuities
Motion without cause
ATTRIBUTION SCHEMA PER QUESTION
law
object_id
trigger_frame
violation_type
CORE LAW SET
law 01
Conservation of Mass
Object permanence Identity preservation Size / volume consistency
law 02
Gravity / Support
Hovering objects Falling inconsistencies Missing support
law 03
Impenetrability
Contact violations Ghosting through surfaces Intersection events
law 04
Conservation of Momentum
Spontaneous acceleration Velocity discontinuities Motion without cause
ATTRIBUTION SCHEMA PER QUESTION
law
object_id
trigger_frame
violation_type
CORE LAW SET
law 01
Conservation of Mass
Object permanence
Identity preservation
Size / volume consistency
law 02
Gravity / Support
Hovering objects
Falling inconsistencies
Missing support
law 03
Impenetrability
Contact violations
Ghosting through surfaces
Intersection events
law 04
Conservation of Momentum
Spontaneous acceleration
Velocity discontinuities
Motion without cause
ATTRIBUTION SCHEMA PER QUESTION
law
object_id
trigger_frame
violation_type
[Fig. 2 — Physics ontology diagram: 4 core law cards (Mass Conservation, Gravity/Support, Impenetrability, Conservation of Momentum) + attribution schema pills]
What We Release
PhysicalRealismBench-U comprises three components:
Baseline results: we evaluate seven state-of-the-art VLMs — including models from Gemini, Qwen, and GPT — and find that the realism score (see realism F1 below) ranges from 0.14 to 0.60, well below reliable physical understanding even for frontier models.
Findings: targeted probes reveal why models fail — they skip frames, rely on spatial shortcut heuristics like border proximity rather than temporal reasoning, and recover dramatically when object positions are supplied as text rather than inferred from pixels.
Leaderboard: a public page where any VLM can be evaluated and compared against our baselines, with overall scores and per-law breakdowns.
Because the ground truth is synthetic and exact, there is no annotation ambiguity — a violation either exists or it doesn't. PhysicalRealismBench-U is a diagnostic tool providing precise attribution: it tells you not just whether your model fails, but where and why.
Explore the PhysicalRealismBench-U
Benchmark & Leaderboard
Submit your VLM's results, and see how it compares. The evaluation page provides per-category breakdowns, interactive result inspection, and ranked model comparisons.
View Leaderboard
[R]
[R]
[R]
[R]
[R]
Explore the PhysicalRealismBench-U Benchmark & Leaderboard
Submit your VLM's results, and see how it compares. The evaluation page provides per-category breakdowns, interactive result inspection, and ranked model comparisons.
View Leaderboard
[R]
[R]
[R]
[R]
[R]
Explore the PhysicalRealismBench-U
Benchmark & Leaderboard
Submit your VLM's results, and see how it compares. The evaluation page provides per-category breakdowns, interactive result inspection, and ranked model comparisons.
View Leaderboard
[R]
[R]
[R]
[R]
[R]
The PhysicalRealismBench-U Dataset
PhysicalRealismBench-U is built on synthetic videos rendered in Nvidia Isaac Sim. Violations are injected programmatically as a single, controlled deviation, giving exact, unambiguous ground truth with no annotation noise.
The dataset contains 1300 videos (1920x1080 resolution, 120 frames each): 600 matched pairs (1200 videos) across six violation types, plus 100 occlusion videos. Each pair shows the same scene twice — once with a violation, once without — making physics changes the only variable. The six physics changes map directly to the physics ontology: spontaneous impulse, spontaneous appearance, spontaneous disappearance, pass-through, shape change, and gravity violation.
The 100 occlusion videos show an object passing behind an occluder and reappearing normally. With no violation present, they can only yield true negatives (no violation) or false positives (violation is present) — specifically probing whether a model hallucinates a violation where none exists.



Tab. 2: PhysicalRealismBench-U dataset statistics.
Dataset Structure: What the VLM Sees
Each sample to be evaluated consists of four components:
Video - The simulated scene rendered with realistic textures and lighting.
Object ID overlays - Each relevant object is assigned an integer ID and labelled with a bounding box on a frame. The VLM is asked to refer to objects by ID in its response, which makes the output programmatically matchable to ground truth — even when the scene contains multiple instances of the same class (e.g., two red balls).
Frame-counter overlay - Some VLMs may not have direct awareness of time or frame numbers. We overlay a frame counter at the top-left of each video so the VLM can cite exact frame numbers when reporting where a violation occurs.
Per-law question prompt - For each video, the VLM is asked one yes/no question per physics violation type in the core set (does this video violate
<law>?). The response is parsed for the binary answer, the ID(s) of any violating object(s), and the frame range — all three feed the automated scoring pipeline allowing precise violation identification and localisation.
Ground truth. Every sample carries a structured ground-truth tuple (law, {object_id, object_name}, trigger_frame_range, violation_type) , which is matched component-wise against the VLM's parsed response. Because both ground truth and prediction use the same object IDs, the match is exact rather than text-fuzzy.
For example, a ground truth entry might read: (conservation_of_mass, {id: 1, "small_mover"}, frame 21, spontaneous disappear). This level of granularity lets us aggregate failure rates by law, by object, by violation type — and trace every score back to a specific object and moment in the video.
Design Principles for Synthetic Data
VLMs rely heavily on visual cues to interpret scenes. The following three requirements shaped the final dataset:
1. Realistic textures - untextured surfaces cause VLMs to misread scene geometry (e.g. calling a grounded object "floating").
2. Depth cues - backgrounds and ground planes must convey spatial context; homogeneous surfaces eliminate the cues VLMs need to judge motion and position.
3. Controlled velocities - objects must move slowly enough to appear across multiple frames; fast motions are systematically missed, consistent with Insight 1 (see below).
Spontaneous Impulse
Object begins moving with no visible cause
VALID
Frame
0Frame
0VIOLATION
Momentum Conservation
Spontaneous Appearance
Frame
0VALID
Frame
0VIOLATION
Mass Conservation
Pass-Through
One object passes through another
Frame
0VALID
Frame
0VIOLATION
Impenetrability
Spontaneous Disappearance
Object vanishes mid- scene with no explanation
Frame
0VALID
Frame
0VIOLATION
Mass Conservation
Gravity
Object hovers or falls inconsistently
Frame
0VALID
Frame
0VIOLATION
Gravity / Support
Shape Change
Object deforms without physical cause
Frame
0VALID
Frame
0VIOLATION
Mass Conservation
Spontaneous Impulse
Object begins moving with no visible cause
VALID
Frame
0Frame
0VIOLATION
Momentum Conservation
Spontaneous Appearance
Frame
0VALID
Frame
0VIOLATION
Mass Conservation
Pass-Through
One object passes through another
Frame
0VALID
Frame
0VIOLATION
Impenetrability
Spontaneous Disappearance
Object vanishes mid- scene with no explanation
Frame
0VALID
Frame
0VIOLATION
Mass Conservation
Gravity
Object hovers or falls inconsistently
Frame
0VALID
Frame
0VIOLATION
Gravity / Support
Shape Change
Object deforms without physical cause
Frame
0VALID
Frame
0VIOLATION
Mass Conservation
Fig. 3: Matched valid/violation example pairs across the six violation types. Each pair shares the same scene, isolating the injected physics change.
Failure: Spontaneous disappearance misclassified as 'leaving the scene'
GEMINI 3.1 PRO RESPONSE
"The object moves toward the right edge of the frame and exits the visible area. This is consistent with normal motion — no physics violation detected."
Ground Truth: Object with id=1 disappears spontaneously in frame 40
Failure: Hovering violation misclassified as normal gravitational motion
GEMINI 2.5 PRO RESPONSE
"All objects accelerate downwards in a manner consistent with gravity. Their trajectories are plausible parabolic arcs. The initial state of being in motion in mid-air is valid."
Ground Truth: Object with id=3 hovers (frames=[6,26])
Failure: Spontaneous (uncaused) force misclassified as static equilibrium
GPT-5 RESPONSE
"After establishing rest over frames 0–2, no object begins moving or changes velocity without a cause; the entire clip shows static equilibrium consistent with friction and support."
Ground Truth: On object ID=1 a force is applied without any visible cause (Frames=[14,14]).
Failure: Spontaneous disappearance misclassified as 'leaving the scene'
GEMINI 3.1 PRO RESPONSE
"The object moves toward the right edge of the frame and exits the visible area. This is consistent with normal motion — no physics violation detected."
Ground Truth: Object with id=1 disappears spontaneously in frame 40
Failure: Hovering violation misclassified as normal gravitational motion
GEMINI 2.5 PRO RESPONSE
"All objects accelerate downwards in a manner consistent with gravity. Their trajectories are plausible parabolic arcs. The initial state of being in motion in mid-air is valid."
Ground Truth: Object with id=3 hovers (frames=[6,26])
Failure: Spontaneous (uncaused) force misclassified as static equilibrium
GPT-5 RESPONSE
"After establishing rest over frames 0–2, no object begins moving or changes velocity without a cause; the entire clip shows static equilibrium consistent with friction and support."
Ground Truth: On object ID=1 a force is applied without any visible cause (Frames=[14,14]).
Failure: Spontaneous disappearance misclassified as 'leaving the scene'
GEMINI 3.1 PRO RESPONSE
"The object moves toward the right edge of the frame and exits the visible area. This is consistent with normal motion — no physics violation detected."
Ground Truth: Object with id=1 disappears spontaneously in frame 40
Failure: Hovering violation misclassified as normal gravitational motion
GEMINI 2.5 PRO RESPONSE
"All objects accelerate downwards in a manner consistent with gravity. Their trajectories are plausible parabolic arcs. The initial state of being in motion in mid-air is valid."
Ground Truth: Object with id=3 hovers (frames=[6,26])
Failure: Spontaneous (uncaused) force misclassified as static equilibrium
GPT-5 RESPONSE
"After establishing rest over frames 0–2, no object begins moving or changes velocity without a cause; the entire clip shows static equilibrium consistent with friction and support."
Ground Truth: On object ID=1 a force is applied without any visible cause (Frames=[14,14]).
Fig. 4: Examples of VLMs failing to detect the physics violation.
How PhysicalRealismBench Scores
PhysicalRealismBench evaluates VLMs on two levels. Throughout, we treat positive = violation, negative = no violation for each (video, law) pair.
Level 1: Classification F1 — can the model detect violations?
For each (video, law) pair, the VLM produces a yes/no judgment. Comparing this to ground truth gives a TP/FP/FN/TN per (video, law), and we report standard classification metrics over all pairs (precision, recall, F1 as the headline). F1 is computed globally (micro-averaged): we accumulate TP, FP, and FN across all (video, law) pairs and compute a single F1 at the end. F1 also compensates for the strong class imbalance in the data — each video has at most one violation across 6 physics violation types, so positives are rare — which makes accuracy uninformative.
Level 2: Joint F1 — the leaderboard metric
Classification alone doesn't tell us whether the model actually understood the physics. We therefore use a stricter F1 in which a (video, law) pair is counted as a TP only if the model both (a) correctly classifies the law as violated and (b) provides correct reasoning. Reasoning is correct if and only if the response references both the violating object (by ID) and the violation's frame range (localise the violation in time). The resulting metric is realism F1 score.
This is the single number reported on the leaderboard (e.g., 60.1% for GPT-5.5). It penalises a model for both detection failures and ungrounded reasoning, which prevents the "lucky classifier" problem — a model that detects something is wrong but can't explain what, where, or when scores no better than one that simply missed the violation.
The joint metric therefore treats detection and reasoning as one combined decision. The table below shows how we assign TP/FP/FN/TN counts to each outcome that can occur in our setup:



Tab. 3: How each (video, law) outcome maps to TP/FP/FN/TN counts for the joint realism F1 (micro-averaged over all pairs).
Why two levels? Level 1 isolates pure detection performance. Level 2 measures detection-plus-grounding, which is what we actually need from a physics critic. The gap between them tells you whether a model's failures are about finding violations or explaining them — and per-component breakdowns (object-match vs. frame-range-match) further localise whether the weakness is spatial or temporal.
Why Physical Realism Needs More Attention
To establish baselines and demonstrate PhysicalRealismBench-U's utility, we evaluated nine state-of-the-art VLMs on the full dataset. These results serve as reference points for anyone evaluating their own models — but more importantly, they reveal how far current models are from reliable physical reasoning. Even on synthetic scenes with unambiguous violations, frontier models struggle: e.g., the best realism F1 (metric definition in section ‘How PhysicalRealismBench Scores’) is only 60.1% (GPT-5.5, where individual frames where provided) among our tested models. If VLMs cannot reliably detect an object vanishing in a controlled environment, they cannot be trusted as physics judges neither for real-world videos, nor for generated videos (via t2v or world models). Across all models, F1 drops substantially when evaluation requires not only the correct class but also correct reasoning, indicating that VLMs may often predict a violation without actually grounding it in what they see.



Tab. 4: VLM baselines on PhysicalRealismBench-U. Realism F1 (grounded) vs. Class F1 (detection only); Drop = their difference. Except for GPT-5 and GPT-5.5, all models were evaluated at the best possible quality settings available. All submission executed by Reka Labs.
* We used them with the best possible quality settings on 1216x684 resolution.
The scores so far show that models fail; the rest of this section asks why. We ran a series of targeted probes into how VLMs process physics in video, and four findings stood out.
Insight 1: State-of-the-Art VLMs Skip Frames
As a probe, we overlaid a random number on each frame and asked models to read them back. On 120-frame videos, Gemini 3.1 Pro recovered only 72% and Qwen3.5-plus 0% — yet both hit 100% when the same frames were sent as individual images. Replacing random numbers with consecutive ones restored Gemini's video-mode accuracy to 100%, confirming it interpolates rather than inspects. Frame skipping is not a single-model quirk.



[Fig. 5: Frame-recognition probe using Gemini 3.1 Pro, 120-frame videos. Video mode: 72% Accuracy. Sent as separate images: 100%. Consecutive numbers in video mode → 100%]
Implication: When a model interpolates across frames instead of inspecting each one, single-frame events — a one-frame position change, a brief appearance — can slip through undetected. Both leading models we probed (Gemini 3.1 Pro and Qwen3.5-plus) read every frame correctly as images yet miss them in video mode, so this is a real reliability constraint for any VLM used as a physics critic.
Insight 2: Border Proximity Triggers False Reasoning
When an object near the image border disappears, we often saw VLMs tend to claim the object "left the scene", even though the object trajectory shows no outward motion, or its linear extrapolation places the object still inside the frame at the moment it vanishes.
This indicates VLMs struggle to combine position, velocity, and frame timing to verify whether an exit is geometrically plausible, revealing a lack of basic spatiotemporal motion and extrapolation understanding.



[Fig. 6 — Border proximity bias: Case A (static near border, vanishes, VLM says exited) and Case B (moving, extrapolation says still inside, VLM says exited)]
Insight 3: VLMs Understand Physics Far Better from Text than from Vision
Supplying an object's trajectory as text — its per-frame positions (x, y, t) — makes VLMs substantially better at judging whether the physics is violated, even though the same information was already visible in the video. On the spontaneous-impulse task, for example, several models gain a lot: e.g, Opus 4.8 improves realism F1 by 141%. The effect is not specific to this task; we observe it on other violation types as well. Once the facts are written out, these models judge the physics correctly — what they lack is the visual understanding to recover those facts from pixels.



Tab. 5: Realism F1 with vs. without per-frame positions (x, y, t) as text (spontaneous-impulse task). Except for GPT-5 and GPT-5.5, all models were evaluated at the best possible quality settings available. All submission executed by Reka Labs.
* We used them with the best possible quality settings on 1216x684 resolution.



[Fig. 7 — Trajectory as text: per-frame positions (x, y, t) injected into the prompt. Without them the VLM misses the violation; with them it judges the physics correctly.]
Insight 4: Motion Detection Has an Asymmetric Temporal Bias
In a controlled sweep varying the number of frames an object spends at position A, position B, and then disappears, we found the VLM (Gemini 3.1 Pro) fails to detect motion specifically when the object lingers at position A for many frames but appears at B for only a few. The reverse (short A, long B) is detected correctly. Out of 108 tested configurations, 12 produced failures — all with this asymmetric pattern. This suggests a recency or salience bias in temporal attention.



[Fig. 8 — Asymmetric temporal bias. Same A→B displacement in both timelines; only the dwell split differs. Long at the origin with a brief destination appearance (top) is missed (✗); the mirror case (bottom) is detected (✓). All 12 of 108 swept configurations failed in this one direction — evidence of a recency/salience bias, not a perception limit.]
Conclusions
The case for a stronger focus on realism: Current VLMs skip frames, rely on spatial shortcut heuristics, and miss basic Newtonian violations in clean synthetic scenes — failures no human would make. If they struggle here, their reliability on real-world video is fundamentally in question. VLMs are increasingly used as judges for video generation and as the reasoning backbone of robotics systems; a weak physics evaluator produces a false sense of progress. Physical realism needs to be a first-class evaluation axis — with dedicated benchmarks, explicit diagnostics, and public leaderboards.
Towards Solving Physical Realism
Physical realism evaluation is not yet a solved problem. PhysicalRealismBench-U provides a precise diagnostic tool — exact ground truth, an evaluation pipeline, and public baselines — with the goal of making physics failures not just detectable but diagnosable: traceable to specific laws, objects, and moments. This release covers physical understanding; other tracks will follow.
We believe having a public benchmark is essential for building video world models that are safe and reliable enough for embodied deployment — and PhysicalRealismBench-U is an important stepping stone towards solving physical realism.
References
VBench — Huang et al., CVPR 2024. Link: https://openaccess.thecvf.com/content/CVPR2024/papers/Huang_VBench_Comprehensive_Benchmark_Suite_for_Video_Generative_Models_CVPR_2024_paper.pdf
VBench-2.0 — Zheng et al., arXiv:2503.21755, 2025. Link: https://arxiv.org/abs/2503.21755
PAI-Bench — Zhou et al., CVPR 2026. Link: https://arxiv.org/abs/2512.01989
WorldModelBench — Li et al., NeurIPS 2025 (Datasets & Benchmarks Track). Link: https://openreview.net/forum?id=a3hafrDzuA
PhysBench — Chow et al., ICLR 2025. Link: https://openreview.net/forum?id=Q6a9W6kzv5
VideoPhy-2 — Bansal et al., ICLR 2026. Link: https://arxiv.org/abs/2503.06800
Physion-Eval — Zhang et al., arXiv:2603.19607, 2026. Link: https://arxiv.org/abs/2603.19607
VLM-RMs — Rocamonde et al., ICLR 2024. Link: https://openreview.net/forum?id=N0I2RtD8je
RL-VLM-F — Wang et al., ICML 2024. Link: https://arxiv.org/abs/2402.03681
ERL-VLM — Luu et al., ICML 2025. Link: https://openreview.net/forum?id=k77bq8AJVy
Citation
If you use PhysicalRealismBench-U, please cite this post:
