World models as approximate environments
To achieve their goals, agents, whether biological or artificial, build approximate models of their environment to anticipate the consequences of their actions. In research, these world models are compressed spatial and temporal representations [1] of an environment’s transition dynamics: given a state and an action, they predict the next state and reward. By unrolling these predictions, the model simulates how the world evolves, effectively allowing the agent to 'see' the future.
To help in evaluating world models, we propose WorldModelGym, a benchmark that indirectly evaluates them through the lens of decision-based fidelity: how useful is the world model in relation to the real-world, by measuring the consequence, in the real world, of acting based on the model's predictions.
This sits alongside our recent PhysicalRealismBench [2], which measures how well an VLM understands real-world physics: can it recognize when a scene breaks the basic rules of dynamics. WorldModelGym asks the complementary question: is the model a sufficiently faithful simulator for an agent to achieve real-world success? Understanding the world and being usable as a simulator of it are different properties, and the two benchmarks cover different aspects of the same goal.
The benchmark employs a single standardized interface to evaluate latent, predictive, and pixel-based models under identical, frozen policies. Designed for efficiency, without requiring the massive rollouts of search-based alternatives, it keeps compute-heavy models tractable and complements existing benchmarks so that together they provide a comprehensive view of model performance.
Existing benchmarks measure visual or physical plausibility [3-7], rollout accuracy [8-9], reconstruction [10], or planning success [11-12]. WorldModelGym expands this ecosystem by introducing a targeted metric for decision-based fidelity, efficiently evaluating the environment utility of a world model's predictions through direct, open-loop action sequences.
The protocol: scoring decisions
Imagine an autopilot AI flying a plane. Hypothetically, we could let it continue for many hours to see if it would crash and fail (which is expensive, unsafe and slow), WorldModelGym provides a sudden, high-stakes question with multiple choices for the autopilot. Each choice is a sequence of actions. All sequences are designed to look plausible for the autopilot and only we know the correct answer. Thus, the only way for the autopilot to decide which sequence is better, is by simulating them and judging the outcomes. If the autopilot (a world model) truly understands the "physics" of the environment, it will correctly simulate and rank different choices. However, if its internal logic is flawed, it will make a mistake which will result in failure.
Following this analogy, WorldModelGym evaluates world models through a series of questions, each representing a single decision-making task. For each question, we provide five different choices (action sequences) for the world model, including one random choice. As we have already run all five in the real environment, we know each one's outcome.
During the evaluation phase, the world model is initialised with a short observation history. It evaluates each action sequence from the provided menu by simulating the future for each sequence, and predicts rewards. A simple, environment-neutral argmax rule then selects the action sequence with the highest cumulative predicted reward. Since this rule remains constant across all tests, any differences in performance are a direct reflection of the model's predictive accuracy.
The headline metric is normalized regret:
Here, best_return refers to the real-world reward score assigned to the best sequence according to the real-world environment, true_return_of_the_picked_sequence refers to the real-world reward score assigned to the sequences chosen by the world-model's reward score. Finally, worst_return is the real-world reward score of the random baseline.
Models achieving a score of 0 have successfully identified the optimal plan, whereas a score of 1 represents performance no better than the worst choice.
Back to our autopilot analogy, imagine two autopilots taking the same test. The first predicts what each maneuver will really do accurately. Its predicted numbers are close to reality and it correctly selected the right choice. The second gets the numbers wrong, but by luck it will still select the correct choice. Both pick correctly, so both have zero regret. Our second metric, reward-prediction error, acts like a flight instructor who checks the predicted outcomes against what actually happened: it credits the autopilot whose predictions were right and exposes the one that chose well only by coincidence. More formally, reward-prediction error, defined as the normalized difference between predicted and actual rewards. This allows us to distinguish between models that make high-quality decisions through precise reward modeling and those that appear successful due to internal error cancellation.
Below is the example of a single probe.
In this environment task a robot arm must reach a red button on the table and press it down from above. The reward has two parts: getting the gripper to the button, and how far it then pushes the button down. So a sequence earns reward for both reaching and pressing. The best sequence does both and gets a reward of 620. The other variations get near the button but barely press it, getting reward around 210-270. The random one hardly moves toward it, and only receives 33. The figure shows how the five outcomes compare.
The WorldModel-Gym contract
We adopt the Gymnasium contract, the industry standard for agent-environment interfaces in reinforcement learning, and extend it to support learned world models as follows:
The step method follows the standard gymnasium.Env.step interface, outputting the next state and the model's predicted reward. The model acts purely as an environment simulator as it consumes actions rather than selecting them, and it does not need to render visual frames, as it simply updates its internal state and estimates the expected reward. Because the state representation is treated as an opaque object by the evaluator, the model can use any format (e.g. Jepa style latent vectors, tokens, or video world models in pixels), making the protocol highly flexible. Any world model that predicts reward fits in this structure regardless of how they predict the states. The primary requirement is that the model supports branching from a frozen state to evaluate various plans efficiently.
Statistics
WorldModelGym launches with 100+ tracks across multiple environment families including: Atari games, Meta-World manipulation tasks, DeepMind Control tasks and classic control.

Environment selection is integral to our design, as effective RL tasks do not necessarily translate into effective world-model benchmarks. E.g. our suite prioritizes environments observable via fixed history windows that support reproducible resets with controlled stochasticity, and offer horizons sufficient for prediction errors to compound meaningfully. Each track is defined by a declarative file specifying the environment, canonical observation, and scoring range, allowing the suite to be extended without modifying the underlying evaluator.
World models integrate with our benchmarks via a lightweight adapter that implements the contract mentioned above. Standardized adapter checks and a unified format ensure that all submissions remain comparable. The leaderboard displays every track, providing a preview of the evaluation policy in the ground-truth environment alongside its baseline score.
Limitations
WorldModelGym focuses on decision-relevant accuracy. It does not judge how realistic a model's imagined frames look like, which is a separate perceptual question. It does not capture the long-horizon, closed-loop behavior a deployed agent would face, since the test is open-loop.
Try it out!
If your world model can encode an observation history, take an action, and predict the resulting reward, it can be scored. Each track ships its questions and answers, its menus, and the precomputed true returns; a submission is graded by normalized regret across the suite.
We will make the leaderboard available soon. For evaluation requests or questions, reach out to us at contact@reka.ai.








