CS2-10k: A Large-Scale Egocentric Counter-Strike 2 Dataset

CS2-10k: A Large-Scale Egocentric Counter-Strike 2 Dataset

CS2-10k: A Large-Scale Egocentric Counter-Strike 2 Dataset

Training interactive world models requires data that is hard to find: ego-centric video sequences with densely aligned action signals — keyboard inputs, camera motion, and ego state — all synchronized to the visual stream. Real-world embodied data is costly to collect. Synthetic data often lacks the visual richness or behavioral diversity needed for generalization.

Counter-Strike 2 demos offer a compelling middle ground. Matches are recorded as a deterministic replay — a compact file that encodes the full state trajectory of all players across every tick of the game. From a single demo file, we can reconstruct clean first-person video for any player at any point in the match, and extract the precise control inputs that drove each visual change.

Today we release CS2-10k, a large-scale egocentric gameplay dataset built from professional CS2 matches. It contains 646,578 player-round videos spanning 11,072 hours of first-person footage, paired with per-frame annotations covering keyboard state, mouse movement, and 3D player trajectory. We are also releasing cs2-dem-renderer, the open-source pipeline used to produce it.

9,548

Hours

512K

Videos

1,520

PRO MATCHES

~1K

Rounds

9,548

Hours

512K

Videos

1,520

PRO MATCHES

~1K

Rounds

Dataset Overview

CS2-10k is built from public professional match demos sourced from HLTV. For each demo, we render clean first-person video at 1280×720 resolution and 48 fps using the demo replay tool inside CS2, producing one video per player per round. Alongside each video, we store a parquet file containing per-frame annotations synchronized to the video timeline.

Annotation Schema

Every video clip has its corresponding anotations stored in a .parquet file:

Per-Frame Annotations

Each entry in frame_data contains:

The combination of video and per-frame control signals creates a tight action-observation loop.

No Abrupt Visual Changes

Each clip is a contiguous segment of a single round from a single player's perspective. There are no mid-round cuts, no editing transitions, and no UI HUD. The camera moves in a physically plausible relationship in the world and we hide the player weapon to get rid of sudden visual changes caused by weapon recoil, reloads, and weapon switching.

Use Cases for World Models

Designed specifically as a training substrate for interactive world models — models that must predict how first-person visual observations evolve in response to control inputs. Below are representative use cases:

Action-Conditioned Video Generation

Train models to generate the next N frames given the current frame and a keyboard+mouse action sequence. Dense per-frame controls make CS2-10k a natural fit for models like GameNGen, Genie, and OASIS.

Egocentric Navigation

With 3D player positions and yaw/pitch per frame, the dataset supports learning navigation priors: what does moving forward look like in a confined corridor vs an open site? How does camera control correlate with positional change?

Inverse Dynamics Modeling

Given two consecutive frames, predict the keyboard and mouse inputs that caused the transition. Per-frame ground-truth controls allow supervised training of action classifiers without manual annotation.

Multi-View Synchronization

Multiple players are recorded per round with shared map and round identifiers. This enables training models that reason about the same event from different spatial viewpoints — important for scene understanding and 3D consistency.

Long-Horizon Planning

Full rounds of 60–90 seconds provide significantly longer temporal horizons than most embodied datasets. A model can learn tactical structure: site entry, holds, rotations, and retakes — each as a coherent visual sequence.

Multi-Agent World Modeling

All 10 players per match are recorded simultaneously with shared round and map identifiers, making it possible to model how one agent's actions causally affect another's observations.

Action-Conditioned Video Generation

Train models to generate the next N frames given the current frame and a keyboard+mouse action sequence. Dense per-frame controls make CS2-10k a natural fit for models like GameNGen, Genie, and OASIS.

Egocentric Navigation

With 3D player positions and yaw/pitch per frame, the dataset supports learning navigation priors: what does moving forward look like in a confined corridor vs an open site? How does camera control correlate with positional change?

Inverse Dynamics Modeling

Given two consecutive frames, predict the keyboard and mouse inputs that caused the transition. Per-frame ground-truth controls allow supervised training of action classifiers without manual annotation.

Multi-View Synchronization

Multiple players are recorded per round with shared map and round identifiers. This enables training models that reason about the same event from different spatial viewpoints — important for scene understanding and 3D consistency.

Long-Horizon Planning

Full rounds of 60–90 seconds provide significantly longer temporal horizons than most embodied datasets. A model can learn tactical structure: site entry, holds, rotations, and retakes — each as a coherent visual sequence.

Multi-Agent World Modeling

All 10 players per match are recorded simultaneously with shared round and map identifiers, making it possible to model how one agent's actions causally affect another's observations.

Rendering Pipeline

The pipeline that produced CS2-10k is open-source at github.com/reka-ai/cs2-dem-renderer. Given a .dem file, it performs a two-pass parse to extract per-player spawn/death intervals and per-frame button inputs, then drives CS2's built-in demo replay system via a lightweight server plugin to render clean first-person video for each player round. Frames are streamed in real time from CS2's movie output to ffmpeg (VAAPI HEVC), producing .mp4 clips alongside synchronized .parquet annotation files. A worker mode processes entire directories of demos with automatic deduplication, making it straightforward to run at the scale of CS2-10k.

Author

Reka Labs Team