Reka Responsible AI, Model Risk, Ethics & Governance Framework

← Back to Blog

Jun 24, 2026

CS2-10k: A Large-Scale Egocentric Counter-Strike 2 Dataset

Training interactive world models requires data that is notoriously hard to find: ego-centric video sequences with densely aligned action signals (keyboard inputs, camera motion, and ego state) all synchronized to the visual stream.

Real-world embodied data is costly to collect, while synthetic data often lacks the visual richness or behavioral diversity needed for generalization. Counter-Strike 2 demos offer a compelling middle ground: because matches are recorded as deterministic replays, we can reconstruct clean first-person video at any point in a match, extracting the precise control inputs that drove each visual change. For these reasons, Counter-Strike is fast becoming a popular substrate for embodied AI and world-model research, with recent efforts such as EgoCS-400k reflecting a growing community interest in it as a rich source of egocentric training data.

Today we release CS2-10k, a large-scale egocentric gameplay dataset built from professional CS2 matches. It contains 600,000+ player-round videos spanning 10,000+ hours of first-person footage, paired with per-frame annotations covering keyboard state, mouse movement, and 3D player trajectory. Alongside this ready-to-use dataset, we are also releasing the ready-to-extend cs2-dem-renderer, the open-source pipeline used to produce it. All of this, so we can build better world models, together.

154+

Hours

512+

Videos

587

PRO MATCHES

893+

Rounds

154+

Hours

512+

Videos

587

PRO MATCHES

893+

Rounds

Browse Dataset →

Dataset Overview

CS2-10k is built from public professional match demos sourced from HLTV. For each demo, we render clean first-person video at 720p, 48fps using the demo replay tool inside CS2, producing one video per player per round. Alongside each video, we store a parquet file containing per-frame annotations synchronized to the video timeline.

Annotation Schema

Every video clip has its corresponding anotations stored in a .parquet file:

Field	Type	Description
`map`	string	Map name (e.g. "mirage", "dust2")
`round_number`	int	Round within the match
`team`	int	0 = Counter-Terrorist, 1 = Terrorist
`num_frames`	int	Total frames in the clip
`fps`	float	Video frame rate (48.0)
`total_time`	float	Clip duration in seconds
`fov`	float	Camera field of view (90.0°)
`frame_data`	list[dict]	Per-frame annotation array (see below)

Per-Frame Annotations

Each entry in frame_data contains:

Field	Description
`actions`	Concatenated active keys: W/A/S/D (movement), J (jump), C (crouch), R (run), V (freefall), [ (fire), ] (scope/secondary), - (no input)
`mouse_x_delta`	Horizontal camera delta — proxy for mouse X movement
`mouse_y_delta`	Vertical camera delta — proxy for mouse Y movement
`position_x / y / z`	Player world position in game units
`rotation_yaw`	Camera yaw angle (−180° to 180°)
`rotation_pitch`	Camera pitch angle (−90° to 90°)

The combination of video and per-frame control signals creates a tight action-observation loop.

No Abrupt Visual Changes

Each clip is a contiguous segment of a single round from a single player's perspective. There are no mid-round cuts, no editing transitions, and no UI HUD. The camera moves in a physically plausible relationship in the world and we hide the player weapon to get rid of sudden visual changes caused by weapon recoil, reloads, and weapon switching.

Many Use Cases

CS2-10k is designed for training interactive world models that learn how first-person visual observations change in response to player actions. The same aligned video, control, and state signals also support a range of related research workflows:

Action-Conditioned Video Generation

Train models to generate the next N frames given the current frame and a keyboard+mouse action sequence. Dense per-frame controls make CS2-10k a natural fit for models like GameNGen, Genie, DIAMOND, and OASIS.

Egocentric Navigation

With 3D player positions and yaw/pitch per frame, the dataset supports learning navigation priors: what does moving forward look like in a confined corridor vs an open site? How does camera control correlate with positional change?

Long-Horizon Planning

Full rounds of 60–90 seconds provide significantly longer temporal horizons than most embodied datasets. A model can learn tactical structure: site entry, holds, rotations, and retakes — each as a coherent visual sequence.

Multi-Agent World Modeling

All 10 players per match are recorded simultaneously with shared round and map identifiers, making it possible to model how one agent's actions causally affect another's observations.

Action-Conditioned Video Generation

Egocentric Navigation

Long-Horizon Planning

Multi-Agent World Modeling

All 10 players per match are recorded simultaneously with shared round and map identifiers, making it possible to model how one agent's actions causally affect another's observations.

Rendering Pipeline

If CS2-10k does not cover the scale, matches, or annotations you need, you can use our open-source pipeline at github.com/reka-ai/cs2-dem-renderer to render your own CS2 datasets. Given a .dem file, it performs a two-pass parse to extract per-player spawn/death intervals and per-frame button inputs, then drives CS2's built-in demo replay system to render first-person video for each player each round. Frames are streamed in real time from CS2's movie output to ffmpeg (VAAPI HEVC), producing .mp4 clips alongside synchronized .parquet annotation files. A worker mode processes entire directories of demos with automatic deduplication, making it straightforward to run at the scale of CS2-10k.

Citation

If you use CS2-10k in your work, please cite:

Author