Products
company
resources
Jul 10, 2025
Reinforcement Learning for Reka Flash 3.1
Reinforcement Learning for Reka Flash 3.1
We are excited to announce Reka Flash 3.1—an updated version of our 21 billion parameter reasoning model, Reka Flash 3. This new iteration significantly improves its capabilities, particularly in coding and as a base planner for finetuning on agentic tasks. It powers Reka Research, our agentic AI designed to navigate the web and private documents to answer complex questions.
The performance improvement comes from our new reinforcement learning algorithm and enhanced RL infrastructure which allows us to train at a much larger scale. The chart below shows comparisons to other closed and open-source models on code datasets.

Reinforcement Learning Algorithm
We use a variant of REINFORCE that incorporates dynamic sampling, computes the loss at the token-level, intelligent gradient clipping informed by effective gradient norms, and better handling of long samples similar to DAPO. We also ensure that training is always on-policy by performing an update after every rollout, which we find leads to better performance and training stability. Lastly, we remove overlaps of RL training examples and supervised fine tuning examples to allow the model to consistently roll out negative trajectories.
For the open source version of Reka Flash 3.1, we use verifiable rewards from math and code domains. Our math data comes from Numina-1.5. We filter examples that do not have valid answers, duplicate examples, examples that are too easy, and examples that are hard to verify using rule-based approaches. We also convert multi-choice problems into fill-in-the-blank questions to avoid reward hacking. Our code data comes from various sources. We focus on difficult coding problems and ensure each example has several test samples. We execute the code in a distributed fashion, where each rollout starts execution as soon as it completes a trajectory.
We show performance improvements on AIME2024 and LCB-v5 as training progresses in the plot below.

Quick Start
For ease of deployment, Reka Flash 3.1 is released in a Llama-compatible format. You can use any library compatible with Llama to run the model. Download it on Hugging Face.
Hugging Face

vLLM

We are excited to announce Reka Flash 3.1—an updated version of our 21 billion parameter reasoning model, Reka Flash 3. This new iteration significantly improves its capabilities, particularly in coding and as a base planner for finetuning on agentic tasks. It powers Reka Research, our agentic AI designed to navigate the web and private documents to answer complex questions.
The performance improvement comes from our new reinforcement learning algorithm and enhanced RL infrastructure which allows us to train at a much larger scale. The chart below shows comparisons to other closed and open-source models on code datasets.

Reinforcement Learning Algorithm
We use a variant of REINFORCE that incorporates dynamic sampling, computes the loss at the token-level, intelligent gradient clipping informed by effective gradient norms, and better handling of long samples similar to DAPO. We also ensure that training is always on-policy by performing an update after every rollout, which we find leads to better performance and training stability. Lastly, we remove overlaps of RL training examples and supervised fine tuning examples to allow the model to consistently roll out negative trajectories.
For the open source version of Reka Flash 3.1, we use verifiable rewards from math and code domains. Our math data comes from Numina-1.5. We filter examples that do not have valid answers, duplicate examples, examples that are too easy, and examples that are hard to verify using rule-based approaches. We also convert multi-choice problems into fill-in-the-blank questions to avoid reward hacking. Our code data comes from various sources. We focus on difficult coding problems and ensure each example has several test samples. We execute the code in a distributed fashion, where each rollout starts execution as soon as it completes a trajectory.
We show performance improvements on AIME2024 and LCB-v5 as training progresses in the plot below.

Quick Start
For ease of deployment, Reka Flash 3.1 is released in a Llama-compatible format. You can use any library compatible with Llama to run the model. Download it on Hugging Face.
Hugging Face

vLLM

We are excited to announce Reka Flash 3.1—an updated version of our 21 billion parameter reasoning model, Reka Flash 3. This new iteration significantly improves its capabilities, particularly in coding and as a base planner for finetuning on agentic tasks. It powers Reka Research, our agentic AI designed to navigate the web and private documents to answer complex questions.
The performance improvement comes from our new reinforcement learning algorithm and enhanced RL infrastructure which allows us to train at a much larger scale. The chart below shows comparisons to other closed and open-source models on code datasets.

Reinforcement Learning Algorithm
We use a variant of REINFORCE that incorporates dynamic sampling, computes the loss at the token-level, intelligent gradient clipping informed by effective gradient norms, and better handling of long samples similar to DAPO. We also ensure that training is always on-policy by performing an update after every rollout, which we find leads to better performance and training stability. Lastly, we remove overlaps of RL training examples and supervised fine tuning examples to allow the model to consistently roll out negative trajectories.
For the open source version of Reka Flash 3.1, we use verifiable rewards from math and code domains. Our math data comes from Numina-1.5. We filter examples that do not have valid answers, duplicate examples, examples that are too easy, and examples that are hard to verify using rule-based approaches. We also convert multi-choice problems into fill-in-the-blank questions to avoid reward hacking. Our code data comes from various sources. We focus on difficult coding problems and ensure each example has several test samples. We execute the code in a distributed fashion, where each rollout starts execution as soon as it completes a trajectory.
We show performance improvements on AIME2024 and LCB-v5 as training progresses in the plot below.

Quick Start
For ease of deployment, Reka Flash 3.1 is released in a Llama-compatible format. You can use any library compatible with Llama to run the model. Download it on Hugging Face.
Hugging Face

vLLM


Resources
Company

Resources
Company

Resources
Company