Research at Reka: Reasoning

This short post is the first in a series of posts that will discuss research topics that are important to us.

What are fundamental building blocks of intelligence? At Reka, we train large multimodal models but we also spend a lot of time thinking about what the missing components are at the foundational level and how they translate to our product offerings. In this post, we share our perspective on reasoning.

What is Reasoning

Reasoning became a popular topic with releases of models such OpenAI o1 and DeepSeek R1. As a result, a lot of people associate it with models that generate thinking tokens (i.e., longer chain of thoughts) before providing answers. Leaderboards and benchmarks separate results from reasoning models and non-reasoning models based on this behavior. We do not think this is the right characterization.

Rather than distinguishing by how the model produces outputs, we think reasoning is an underlying capability that allows the model to solve complex goals by breaking them down into steps and dynamically performing self-verification to backtrack along the way. We believe such a capability is necessary to allow our models to not just replicate human intelligence, but to discover new things that surpass our capabilities.

How to Get Models to Reason

There are two complementary approaches to achieve this goal.

Test Time Compute

The first method to get a reasoning behavior is by orchestrating inference using test-time compute algorithms. For example, search techniques such as MCTS, parallel thinking, using process reward models to guide decoding, or simply workflow orchestration by creating an execution graph and having another model to verify the next node to explore fall under these categories. In this scenario, while the underlying model is not explicitly trained to perform multi-step inference, we enforce the step-by-step and self-verification behavior by supervising it at test time.

The last approach is perhaps the most commonly used method to create an agent that can perform complex tasks right now. An agentic system will call an LLM multiple times with specific instructions to decide on what to do next (e.g., proceed to the next step, retry the current step). While this approach is sufficient for a simple workflow that only needs shallow reasoning, it is not as elegant or generalizable since every workflow is manually designed with human supervision. None of our agentic offerings follow this template-based approach. We instead choose to build the reasoning capabilities directly to the model (see below).

Training a Reasoning Model

A second variant is by training a model to natively exhibit this behavior. For models that are trained to do this step, they can either explicitly output their thinking process (e.g., <reasoning> … </reasoning>) or think implicitly in their latent space. What is the best way to train such a model? The main question is whether we can get this capability by scaling up unsupervised pretraining and preference-based reinforcement learning or we need a different approach.

Distillation

One method that is surprisingly very effective is by learning from reasoning examples. These examples can be created by either manually creating them by asking humans to mimic the desired thinking process, or more commonly by generating synthetic data from other models that have exhibited this behavior (i.e., distillation). It is perhaps surprising how effective distillation is, not just for reasoning but for transferring other kinds of capabilities. Distillation can be done by either training on the output directly (more common since it is easier to do in practice) or by training to mimic the logits of a teacher model.

We found that with a strong base model it does not take a lot of examples to trigger this behavior, and it improves performance on a lot of tasks across the board. We also found that stronger coding models are better reasoners, which is one of the reasons that we have been focusing on pretraining and posttraining Reka Flash to become a better coding model.

Reinforcement Learning

The second approach is using reinforcement learning. Among other learning methods we have today, RL is uniquely suitable to tackle this problem since it only needs to rely on high-level feedback (reward) to discover complex sequential strategies. In RL, an agent interacts with an environment by taking actions to achieve its goal and receives a reward. The agent can iteratively learn through trial and error, either via sparse or dense rewards. This well-defined feedback loop allows the agent to quickly converge on effective problem-solving approaches.

Improving reasoning via reinforcement learning requires creating hard tasks that translate , defining rewards for those tasks, and setting up an infrastructure to optimize the system end to end. Early work in the open-source community has been focusing on math reasoning models because it is easy to verify rewards for that task (correct answer or not, which can be done directly with string matching after some regular expression formatting, and does not require running an extra tool) and it does not involve any kind of interaction with external tools. We found performing reinforcement learning in this setup to be straightforward. The main challenge is coming up with harder and harder tasks that fit this setup and designing verifiable rewards for those tasks. However, we also believe that the gain that can be obtained in a pure model-only setup without external tools is limited.

As a result, our focus has been on establishing our general end-to-end reinforcement learning infrastructure with tool use. This setup is harder to do since leaving the training environment to use assigned tools can introduce significant latency. The most simple example of this setup is in training a coding model, where the model requires access to a code interpreter to execute the output code and get a reward. This setup can be extended to many diverse tools, from tools to browse the web to MCP calls to accomplish other tasks.

Reinforcement learning also has another advantage as the most promising learning approach that allows the model to perform explorations in the training process. Regardless of whether compressing world knowledge (i.e., the web) to a neural network model via unsupervised pretraining can lead to AI agents that can generalize to complex tasks beyond human capabilities or not, results from other domains have shown that allowing agents to discover novel strategies via reinforcement learning have resulted in superhuman capabilities in narrow settings (e.g., in AlphaGo, AlphaZero). We also think that this is a necessary ingredient for self-improving systems via open-ended discovery.

Reasoning in Reka Products

The ability to reason is core to our product experience. Our goal is to push the boundaries of what is possible, leveraging our state-of-the-art research projects to offer a new experience that inspires our users.

In both of the products below, instead of going directly to reinforcement learning, we start with supervised examples (both synthetic and human-generated) to reduce the number of training steps needed in the reinforcement learning stage. Early success of reasoning in other domains such as AlphaGo, AlphaStar, and AlphaCode also rely on having supervised data from human replays before large-scale reinforcement learning.

Reka Research

Reka Research achieves market learning performance at synthesizing information from multiple sources via browsing the web and private documents due to its excellent reasoning capabilities. It can interact with a web browsing tool, a private document search tool, and a document analysis tool to produce an answer. Beyond merely integrating the tools, we taught Reka Research to use them effectively, from formatting search queries, interpreting search results, analyzing documents, to integrating results to its thought process. The same training infrastructure that we use to train Reka Research end-to-end can also be used by our customers to optimize it further according to their business goals, creating a research agent that is uniquely theirs.

Reka Research also has a test-time compute option using parallel thinking which uses a reward model to rank multiple generations at test time. It delivers consistent performance improvements by increasing the number of parallel generations without a significant increase in response time.

Reka Vision

Reka Vision is a state-of-the-art visual understanding environment that is supported with tools such as our proprietary multimodal search engine, basic video editing, and triggers for live alerts. At its core is a multimodal reasoning model that has an excellent understanding of the physical world by combining the strengths of a multimodal language model and traditional computer vision methods for object detection and tracking. This model effectively uses available tools to intelligently break down videos into chunks, derive metadata for each part, and combine information from each part to perform a long-horizon task such as producing reels from a long video with a natural language prompt or triggering smart alerts for incident monitoring.

The Research Frontier

We believe that the ability to break down complex problems into steps and dynamically solve them is core to future artificial intelligence models. We think that research on reasoning in general-purpose models such as multimodal models is still in its early days.

Reinforcement learning, in particular, offers a path to learning beyond the limitations of existing data. It enables autonomous skill acquisition, allowing agents to discover novel strategies, adapt to new environments, and continuously improve through self-play or extensive interaction. This ability is crucial for tackling truly open-ended problems. On the other hand, test-time compute approaches such as parallel thinking allow scaling inference time to solve harder problems that require stronger reasoning skills.

Reasoning is also inextricably linked with safety. Ensuring an agent behaves predictably, especially when operating in complex environments with ill-defined rewards is of paramount importance. We plan to continue to share our research progress on reasoning via our products, open source models and library, and technical reports.

Author