Aug 28, 2025

Introducing Research-Eval: A Benchmark for Search-Augmented LLMs

Introducing Research-Eval: A Benchmark for Search-Augmented LLMs

Search-augmented large language models (LLMs), such as Reka Research, are transforming how we access information and interact with AI. By retrieving fresh information from the web, these models extend beyond the static knowledge in their parameters, providing up-to-date answers with grounded citations across a wide variety of domains.

But despite the rapid progress, the field lacks robust benchmarks to evaluate search-augmented LLMs. SimpleQA, for example, is widely used (being the primary benchmark reported by OpenAI, Mistral and Perplexity for their search-augmented models). It was originally designed to measure the ability to answer factual questions without browsing, so it is dominated by one-hop encyclopedic questions grounded in Wikipedia and has already approached saturation when used to evaluate search-augmented models. On the other end of the spectrum, newer efforts like BrowseComp are significantly more challenging, but focus on highly artificial puzzle-like questions that require deep research and do not reflect common real-world use cases.

To address this gap, we are releasing Research-Eval: a high-quality benchmark designed specifically for evaluating search-augmented LLMs.

What is Research-Eval?

Research-Eval consists of 374 diverse, high-quality questions, each paired with a checklist of requirements to be used with an LLM judge to evaluate correctness. The benchmark is:

  • Diverse – Questions span a wide range of topics and require grounding across different types of websites (see examples below). This prevents overfitting to narrow sources like Wikipedia, and provides a more accurate measure of real-world performance.

  • Discriminative – Current frontier models achieve between 26.7% and 59.1% accuracy, making Research-Eval well-calibrated to distinguish between existing systems, while remaining challenging enough to drive future progress.

  • High-quality – Every example has been rigorously vetted, fixed, and filtered through a rigorous multi-step annotation process (see below).

The result is a benchmark that meaningfully separates model performance while reflecting realistic use cases.

What the Questions Look Like

Here are a few representative examples from Research-Eval:

Example 1: Regulatory lookup

I'm a farmer in the United Kingdom. Can you send me the fees I have to pay for these animal by-product services provided by APHA:

  1. Consideration of an application for approval of an incinerator plant, with a 60 minute site visit

  2. Administrative fee for production of invoices for additional visits and connected activities.

Can you send me the fee for each service and also the total fee I have to pay as well?

Requirements:

  1. The answer must state that the fee for the Consideration of an application for approval of an incinerator plant, with a 60 minute site visit costs £557

  2. The answer must state that the fee for the Administrative fee for production of invoices for additional visits and connected activities costs £21

  3. The answer must state that the total fee for both services costs £578

  4. The answer must not state the fee of either services as being any other price

  5. The answer must not state the total as any other price

Example 2: Business contact information

I was recommended two pipeline cleaning companies in the Houston area, Flowmore and TX Hydrojet. What's their contact email, phone number, and address?

Requirements:

  1. The answer should list Flowmore's email as sales@flowmore.com

  2. The answer should list Flowmore's phone number as 800-356-9667 or 281-351-7979.

  3. The answer should list Flowmore's office address as 11230 Timber Tech Ave. Tomball TX 77375

  4. The answer should list TX Hydrojet's email as info@txhydrojet.com

  5. The answer should list TX Hydrojet's phone number as (346) 258-7870.

  6. The answer should list TX Hydrojet's office address as 5338 PRUDENCE DR, HOUSTON, TX 77045

Example 3: Academic deadlines

Give me EMNLP full paper submission deadlines from 2019 to 2021

Requirements:

  1. The answer must state "May 17, 2021" (or equivalent format) for EMNLP 2021

  2. The answer must state "June 3, 2020" (or equivalent format) for EMNLP 2020

  3. The answer must state "May 21, 2019" (or equivalent format) for EMNLP 2019

  4. The answer should not claim that any other dates are the EMNLP full paper submission deadlines from 2019 to 2021

Example 4: Movie platform availability

On what platform is Filament Games "Mission: Mars" available?

Requirement:

  1. The answer must state Filament games' "Mission: Mars" is available on Roblox

How We Built It

We followed a careful multi-stage annotation pipeline to build Research-Eval, comprising the following steps:

  1. Authoring questions – Our in-house AI tutors wrote 606 candidate questions, each with a list of correctness requirements.

  2. Generating answers – For each question, we produced 29 answers using a variety of systems, both internal and third-party.

  3. Independent annotation – Two AI tutors independently judged the correctness of each answer against the requirements, marking problematic examples as invalid (e.g., ambiguous question or wrong requirements).

  4. Consensus & refinement – The two AI tutors resolved the disagreements from the previous steps, refining requirements or marking examples as invalid if relevant.

  5. Re-annotation – We repeated steps 3 and 4 for all examples with changed requirements.

  6. Filtering – We removed examples that were:

    • Marked as invalid at the end of the previous process

    • Prone to high disagreement between AI tutors or top LLM judges

    • Noisy (where LLM judges were inconsistent across runs)

After all passes, we arrived at 374 high-quality examples, forming the final Research-Eval benchmark.

Leaderboard

We evaluated several frontier search-augmented models on Research-Eval. Below are the results, averaged across 5 runs. Cost reflects USD per 1,000 queries on the Research-Eval distribution.

Availability

The dataset and evaluation code are openly available at github.com/reka-ai/research-eval

It supports most frontier models out of the box, and includes scripts to easily reproduce the full leaderboard.

Final Remarks

Search-augmented LLMs are one of the most promising directions for building useful, reliable AI assistants. However, progress requires robust evaluation. With Research-Eval, we aim to provide the community with a diverse and discriminative benchmark that drives forward research in this critical area—continuing our commitment to openly share key artifacts of our research, such as Reka Flash 3.1, Reka Quant, and Reka Vibe-Eval.

Finally, we’re growing our team. If you’re passionate about building the next generation of AI systems, we are hiring across all roles—come join us in shaping the future of AI.