Reka Flash: An Efficient and Capable Multimodal Language Model

We introduce Reka Flash, our efficient, fast, and highly capable multimodal and multilingual language model.

Reka Flash is a state-of-the-art 21B model trained entirely from scratch and pushed to its absolute limits. It serves as the “turbo-class” offering in our lineup of models. Reka Flash rivals the performance of many significantly larger models, making it an excellent choice for fast workloads that require high quality. On a myriad of language and vision benchmarks, it is competitive with Gemini Pro and GPT-3.5.

Moreover, we also present a compact variant Reka Edge that is significantly smaller (7B) and more efficient, making it suitable for resource-constrained (e.g., on device, local) scenarios.

Meanwhile, our largest and most capable model Reka Core will be available to the public in the coming weeks.



We evaluate our base models on three key benchmarks, i.e., MMLU (knowledge-based question answering), GSM8K (reasoning & math), HumanEval (code generation) and GPQA (Google-proof graduate-level question answering), a graduate-level question answering benchmark that is challenging.

Reka Flash achieves very strong results on these benchmarks. It outperforms Gemini Pro on MMLU and GPQA and is competitive on GSM8K and HumanEval. Moreover, Reka Flash is better than many larger models (e.g., Llama 2 70B, Grok-1, GPT-3.5) by a strong margin on these evaluations.

Model MMLU (knowledge QA)accuracy GSM8K (reasoning)accuracy HumanEval (code generation)pass@1 GPQA (graduate-level QA)accuracy
Reka FlashBase model v1.0 73.5 (5-shot direct) 81.0(8-shot CoT, maj@4) 65.2 (0-shot) 33.7(5-shot)
Gemini Pro 71.8(5-shot direct) 86.5(CoT + self consistency) 67.7(instruction-tuned) 25.9(5-shot)
GPT-3.5 70.0(5-shot direct) 57.1(8-shot CoT) 48.1(0-shot) 28.9(5-shot)
Grok-1 73.0(5-shot direct) 62.9(8-shot CoT) 63.2(0-shot)
Mixtral 45B (8×7) 70.6(5-shot direct) 74.4(8-shot CoT, maj@4) 40.4(0-shot) 24.1(5-shot)
Llama-2 70B 68.9(5-shot direct) 56.8(8-shot CoT) 29.9(0-shot) 26.9(5-shot)
GPT-4 86.4(5-shot direct) 92.0(5-shot CoT) 67.0(0-shot) 38.1(5-shot)
Gemini Ultra 83.7(5-shot direct) 94.4 (CoT + Self Consistency) 74.4(instruction-tuned)
Comparison of Reka Flash against leading models on base language model evaluations. Most of the numbers above are self-reported with the exception of Gemini Pro on GPQA (we use their API) and Mixtral (we use the publicly released model). GPT-4 and Gemini Ultra are grayed out because they are in a different compute class.

Multilingual Reasoning

Reka Flash is pretrained on text from over 32 languages (see below) and is therefore a strong multilingual model. We compare models on three diverse multilingual benchmarks encompassing multilingual commonsense reasoning, causal reasoning, and question answering. Our results show that Reka Flash outperforms both Llama-2 70B and Mixtral on all of these tasks.

Model XStoryCloze(commonsense reasoning)Accuracy XCOPA (causal reasoning)Accuracy BELEBELE (multilingual QA)Accuracy
Reka Flashv1.0 68.4(0-shot) 66.2(0-shot) 49.5(0-shot)
Llama-2 70B 63.2(0-shot) 60.5(0-shot) 48.0(0-shot)
Mixtral 45B (8×7) 62.6(0-shot) 59.6(0-shot) 39.8(0-shot)

Languages pretrained on:

English, German, Chinese, Japanese, French, Korean, Spanish, Italian, Arabic, Hindi, Indonesian, Vietnamese, Thai, Czech, Dutch, Finnish, Bulgarian, Basque, Portuguese, Tamil, Persian, Greek, Russian, Turkish, Telugu, Burmese, Swahili, Urdu, Estonian, Malay, Swedish, Norwegian.

Vision and Video

We evaluate Reka Flash on a suite of multimodal benchmarks, including visual question answering (MMMU, VQA-v2), video captioning (VATEX), and video question answering (Perception Test). We find that Reka Flash is competitive to Gemini Pro on all four benchmarks.

Model MMMU(image QA) VQA-v2(image QA) VATEX(video captioning) Perception Test(video QA)
Reka Flashv1.0 51.3 77.7 58.1 51.6
Gemini Pro 47.9 71.2 57.4 51.1
Adept Fuyu-Heavy 48.3 72.5
Flamingo 67.6 56.0 43.6
Gemini Ultra 62.4 77.8 62.7 54.7
GPT4-V 56.8 77.2

For multimodal inputs, models deployed at Reka Playground work best in English at the moment.

Reka Chat Models

Reka Flash and Reka Edge base models are instruction tuned and then RLHFed with PPO using Reka Flash as the reward model. We conduct a series of human evaluations to evaluate our chat models. 

We consider two setups, 1) text-only chat and 2) multimodal chat. Each setup has its own set of baseline models for comparison. We conduct blind evaluations with human raters from a third party data provider company. We compute ELO scores following Askell et al., where we only consider pairwise comparisons where annotators express a preference stronger than the weakest available. We report the overall win rate of each model against the rest together with ELO scores.

Text Chat

Our text human evaluation comprises over 1000 prompts designed to test model capabilities across diverse categories such as reasoning, coding, knowledge, input manipulation, and creative writing. We benchmark against leading models such as GPT-4, Claude 2.1, and Gemini Pro (API version). We also include Mistral 7B and Llama 2 7B chat for comparisons to Reka Edge.

Model ELO score Win Rate
GPT-4 (1106-preview) 1215 80.5%
GPT-4 (0613) 1109 69.0%
Reka Flashv1.0 1045 57.4%
GPT-3.5 Turbo 1017 54.8%
Claude 2.1 1001 51.3%
Mixtral 961 46.4%
Gemini Pro (20240201) 961 46.3%
Claude Instant 1.2 950 44.5%
Reka Edgev1.0 941 41.2%
Mistral 7B v0.2 911 36.5%
Llama-2 7B Chat 850 28.7%

Our human evaluation results indicate that Reka Flash ranks competitively on our internal ELO leaderboard, outperforming GPT-3.5 turbo, Claude, Mixtral, and Gemini Pro. Reka Edge is ahead of the two other 7B models and comes close to Claude Instant 1.2.

While we generally do not recommend automatic evaluation based on proprietary models such as GPT-4, we check peak MT-bench scores of Reka Flash and Reka Edge. Reka Flash obtains a 8.2 MT-bench score, matching models such as Claude 1 and Claude 2. Meanwhile, Reka Edge obtains 7.6 MT-Bench score which is competitive with the best models of similar size in the industry. We find that models overfitted to MT-bench do not necessarily perform better on human evaluation. Therefore, we consider this benchmark inadequate and often misleading for model selection and measuring progress.

Multimodal Chat

Our multimodal chat evaluation measures the quality of a model’s response given an image and text prompt. We compare our models with multimodal language models such as GPT4-V, Gemini Pro, Llava-1.6, IDEFICS 80b, and Adept Fuyu-8B.

We design our evaluation prompts to be diverse and with moderate to hard level difficulty spanning many different categories such as food recognition, chart understanding, table understanding, humor understanding, shape detection, and others.

Model ELO score Win Rate
GPT4-V 1461 82.3%
Reka Flashv1.0 1314 67.7%
Gemini Pro 1309 62.5%
Llava 1.6 34B 1308 65.4%
Reka Edgev1.0 1299 64.9%
Llava 1.6 Mistral-7B 1154 41.2%
IDEFICS 80B 1063 27.1%
Adept Fuyu 8B 800 6.5%

Our multimodal chat results show that Reka Flash outperforms all models except GPT4-V. Reka Edge also achieves a strong ranking, outperforming Llava 1.6 7B based on Mistral 7B and approaches the performance of Gemini Pro.

Reka Edge

Reka Edge is our compact 7B model designed for local deployments and latency sensitive applications. On language evaluations, we report its performance on language benchmarks compared to other models of similar scale, i.e., Mistral 7B and Llama-2 7B. Our results show that Reka Edge outperforms both Llama 2 7B and Mistral 7B on standard language benchmarks.

Model MMLU (knowledge QA)accuracy GSM8K (reasoning)accuracy HumanEval(code generation)pass@1 GPQA(graduate-level QA) MMMU(image QA) VQA-v2(image QA) VATEX(video captioning) Perception Test(video QA)
Reka Edgev1.0 63.1(5-shot direct) 53.1(8-shot CoT) 41.5(0-shot) 31.3(5-shot direct) 42.8 71.9 47.9 47.4
Mistral 7B 62.0(5-shot direct) 50.1(8-shot CoT) 26.7(0-shot) 19.4(5-shot CoT)
Llama-2 7B 44.0(5-shot direct) 16.0(8-shot CoT) 11.6(0-shot) 10.9(5-shot direct)
Gemini Nano 55.8(5-shot direct) 32.6 67.5

Concluding Remarks

