Announcing our Multimodal AI Assistant

Introduction

We are excited to release the first version of our multimodal assistant Yasa-1, a language assistant with visual and auditory sensors that can take actions via code execution.

We trained Yasa-1 from scratch, including pretraining base models from ground zero, aligning them, as well as heavily optimizing both our training and serving infrastructure.

In addition to being multimodal, Yasa-1 comes with a myriad of exciting features including long context document processing, fast natively-optimized retrieval augmented generation, multilingual support (20 languages), search engine interface, and a code interpreter.

Yasa-1 is currently in private preview. It is available today both via our APIs and as docker containers for an on-premise or virtual private cloud deployment (see the documentation here). At Reka, we are dedicated to deploying Yasa safely and responsibly. In the coming weeks, we will be expanding access to more enterprise and organization partners. Please reach out to us at contact@reka.ai for a demo of Yasa-1.

Capabilities and Features of Yasa-1

Multimodal Understanding

Search and Retrieval

Supporting fresh content with live search

Yasa-1 features a convenient search engine flag that connects it to the web. When activated, it grants access to various commercial search engines. This enables the model to use up-to-date information, free from any cutoff date limitations.

Retrieval augmented generation

Yasa can be taught to understand private datasets. Our API and on-premise deployment setup allows seamless integration of internal datasets of any modality type. 

We handle the constructions of embedding services and vector databases, as well as the adaptation process to the private datasets, to allow users to focus on building amazing experiences.

As an end-to-end model provider, we’re able to train Yasa-1 to use information more accurately than with standard prompting techniques.

import reka reka.API_KEY = "your-api-key" reka.add_dataset( filepath="reka_candidate_resumes.zip", name="candidate_resumes", ) reka.prepare_retrieval(dataset_name="candidate_resumes") response = reka.chat( "Summarize and compare the professional experience of Alice and Bob", retrieval_dataset="candidate_resumes", )

Long context model and retrieval for very long documents

Our long-context model currently supports 24K tokens by default. However, our research indicates significant headroom in natively optimizing retrieval to work with long-context models. We have verified that our setup works with documents as long as 100K tokens.

To test this, we created a high-quality benchmark specifically for monitoring and tracking performance on realistic, long context tasks. One of our internal datasets was constructed by collecting publicly available movie plots. We evaluated the speed and accuracy of Yasa-1 on evaluation questions. Our setup is able to achieve comparable quality while being approximately 8x faster compared to using a 100K context state-of-the-art model directly.

Model Accuracy Median Time Per Query
External model – A (no context) 26% 1.96s
External model – B (no context) 20% 4.78s
External model – B (100K context) 83% 59.4s
Reka – Yasa-1 82% 7.9s
Table 1: Comparing Yasa-1 long context against external models on an internal benchmark of reading movie plots.

Code Interpreter

Yasa is more than just a passive AI assistant; it has the capability to actively execute code. This feature is enabled via a simple flag. When active, Yasa automatically identifies the code block within its response, executes the code, and appends the result at the end of the block. 

Below is an example of the feature’s ability to perform arithmetic operations, analyze spreadsheets, or create visualizations.

Arithmetic

File reading and graph plotting

Customization

For any of the use cases above, our model can be further customized to get the best performance. If you are interested in customized Yasa, please reach out to us at contact@reka.ai.

Evaluation

We recognize the importance of deploying responsibly for a frontier technology such as a multimodal AI assistant. We design a dynamic multidimensional evaluation framework to rigorously benchmark our AI assistant across fine-grained categories in several dimensions such as correctness, safety, and helpfulness inter alia.

Correctness evaluates the accuracy and factuality of an answer, penalizing any false or misleading information. Safety measures how appropriate an answer is for an AI assistant. Outputs that could harm users, others, or involve controversial or illegal content are penalized. Helpfulness checks whether an answer aids the user achieve their goal, penalizing outputs that disregard user instructions. We use either human evaluation or automatic evaluation to obtain these scores.

These axes, among others, can be combined to produce an overall quality score. For example, our initial evaluation indicates that Yasa is 69% comparable or better than a publicly available multimodal AI assistant on image related prompts. For text-based use cases, Yasa is 65% comparable or better than a publicly available language-only AI assistant.

We report some of the dimensions below to provide more insight. While Yasa-1 scores higher on helpfulness compared to Model A, it still fares worse on the correctness axis. On the safety aspect, Yasa-1 is also slightly worse. However, it is worth noting that both models have a large percentage of ties and are already relatively safe. We are committed to deploy responsibly and are working on improving these areas.

Limitations

While Yasa offers a range of capabilities, it is important to note that it may produce inaccurate outputs. For critical advice, it is essential to not solely rely on Yasa.

For multimodal tasks, Yasa excels at providing high-level descriptions of images, videos, or audio content. However, without further customization, its abilities to discern intricate details in multimodal media is limited. For the current version, we recommend audio or video clips be no longer than one minute for the best experience.

Regarding search and retrieval, whilst we do provide citations, there is no guarantee that Yasa fetches the most relevant documents for a particular query. We do, however, provide customization options to enhance retrieval performance.

Currently, the code execution feature is exclusively available to on-premise deployments.

Expect Yasa’s capabilities to improve significantly in the next few months.

Closing Remarks

We are proud to have one of the best models in its compute class, but we are only getting started. Yasa is a generative agent with multimodal capabilities.

It is a first step towards our long-term mission to build a future where superintelligent AI is a force for good, working alongside humans to solve our major challenges.

We are hiring strong technical talents anywhere in the world. If you are excited about our mission and our work, please apply here.


Reka Team

Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eugenie Lamprecht, Hai Pham, Kaloyan Aleksiev, Lei Li, Matt Henderson, Max Bain, Mikel Artetxe, Qi Liu, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu