How the Kinyarwanda AI tokenization tool works
Saturday, March 29, 2025
AI applications do not inherently understand human languages like English or Kinyarwanda.

An AI company based in Kigali, with focus in building inclusive AI solutions, built the first Kinyarwanda AI tokenizer, Alta Tokenizer, for AI models, applications, and large language models, to ease the development of Kinyarwanda-language supported solutions.

ALSO READ: Over 1,000 global leaders to gather in Kigali for Global AI Summit

While AI has become a part of daily life, few people explore its inner workings—how it processes data and how it can be harnessed to develop tailored solutions.

The New Times spoke with Philbert Murwanashyaka, co-founder and head of the company, Yali Labs, who said that tokenizers act as a bridge between human language and AI models. This innovation simplifies the development of Kinyarwanda AI applications, natural language software, and large language models that truly capture the nuances of Kinyarwanda, Murwanashyaka said.

ALSO READ: Rwanda to develop 50 AI tools across various sectors: official

According to Murwanashyaka, while some large language models include Kinyarwanda, they are primarily trained using foreign language tools like OpenAI’s Tiktoken, which do not fully account for the complexities of Kinyarwanda.

"Instead of trying to utilize the global AI tools for large language models, which already exist, for most of them, even though they try, they don't understand what Kinyarwanda is about,” Murwanashyaka said.

"That is putting us in a stage where we are said to be the consumers, not the ones who are also contributing to the AI sector.”

How Alta Tokenizer works

AI applications do not inherently understand human languages like English or Kinyarwanda. Instead, they process text as numerical representations.

Tokenization is the process of converting text into tokens, or numerical values, which AI models use to analyze input and generate accurate outputs, Murwanashyaka explained.

"When we are training model's we first convert the dataset into tokens and train the model on tokens, and these tokens serve as a representation of words. This tool is essential for building robust Kinyarwanda AI models.”

Murwanashyaka pointed out that with Alta Tokenizer, users input Kinyarwanda words, and the system converts them into numerical tokens, effectively functioning as a Kinyarwanda dictionary for AI models and applications.

The goal is to equip Rwandan developers with the tools to build AI solutions specifically for Kinyarwanda, rather than merely adapting existing open-source (publicly available source codes) models that do not fully have a deep Kinyarwanda dataset.

"We are trying to equip the developers with the right tools to start building AI solutions for Kinyarwanda, not trying to utilize the existing models like open-source models. We have our own culture, needs, and we need our solutions to be tailored for us,” he said.

Murwanashyaka explained that large language models (LLMs) rely on deep learning algorithms, allowing them to construct meaningful sentences much like humans do. The Alta Tokenizer would be key in developing Kinyarwanda LLMs, he said.

Instead of being explicitly programmed, he added, these models learn to generate text naturally by analyzing vast amounts of data.

According to Murwanashyaka, Yali Labs is currently focused on feeding data into Alta Tokenizer to build a robust Kinyarwanda dataset.

"We are collecting all the information about Kinyarwanda. When I say information, I mean like textbooks. We remove images and even foreign words, in a process called data cleaning, to make sure it is pure Kinyarwanda.”

The Alta Tokenizer will inspire AI developers and other Kinyarwanda application developers to develop Kinyarwanda solutions, designed to address the needs of Rwandans and also leverage the emerging technologies, Murwanashyaka said.

To promote accessibility, Murwanashyaka and his team have made Alta Tokenizer open source, allowing anyone to contribute to and benefit from its development. Global AI companies often lack sufficient Kinyarwanda data, he said, noting that this makes Alta Tokenizer a critical tool in training AI models to understand and generate text in the language accurately.