Text-to-speech – Samsung Newsroom Philippines

The Learning Curve, Part 3: Taking AI Data From Good to Great

Wed, 29 May 2024 12:12:24 +0000

Samsung is pioneering premium mobile AI experiences. To learn how Galaxy AI is maximizing the potential of its users, we are visiting Samsung Research centers around the world. Now supporting 16 languages, Galaxy AI is enabling more people to expand their language capabilities, even when offline, thanks to on-device translation in features such as Live Translate, Interpreter, Note Assist and Browsing Assist. We recently visited Jordan to learn the complexities of developing an AI model for Arabic, a language with many dialects. This time, we’re going to Vietnam to explore how data is prepared to train AI models.

What is the difference between a ghost, grave and mother in Vietnamese? For a language spoken by 97 million people worldwide, very little. Each word translates to “ma,” “mả” and “má,” respectively — and can only be distinguished by tone. This illustrates how difficult it can be for AI models to learn a language, considering they cannot recognize firsthand the context and emotions of conversations nor the intentions of those speaking.

Samsung R&D Institute Vietnam (SRV) used finely refined data to help its AI model properly recognize even the most subtle differences in language.

The quality of data used directly affects the accuracy of automatic speech recognition (ASR), neural machine translation (NMT) and text-to-speech (TTS) — processes that help Galaxy AI features such as Live Translate, Interpreter, Chat Assist and Browsing Assist break down language barriers.

A Typhoon of Challenges

“Vietnamese is a complex and diverse language with rich expressions, many of which are challenging to capture,” says Ngô Hồng Thái, NMT lead at SRV. Of the 16 languages that Galaxy AI supports, Vietnamese was particularly difficult to develop.

“Personally, creating an AI model for Vietnamese was more daunting than our typhoons!” he adds before explaining the hurdles faced during the development process.

Vietnamese is a tonal language with six distinct tones. As evident in the “ma” example above, small nuances in vocalization can drastically alter the meanings of words. Therefore, a meticulous and detailed approach was necessary.

“When similar sounding words are broken down, one word consists of several short segments, or ‘frame sets’,” says Bui Ngoc Tung, ASR lead at SRV. “The AI model differentiates between the short audio frames of around 20 milliseconds to recognize what words correspond to a certain set of consecutive frames. As such, it is critical to put great effort into the early stages of the AI learning process.”

Furthermore, homophones and homonyms are common in Vietnamese. People can normally rely on context and nonverbal elements in conversations to differentiate between words that sound the same or are written the same but have different meanings. However, AI models need to be taught to accurately identify and differentiate between tones and similar words.

“This isn’t a straightforward task,” Thái explains. “Apart from the amount, the data needs to be accurate to ensure it is capable of recognizing the linguistic nuances that exist in Vietnamese.”

Rigorous Preparation

The data refinement process consists of three steps. First, the audio and text used to train the AI model must be reviewed and corrected. Then, this dataset goes through random checks for overall quality. Finally, the dataset is normalized and cleaned before use in training.

“We thoroughly performed a series of tests to check the accuracy of our dataset,” says Nguyen Manh Duy, TTS lead at SRV who oversees database creation. “We faced a number of unexpected problems including misspelled words in scripts and background noise or incorrect pronunciation during audio recordings. We spent significant time refining and improving our training data.”

In addition to the unique linguistic challenges in Vietnamese, there is a lack of universally accessible data compared to more widely spoken languages. “This is another reason why the data refinement stage is so important,” he adds. “Since we had limited sources, every piece of data had to be fully reliable. There was no margin for error.”

Moreover, the AI model for Vietnamese must consider both tonal and regional differences. To improve the AI model’s accuracy, the team collected vast amounts of data with Vietnam’s northern, central and southern accents — resulting in an enormous amount of information to refine and verify.

Continued Improvement

Developers at SRV completed the project after months of hard work, and Vietnamese became one of the first languages to be supported by Galaxy AI. Despite this success, the team is ceaselessly working to improve the Vietnamese Galaxy AI experience.

“We’re continuing to enhance the AI model by incorporating user feedback about the relevance of words and phrases in Galaxy AI,” says Tran Tuan Minh, leader of the AI language development project at SRV. “We have just taken our first steps into a more open world — and we have so much more to explore together.”

In the next episode of The Learning Curve, we will head to China to dig into how AI models are trained and fine-tuned.

The Learning Curve, Part 2: How to Build an AI for Diverse Dialects

Wed, 29 May 2024 12:01:37 +0000

Galaxy AI now supports 16 languages, helping more people to lower language barriers with real-time and on-device translation. Samsung opened the door to a new era of mobile AI, so we are visiting Samsung Research centers all over the world to learn how Galaxy AI came to life and what it took to overcome the challenges of AI development. While part one of the series examines the task of determining what data is needed, this installment looks at the complex task of accounting for dialects.

Teaching a language to an AI model is a complex process, but what if it isn’t a singular language, but a collection of diverse dialects? That was the challenge faced by the team at Samsung R&D Institute Jordan (SRJO). While Arabic was added as a language option for Galaxy AI features such as Live Translate, the team had to cater to the various Arabic dialects that span the Middle East and North Africa, with each varying in pronunciation, vocabulary and grammar.

Arabic is one of the top six most widely spoken languages around the world, used daily by more than 400 million people.¹ The language is categorized into two forms: Fus’ha (Modern Standard Arabic) and Ammiya (the dialects of Arabic). Fus’ha is typically used in public and official events, as well as in news broadcasts, while Ammiya is more commonly used for day-to-day conversations. Over 20 countries use Arabic, and there are currently around 30 dialects in the region.

Unwritten Rules

Recognizing the variation presented by these dialects, the team at SRJO employed a range of techniques to discern and process the unique linguistic features inherent in each. This approach was crucial in ensuring that Galaxy AI could understand and respond in a way that accurately reflects the regional nuances.

“Unlike other languages, the pronunciation of the object in Arabic varies depending on the subject and verb in the sentence,” says Mohammad Hamdan, project leader of the Arabic language development team. “Our goal is to develop a model that understands all these dialects and can answer in standard Arabic.”

TTS is the component of Galaxy AI’s Live Translate feature that lets users interact with speakers of different languages by translating spoken words into written text, and then vocally reproducing them. The TTS team faced a unique challenge, caused by the quirk of working with Arabic.

Arabic uses diacritics, which are guides for the pronunciation of words in some contexts, such as religious texts, poetry and books for language learners. Diacritics are widely understood by native speakers but absent in everyday writing. This makes it difficult for a machine to convert raw text into phonemes, the basic units of sound that are the building blocks of speech.

“There is a shortage of high-quality and reliable datasets that accurately represent how diacritics are correctly used,” explains Haweeleh. “We had to design a neural model that can predict and restore those missing diacritics with high accuracy.”

Neural models work similarly to human brains. To predict diacritics, a model needs to study lots of Arabic text, learn the language’s rules and understand how words are used in different contexts. For instance, the pronunciation of a word can vary greatly depending on the action or gender it describes. Extensive training from the team was the key to enhancing the Arabic TTS model’s accuracy.

Enhancing Understanding

The SRJO team also had to collect diverse audio recordings of the dialects from various sources, which had to be transcribed, focusing on unique sounds, words and phrases. “We assembled a team of native speakers in the dialects who were well-versed in the nuances and variations,” says Ayah Hasan, whose team was responsible for database creation. “They listened to the recordings and manually converted the spoken words into text.”

This work was crucial for enhancing the Automatic Speech Recognition (ASR) process so that Galaxy AI could handle the rich tapestry of Arabic dialects. ASR is pivotal in enabling Galaxy AI’s real-time understanding and response capabilities.

“Building an ASR system that supports multiple dialects in a single model is a complex undertaking,” says Mohammad Hamdan, ASR lead for the project. “It demands a thorough understanding of the language’s intricacies, careful data selection and advanced modeling techniques.”

The Culmination of Innovation

After months of planning, building and testing, the team was ready to release Arabic as a language option for Galaxy AI, enabling many more people to communicate across borders. This single team has made Galaxy AI services accessible to Arabic speakers, lowering the language and cultural barriers between them and people all over the world. In doing so, they have established new best practices that can be rolled out globally. This success is only the beginning: the team continues to refine their models and enhance the quality of Galaxy AI’s language capabilities.

In the next episode, we go to Vietnam to see how the team makes language data better. Plus, what does it take to train an effective AI model?

Arabic is just one part of the languages and dialects newly supported by Galaxy AI and available for download from the Settings app. Galaxy AI’s language features such as Live Translate and Interpreter are available on Galaxy devices running Samsung’s One UI 6.1 update.²

¹ UNESCO, World Arabic Language Day 2023, https://www.unesco.org/en/world-arabic-language-day
² One UI 6.1 was first released on Galaxy S24 series devices with a wider roll out to other Galaxy devices including S23 series, S23 FE, S22 series, S21 series, Z Fold5, Z Fold4, Z Fold3, Z Flip5, Z Flip4, Z Flip3, Tab S9 series and Tab S8 series

The Learning Curve, Part 1: Why Teaching AI New Languages Begins with Data

Wed, 29 May 2024 09:28:12 +0000

As Samsung’s continues to pioneer premium mobile AI experiences, we visit Samsung Research centers around the world to learn how Galaxy AI is enabling more users to maximize their potential. Galaxy AI now supports 16 languages, so more people can expand their language capabilities, even when offline, thanks to on-device translation in features such as Live Translate, Interpreter, Note Assist and Browsing Assist. But what does AI language development involve? This series examines the challenges of working with mobile AI and how we overcame them. First up, we head to Indonesia to learn where one begins teaching AI to speak a new language.

The first step is establishing targets, according to the team at Samsung R&D Institute Indonesia (SRIN). “Great AI begins good quality and relevant data. Each language demands a different way to process this, so we dive deep to understand the linguistic needs and the unique conditions of our country,” says Junaidillah Fadlil, head of AI at SRIN, whose team recently added Bahasa Indonesia (Indonesian language) support to Galaxy AI. “Local language development has to be led by insight and science, so every process for adding languages to Galaxy AI starts with us planning what information we need and can legally and ethically obtain.”

Galaxy AI features such as Live Translate perform three core processes: automatic speech recognition (ASR), neural machine translation (NMT) and text-to-speech (TTS). Each process needs a distinct set of information.

ASR, for instance, needs extensive recordings of speech in numerous environments, each paired with an accurate text transcription. Varying background noise levels help account for different environments. “It’s not enough just to add noises to recordings,” explains Muchlisin Adi Saputra, the team’s ASR lead. “In addition to the language data we obtained from authorized 3^rd party partners, we must go out into coffee shops or working environments to record our own voices. This allows us to authentically capture unique sounds from real life, like people calling out or the clattering of keyboards.”

The ever-changing nature of languages must also be considered. Saputra adds: “We need to keep up to date with the latest slang and how it is used, and mostly we find it on social media!”

Next, NMT requires translation training data. “Translating Bahasa Indonesia is challenging,” says Muhamad Faisal, the team’s NMT lead. “Its extensive use of contextual and implicit meanings relies on social and situational cues, so we need numerous translated texts that the AI could reference for new words, foreign words, proper nouns, and idioms – any information that helps AI understand the context and rules of communication.”

TTS then requires recordings that cover a range of voices and tones, with additional context on how parts of words sound in different circumstances. “Good voice recordings could do half the job and cover all the required phonemes (units of sound in speech) for the AI model,” adds Harits Abdurrohman, TTS lead. “If a voice actor did a great job in the earlier phase, the focus shifts to refining the AI model to clearly pronounce specific words.”

Stronger Together

It takes vast resources to plan for much data, and SRIN worked closely with linguistics experts. “This challenge requires creativity, resourcefulness and expertise in both Bahasa Indonesia and machine learning,” Fadlil reflects. “Samsung’s philosophy of open collaboration played a big part in getting the job done, as did our scale of operations and history of AI development.”

Working with other Samsung Research centers around the world, the SRIN team was able to quickly adopt best practices and overcome the complexities of establishing data targets. Furthermore, collaboration was good for advancing not only technology but also culture. When the SRIN team joined their counterparts in Bangalore, India, they observed the local fasting customs, creating deeper connections and expanding their understanding of different cultures.

For the team, Galaxy AI’s language expansion project took on a new significance. “We are particularly proud of our achievements here as this was our first AI project, and it won’t be our last as we continue to refine our models and improve the quality of output,” Fadlil concludes. “This expansion not only reflects our values of openness but also respects and incorporates our cultural identities through language.”

In the next episode of The Learning Curve, we will head to Samsung R&D Institute Jordan to speak to the team who led Galaxy AI’s Arabic language project. Tune in to learn about the complexities of building and training an AI model for a language with diverse dialects.