Meta AI Introduces SeamlessM4T, a Multilingual and Multimodal Machine Translation Model

In today’s interconnected world, where the internet, mobile devices, and social media connect people globally, the ability to communicate across languages is more crucial than ever. The dream of seamless, universal translation, reminiscent of science fiction, is on the verge of becoming a reality thanks to advancements in artificial intelligence. Introducing SeamlessM4T, a groundbreaking multilingual and multitask model that promises to revolutionize the way we bridge language barriers.

SeamlessM4T is a versatile tool that offers automatic speech recognition and translation for nearly 100 languages, covering both text and speech. It supports speech-to-text and text-to-speech translation, as well as speech-to-speech translation, making it a comprehensive solution for language communication. This model is a game-changer, addressing the limitations of existing systems that only cater to a fraction of the world’s languages and often rely on separate subsystems.

What sets SeamlessM4T apart is its ability to work with low and mid-resource languages, where digital linguistic resources are scarce. Furthermore, it excels in high-resource languages like English, Spanish, and German, eliminating the need for separate language identification models.

In a move towards open science, SeamlessM4T is released under CC BY-NC 4.0, allowing researchers and developers to build upon this innovation. It comes with a rich dataset, SeamlessAlign, comprising 270,000 hours of mined speech and text alignments, making it easier for the community to conduct research in this domain.

SeamlessM4T is a culmination of years of research and development, drawing insights from projects like No Language Left Behind (NLLB), Universal Speech Translator, SpeechMatrix, and Massively Multilingual Speech. This model promises to bring us closer to the universal translator, enabling effective communication across languages and cultures, and opening up new possibilities for global collaboration and understanding.

SeamlessM4T supports:

  • Automatic speech recognition for nearly 100 languages
  • Speech-to-text translation for nearly 100 input and output languages
  • Speech-to-speech translation, supporting nearly 100 input languages and 35 (+ English) output languages
  • Text-to-text translation for nearly 100 languages
  • Text-to-speech translation, supporting nearly 100 input languages and 35 (+ English) output languages

How SeamlessM4T Works: Approach and Architecture

To build a unified multilingual and multimodal translation model SeamlessM4T, several key components and innovations are outlined:

  • Redesigned Sequence Modeling Toolkit: The foundation for this project is a redesigned version of the fairseq2 sequence modeling toolkit. This toolkit has been optimized for efficiency and is designed to work seamlessly with other PyTorch ecosystem libraries, ensuring that it can be easily integrated into the broader AI development landscape.
  • Multitask UnitY Model Architecture: The model architecture employed is called UnitY, which is a multitask model capable of handling various translation and speech tasks. It can perform automatic speech recognition, text-to-text, text-to-speech, speech-to-text, and speech-to-speech translations. This unified architecture streamlines the translation process, eliminating the need for separate models for different tasks.
  • Speech and Text Encoders: The model incorporates two main encoders: a speech encoder and a text encoder. These encoders are responsible for recognizing speech input in nearly 100 languages and understanding text in nearly 100 languages, respectively.
  • Text Decoder: The text decoder takes encoded speech representations or text representations and produces translated text. It can handle both tasks within the same language and multilingual translation tasks. This decoder is guided by token-level knowledge distillation from a strong text-to-text translation model called NLLB (No Language Left Behind).
  • Text-to-Unit (T2U) Component: To generate speech representations, a text-to-unit (T2U) component is utilized. It converts text output into discrete acoustic units, and it is pre-trained on automatic speech recognition (ASR) data.
  • HiFi-GAN Unit Vocoder: After generating discrete speech units, a multilingual HiFi-GAN unit vocoder is used to convert these units into audio waveforms, enabling the synthesis of human-like speech.
  • Data Scaling and Mining: To support the model’s training and ensure scalability, a significant amount of high-quality end-to-end data is required. The approach includes the creation of a massively multilingual and multimodal text embedding space called SONAR, which covers 200 languages. A teacher-student approach is then employed to extend this embedding space to the speech modality, covering 36 languages. Mining is performed on publicly available web data and speech repositories, resulting in the creation of a vast corpus named SeamlessAlign, which comprises speech/speech and speech/text alignments, making it one of the largest open datasets in terms of volume and language coverage.

This approach leverages a combination of advanced modeling techniques, data scaling strategies, and efficient encoders to build SeamlessM4T, a comprehensive multilingual and multimodal translation model capable of bridging language barriers and facilitating seamless communication across languages and modalities.

What SeamlessM4T Has Achieved: S2ST & S2T

Translation quality measured on SeamlessM4T and state-of-the-art competitor models including direct and cascaded systems averaged over 81 FLEURS X-English languages pairs

SeamlessM4T excels in multiple language-related tasks, delivering state-of-the-art results for nearly 100 languages. It offers support for various tasks, including automatic speech recognition, speech-to-text, speech-to-speech, text-to-speech, and text-to-text translation, all within a single model. The model also notably enhances performance for languages with limited resources while maintaining strong performance for languages with abundant resources.

To provide a more accurate evaluation of the system’s performance, a text-less metric, BLASER 2.0, has been introduced. This metric allows for evaluation across both speech and text units and exhibits comparable accuracy to its predecessor. In terms of robustness testing, SeamlessM4T outperforms existing models in handling background noises and speaker variations in speech-to-text tasks, with substantial average improvements of 37% and 48%, respectively.

Meta hopes to help connect people across languages through this technology, and they explore how this foundational model can enable new communication capabilities — ultimately bringing us closer to a world where everyone can be understood.

This article is drafted with the assistance of A.I. and referencing from the sources below:

The work described in this article was supported by InnoHK initiative, The Government of the HKSAR, and Laboratory for AI-Powered Financial Technologies.
(AIFT strives but cannot guarantee the accuracy and reliability of the content, and will not be responsible for any loss or damage caused by any inaccuracy or omission.)

Share this content


Units 1101-1102 & 1121-1123,
Building 19W Science Park West Avenue,
Hong Kong Science Park,
Shatin, Hong Kong

Products & Solutions


About Us


Copyright © 2024 Laboratory for AI-Powered Financial Technologies Ltd. All Rights Reserved.