Uncategorized

A translator that works in dozens of languages is created by Meta Artificial Intelligence

Building a Model for Universal Translation using Crowdsourced Data: An Empirical Study of a Massive Sum of Speech, Text, and Video

The team at Meta built on its previous work on speech-to-speech translation2 as well as on a project called No Language Left Behind3, which aimed to provide text-to-text translation for some 200 languages. Researching has shown that making translation systems multilingual can improve their performance even with limited training data; why this happens is unclear.

Many models only work for text, or use text as an intermediate step in speech-to-speech translation, and focus on a small subset of the world’s languages.

These challenges can be addressed by the SEAMLESS Communication Team1, which is tasked with developing technologies that could make rapid universal translation a reality.

To train their AI model, the researchers relied on methods called self-supervised and semi-supervised learning. These approaches help a model to learn from huge amounts of raw data — such as text, speech and video — without requiring humans to annotate the data with specific labels or categories that provide context. Such labels might be accurate transcripts or translations, for example.

The part of the model that is responsible for translating speech was pre-trained on a massive data set containing 4.5 million hours’ worth of multilingual spoken audio. This kind of training helps the model to learn the patterns in data, making it easier to fine-tune the model for specific tasks without the need for large amounts of bespoke training data.

One of the SEAMLESS team’s savviest strategies involved ‘mining’ the Internet for training pairs that align across languages — such as audio snippets in one language that match subtitles in another. The authors trained the model to recognize when a video clip and a subtitle correspond to each other by taking some data that they knew to be reliable. They collected 443,000 hours of audio with matching text and aligned approximately 30,000 hours of speech pairs, using the same technique to further train their model.

I believe the biggest virtue of this work is not the proposed idea or method. The model of this technology can only be used for non- commercial purposes, as the data and code is publicly available. The authors say that their translation model is foundational, which means it can be fine- adjusted to specific data sets for specific purposes or to improve translation quality for certain language pairs.

Meta is one of the biggest advocates of open-source language technology. Its research team was instrumental in developing PyTorch, a software library for training AI models, which is widely used by companies such as OpenAI and Tesla, as well as by many researchers around the world. The Llama family of large language models can be used to create applications like the one pictured here. This level of openness is a huge advantage for researchers who lack the massive computational resources needed to build these models from scratch.

Although speech technologies can be more cost-effective and efficient than humans, it’s important to understand how they fail for some people, who are also prone to biases. Future work must ensure that speech-technology researchers ameliorate performance disparities, and that users are well informed about the potential benefits and harms associated with these models.

Design-oriented thinking will be necessary in order for users to be able to understand the translations offered by these models. As well as the toxicity warnings explored by the SEAMLESS authors, developers should consider how to display translations in ways that make clear a model’s limitations — flagging, for example, when an output involves the model simply guessing a gender. Forgoing an output entirely when its accuracy is questionable or accompanying low-quality outputs with written caveats or visual cues would likely be involved. Users should be able to decide if they want to use speech technology in medical or legal settings.

The authors also looked for gender bias in the translations produced by their model. Their analysis examined whether the model over-represented one gender when translating gender-neutral phrases into gendered languages: does “I am a teacher” in English translate to the masculine “Soy profesor” or to the feminine “Soy profesora” in Spanish? Linguistic biases should be studied more in the future due to the fact that analyses are limited to languages with masculine or feminine forms.

SEAMLESSM4t: Open-source Audio Archive for Human-generated Translations of Sommessmannian Speech in the United Nations Archives

Meta, which has a headquarters in California and runs social-media sites, says it is making the SEAMLESSM4t open-sourced after the success of the LLaMA large.

The team collected millions of hours of audio files of speech, along with human-generated translations of that speech, from the Internet and other sources, such as United Nations archives. The authors also collected transcripts of some of those speeches.