Is the long-awaited worldwide communication without borders a distant dream or reality?

In a world with more than 7,000 languages, roughly half of the people speak only one language, thus creating huge barriers that restrict cross-cultural communication. Thanks to advances in Artificial Intelligence and Machine Learning, we are steadily moving towards a future where hosting a dinner party for 20 guests, each speaking a different language, won’t be a problem anymore. Indeed, scenarios where people can talk to each other in different languages are no longer a distant dream and there are already many speech-to-speech translation apps and products available.

This highly innovative technology is advancing at a rapid pace thanks to the continued development of its three underlying components. Analyzing these individual technologies can be helpful in understanding how multilingual voice-enabled technologies work.

Automatic Speech Recognition (ASR), also known as Speech-to-Text (STT), transcribes human voice input into text. The quality of STT systems depends on audio quality, clarity of spoken text, specific vocabulary, etc. These systems must be trained on high-quality voice data.
Machine Translation (MT) automatically translates this text from the source language to the target language. MT systems can be categorized into general or custom MT. General MT systems are usually trained on large amounts of parallel data (i.e., paired original and translated sentences) from different sectors and domains, whereas custom MT systems are trained on sector- or customer-specific data (terminology, translation memories, glossaries, parallel and monolingual text). Custom MT systems typically deliver more accurate and consistent translations than general systems.
Text-to-speech (TTS) converts the translated text into audio.

When developing speech-to-speech translation technologies, there are two alternative ways to go:

Pipeline-based speech-to-speech translation leverages separately trained STT, MT and TTS systems and runs them sequentially as above.
End-to-end speech-to-speech translation leverages audio-to-text parallel data, i.e., voice signals paired with the corresponding translated text, to train a single system that directly converts voice input into translated text which is then converted into audio by TTS. Such parallel data is often hard and expensive to obtain, therefore pipeline-based systems are often more practical and realistic.

Even though there are many solutions available on the market, we are still relatively early in the paradigm of speech-to-speech translation. STT, neural MT and TTS each play a huge part towards achieving seamless, high-quality translation. At the same time each of these technologies presents its own difficulties, therefore there are still ways to go in making this emerging technology perfect. For instance, spoken language often contains ungrammatical, colloquial expressions and it does not include punctuation. This can lead to STT errors which can in turn lead to major MT errors to the extent when the translated output becomes incomprehensible.

In its article on “10 Emerging Technologies That Will Change Your World”, the MIT Enterprise Technology Review listed speech-to-speech translation as one of ten technologies that will revolutionize our world. This is no surprise because being able to speak and have one’s words automatically translated into other persons’ language has countless benefits. Speech-to-speech translation technology can enable people to seamlessly communicate in different languages, and it directly contributes to creating a world without language barriers, advancing global business and promoting cross-cultural exchange, among others. However, only the future will be able to show us where this powerful technology will take us.

Developer survey: Since you are here and interested in our project, could you please spare a moment to share your concerns and answer 12 questions related to developing voice-enabled apps.