The year 2021 marked the arrival of speech to speech translation in the commercial world.
Scientists are working on making the underlying technology smoother and more accurate, engineers are integrating it into practical use cases. At the same time, there is an explosion in neural voices. Between July and September, three companies in this area got funded, Murf.ai, Wellsaid Labs, and Lovo.ai. As a result, localization professionals can now choose from 20+ providers of pre-made voices, or even clone their own voice to generate personalized speeches.
Speech Translation Full Steam Ahead
Lately, big Tech providers and other important players introduced some major changes in their speech-to-speech translation technologies. Among others, Amazon AI showcased some of their work at MT Summit conference in August. In the presentation conducted by Marcello Federico, prosodic alignment was an important aspect of the work. Google introduced “Translatotron-2” in August, which outperforms the original model. Iterations resulted in better translation quality and more natural predicted speech. The new model also enables the retention of source voice in translated speech. In Academia environment, ELITR project run by Ondrej Boyar from the Charles University in Prague and funded by Horizon Europe 2020 arrives at its conclusion in December.
Neural Voices into Products
In a live voiceover scenario, the source can come from MT. In an “automated dubbing” scenario, an engineer can edit the text and the pronunciation of the neural voice to improve on quality. As technology comes to the market, this type of offering has begun to crystallize into “AI voiceovers” at an average price of $20 per minute.
In September, Russian’s Tech Giant Yandex unveiled a free automatic English- Russian translator for Youtube (as well as other video streaming services). Through a “Translate” button added to Yandex Browser (Chrome compatible), it translates and dubs videos with the recognizable cheer of Alisa, Yandex’s sister system to Amazon Alexa. Under the hood, the translator chains several neural networks to make it sound more natural.
At Custom.MT, we took the translator for a spin!
Overall, the feature left a favorable impression: we could follow TED talks and narrated videos with clear diction in Russian. Even though the translation was literal, it was clear enough to follow the speaker. In a more specialized video with hardcore terminology, the MT engine predictably stumbled. Here is what our team noticed:
- Youtube translator works better for short videos. A 30-minute clip took 8 minutes to render.
- It understands the need for speed. Russian sentences may get longer than English. To compensate, the digital narrators change their delivery. They sometimes spray words at a comic rate to keep up with rapid-fire jabber.
- Prosody is in! The voice changed in pitch depending on the source pitch and tone. It felt close to a human voice on shorter stretches.
David Talbot, the head of NLP at Yandex, told us they’re planning on adding more translation pairs and voices to the Yandex Youtube translator, and also on making it faster in the upcoming future.
Considering billions of videos on Youtube to potentially translate, the processing power needed to run so many neural networks behind the service, and it being completely free, this is an ambitious project!