Today’s race for leadership in machine translation (MT) is between IT Giants Google and Microsoft, a few small specialized companies like DeepL and ModernMT, and open-source projects. What does it take, and how does it feel — to expand the boundaries of science and business at the very top of the machine translation field? Arul Menezes – the founder of Microsoft Translator provides a perspective.
Over the last 22 years, Arul Menezes has grown the Translator from a small research project into one of Microsoft’s most successful AI services, translating 100+ languages and dialects.
The Team Behind Microsoft Translator
Thank you for joining us today. And I’d like to maybe start with a little bit about how machine translation work is done at Microsoft. How is the Microsoft Translator team structured? What kind of people does it take to create one of the best MT brands in the world?
It’s a really diverse team in six different sites: Redmond, Edinburgh, Munich, Cairo, Hyderabad, and Beijing. We have three categories of specialists.
- The first is what I would call researchers and research engineers who have a lot of experience in deep learning, neural networks. They know how to train models, run parameter sweeps, and debug things if something goes wrong.
- The second category is infrastructure engineers. They maintain the very large-scale infrastructure for data and model training as well as the translation API that serves billions of requests globally.
- The third is data specialists. They lead the effort for testing and evaluation of models, as well as data collection, selection, cleaning, filtering, massaging, synthesizing, and trying to improve our domain coverage. A lot of translation capability so far has been focused on the web and news, but there are other domains, such as medical, legal, patents, etc.
The Hunt for Rare Language Data
Where does the data come from?
Our main data source is of course the Web. We have the Bing index of the whole web, where we find parallel and monolingual data.
How does one get highly specialized multilingual data from the web? Patents, medical, legal?
There are general medical sources on the web, like Mayo Clinic or the CDC. You have to find them either by classifying them at the website level, the URL level, or the page level.
When I look at the languages which you’ve been adding over the last couple of years, it’s all exotic languages for which it’s not so easy to find data. Specialized technical and patent corpora do not exist in languages like Assamese and Maori.
Indeed. So for many of our new languages, we work closely with governments. For example, we worked with the Federal Government of Canada to release the Inuktitut language, with the New Zealand government to release Māori, and with the Welsh Assembly to add the Welsh language.
Sometimes our partner may not necessarily be the government but someone else working in that country. In Kazakhstan, there were companies that were interested in us developing the Kazakh MT system. They shared some of the data that they had.
We recently had an interview with Modern MT, and they have a different source of data: translators working in their tool MateCAT, while engineers track corrections and cognitive effort. At Microsoft, are there any other sources besides the web that you could use to build a better training dataset?
Unlike Modern MT, we don’t have any access to user data according to the Microsoft Azure compliance and privacy requirements.
When needed, we purchase a lot of data from different sources. Our initial focus is the domain data for the Top-15 to Top-30 commercially most important languages.
And which domains do you see as the most important?
Technical because we have a lot of software localization customers, eCommerce, and conversational. Our big conversational use case would be Microsoft Teams.
Fighting Long-Tail Errors
How is data collection from the web different today than it used to be, three years ago? Is it about building parsers for specific websites or we are talking about more granular sifting through the search results?
Today we aim at the quality as well as the quantity of data. Neural networks learn whatever you give them, so if you give them something that has a systematic misleading pattern, they will learn it too. For example, if you have a few thousand examples of translating dollars into euros, they will learn to often translate dollars into euros.
In technical terms, where previously, the alignment between pages and sentences has been more heuristic, today we heavily use embeddings: sentence embeddings, page embeddings, word embeddings to match pages and sentences.
And in order not to exacerbate systematic error with embeddings, we have a whole series of internal heuristics and models applied at the document, page, and website level. For example, Our detection would say something like ‘this entire page or web domain looks like it’s machine translated rather than human translated’, so we don’t want to learn from that.
It’s important that the data algorithms can run at the Web scale. Recently we’ve sped up the data processing and training pipeline, so the cycle from finding data to making a better engine is now much faster. What used to take a month, we had one of our really smart guys look at it, and he got it down to two days.
The other aspect we are quite focused on is detecting long-tail errors: like random hallucinatory patterns where MT suddenly says ‘ha ha ha ha’ or ‘0 0 0 0’ and things like that. These are quite hard to detect in test sets or average BLEU scores, or even human evaluations because they are not very common. Our approach is to run millions of sentences through the system and then apply detectors that are looking for specific types of problem patterns that we find. It may be less than 1% but it is still important because when a user sees a mistake that a human would never make, it breaks the trust in the system. People get really upset if you mistranslate the price in their e-commerce listing, for example.
Monetization in MT
Let’s talk about the business side of MT. I ran a research project a year ago to size up the MT market and came up with a very small number, something like $270 mln. Do you have a better insight into what generates business today?
The biggest difficulty in estimating the size of the MT market is anticipating and being able to figure out what the new opportunities are. Twenty years ago it was governments, because they spent big on translation, and it was easy to create value by replacing manual translation work.
Now the biggest opportunities are in the new use cases that people didn’t do in the past. There are tons and tons of new use cases coming up, and the volume of machine translation overall is going to keep growing to an almost infinite extent, particularly with user-generated content.
I don’t know how many new tweets are generated every day but if all of them get translated, that is a huge volume. The same thing is with Facebook, WeChat, and any other social media. All eCommerce listings and any kind of B2B marketplace have the need for MT.
Take Zoom or Teams, for example, with enabled translation. Suppose only 1% of the meetings are cross-lingual but even that 1% of a very large number of online meetings is huge. The business challenge is that much of this translation is provided as a free feature of a larger product, such as a web browser, a social media feed, an email app etc. Let’s say we enable free translation for conversations in Teams, that’s an internal cost that unlocks more usage of Teams, but it’s not a part of the dollar volume of the MT market.
The world is so better-connected thanks to you guys offering us MT for free – thank you from the bottom of our hearts!
Yes, Google and Microsoft just gave translation in browsers for free and that has become a huge benefit to everyone. But it has not paid for my new car, figuratively speaking. Many of the really big use cases are not monetized now: every browser owner is consuming MT without being explicitly a part of the market. On the corporate front, Facebook in-housed their machine translation when they realized their volume was going to grow exponentially.
For Microsoft, which use cases generate the most business in machine translation?
In terms of paid products, in the Azure Cognitive Services family, we have our main Microsoft Translator text API, and our document Translator API that translates documents while preserving the rich formatting, and the Custom Translator product that allows customization of Microsoft Translator to the user domain and vocabulary.
We’ve released two new successful things in the last year: one is the containerized version of the translation and the other – a very exciting document translator product. Its key advantage is that for the first time it’s a turnkey solution for a non-technical user: you just drop in your PDF and you get a translated PDF without having to do any parsing of the document, OCR, or anything. We’ve had a lot of uptake on this product and we cover a very wide range of formats now.
And a new feature that’s coming next in that product is using the multi-sentence context. The idea is to get away from translating sentence by sentence and translate the whole document, to get consistency in pronoun translation or terminology translation.
I have a feeling that both you and Google launched this set of features in the same year, so was there a race for this feature?
Yes, we and Google tend to be neck to neck. For example, we announced our first neural machine translation system years ago at the Microsoft Ignite Conference. And it just so happened that Google released their Google neural translation paper on almost the exact same day, which is a bit of a coincidence. We also released NMT to our public APIs on almost the exact same day.
Expectations for 2022
Are there any game-changers this year, like video translation or anything else, where you see definite opportunity? For example, Yandex created the video translator for YouTube.
If video translation becomes a big thing, it may just be built into YouTube and Google may eat the cost of that. Microsoft doesn’t have the same video assets as Google, and we don’t yet have an entertainment channel like Amazon’s Twitch, so I haven’t focused that much on video. We have Microsoft Game Studios and we work with Xbox on live chat in video games.
I believe the game today is about user-generated content because its volume is always going to be much higher than that of the published content. For example, the travel market is great, we’d love to win this space.
Long-term, the biggest change will be human parity translation. If we ever get to a point where people are confident to just MT everything without post-editing, it will presumably be game-changing. Maybe it’s not like flipping the switch but an incremental thing with more and more cases of people publishing unedited content. But as long as post-editing is there, it really limits the efficiency of the whole end-to-end process: you may save from 20 to 40% of the revision time but it’s not a hundred times faster or a thousand times cheaper. Unlike a completely automated translation.
Unsolved Problems for the New Generation
And finally, where do you think young people from our new wave of MT specialists can apply themselves?
I think the opportunities lie with minority languages in relatively rich countries, where the government can afford to invest in them. Like Maori in New Zealand or Australia, which has a special fund for aboriginal languages.
India is also an interesting challenge, there are 22 national languages with tens or even hundreds of millions of speakers. Yet you struggle to get even a couple of million sentences of parallel data in some of the languages.
In the technical domain, I think the opportunity is to find better ways to train MT systems in an unsupervised way, with monolingual data. There are languages without a defined orthography, so maybe you can translate straight from speech and bypass the written version of the language.
This has been demonstrated at a small scale in the research context. They have essentially parallel speech tracks, spoken language in one and spoken translation in the other. And you literally train audio signal to audio signal. And I think so far it’s in the toy stage, it’s amazing that it works at all! And it’s not clear what the next step is, because there isn’t much parallel spoken language data recorded. You’d have to actually go out there and create it.
– Well, it is exciting, right? It means that MT is remaining an area where there’s a lot to do.
– Indeed, I think so.
Authors: Victoria Burbelo and Konstantin Dranch