Trained MT in automotive: 83% needs no editing

Guest post by Jourik Ciesielski*

Guest Case study

Jourik Ciesielski trained a Google AutoML engine for an automotive customer in English to Spanish (Latam) language combination. As a result training, only 17% segments needed editing

In this case study we discuss the impact of training on the output generated by an MT engine. The technology behind this project is Google Cloud AutoML, but note that there are more providers that support engine training (e.g. Microsoft, SYSTRAN, Yandex).

Project details

Language combination: English to Spanish (Latin America)

Domain: Automotive

Training data: 932K unique and approved translation units

Achieved BLEU score: 54.16

The data set was extended with 500 random sentences coming from owner manuals that are available online. Those sentences were processed with the trained engine after which they were evaluated against a “ready to publish” standard. For each translation we determined whether or not it could be published without any human intervention. Perfect translations were categorized as “No post-editing”, translations that required one correction as “Light post-editing” and translations that needed more than one correction as “Heavy post-editing”.


83% of the translations were correct and didn’t need any post-editing effort. The remaining 17% required only 1 correction. There was not a single translation that needed heavy post-editing.

Despite the fact that both the domain (automotive) and the language combination (EN > LaES) are very suitable for MT, the overall quality of the generated output must be labeled as “excellent”. The trained engine produces accurate and consistent translations while style and tone of voice correspond to the training data. The Latin flavor is beautifully maintained as well.


Inconsistent translations of term candidates across multiple sentences (e.g. both “concesionario” and “distribuidor” for “dealer”) were the biggest problem. Those inconsistencies only occur when the term candidates appear either inconsistently or not frequently enough in the training data. The truth of the matter is that a lot of LSPs will face this problem; translation memories that are built up over several years may not always be very consistent.

Note that there’s actually only 1 term candidate (“airbag”) that the engine doesn’t handle well at all (both “airbag” and “bolsa de aire”, each 55 occurrences). This term candidate pushes the error rate up considerably.

Besides the inconsistencies, 4 translations had grammar issues while another 4 translations had semantic problems.


Since the inconsistency problem is caused by deficiencies in the training data, the solution is simple: retrain the engine until it generates the desired results. We estimate that the amount of correct translations will reach 98% if the engine is retrained appropriately.

Note that Google AutoML supports the use of glossaries (API only). It can help to avoid certain terminology errors:


   "translations": [


           "translatedText": "El volante calefactado se apaga cada vez que arranca el motor, incluso si lo encendió la última vez que condujo el vehículo.",

           "model": "projects/XXXXXXXXXXXX/locations/us-central1/models/TRLXXXXXXXXXXXXXXXXXXX"



   "glossaryTranslations": [


           "translatedText": "El volante térmico de dirección se apaga cada vez que arranca el motor, incluso si lo encendió la última vez que condujo el vehículo.",

           "model": "projects/XXXXXXXXXXXX/locations/us-central1/models/TRLXXXXXXXXXXXXXXXXXXX",

           "glossaryConfig": {

               "glossary": "projects/XXXXXXXXXXXX/locations/us-central1/glossaries/Automotive_en_es"





Nevertheless, it doesn’t take casing, gender, inflections or plurals into account (it only does 1-on-1 replacements similar to the Custom Terminology feature in Amazon Translate), so one should be very careful with this. Glossaries are preferably only used for non-translatables and perhaps ambiguous terms. Training must have priority over glossaries.

Unlike Google and Amazon, DeepL’s custom vocabulary feature does take casing, gender, inflections and plurals into account. For the time being it is available for a few language combinations only, so we don’t know how it will react to more exotic (Asian) or heavily inflected (Slavic) languages. If DeepL delivers, this might be a serious breakthrough for terminology in MT.


Despite the requirements that MT engine training entails (collecting and preparing data, evaluating, testing, etc.), it makes sense to train. A well-trained (and frequently retrained) engine is very suitable for raw MT projects and ensures that post-editing efforts are reduced to a minimum, which enables companies (both LSPs and enterprises) to improve their gross margins.


Jourik Ciesielski holds a Master in Translation as well as a Postgraduate Certificate in Specialized Translation from KU Leuven, Faculty of Arts in Antwerp (Belgium). In 2013 he started as an intern at Yamagata Europe in Ghent (Belgium) as part of his studies and then stayed with the company as full-time localization engineer. In addition to his responsibilities at Yamagata, he is a frequent speaker at the universities of Antwerp and Ghent.

He launched his own company, C-Jay International, in October 2020. With C-Jay he provides consulting services and technical support to enterprises as well as LSPs. Main fields of expertise are localization strategies, translation technology and machine translation.


After having trained the engine with 932K unique and approved translation units, we used it to translate a set of 500 random sentences coming from owner manuals that are available online. For each translation we determined whether or not it could be published without any human intervention. Perfect translations were categorized as “No post-editing”, translations that required one correction as “Light post-editing” and translations that needed more than one correction as “Heavy post-editing”.

We subsequently analyzed the different errors produced by the engine and categorized them as well. We ended up with three error categories: grammar, inconsistencies and semantics.

Machine Translation Trainer Tools


Machine Translation trainers belong to a new professional category becoming more and more relevant in the localization industry. In 2021, an MT specialist will be one of the jobs in the spotlight in most LSPs and buy-side localization programs.

Our team at Custom.MT collaborated with Effectiff to make a list of tools MT trainers use in their daily workflow.

The report below provides a comparison of the tools available on the market to get data, clean it up, and customize language AI models.

High-res infographic is available here.

TMX Editor Comparison


Translation memories are usually database-type files that contain previously translated texts, their formatting and other properties. Some of the properties are defined by default (e.g., source and target language, date, time, the user ID or CAT tool that performed the translation, etc.), while others can be added as custom attributes. Each CAT tool has its way of storing translation memories, but it is essential for all language service providers to share translation memories to carry out their activities.

Translation Memory eXchange (TMX) is an XML-based format designed for exchanging translation memories between different computer-aided translation and localization tools.

In this report, we provided a comparison of TMX Editors available on the market, useful for machine translation trainers to clean up and prepare datasets from translation memories.

High-res infographic is available here.

English to Russian Medical Machine Translation improves 31% with training

Case study

In this case study, our client is a medium-sized translation company in Moscow that specializes in the medical field. They are a regional leader in medical documentation, with tons of clinical studies, pharma labels, and Covid-19 announcements. The agency has experimented with a stock Yandex engine and contracted Custom.MT to see how far the improvement can go with training. 

Language combination: English to Russian

Domain: Medical

Training dataset: 250k segments

Highest BLEU score attained: 43 (above average)

Gains over stock engine: +31% segments that need no editing

The project gripped us with challenges from the first week. Over 15 years in business, the client has accumulated 1.1 million parallel segments in the subject matter area. When we looked at the size of the dataset, it made our eyes water with excitement and anticipation: this is one killer TMX to contend with.

Our production team was also wary of the inflectional character of Russian, which means word endings change depending on gender, singular/plural, and cases. Machine translation engine needs to take it into consideration, otherwise, editing all these suffixes can take as much time as translating from scratch. On the line of difficulty, Russian is a harder nut to crack for MT than French and Italian but thankfully not quite as difficult as Korean and Turkish.

Here is how we went about it.

First, for the proof of concept project, we splintered a part of the monster TMX with the most relevant and high-quality translations, processing 0.35 million parallel sentences. From our previous Medical project, we knew that combining too many areas of Life Sciences together can be harmful to training. You don’t go to your allergologist for a knee operation. Likewise, you don’t train Covid engines with optometry. After selecting a proper sample, we cleaned it up with our usual data pipeline operations and uploaded it into 4 different MT consoles. Once training had run its course, we measured BLEU scores.

Trained MT Engine 143.27
Stock MT Engine 134.34
Trained MT Engine 237.38
Glossary MT Engine 229.67
Stock MT Engine 231.72
Stock MT Engine 335.4
MT Engine 432.80
Stock MT  Engine 524.19

Already at this stage, it was clear that E4 and E1 perform on a similar level, but training spices things up. E1 improved the score by 26% after we loaded it up with data, training E2 gave the engine an 18% better performance. These were good initial results to be checked by an actual human evaluation. By contrast, E5 did not show improvement post-training in this project, so we eliminated it from the project’s next stage.

Human Evaluation

E1, E2, E3, and E4 made it into the human evaluation phase in this project. For the evaluation, the translation company provided three specialist medical translators who scored segments according to our methodology. They marked them with either Perfect or Good if the segments were good enough to leave them without edits.

MT did not beat human translation this time but caught up really close with a 4-point difference. E1 has been awarded 59 of these green points, much better than 45 of the stock E3. 


E2 scored poorly even with training this time, as was rated the last by every reviewer. E4 and E3 vied for second place after E1, and opinions were split.

Engine preference by reviewer

Since the translation company is using Memsource and Smartcat as TMS/CAT tools, we were limited in the choice of technologies to only what is already integrated into Smartcat. Memsource naturally has the largest pool of MT integrations among CAT-tools, while Smartcat is only beginning to add engines with training capabilities.

E1 winning on both BLEU scoring, human evaluation and integrations simplified our final recommendation. The picked E1 and configured it in the client’s workflows. The first live projects are already running with trained engine support.

French to English Finance machine translation beats Human

Case study

In this case study, we look at a machine translation engine training project with a French LSP. We trained a set of MT engines and evaluated the performance both automatically and with a human eye with the client’s pool of linguists.

Language combination: French to English 

Domain: Financial documentation

Training dataset: 607k parallel segments (295k after cleanup)

BLEU scores attained: 44 and 43

There was a “Kasparov vs Deep Blue” moment when the machine won against a specialist human translation 5 out of 5. In our test, five specialist translators ran a blind test on a group of engines, among which human reference was hidden as another MT output. Without knowing which was which, the translators scored 96 segments for each engine from 5 (Perfect) to 1 (Useless). By the number of segments that needed no editing, the linguists placed human translation only in the 3rd position.

The results

The trained engine gained a lot on every metric compared to the very strong stock engine from DeepL that the client used before:

  • +30% in the human evaluation score
  • needs 42% less time to edit
  • need -62.5% effort (WER) to edit

The company is now implementing a new compensation scheme, while we’re analyzing mistakes still made by the machine to retrain it a couple of months down the road.

Error types

We estimate that the savings from upgrading to a trained engine will improve the LSP’s gross margins by more than 10% in 2021.

The machine does not always perform so well. In another evaluation, this time for English to Russian, humans won against every engine trained. Russian is a harder nut to crack for MT because it is an inflectional language. However, it still makes sense to train. The difference between stock and the best performing trained engine was huge; the client still gained +60% better MT performance after training.