Jourik Ciesielski trained a Google AutoML engine for an automotive customer in English to Spanish (Latam) language combination. As a result of training, only 17% segments needed editing after machine translation.
In this case study we discuss the impact of training on the output generated by an MT engine. The technology behind this project is Google Cloud AutoML, but note that there are more providers that support engine training (e.g. Microsoft, SYSTRAN, Yandex).
Project details
Language combination: English to Spanish (Latin America)
Domain: Automotive
Training data: 932K unique and approved translation units
Achieved BLEU score: 54.16
The data set was extended with 500 random sentences coming from owner manuals that are available online. Those sentences were processed with the trained engine after which they were evaluated against a “ready to publish” standard. For each translation we determined whether or not it could be published without any human intervention. Perfect translations were categorized as “No post-editing”, translations that required one correction as “Light post-editing” and translations that needed more than one correction as “Heavy post-editing”.
Results of the Training
83% of the translations were correct and didn’t need any post-editing effort. The remaining 17% required only 1 correction. There was not a single translation that needed heavy post-editing.
Despite the fact that both the domain (automotive) and the language combination (EN > LaES) are very suitable for MT, the overall quality of the generated output must be labeled as “excellent”. The trained engine produces accurate and consistent translations while style and tone of voice correspond to the training data. The Latin flavor is beautifully maintained as well.
Errors that Required Editing after Machine Translation
Inconsistent translations of term candidates across multiple sentences (e.g. both “concesionario” and “distribuidor” for “dealer”) were the biggest problem. Those inconsistencies only occur when the term candidates appear either inconsistently or not frequently enough in the training data. The truth of the matter is that a lot of LSPs will face this problem; translation memories that are built up over several years may not always be very consistent.
Note that there’s actually only 1 term candidate (“airbag”) that the engine doesn’t handle well at all (both “airbag” and “bolsa de aire”, each 55 occurrences). This term candidate pushes the error rate up considerably.
Besides the inconsistencies, 4 translations had grammar issues while another 4 translations had semantic problems.
Terminology Data for Training
Since the inconsistency problem is caused by deficiencies in the training data, the solution is simple: retrain the engine until it generates the desired results. We estimate that the amount of correct translations will reach 98% if the engine is retrained appropriately.
Note that Google AutoML supports the use of glossaries (API only). It can help to avoid certain terminology errors:
{
“translations”: [
{
“translatedText”: “El volante calefactado se apaga cada vez que arranca el motor, incluso si lo encendió la última vez que condujo el vehículo.”,
“model”: “projects/XXXXXXXXXXXX/locations/us-central1/models/TRLXXXXXXXXXXXXXXXXXXX”
}
],
“glossaryTranslations”: [
{
“translatedText”: “El volante térmico de dirección se apaga cada vez que arranca el motor, incluso si lo encendió la última vez que condujo el vehículo.”,
“model”: “projects/XXXXXXXXXXXX/locations/us-central1/models/TRLXXXXXXXXXXXXXXXXXXX”,
“glossaryConfig”: {
“glossary”: “projects/XXXXXXXXXXXX/locations/us-central1/glossaries/Automotive_en_es”
}
}
]
}
Nevertheless, it doesn’t take casing, gender, inflections or plurals into account (it only does 1-on-1 replacements similar to the Custom Terminology feature in Amazon Translate), so one should be very careful with this. Glossaries are preferably only used for non-translatables and perhaps ambiguous terms. Training must have priority over glossaries.
Unlike Google and Amazon, DeepL’s custom vocabulary feature does take casing, gender, inflections and plurals into account. For the time being it is available for a few language combinations only, so we don’t know how it will react to more exotic (Asian) or heavily inflected (Slavic) languages. If DeepL delivers, this might be a serious breakthrough for terminology in MT.
Conclusion
Despite the requirements that MT engine training entails (collecting and preparing data, evaluating, testing, etc.), it makes sense to train. A well-trained (and frequently retrained) engine is very suitable for raw MT projects and ensures that post-editing efforts are reduced to a minimum, which enables companies (both LSPs and enterprises) to improve their gross margins.
Guest post by Jourik Ciesielski*
*Biography:
Jourik Ciesielski holds a Master in Translation as well as a Postgraduate Certificate in Specialized Translation from KU Leuven, Faculty of Arts in Antwerp (Belgium). In 2013 he started as an intern at Yamagata Europe in Ghent (Belgium) as part of his studies and then stayed with the company as full-time localization engineer. In addition to his responsibilities at Yamagata, he is a frequent speaker at the universities of Antwerp and Ghent.
He launched his own company, C-Jay International, in October 2020. With C-Jay he provides consulting services and technical support to enterprises as well as LSPs. Main fields of expertise are localization strategies, translation technology and machine translation.
Process:
After having trained the engine with 932K unique and approved translation units, we used it to translate a set of 500 random sentences coming from owner manuals that are available online. For each translation we determined whether or not it could be published without any human intervention. Perfect translations were categorized as “No post-editing”, translations that required one correction as “Light post-editing” and translations that needed more than one correction as “Heavy post-editing”.
We subsequently analyzed the different errors produced by the engine and categorized them as well. We ended up with three error categories: grammar, inconsistencies and semantics.
Comments are closed.