French to English Finance Machine Translation Beats Human

Case studies

Case study

In this case study, we look at a machine translation engine training project with a French LSP. We trained a set of MT engines and evaluated the performance both automatically and with a human eye with the client’s pool of linguists.

Language combination: French to English

Domain: Financial documentation

Training dataset: 607k parallel segments (295k after cleanup)

BLEU scores attained: 44 and 43

There was a “Kasparov vs Deep Blue” moment when the machine won against a specialist human translation 5 out of 5. In our test, five specialist translators ran a blind test on a group of engines, among which human reference was hidden as another MT output. Without knowing which was which, the translators scored 96 segments for each engine from 5 (Perfect) to 1 (Useless). By the number of segments that needed no editing, the linguists placed human translation only in the 3^rd position.

The results

The trained engine gained a lot on every metric compared to the very strong stock engine from DeepL that the client used before:

+30% in the human evaluation score
needs 42% less time to edit
need -62.5% effort (WER) to edit

The company is now implementing a new compensation scheme. Meanwhile, we’re analyzing mistakes still made by the machine to retrain it a couple of months down the road.

Error types

The table provided compares the performance of four different translation services—DeepL, Globaltese, Google, and ModernMT—across five categories of errors: spelling, grammar, incorrect spaces, partial translation, and punctuation errors. Here's a breakdown of the error counts for each service:

DeepL recorded 6 spelling errors, 6 grammar errors, 1 incorrect space, 1 partial translation, and 3 punctuation errors.
Globaltese had 9 spelling errors, the highest among the services, along with 10 grammar errors, 1 incorrect space, 5 partial translations, and 1 punctuation error.
Google showed 7 spelling errors, 5 grammar errors, a high of 9 incorrect spaces, 5 partial translations, and 1 punctuation error.
ModernMT had 6 spelling errors, 6 grammar errors, 5 incorrect spaces, 2 partial translations, and 5 punctuation errors, which is the highest in that category.

This comparison highlights the strengths and weaknesses of each translation service in handling different types of errors.

We estimate that the savings from upgrading to a trained engine will improve the LSP’s gross margins by more than 10% in 2021.

The machine or human translation debate does not always end like that. In another evaluation, this time for English to Russian, people won against every engine trained. Russian is a harder nut to crack for MT because it is an inflectional language. However, it still makes sense to train. The difference between stock and the best performing trained engine was huge; the client still gained +60% better MT performance after training.

Konstantin Dranch

Language Industry Researcher | Founder Custom.MT learn something new every week, create transparency in specialized markets

Comments are closed.

Case study

The results

Error types

Categories

Webinars