English to Russian Medical Machine Translation improves 31% with training

December 23, 2020

Case study

In this case study, our client is a medium-sized translation company in Moscow that specializes in the medical field. They are a regional leader in medical documentation, with tons of clinical studies, pharma labels, and Covid-19 announcements. The agency has experimented with a stock Yandex engine and contracted Custom.MT to see how far the improvement can go with training. 

Language combination: English to Russian

Domain: Medical

Training dataset: 250k segments

Highest BLEU score attained: 43 (above average)

Gains over stock engine: +31% segments that need no editing

The project gripped us with challenges from the first week. Over 15 years in business, the client has accumulated 1.1 million parallel segments in the subject matter area. When we looked at the size of the dataset, it made our eyes water with excitement and anticipation: this is one killer TMX to contend with.

Our production team was also wary of the inflectional character of Russian, which means word endings change depending on gender, singular/plural, and cases. Machine translation engine needs to take it into consideration, otherwise, editing all these suffixes can take as much time as translating from scratch. On the line of difficulty, Russian is a harder nut to crack for MT than French and Italian but thankfully not quite as difficult as Korean and Turkish.

Here is how we went about it.

First, for the proof of concept project, we splintered a part of the monster TMX with the most relevant and high-quality translations, processing 0.35 million parallel sentences. From our previous Medical project, we knew that combining too many areas of Life Sciences together can be harmful to training. You don’t go to your allergologist for a knee operation. Likewise, you don’t train Covid engines with optometry. After selecting a proper sample, we cleaned it up with our usual data pipeline operations and uploaded it into 4 different MT consoles. Once training had run its course, we measured BLEU scores.

Trained MT Engine 143.27
Stock MT Engine 134.34
Trained MT Engine 237.38
Glossary MT Engine 229.67
Stock MT Engine 231.72
Stock MT Engine 335.4
MT Engine 432.80
Stock MT  Engine 524.19

Already at this stage, it was clear that E4 and E1 perform on a similar level, but training spices things up. E1 improved the score by 26% after we loaded it up with data, training E2 gave the engine an 18% better performance. These were good initial results to be checked by an actual human evaluation. By contrast, E5 did not show improvement post-training in this project, so we eliminated it from the project’s next stage.

Human Evaluation

E1, E2, E3, and E4 made it into the human evaluation phase in this project. For the evaluation, the translation company provided three specialist medical translators who scored segments according to our methodology. They marked them with either Perfect or Good if the segments were good enough to leave them without edits.

MT did not beat human translation this time but caught up really close with a 4-point difference. E1 has been awarded 59 of these green points, much better than 45 of the stock E3. 


  • 70% faster editing
  • +32% in BLEU
  • +31% good and perfect segments that need no editing

E2 scored poorly even with training this time, as was rated the last by every reviewer. E4 and E3 vied for second place after E1, and opinions were split.

Engine preference by reviewer

Since the translation company is using Memsource and Smartcat as TMS/CAT tools, we were limited in the choice of technologies to only what is already integrated into Smartcat. Memsource naturally has the largest pool of MT integrations among CAT-tools, while Smartcat is only beginning to add engines with training capabilities.

E1 winning on both BLEU scoring, human evaluation and integrations simplified our final recommendation. The picked E1 and configured it in the client’s workflows. The first live projects are already running with trained engine support.

Related posts

October 14, 2021
The Arrival of Automatic Dubbing

The year 2021 marked the arrival of speech to speech translation in the commercial world. Scientists are working on making the underlying technology smoother and more accurate, engineers are integrating it into practical use cases. At the same time, there is an explosion in neural voices. Between July and September, three companies in this area […]

Read More
July 28, 2021
The Rise of Government NLP Programs – with Manuel Herranz, Pangeanic

Partner Spotlight: Pangeanic Smart governments are hiring data scientists to further automate what governments do for their  citizens. These data scientists work on creating data highways, so that the information that flows into systems is structured, and a thousand different applications can spring forth from it in the future. In the meanwhile, Manuel Herranz and his company […]

Read More
March 29, 2021
MT engine from Globalese gains 115% after training

Case Study Engines from IT giants such as Google Translate, Microsoft, and Yandex often win in quality because search engine companies possess the whole internet as their data pool. However, with very specialized content and excellent translation memory, this advantage is nullified. In this case study, the engine from a smaller MT vendor Globalese won […]

Read More
Subscribe to our newsletter