Machine Translation Evaluation

Measure your MT model accuracy with independent specialists.

Stages of Quality Assessment:

  • Benchmarking of customized engines vs Google Translate and DeepL
  • Measurements of edit distance and BLEU scores
  • Evaluation by specialist linguists

Automated Metrics Supported

  • BLEU

 

  • hLEPOR

 

  • COMET

 

  • METOR
A laptop screen displaying detailed infographics comparing AI models by their accuracy using the BLEU score.

Human Evaluation Methods:

Segment Scoring by Expert Panel

Scoring covers a lot of ground quickly. It’s practical when comparing 5 or more different models at the same time.

Specialist linguists go segment by segment and score them from 1(Useless) to 5 (Perfect). The business analyst works with 3 reviewers on each evaluation and reconciles their scores to reduce subjectivity in judgement.

All tests are Blind tests, and linguists don’t know which specific technologies are being evaluated.

ant-design_field-time-outlined

60 minutes per model

Untitled design

% of segments scored Perfect or Good without human corrections

Untitled design (1)

An engineer prepares a test dataset of 100 segments per model. These segments are selected based on their metric scores to represent all models fairly.

Blind ABC Test

ABC test is useful to compare 2-3 models and find a clear winner among them.

The ABC test takes only minutes, not hours, and is easy to run. It does not provide an effort measurement, and it is not practical to use it to compare more than 4 models.

ant-design_field-time-outlined

40-50 minutes total

Untitled design

% of segments preferred by linguists

Untitled design (1)

Three specialist linguists pick the best translation from 2-3 options, If two or more linguists agree on the best output, that model scores a point. The test continues for 100 segments. The model with the highest score total wins.

Edit Distance Measurement

Intensive test to estimate the human effort needed to polish MT

Significantly more accurate than scoring but more labor intensive.

ant-design_field-time-outlined

120 minutes per model

Untitled design

Word-error rate measurement

Untitled design (1)

Three linguists edit a sample MT output and bring it to the required level of quality. An engineer runs a script to measure the percentage of words changed (word error rate) on the document and the segment level.

Error Classification

Intensive test to understand model performance in detail

Significantly more accurate than scoring but more labor intensive.

ant-design_field-time-outlined

150 minutes per model

Untitled design

Weighted error score, error distribution

Untitled design (1)

On top of editing a sample MT output, linguists flag and classify errors based on a pre-agreed typology. We rely on a streamlined DQF/MQM harmonized metric to measure accuracy, language fluency, style, locale convention, spacing issues and other errors. This approach allows the project leaders to have in-depth understanding of the cognitive effort required to edit the MT output.