![](https://custom.mt/wp-content/uploads/2025/02/Deepseek-for-translation-CustomMT-evaluation.jpg)
Part 1 of the evaluation with scores
Сommunity’s evaluation of Deepseek’s translation accuracy is complete.
- Deepseek won in 9 out of 18 linguistic tests. In six out of seven remaining tests it stayed within 5% of the leading model.
- A clean human test. Volunteer professional linguists rated the translations. They had no information about which output came from which model. It was a double blind test without bias toward any particular brand.
- Support for low-resourced language combinations. The evaluation included language combinations without English: Czech to Hungarian, Swedish to Norwegian, French to Italian, Turkish to Russian, and French to German. Subject matter areas ranged from marketing websites to legal and medical texts. Deepseek performed well in almost all of them. It fell short of the leader only in one test – with English to Japanese. In this particular test, however, all models received poor ratings.
Conclusion: Deepseek is a top-notch translation powerhouse racing nose-to-nose with the market leaders. The differences were small, and the ratings depended more on personal preferences of evaluators towards computer-generated translation, rather than model performance.
And it will get better
We structured this first evaluation to mimic the way translations happen in professional translation (CAT) tools: sentence by sentence, without the model getting the adjacent phrases or the full body of text. This approach penalizes large language models that thrive in scenarios with context. As we expand benchmarking to full text, we will certainly see improved translation performance from LLMs.
Moreover, in the first evaluation we used a simple zero-shot prompt not priming the model to specific subject matter areas or styles. With a little bit of tweaking, the linguists may get the outputs more in line with their preferences.
With a few tweaks to the process, we can significantly improve LLM performance.
Technical Limitations
Yet, in its present form, Deepseek R1 is unusable for professional translation.
- Slow: using Deepseek’s R1 hosted in the Western datacenter by Together.AI, we had to wait 20-60 seconds per sentence translated. In contrast, neural machine translation requires 1-2 seconds per sentence.
- Artifacts: it generated <think> tokens, especially when presented with 5+ sentences at once. “Hmm, the user asked me to translate this, let’s go word by word..”. Our engineers had to remove them manually before sending translations to linguists.
- Unwanted tag handling: Deepseek occasionally generated markdown syntax, even when the source text lacked it.
- Instruction Non-Compliance: Deepseek often failed to follow prompt instructions, and we had to regenerate with a higher or a lower temperature.
- Incomplete Translations: In 10 out of 400 test segments, the model stopped generating inference mid-process, failing to provide a translation. We increased temperature and relaunched translations which resolved this issue.
- High entry costs for on-prem: a full-sized Deepseek model requires a minimum of 6x A100 Nvidia GPUs to run, which translates to $52,500 in GPU rental costs per year. For faster speed and larger volumes, engineers will ask for 8x H100 units, at the rent cost of more than $100,000 per year.
In short, the largest model Deepseek R1 is slow and unstable out of the box, and needs to be reduced in size (distilled), fine-tuned for translation and integrated into translation tools with protocols in place to relaunch stuck translations. We’re sure to see many AI teams around the world doing exactly that in the near future.
A Robin Hood moment
Unlike OpenAI and Claude, Deepseek’s models are open source, and AI engineers are free to reuse and distill them into their own creations and recipes. With an injection of proprietary data, such as a collection of high quality translations, a derivative model will become proprietary, and the translation company or the enterprise team that trained it may use it for business, skipping commercial products such as DeepL or Google Translate.
In November 2024, Tower-2 model by Ricardo Rei and André Martins from Unbabel outperformed popular commercial products such as Google Translate, DeepL, and Microsoft Translator in 9 out of 11 language tasks at WMT24 competition. According to Tower’s paper, it was made by pre-training Meta’s Llama-2, finetuned on high-quality machine translation instructions. With Deepseek performing better than Llama-2 and others in translation tasks, model makers can switch from Llama and get the leading performance with less investment and know-how. There is no guarantee a novice engineer may get to the level of reliability tried-and-tested commercial products offer, but with a stronger foundational model to start with the path to join the race has just got shorter.
A hundred viable alternatives to Google Translate, ChatGPT, DeepL and other popular translation products may spring up. And perhaps one day soon, quality will no longer be the main differentiator between models.
And the trickiest part? Should the suspicions that Deepseek distilled closed-source OpenAI models to make their own product be proven accurate, it will be already too late to trace and recall all derivatives from the market. It’s not that OpenAI or other model makers paid for data in the first place. That is why most companies don’t say what datasets they really used, or they explicitly mention they are built with synthetic data (Microsoft’s Phi-4 for example).
The “gold” of translation AI is already out from the research labs of the rich with billions of dollars in investment and it is getting distributed to thousands of developers. And there is no Sheriff of Nottingham in sight to get it back.
It’s a Robin Hood-like moment in language AI.
Thank you for contributing ideas and counter-arguements:
- Gema Ramirez-Sanchez (Promptsit, EuroLLM)
- Ricardo Rei and Joao Graca (Unbabel)
- Marco Trombetti (Translated)
Opinions expressed are our own.
Comments are closed.