MT Summit 2025: Curated List of Impactful Papers

Blog post

The 20th Machine Translation Summit held in Geneva featured 76 presentations, including 37 poster sessions that shape the future of machine translation as a field. We present a curated list of impactful papers based on the User track proceedings.

Key Trends from MT Summit 2025

Key topics this year included:

Context-Aware MT
Named Entity-Aware MT
Multilingual and Low-Resource Translation, addressing data scarcity across diverse languages
Translation Evaluation and Productivity, with a focus on workflow efficiency
Large Language Models (LLMs) in emergency response and education
Growing interest in machine-assisted translation of expressive texts, including literature
Gender-Inclusive and Signed/Spoken Language MT, reflecting increased attention to accessibility and inclusive language

Top 10 Innovations from MT Summit 2025

Here are ten solutions and ideas we believe are actionable, and industry professionals should know them from this year’s Summit:

ProMut: Evolving MT Education
A training platform for MT professionals built by Promptsit. Includes MarianNMT integration, COMET evaluation, and expanded training capabilities. ProMut builds on MutNMT and includes 500k training sentence pairs. It just went open-source.
AI Workflows for Multimedia Localization
An integrated pipeline that combines transcription, MT, and voice-over synthesis. Delivers up to 86% time and 71% cost savings in subtitling and dubbing.
Speech-to-Speech Translation for Low-Resource Languages
Modular systems integrating ASR, MT, and TTS for under-resourced languages. Proven feasible with as little as 3 hours of training audio, tested on over 60 pipeline configurations. Built at The School of Engineering and Management Vaud.
Cultural Transcreation in Asian Languages with Prompted LLMs
In a project by Unbabel, prompted LLMs deliver culturally nuanced translations for East Asian languages—enhancing customer service communication with greater cultural fidelity.
Cross-Locale Adaptation via LLMs
A project built by Vera Senderowicz Guerra at Welocalize automatically localizes Spanish variants (Mexico, Argentina, Spain) using LLM prompts, ensuring consistency in tone and terminology across regions.
CAT-GPT: Purpose-Based Translation
Introduces an open-source CAT tool built for GenAI. While this is a university project, it showcases the need for a new generation of CAT systems.
MTUOC Server for Engine Integration
The main goal of the project is to make the training, use and integration of Neural Machine Translation systems easier. This open-source application integrates NMT systems and LLMs, supporting MTUOC, Moses, and ModernMT protocols.
UniOr PET: Real-Time Post-Editing Platform
A collaborative web-based tool for post-editing with detailed action tracking and real-time quality metrics (hTER, BLEU, ChrF). Another signal for MT and LLM-first CAT-tool generation.
Multilingual Chatbot Applications
LLM-powered bots for real-time multilingual use cases, including meeting summaries and quality checks.
BridgeAI: Aligning AI Policy, Ethics, and Practice in Portugal
A Portuguese national initiative creating a multidisciplinary framework to implement the EU AI Act. BridgeAI delivers tools for risk assessment, ethics, and AI literacy—fostering responsible innovation with implications across NLP and MT.

Key Topic: Prompt Engineering

Prompt engineering emerged as a standout theme at this year’s Summit, with research showing how it can significantly improve translation outcomes:

Helena Wu et al. demonstrated that GPT-4o reached human-level quality in 92.86% of segments for cultural transcreation in Asian languages.
Vera Senderowicz Guerra introduced a standardized prompt-based workflow that regionalizes Spanish content using a human-reviewed NMT root.
In a study on Arabizi, prompt tuning boosted BLEU scores by 16% (EN) and 20% (AR) using one-shot prompts.

The takeaway? Localization teams should invest in prompt engineering training, integrate prompt-based tools like CAT-GPT into workflows, and maintain human oversight—especially for sensitive content. For example, Abeer Alfaify’s analysis of the Gaza-Israel conflict showed that MT struggled with cultural nuance and cursive handwriting, underscoring the need for human validation in high-risk scenarios.

Key Topic: Fine-Tuning LLMs

Another major trend from the Summit was the targeted fine-tuning of large language models—balancing prompt simplicity with contextual depth, all while preserving human oversight.

In “DeMINT”, Miquel Esplà-Gomis et al. fine-tuned LLaMA 3.1–8B to create a chatbot that assists English learners using real-world transcripts. The model acts as a context-aware language tutor, showcasing how fine-tuning can build empathy and pedagogical value into MT systems.
Andrei Popescu-Belis and team fine-tuned Whisper and NLLB-200 (1.3B) with just 3 hours of data per language (Turkish, Pashto, French), achieving strong results, particularly in Pashto. This is a practical solution for humanitarian MT projects with limited data.
The ZuBidasoa initiative by Xabier Soto and colleagues is developing MT systems for migrants in the Basque Country. By fine-tuning Latxa-based LLMs and engaging local communities, the project demonstrates how MT can drive real-world impact in underserved populations.

Industry Perspectives

Bruno Ciola, Head of Language Technology at Diction AG (Switzerland)

“1. Using plain or LLM-augmented NMT, versus replacing entirely with LLMs — the academia mostly focuses on the latter. In the industry, LLMs can still be unpredictable, so testing and benchmarking are a must. It’s clear there’s no one-size-fits-all solution with the many LLMs out there.
2. Document-level processing is close to overtaking segment-based systems—it’s time to wake up, technology vendors!
3. Commercial systems often outperform open-source ones, and surprisingly, old-school methods like regular expressions are still effective in some cases.

Maria Carmen Staiano, PhD Candidate in Humanities and Technologies at the University of Macerata

“I attended the 1st Workshop on Artificial Intelligence and Easy and Plain Language in Institutional Contexts. Among the key themes that stood out was the emphasis on the use of plain and clear language to ensure equitable access to information, particularly in the medical domain.

Another major focus was on sign language and speech translation, reflecting the broader commitment to inclusivity and multilingual accessibility across different modalities.”

Steve Dept, and Laura Casanellas Luri, cApStAn

“1. Monolingual evaluation could be a game-changer, it is less time-consuming and resource-intensive.

2. LLMs perform well in creative translation tasks when used at the paragraph level with prompts like “translate creatively”, but the models seem to lose attention and begin to produce more errors at the document level.

3. There is the rising focus on ethics and sustainability, particularly the environmental cost of AI as an essential part of how future MT systems should be assessed.”

Gema Ramírez-Sánchez, CEO of Prompsit Language Engineering

“A crowded room, eager to learn about how can GenAI help in delivering linguistic and culturally localized content beyond the extraordinary, deserves serious evaluation processes and standards, high-quality benchmarks, dynamic evaluation, solid metrics and easily customizable frameworks.

Without proper evaluation we cannot measure what’s the value proposition that GenAI has to offer in a fast-changing environments. And we’d better do evaluation in the open, so that everyone can see, share and compare, but also with capacity to adapt to specific use cases.”

Conclusion

From our perspective, this year’s Summit confirmed what many in the field are already sensing: the machine translation ecosystem is evolving fast towards LLM and prompt-driven workflows. Proceedings list ready to use tools for translation teams, smart adaptations of LLMs to real-world constraints, and fresh ideas that bring policy, ethics, and technology into closer alignment. What stood out most was the shift from theoretical ambition to hands-on application. Whether it’s fine-tuned models for refugees, prompt-based workflows that actually save time, or practical frameworks for regulation, the focus is clearly turning toward usability and accountability. And that’s good news for everyone building or relying on MT today.

Bonus: Datasets for MT Research & Experimentation

We’ve also compiled newly released open datasets from the Summit to support innovation, language development and multilingual evaluation.

Dataset Name	Hosted On	Languages	Data Volume
OpenHQ-SpeechT-GL-EN	HuggingFace (link)	Galician–English	4,798 train, 507 dev, 282 test
FLEURS-SpeechT-GL-EN	HuggingFace (link)	Galician–English	2,742 train, 496 dev, 212 test
COVOST2_ID-EN	HuggingFace (link)	Indonesian–English	1,243 train, 792 dev, 844 test
FLEURS-AR-EN-split	HuggingFace (link)	Arabic–English	2,228 train, 278 dev, 279 test
indic-en-bn	HuggingFace (link)	Bengali–English	41,984 train, 9,000 dev, 1,000 test
voxpopuli_es-ja	HuggingFace (link)	Spanish–Japanese	9,972 train, 1,440 dev, 1,345 test
Darija Open Dataset	GitHub (link)	Darija–English	50,000 sentences
FLORES+ Mayas	GitHub (link)	Six Mayan languages	2,000 sentences (~50,000 words)
HPLT	HPLT / OPUS / HuggingFace (multiple links)	193 mono, 51 parallel pairs	7.6T tokens (mono), 380M sentences (parallel), 1,275 language pairs
eSTÓR	estor.ie	Irish–English	184 resources (185,343 TUs), 201,719 monolingual words
Prompt-based QE (En–Ml)	GitHub (link)	English–Malayalam	8,000 segments
AI4Culture	ai4culture.eu	Multilingual	Covers OCR/HTR, subtitles, MT, image analysis, semantic linking

Datasets LLMs

Ekaterina Barannikova

Comments are closed.