AI prompt engineering for localization is the process of designing structured prompts that help large language models produce accurate translations, terminology-consistent output, and reliable localization quality assessments.
In 2024, one year into the GPT boom, language teams strive to implement generative AI better and more effectively. Prompts are becoming increasingly elaborate, with engineers and project managers employing various techniques to reduce hallucinations, enhance accuracy, and lower costs.
Key Takeaways
- The workshop covered: prompt structure, LQA automation, chain-of-thought prompting, RAG, multimodal prompts, agents, and fine-tuning.
- Nearly 200 localization professionals joined the Custom.MT workshop on April 18–19, creating 535 prompts and executing 38,947 generations.
- Models used included GPT-4 Turbo, Claude 3, Mistral 7B, and others.
- This article provides a recap of the most effective prompt engineering techniques used by localization teams in 2024.
1. Live examples of GenAI Industrialization by language teams
More than 40% of the participants indicated that generative AI helps their organizations’ localization workflows. Five presenters provided an overview of their work.
– Terminology extraction for translation (Marta Castello, Creative Words)
– Generating product descriptions in multiple languages for eCommerce (Lionel Rowe, Clearly Local)
– Translation + RAG glossaries (Silvio Picinini, eBay)
– Copy generation and GenAI hub in a gaming company (Bartlomiej Piatkiewicz, Ten Square Games)
– Video voiceover and summarization of descriptions in eLearning (Mirko Plitt, WHO Academy)
2. Recommended Prompt Structure

The first practical task involved building a terminology extraction prompt applying best practices:
- Using clear section titles (e.g., ### Section Title ###)
- Using variables such as {{CONTENT}}, {{LANGUAGE}}, or {{GLOSSARY}}
- Keeping instructions minimal
- Adding examples to improve consistency
- Keeping temperature = 0 for reproducibility
- Adding an emotional marker (e.g., “It’s very important…”) to improve model compliance
3. Spreadsheet Integration for LQA
The second task was to integrate prompts with spreadsheets to unlock working with variables across multiple lines of content. Participants created their own language quality assurance bots by asking LLMs to detect and label translation errors. We used the public DEMETR dataset from the WMT competitions as task material.

Example LQA Prompt
Evaluate the translation below.
### Task ###
Identify translation errors and classify them as:
- Minor
- Major
- Critical
### Output ###
Severity:
Error Type:
Explanation:
### Source ###
{{SOURCE_SENTENCE}}
### Translation ###
{{TRANSLATED_SENTENCE}}
4. Chain of Thought Prompts for Localization
Giving an AI model too many instructions, for example, a 20-line localization style guide leads to many instructions being skipped. Instead, in this exercise the participants split the prompt into a chain of smaller prompts or provided instructions to the model to work on step by step.
In our spreadsheet exercise, the 1st prompt evaluated the severity of the error, and the 2nd prompt classified it by type based on the results of the previous generation.
How-To: Building a Step-by-Step LQA Prompt Workflow
- Prompt 1 – Evaluating severity
Ask the model to read the source and target text and assign an error severity (Minor, Major, Critical).
- Prompt 2 – Classifying the error type
Use the severity result from Prompt 1 as input and classify the error (Accuracy, Fluency, Terminology, Style).
- Prompt 3 – Suggesting an alternative translation
Based on the previous two outputs, request an improved translation that fixes the identified issue.
5. Vision in multimodal models
In this task, participants worked with vision model GPT-4 Turbo to complete assignments with images:
- – image translation
- – getting text from scanned PDFs
- – screenshot localization testing
- – generating multilingual product descriptions from image
- – identifying fonts
- etc

Example: a participant translates the menu of a sample website from an image screenshot.
6. Retrieval-Augmented Generation (RAG) for Glossaries & Assets
Using retrieval, a large language model may generate output based on facts from a linked database. Such databases include company data, glossaries, style guides, and translation memories, and the answer can be based on facts contained within, instead of relying on LLM’s general memory.
In the exercise, we used retrieval for the following use cases:
- – translate with glossaries
- – check existing translation for terminology compliance
- – create a “chat-with-your-website” bot

Example: for retrieval, the database splits the text into chunks, and the search of the relevant chunks is based on semantic similarity, expressed in percentage in the screenshot above.
7. Agents: Using LLMs to Control External Apps
This module covered the ability of LLMs to operate external apps via the API. We explored translating with DeepL and proofreading with ChatGPT in a chat interface. While not immediately applicable to workflows, the overview of Agents showcases a potential future scenario where LLMs act as a user interface to other applications.

Example: GPT-4 Turbo calls DeepL to translate a paragraph of text.
8. Fine-tuning GPT-3.5 Turbo
By giving translation memory and glossaries to GPT-3.5 Turbo, it is possible to improve its output closer to GPT-4, while minimizing the cost below the cost of conventional machine translation. In this module of the workshop, we covered how fine-tuning is done, and its impact on quality and cost of LLM localization.

Presenters: Dominic Wever (Promptitude) and Konstantin Dranch (Custom.MT).
The workshop recording is available at:
https://www.youtube.com/watch?v=MJNlhyStv14 – part 1.
https://www.youtube.com/watch?v=QPPRtquyvgQ – part 2.
Frequently Asked Questions
You can evaluate translation quality by asking the AI model to read the source and target text and label the error severity as Minor, Major, or Critical. This is the first step in an LQA workflow and helps structure later AI quality checks.
After severity is set, use a second AI prompt to classify the error type such as Accuracy, Fluency, Terminology, or Style. This keeps the LQA process consistent and improves reliability across language pairs.
Once severity and error type are identified, a third prompt can request an improved translation. The AI produces a corrected version that fixes the issue, improving translation quality through structured LQA prompting.
Upcoming GenAI in Localization Workshops
Subscribe to our newsletter to get notified about the next AI Prompt Engineering for Localization workshop.
Comments are closed.