AI Prompt Engineering for Localization - 2024 Techniques

Close to 200 people attended the Custom.MT workshop on AI prompt engineering in localization on April 18-19. The event spanned 7 hours and covered a wide array of material. Participants created 535 prompts and executed generation tasks a total of 38,947 times. The models used in the exercises included OpenAI GPT-4 Turbo, Anthropic Claude 3, and Mistral 7B, among others.

In 2024, one year into the GPT boom, language teams strive to implement generative AI better and more effectively. Prompts are becoming increasingly elaborate, with engineers and project managers employing various techniques to reduce hallucinations, enhance accuracy, and lower costs.

Here is a quick recap of the key points.

1. Live examples of GenAI industrialization by language teams

More than 40% of the participants indicated that generative AI helps their organizations’ localization workflows. Five presenters provided an overview of their work.
– Terminology extraction for translation (Marta Castello, Creative Words)
– Generating product descriptions in multiple languages for eCommerce (Lionel Rowe, Clearly Local)
– Translation + RAG glossaries (Silvio Picinini, eBay)
– Copy generation and GenAI hub in a gaming company (Bartlomiej Piatkiewicz, Ten Square Games)
– Video voiceover and summarization of descriptions in eLearning (Mirko Plitt, WHO Academy)

2. Recommended prompt structure

The picture is an example of how to structure your prompt.

A good prompt consists of sections. Each section has a name surrounded by ### before and ### after the name. The name is followed by a short and clear instruction.

At the picture you can see typical sections and examples of the their content:

###Role###
Start by defining the role or identity that the model should assume. This helps set the context for the task. For example, "You're a professional translator."

###Instructions###
Clearly state what you want to be done. In this case, the task is to "analyze the Website below and then extract terminology from it."

###Output Format###
Specify how you want the information to be presented. Here, the following is requested:
' - First, categorize the results.
- Then, for each category, return a numbered list.'

###Examples###
Provide examples to illustrate what you're looking for. This prompt includes examples like "1. Greenhouse gas" and "2. Emission inventory" each in a separate line.

###Website###
Indicate the source material to be used for the task. In this prompt, a placeholder """({{website_page}})""" is used to represent where the website link or content should be placed.

In the end you should emphasize the importance of the task with an emotional marker. In this prompt, the example is:
'Now, please extract the terminology, it's very important for us:'

The first practical task involved building a terminology extraction prompt using best practices:
– Separating prompts into ### Sections ### to make them modular
– Variables – to be able to use the prompt again and again with new content
– Minimizing the instructions
– Using examples
– Adding emotional markers to improve the output potentially
– Using temperature 0 to have reproducible results

3. Spreadsheet integration for LQA

The second task was to integrate prompts with spreadsheets to unlock working with variables across multiple lines of content. Participants created their own language quality assurance bots by asking LLMs to detect and label translation errors. We used the public DEMETR dataset from the WMT competitions as task material.

The screenshot features an example of LQA done automatically in Google Spreadsheets. The table's contents ae as follows:

Source Text

French: Toutes les entrées des grottes, qui ont été baptisées « Les Sept sœurs », ont un diamètre d'au moins 100 à 250 mètres.
Czech: Proti Uberu v Brně protestovali taxikáři, poukazovali na to, že Uber nesplňuje podmínky jako například označení jako taxi nebo taxametr.
Japanese: 紀元前3世紀に建てられたピラミッドによって表現されたエジプト王、死んだファラオを讃えるために建てられた多くのピラミッドの一つとして「大ピラミッド」が挙げられています。

Language

French
Czech
Japanese

Translation

French: All the entrances to the caves, which have been dubbed "The Seven Sisters" have a diameter of at least 100 to 250 meters.
Czech: Taxi drivers protested against Uber in Brno, they pointed out that Uber does not meet requirements such as taxi vehicle marking or taximeter.
Japanese: Built by the Egyptians in the third century BCE, the Great Pyramid is one of many large pyramids built to honor dead Pharaoh.

Classification

Severity
French: Major error
Czech: Minor error
Japanese: Major error
Explanation
French: The translation error in the sentence "All the entrances to the caves, which have been dubbed "The Seven Sisters" have been having a diameter of at least 100 to 250 meters." involves a fluency issue. The phrase "have been having a diameter" is not natural or standard in English. The correct expression should simply be "have a diameter." This error disrupts the natural flow and grammatical structure of the sentence, making it sound awkward and incorrect in English.
Czech: The translation inaccurately renders "označení jako taxi" as "taxi vehicle mark," which should be more accurately translated as "marking as a taxi." Additionally, the phrase "missing taximeter" in the translation might mislead by implying the taximeter is lost, rather than not being present as required. The original phrase "nesplňuje podmínky jako například" is better translated as "does not meet conditions such as."
Japanese: The translation omits the phrase "紀元前3世紀に" which specifies that the construction happened in the third century BCE. Additionally, the translation uses "the Great Pyramid" to refer to "大ピラミッド", which while technically accurate, could be misleading as it might be interpreted as referring specifically to the Great Pyramid of Giza rather than a general pyramid.

4. Chain of Thought prompts

Giving an AI model too many instructions, for example, a 20-line localization style guide leads to many instructions being skipped. Instead, in this exercise the participants split the prompt into a chain of smaller prompts or provided instructions to the model to work on step by step.

In our spreadsheet exercise, the 1st prompt evaluated the severity of the error, and the 2nd prompt classified it by type based on the results of the previous generation.

5. Vision in multimodal models

In this task, participants worked with vision model GPT-4 Turbo to complete assignments with images:
– image translation
– getting text from scanned PDFs
– screenshot localization testing
– generating multilingual product descriptions from image
– identifying fonts
etc

The screenshot features an example of UI elements' translation done using a prompt and a website variable that uses a URL as an input.

Example: a participant translates the menu of a sample website from an image screenshot.

6. Retrieval-augmented generation (RAG)

Using retrieval, a large language model may generate output based on facts from a linked database. Such databases include company data, glossaries, style guides, and translation memories, and the answer can be based on facts contained within, instead of relying on LLM’s general memory.

In the exercise, we used retrieval for the following use cases:
– translate with glossaries
– check existing translation for terminology compliance
– create a “chat-with-your-website” bot

The screenshot displays an example of the database split done with RAG. The text is divided into chunks, and the search of the relevant chunks is based on semantic similarity.

The example percentages are:
Title: Wine Glossary
Saignée 82.4%
Cépage 81.4%
Vieille Vignes 80.7%
Blanc de Blancs 80.5%
Blanc de Noirs 80.5%

Example: for retrieval, the database splits the text into chunks, and the search of the relevant chunks is based on semantic similarity, expressed in percentage in the screenshot above.

7. Agents

This module covered the ability of LLMs to operate external apps via the API. We explored translating with DeepL and proofreading with ChatGPT in a chat interface. While not immediately applicable to workflows, the overview of Agents showcases a potential future scenario where LLMs act as a user interface to other applications.

A screenshot of GPT-4 Turbo calling DeepL to translate a paragraph of text.

The dialog is as follows:
Request: '- Translate the following with DeepL into French then make it informal and colloquial.

Taxi drivers protested against Uber in Brno, pointing out that Uber does not meet the conditions, such as being labeled as a taxi or a taximeter.'

UI element: Translate via DeepL OK

Reply: '- Les chauffeurs de taxi à Brno ont manifesté contre Uber, en faisant remarquer que Uber, ça respecte pas les règles, genre avoir l'étiquette de taxi ou un compteur.'

Example: GPT-4 Turbo calls DeepL to translate a paragraph of text.

8. Fine-tuning GPT-3.5 Turbo

By giving translation memory and glossaries to GPT-3.5 Turbo, it is possible to improve its output closer to GPT-4, while minimizing the cost below the cost of conventional machine translation. In this module of the workshop, we covered how fine-tuning is done, and its impact on quality and cost of LLM localization.

A screenshot of how fine-tuning is done in an OpenAI instance.

Presenters: Dominic Wever (Promptitude) and Konstantin Dranch (Custom.MT).

The workshop recording is available at:
https://www.youtube.com/watch?v=MJNlhyStv14 – part 1.
https://www.youtube.com/watch?v=QPPRtquyvgQ – part 2.

Next workshop

The next installment is planned for Jun 18, 2024, before the TAUS Massively Multilingual Conference in Rome.

https://www.taus.net/events/massively-multilingual-conference-rome-2024

GenAI Prompts

Konstantin Dranch

Language Industry Researcher | Founder Custom.MT learn something new every week, create transparency in specialized markets

Comments are closed.