Services

DATA ACQUISITION

This service is for:

  • Localization directors adding new languages
  • Machine learning specialists training models

 

We recommend at least 3 million words in parallel text to customize a machine translation model and significantly improve its accuracy in-domain. This amount of language data is not always available, especially in case when starting with a new language or subject matter area. Hence the need to acquire data.

 

While text is abundant on the web, specialist in-domain quality datasets and datasets for low-resourced languages are hard to come by. Finding and licensing them requires expertise and grassroots presence.

 

Purchasing in-domain data

Using a sample of customer data of at least 20 000 segments, our team can scan available public and commercial databases for matching material, or request data from in-country linguists.

 

Parsing web sources

Using web crawlers and automated alignment tools, we can secure parallel texts from public sources such as disclosure websites, legal notice websites, multilingual news sources and specialized encylopedia. Parsers that connect can return large bodies of information.

 

Data manufacturing

Using a sample of customer data of at least 20 000 segments, our team can scan available public and commercial databases for matching material, or request data from in-country linguists.

Quality:

We verify datasets before purchase to ensure they are new, non-repetitive, and of high-quality using our proprietary scanner and a network of language professionals.