Data Acquisition

While text is abundant on the web, specialist in-domain quality datasets and datasets for low-resourced languages are hard to come by.

 

Finding and licensing them requires expertise and grassroots presence.

How it works

Custom.MT obtains new data with the following 3 techniques:

Purchasing in-domain data

Using a sample of customer data of at least 20 000 segments, our team can scan available public and commercial databases for matching material, or request data from in-country linguists.

Parsing web sources

Using web crawlers and automated alignment tools, we can secure parallel texts from public sources such as disclosure websites, legal notice websites, multilingual news sources and specialized encylopedia. Parsers that connect can return large bodies of information.

Data manufacturing

Using a sample of customer data of at least 20 000 segments, our team can scan available public and commercial databases for matching material, or request data from in-country linguists.

Quality

We verify datasets before purchase to ensure they are new, non-repetitive, and of high-quality using our proprietary scanner and a network of language professionals.