Home > Data acquisition
While text is abundant on the web, specialist in-domain quality datasets and datasets for low-resourced languages are hard to come by.
Finding and licensing them requires expertise and grassroots presence.
Custom.MT obtains new data with the following 3 techniques:
Using a sample of customer data of at least 20 000 segments, our team can scan available public and commercial databases for matching material, or request data from in-country linguists.
Using web crawlers and automated alignment tools, we can secure parallel texts from public sources such as disclosure websites, legal notice websites, multilingual news sources and specialized encylopedia. Parsers that connect can return large bodies of information.
Using a sample of customer data of at least 20 000 segments, our team can scan available public and commercial databases for matching material, or request data from in-country linguists.
We verify datasets before purchase to ensure they are new, non-repetitive, and of high-quality using our proprietary scanner and a network of language professionals.