Brussels / 1 & 2 February 2020

schedule

The unsupervised free CAT for low resource languages

Building a pipeline for the communities


We present: 1) a full pipeline for unsupervised machine translation training (making use of monolingual corpora) for languages with low available resources; 2) a translation server making use of that unsupervised MT with an API compatible with the EU funded free Computer Aided Translation (CAT) tool MateCAT; 3) a Docker packaged version of MateCAT for ease of deployment. This full translation pipeline enables a non technical user, speaking a non-FIGS language for which there is scarcity of parallel corpora, to start translating documents and software following translation industry standards.

Localization within community suffers from the fragmentation of technologies (too wide wedge between commercial Computer Aided Translation tools and free ones), available language resources (making difficult to train a Machine Translation) and lack of clear and robust pipelines to get started. Low resource language communities suffer the most, since MT systems require training corpora of millions of words and industry has settled to expecting the massive corpora available to FIGS (French, Italian, German, Spanish) languages. Moreover, the community suffers from a lack of adoption of established technologies and workflows, leading to reinventing the wheel and suboptimal efforts’ outcomes. Today we would like to present a connector for the implementation of an unsupervised MT (made by Artetxe et al.), that claims a BLEU of 26 on limited language resources (which is enough as a support system) integrated with MateCAT, an industry level, free, web based tool funded by EU, in order to provide a more viable alternative to resorting to Google Translate and commercial LSPs.

Speakers

Photo of Alberto Massidda Alberto Massidda

Links