The parallel corpus of the Official Diary of the Catalan Government

  • Antoni Oliver González Universitat Oberta de Catalunya
Keywords: parallel corpus, neural machine translation

Abstract

In this paper the process of compilation of the new version of the Catalan-Spanish parallel corpus of the Official Diary of the Catalan Government (DOGC) is presented. The processes of downloading, conversion to text, segmentation and automatic alignment are described. All the programs that have been developed to perform these processes are distributed under a free license and the compiled corpus can be freely downloaded. Furthermore, the process of training and evaluation of two neural machine translation systems, Catalan-Spanish and Spanish-Catalan, using this corpus is presented.

Author Biography

Antoni Oliver González, Universitat Oberta de Catalunya
Profesor agregado de los estudios de Artes y Humanidades de la Universitat Oberta de Catalunya (UOC). Director del posgrado en Traducción y Tecnologías.
Published
2023-01-07
How to Cite
Oliver González, A. (2023). The parallel corpus of the Official Diary of the Catalan Government. Linguamática, 14(2), 75-81. https://doi.org/10.21814/lm.14.2.380
Section
Technical Articles