Automatic text readability classification: resources and models for Galician

  • Sandra Rodríguez Rey CITIUS - Universidade de Santiago de Compostela
  • Marcos Garcia Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela https://orcid.org/0000-0002-6557-0210
Keywords: readability corpus, automatic readability assessment, text classification, Galician, fine-tuning, adult learning

Abstract

The automatic readability assessment of texts is a growing field within Natural Language Processing, with significant implications in areas such as language teaching and learning and accessibility. In this context, this paper presents Corlega, the first corpus of Galician texts classified by readability level, consisting of 480 texts aimed at adult readers. The corpus
covers 11 categories and 36 subcategories, including a variety of text types, genres and subgenres. The process of selection and compilation of documents, as well as classification, follows the standards of the iRead4Skills project, which develops resources and computational models for Portuguese, Spanish and French. To compile Corlega, this work defines six levels of readability in Galician and proposes a set of linguistic descriptors for each level. Using this taxonomy, we describe the compilation process of the corpus and its current distribution ---across four of the six readability levels---,
as well as the main features of this new resource. Additionally, we used the corpus to train and evaluate automatic readability classification tools by fitting monolingual and multilingual Transformer models, and the implementation of hybrid models. The results suggest that, with small training corpora, feature extraction from pre-trained models is
an efficient method to achieve competitive results with supervised model fitting. However, combining corpora from different languages enables the fitting of multilingual models with better performance. Both the corpus and the models are available to the scientific community.

Published
2025-11-23
How to Cite
Rodríguez Rey, S., & Garcia, M. (2025). Automatic text readability classification: resources and models for Galician. Linguamática, 17(2), preprint. Retrieved from https://linguamatica.com/index.php/linguamatica/article/view/488
Section
Research Articles