Enhancing Automatic Hyphenation in Portuguese for TeX

  • Leonardo Carneiro Araujo UFSJ
  • Aline Benevides
Keywords: hyphenation, hyphenation patterns, automatic hyphenation in Portuguese

Abstract

 Portuguese hyphenation rules for TeX have been in use for over three decades, showing good overall performance. However, there are still incorrect hyphenations and undetected hyphenation points. These points, although mostly occurring near word boundaries and being irrelevant for typographic purposes in TeX, can be relevant in specific contexts, such as when dealing with words outside the standard lexicon or in applications that utilize syllabic/typographic segmentation. Based on an analysis of 49,528 hyphenated words obtained from online dictionaries, we proposed 120 new rules to be incorporated into the existing Portuguese hyphenation rules. Additionally, we used patgen to create new rules or improve existing ones. However, the rules generated by patgen did not demonstrate good generalization capability. Ultimately, the manually adjusted rules showed the best performance, resulting in a 2.1% increase in the success rate. The number of correct hyphenation points increased from 38,519 to 39,808, while the incorrect hyphenation points drastically decreased from 2,059 to 33. It~is also important to note that the manually crafted rules demonstrated better generalization capability than the automatically generated rules by patgen.

Published
2024-12-30
How to Cite
Araujo, L. C., & Benevides, A. (2024). Enhancing Automatic Hyphenation in Portuguese for TeX. Linguamática, 16(2), preprint. Retrieved from https://linguamatica.com/index.php/linguamatica/article/view/435
Section
Technical Articles