Automatic Analysis and Classification of Discourse Domains in Brazilian Portuguese

Keywords: automatic text classification, identification of textual properties, automated textual analysis, discursive domains, Brazilian portuguese, recognition of discoursive patterns, computational study of language

Abstract

This paper addresses the identification of the Juridical, Entertainment, Journalistic, Virtual, and Instructional discourse domains of Brazilian Portuguese at the sentence level, sampled from the Carolina corpus. We evaluate grammatical, lexical, and semantic properties. We demonstrate that the domains are discernible and organized into a consistent scale, which we associate with the oral-involved vs. literate-informational distinction based on comparison with other works. We trained Transformer classifiers on a new sentence dataset for domain identification, achieving high performance. The models' error patterns correlate with the identified scale, suggesting the models captured this dimension of variation. The datasets and models developed in this study are publicly available.

Published
2026-01-07
How to Cite
Serras, F. R., Carpi, M. de M., Sturzeneker, M. L., Palma, M. F., Costa, A. S., Monte, V. M. do, Namiuti, C., Crespo, M. C. R. M., Paixão de Sousa, M. C., & Finger, M. (2026). Automatic Analysis and Classification of Discourse Domains in Brazilian Portuguese. Linguamática, 17(2), preprint. Retrieved from https://linguamatica.com/index.php/linguamatica/article/view/476
Section
PROPOR 2024 | Invited Articles