Toxic Content Detection in Online Social Networks: A New Dataset from Brazilian Reddit Communities

  • Luiz Henrique Quevedo Lima Luiz Universidade Federal de Minas Gerais
  • Ana Clara Souza Pagano Ana Clara Universidade Federal de Minas Gerais (UFMG)
  • Adriana Silvina Pagano Adriana Universidade Federal de Minas Gerais (UFMG)
  • Ana Paula Couto da Silva Ana Universidade Federal de Minas Gerais (UFMG)
Keywords: toxicity, portuguese, dataset

Abstract

The proliferation of online social interactions in recent years, with the consequent growth in user-generated content, has brought the escalating issue of toxic language. While automatic machine learning models have been effective in moderating the vast amount of data on online social networks, low-resource languages, such as Brazilian Portuguese, still lack efficient automated moderation tools. We address this gap by creating a novel dataset collected from some of the most popular Brazilian Reddit communities. To that end, we manually labeled a sample dataset of 2,500 comments extracted from the most engaging communities. We conducted an in-depth exploratory analysis to gain valuable insights into the language of toxic and non-toxic content. Our results show a high level of agreement among annotators, attesting to the suitability of this dataset for various downstream machine learning tasks. This research offers a significant contribution to the creation of a safer online environment for users engaging in discussions in Portuguese and paves the way for more effective automatic moderation tools using machine learning.

Published
2024-12-31
How to Cite
Luiz, L. H. Q. L., Ana Clara, A. C. S. P., Adriana, A. S. P., & Ana, A. P. C. da S. (2024). Toxic Content Detection in Online Social Networks: A New Dataset from Brazilian Reddit Communities. Linguamática, 16(2), 201-218. https://doi.org/10.21814/lm.16.2.459
Section
PROPOR 2024 | Invited Articles