Named Entity Recognition and Data Leakage in Legislative Texts

A Literature Reassessment

  • Rafael Oleques Nunes UFRGS
  • André Susliz Spritzer
  • Carla Maria Dal Sasso Freitas
  • Dennis Giovani Balreira
Keywords: Data Leakage, Named Entity Recognition, Legislative Texts, Benchmark, Self-learning, Portuguese

Abstract

This work addresses data leakage in training Named Entity Recognition (NER) models in Brazilian Portuguese legislative texts, resulting from duplicates and inconsistent annotations, which compromise model evaluation. After correcting this leakage in the UlyssesNER-Br corpus, we conducted a new benchmark, comparing the results with previous studies in a more reliable setting. We also re-evaluated a semi-supervised approach using self-learning and active sampling. However, by reusing a fixed threshold, chosen from a cloud of values before the correction, the results were unsatisfactory. This indicates that a dynamic threshold, which adapts to the characteristics of the data post-correction, could provide a more efficient and accurate evaluation, highlighting the need for future studies on threshold selection.

Published
2025-01-09
How to Cite
Rafael Oleques Nunes, Spritzer, A. S., Freitas, C. M. D. S., & Balreira, D. G. (2025). Named Entity Recognition and Data Leakage in Legislative Texts: A Literature Reassessment. Linguamática, 16(2), preprint. Retrieved from https://linguamatica.com/index.php/linguamatica/article/view/450
Section
PROPOR 2024 | Invited Articles