Named Entity Recognition and Data Leakage in Legislative Texts

A Literature Reassessment

Authors

DOI:

https://doi.org/10.21814/lm.16.2.450

Keywords:

Data Leakage, Named Entity Recognition, Legislative Texts, Benchmark, Self-learning, Portuguese

Abstract

This work addresses data leakage in training Named Entity Recognition (NER) models in Brazilian Portuguese legislative texts, resulting from duplicates and inconsistent annotations, which compromise model evaluation. After correcting this leakage in the UlyssesNER-Br corpus, we conducted a new benchmark, comparing the results with previous studies in a more reliable setting. We also re-evaluated a semi-supervised approach using self-learning and active sampling. However, by reusing a fixed threshold, chosen from a cloud of values before the correction, the results were unsatisfactory. This indicates that a dynamic threshold, which adapts to the characteristics of the data post-correction, could provide a more efficient and accurate evaluation, highlighting the need for future studies on threshold selection.

References

Published

2024-12-31

Issue

Section

PROPOR 2024 | Invited Articles

How to Cite

Named Entity Recognition and Data Leakage in Legislative Texts: A Literature Reassessment. (2024). Linguamática, 16(2), 141-166. https://doi.org/10.21814/lm.16.2.450