Estratégias de Seleção Informada de Dados para Aprendizado com Dados Escassos e Desbalanceados

Alexandre Alcoforado; Lucas Okamura; Thomas Ferraz; Israel Campos Fama; Bárbara Dias Bueno; Bruno Miguel Veloso; Anna Helena Reali Costa

doi:10.21814/lm.17.1.451

Informed Data Selection Strategies for Few-Shot Learning on Imbalanced Data

Alexandre Alcoforado Universidade de São Paulo https://orcid.org/0000-0003-3184-1534
Lucas Okamura Universidade de São Paulo https://orcid.org/0000-0002-7198-6140
Thomas Ferraz Télécom Paris, Institut Polytechnique de Paris https://orcid.org/0000-0002-5385-9164
Israel Campos Fama Universidade de São Paulo https://orcid.org/0000-0001-6325-4153
Bárbara Dias Bueno Universidade de São Paulo https://orcid.org/0009-0004-7455-3342
Bruno Miguel Veloso Universidade do Porto, INESC-TEC https://orcid.org/0000-0001-7980-0972
Anna Helena Reali Costa Universidade de São Paulo https://orcid.org/0000-0001-7309-4528

DOI: https://doi.org/10.21814/lm.17.1.451

Keywords: imbalanced data, nlp, transformers, few-shot learning, reverse semantic search

Abstract

Acquiring high-quality annotated data remains one of the most significant challenges in Natural Language Processing (NLP), especially for supervised learning approaches. In scenarios where pre-existing labeled data is unavailable, common solutions like crowdsourcing and zero-shot approaches often fall short, suffering from limitations such as the need for large datasets and a lack of guarantees regarding annotation quality. Traditionally, data for human annotation has been selected randomly, a practice that is not only costly and inefficient but also prone to bias, particularly in imbalanced datasets where minority classes are underrepresented. To address these challenges, this work introduces an automatic and informed data selection architecture designed to minimize the volume of required annotations while maximizing the diversity and representativeness of the selected data. Among the evaluated methods, Reverse Semantic Search (RSS) demonstrated superior performance, consistently outperforming random sampling in imbalanced scenarios and enhancing the effectiveness of trained classifiers. Furthermore, we compared RSS with other clustering-based approaches, providing insights into their respective strengths and weaknesses.

PDF (Português (Portugal))

Published

2025-06-30

How to Cite

Alcoforado, A., Okamura, L., Ferraz, T., Campos Fama, I., Dias Bueno, B., Veloso, B. M., & Reali Costa, A. H. (2025). Informed Data Selection Strategies for Few-Shot Learning on Imbalanced Data. Linguamática, 17(1), 105-120. https://doi.org/10.21814/lm.17.1.451

Download Citation

Issue

Vol. 17 No. 1

Section

PROPOR 2024 | Invited Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).