Informed Data Selection Strategies for Few-Shot Learning on Imbalanced Data

  • Alexandre Alcoforado Universidade de São Paulo
  • Lucas Okamura
  • Thomas Ferraz
  • Israel Campos Fama
  • Bárbara Dias Bueno
  • Bruno Miguel Veloso
  • Anna Helena Reali Costa
Keywords: imbalanced data, nlp, transformers, few-shot learning, reverse semantic search

Abstract

Acquiring high-quality annotated data remains one of the most significant challenges in Natural Language Processing (NLP), especially for supervised learning approaches. In scenarios where pre-existing labeled data is unavailable, common solutions like crowdsourcing and zero-shot approaches often fall short, suffering from limitations such as the need for large datasets and a lack of guarantees regarding annotation quality. Traditionally, data for human annotation has been selected randomly, a practice that is not only costly and inefficient but also prone to bias, particularly in imbalanced datasets where minority classes are underrepresented. To address these challenges, this work introduces an automatic and informed data selection architecture designed to minimize the volume of required annotations while maximizing the diversity and representativeness of the selected data. Among the evaluated methods, Reverse Semantic Search (RSS) demonstrated superior performance, consistently outperforming random sampling in imbalanced scenarios and enhancing the effectiveness of trained classifiers. Furthermore, we compared RSS with other clustering-based approaches, providing insights into their respective strengths and weaknesses.

Published
2025-06-26
How to Cite
Alcoforado, A., Okamura, L., Ferraz, T., Campos Fama, I., Dias Bueno, B., Veloso, B. M., & Reali Costa, A. H. (2025). Informed Data Selection Strategies for Few-Shot Learning on Imbalanced Data. Linguamática, 17(1), preprint. Retrieved from https://linguamatica.com/index.php/linguamatica/article/view/451
Section
PROPOR 2024 | Invited Articles