Desenvolvimento e avaliação de um modelo NER no domínio da análise cultural e do turismo

Susana Sotelo Docío; Pablo Gamallo; Álvaro Iriarte

doi:10.21814/lm.15.2.405

Development and evaluation of a NER model in the domain of cultural analysis and tourism

Susana Sotelo Docío Universidade de Santiago de Compostela https://orcid.org/0000-0002-0067-7957
Pablo Gamallo Universidade de Santiago de Compostela https://orcid.org/0000-0002-2429-0984
Álvaro Iriarte Universidade do Minho https://orcid.org/0000-0003-0077-8843

DOI: https://doi.org/10.21814/lm.15.2.405

Keywords: Named-Entity Recognition, Machine Learning, Neural Networks, Transformers, evaluation

Abstract

Named Entity Recognition (NER) is an essential task in information extraction where entities in a text are identified and classified. One of the primary challenges addressed by NER systems is the difficulty of generalizing what was learned to different types of corpora beyond the training data. This problem is magnified by the fact that most of the training corpora used are journalistic and therefore need to be adapted to other genres and domains. In this paper, we use a Spanish corpus consisting of interviews with visitors to the city of Santiago de Compostela and annotated with named entities, to evaluate and train NER systems tailored to the domain of cultural analysis and tourism. We provide a comprehensive comparison of various approaches employed, ranging from classical machine learning algorithms to fine-tuning Transformer models. The results significantly outperform the baseline, represented here by the toolkits Stanza, spaCy and Flair, although initial tests with unseen entities during training highlight the need for additional evaluations regarding their generalization capability and the utilization of adversarial splits for the corpus.

PDF (Português (Portugal))

Published

2023-12-30

How to Cite

Sotelo Docío, S., Gamallo, P., & Iriarte, Álvaro. (2023). Development and evaluation of a NER model in the domain of cultural analysis and tourism. Linguamática, 15(2), 3-18. https://doi.org/10.21814/lm.15.2.405

Download Citation

Issue

Vol. 15 No. 2

Section

Research Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).