A SMS-like language analyzer for Spanish

Authors

  • Andrés Alfonso Caurcel Díaz Universidad Politécnica de Madrid image/svg+xml
  • Jose Maria Gomez Hidalgo Departamento de I+DOptenet S.A.
  • Yovan Iñiguez del Rio Universidad Politécnica de Madrid image/svg+xml

Keywords:

SMS language, chat language, tokenizer, automated translation, Natural Language Processing, Age detection

Abstract

The usage of specific language codes and chat and SMS-like messages is a major trend in electronic communications. This fact makes Natrual Language Processing quite hard, even at the simplest step fo text message tokenization, due to the widespread usage of non-alphanumeric symbols, frequent typos and non-standard word separators.

In this work we present a new approach for text message tokenization, specific for the Spanish language as used in Social Networks and in electronic communications. Our system has been integrated in a more general application for age-detection in Social Networks developed in the research and development project WENDY, and it has been quantitatively evaluated both in a direct fashion, and indirectly by its impact on the genearl age-detection application, showing very promising results.

References

Published

2013-07-20

Issue

Section

Research Articles