Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information

  • Fernando Balbachan Facultad de Filosofía y Letras - Universidad de Buenos Aires (UBA)
  • Diego Dell'Era Facultad de Filosofía y Letras - Universidad de Buenos Aires (UBA)
Keywords: computational linguistics, statistical parsing, syntax constituency, distributional information

Abstract

Argument from the Poverty of Stimulus (APS) is the great epistemological debate arena between simbolic and statistical paradigms in computational linguistics. Since 2000, several works inside statistical paradigm have been published, attacking APS as they present some unsupervised general-purpose algorithm for language acquisition. Among the most important contributions, Clark’s Ph.D. thesis (2001) appeals to diverse statistical techniques in order to come up with an unsupervised general-purpose algorithm for inducing language and, more precisely, a complete Context-Free Grammar (CFG) for English.  

Clark (2001) works with several induction techniques for each linguistic phenomenon modelized: morphology from Hidden Markovian Models (HMM), POS-tagging from clustering, etc. Particularly, in this current paper we are interested in the induction of syntax constituency, given a POS-tagged corpus, as a previous step towards the whole process of inducing a complete CFG. In his own thesis, the author admits that more crosslinguistic evidence is needed, so as to support the psycholinguistic plausibility of an approach such as his. Currently, there is no work that have proposed to prove Clark’s approach in very inflected languages with free-order constituents like Spanish. Thus, our work is intended to contribute with that crosslinguistic evidence, analyzing the feasibilty of the application of Clark’s algorithm for inducing  constituency on Spanish.

Clark (2001) entails the application of K-means clustering to group sequences of morpho-syntactic labels, according to their distributional information. Then, there is a stage of filtering out the clusters, through a mutual-information-based criterion between the symbols that co-occur immediately before and after the sequences. This criterion prevents from the typical bias in sparsed corpora, and in turn, succeeds in distinguishing the co-ocurrence of adyacent symbols above the threshold of default entropy for short-distance (Li 1990). 

Our implementation has been tested on a prototypical corpus, obtaining interesting results. We have verified recall=74%, precision=58% and F-measure=65% for this prototypical stage. These results encourage us to continue with our long-term research, the goal of developing an algorithm for complete acquisition of Spanish.

 

 

Author Biographies

Fernando Balbachan, Facultad de Filosofía y Letras - Universidad de Buenos Aires (UBA)
Lecturer at chair Modelos Formales No Transformacionales (UBA), specialization in Linguistics.
Diego Dell'Era, Facultad de Filosofía y Letras - Universidad de Buenos Aires (UBA)
Assistant at chair Modelos Formales No Transformacionales (UBA), specialization in Linguistics.
Published
2010-06-09
How to Cite
Balbachan, F., & Dell’Era, D. (2010). Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information. Linguamática, 2(2), 39-57. Retrieved from https://linguamatica.com/index.php/linguamatica/article/view/60
Section
Research Articles