Induction of syntax constituents in Spanish through clustering and filtering out criterion based on mutual information
Abstract
Argument from the Poverty of Stimulus (APS) is the great epistemological debate arena between simbolic and statistical paradigms in computational linguistics. Since 2000, several works inside statistical paradigm have been published, attacking APS as they present some unsupervised general-purpose algorithm for language acquisition. Among the most important contributions, Clark’s Ph.D. thesis (2001) appeals to diverse statistical techniques in order to come up with an unsupervised general-purpose algorithm for inducing language and, more precisely, a complete Context-Free Grammar (CFG) for English.
Clark (2001) works with several induction techniques for each linguistic phenomenon modelized: morphology from Hidden Markovian Models (HMM), POS-tagging from clustering, etc. Particularly, in this current paper we are interested in the induction of syntax constituency, given a POS-tagged corpus, as a previous step towards the whole process of inducing a complete CFG. In his own thesis, the author admits that more crosslinguistic evidence is needed, so as to support the psycholinguistic plausibility of an approach such as his. Currently, there is no work that have proposed to prove Clark’s approach in very inflected languages with free-order constituents like Spanish. Thus, our work is intended to contribute with that crosslinguistic evidence, analyzing the feasibilty of the application of Clark’s algorithm for inducing constituency on Spanish.
Clark (2001) entails the application of K-means clustering to group sequences of morpho-syntactic labels, according to their distributional information. Then, there is a stage of filtering out the clusters, through a mutual-information-based criterion between the symbols that co-occur immediately before and after the sequences. This criterion prevents from the typical bias in sparsed corpora, and in turn, succeeds in distinguishing the co-ocurrence of adyacent symbols above the threshold of default entropy for short-distance (Li 1990).
Our implementation has been tested on a prototypical corpus, obtaining interesting results. We have verified recall=74%, precision=58% and F-measure=65% for this prototypical stage. These results encourage us to continue with our long-term research, the goal of developing an algorithm for complete acquisition of Spanish.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).