Hierarchical Multi-Label Dialog Act Recognition on Spanish Data

Dialog acts reveal the intention behind the uttered words. Thus, their automatic recognition is important for a dialog system trying to understand its conversational partner. The study presented in this article approaches that task on the DIHANA corpus, whose three-level dialog act annotation scheme poses problems which have not been explored in recent studies. In addition to the hierarchical problem, the two lower levels pose multi-label classification problems. Furthermore, each level in the hierarchy refers to a different aspect concerning the intention of the speaker both in terms of the structure of the dialog and the task. Also, since its dialogs are in Spanish, it allows us to assess whether the state-of-the-art approaches on English data generalize to a different language. More specifically, we compare the performance of different segment representation approaches focusing on both sequences and patterns of words and assess the importance of the dialog history and the relations between the multiple levels of the hierarchy. Concerning the single-label classification problem posed by the top level, we show that the conclusions drawn on English data also hold on Spanish data. Furthermore, we show that the approaches can be adapted to multi-label scenarios. Finally, by hierarchically combining the best classifiers for each level, we achieve the best results reported for this corpus.


Introduction
It is valuable for a dialog system to identify the intention behind its conversational partners' words since it provides an important cue concerning the information contained in a segment and how it should be interpreted. According to Searle (1969), that intention is revealed by dialog acts, which are the minimal units of linguistic communication. Consequently, automatic dialog act recognition is an important task in the context of Natural Language Understanding (NLU), which has been widely explored over the years on multiple corpora with different characteristics. Still, recently, most studies have focused on English data and, more specifically, on the Switchboard Dialog Act Corpus (SwDA) (Jurafsky et al., 1997), since it is the largest annotated corpus and its label set is independent from both task and domain. However, there are corpora and annotation schemes that pose problems in the context of dialog act recognition which are not covered by the SwDA corpus and its SWBD-DAMSL annotations. With this in mind, in this article we explore the DIHANA corpus (Benedí et al., 2006), which features interactions in Spanish between humans and a Wizard of Oz (WoZ) dialog system. In the context of dialog act recognition, the differentiating aspect of this corpus is its three-level annotation scheme, in which the top level refers to the generic task-independent dialog act and the others complement it with task-specific information. Additionally, while each segment has a single top-level label, it may have none or multiple labels on the other levels. Thus, the DIHANA corpus allows us to approach dialog act recognition as both a hierarchical and a multi-label classification problem.
Similarly to other text classification tasks, such as news categorization and sentiment analysis (Kim, 2014;Conneau et al., 2017), most of the recent approaches on dialog act recognition take advantage of Deep Neural Networks (DNNs). We provide an overview of these approaches in Section 2.2. However, overall, they use a Recurrent Neural Network (RNN)-or Convolutional Neural Network (CNN)-based approach to generate a representation of the segment from the embedding representation of its words and then use the information present in that representation to obtain the classification of the segment. The distinction between RNN-and CNN-based approaches is relevant since they are able to capture different information. The first focus on identifying relevant word sequences, including long range dependencies. On the other hand, the latter focus on identifying relevant word patterns by inspecting limited context windows surrounding each word. Additionally, the top performing approaches on dialog act recognition do not consider each segment on its own, but rather in combination with context information from both the surrounding segments and concerning the speakers.
Considering the characteristics of the DI-HANA corpus and the state-of-the art approaches on single-label dialog act recognition, in this article we explore different aspects. First, we assess whether those approaches perform similarly on a language other than English, by using them to predict the task-independent labels of the top level. Then, we explore their applicability in the multi-label classification scenarios of the other levels. Furthermore, since those levels refer to different task-specific aspects, we also assess how context information from the preceding segments influences the ability to predict each of those aspects. Similarly, we assess how that ability is influenced by information from the upper levels in the hierarchy. Finally, we explore the hierarchical combination of the best approaches for each level and compare its performance with that of the flat approach that was used on previous studies on the corpus.
In the remainder of the article we start by providing an overview of related work in Section 2. First, we provide an overview on existing corpora for dialog act recognition in Section 2.1. Then, we discuss the state-of-the-art approaches on dialog act recognition in Section 2.2. Addition-ally, previous studies on dialog act recognition on Spanish data are summarized in Section 2.3. Then, in Section 3, we describe our experimental setup. We start by describing the DIHANA corpus and its dialog act annotations in Section 3.1. Section 3.2 presents the generic network architecture used in our experiments and describes what changes between experiments. Finally, Section 3.3 introduces the training and evaluation procedures according to the level of the hierarchy in focus. The results achieved by our experiments on each of those levels, as well as their combination, are presented and discussed in Section 4. Finally, Section 5 states the most important conclusions that can be drawn from the experiments described in this article and provides pointers for future work.

Related Work
As previously stated, automatic dialog act recognition is a task that has been widely explored over the years on multiple corpora with different characteristics and using a variety of classical machine learning approaches, from Hidden Markov Models (HMMs) (Stolcke et al., 2000) to Support Vector Machines (SVMs) (Gambäck et al., 2011). The article by Král & Cerisara (2010) provides an overview of most of those approaches on the task. However, recently, most approaches take advantage of DNN architectures. Below, we present an overview of such approaches. Additionally, since our study focuses on the DIHANA corpus (Benedí et al., 2006), we also present previous approaches on dialog act recognition on Spanish data. However, first, we provide an overview on existing corpora for dialog act recognition.

Corpora for Dialog Act Recognition
Multiple corpora have been annotated in terms of dialog acts. Table 1 presents a non-exhaustive set of those corpora and their characteristics. We can see that multiple domains, languages, and kinds of interaction are covered, which enables the assessment of the generalization capabilities of dialog act recognition approaches in multiple scenarios. However, on the other hand, the used tag sets are not standardized among corpora. In fact, there are even different tag sets for the same corpus. This means that the sets were developed with different objectives and have different hierarchies and levels of abstraction, which makes cross-corpora and generalization experiments hard to perform. This is particularly problematic when the used tag sets are domain-dependent, since they cannot be applied to corpora in other domains.
Concerning alternative tag sets for the same corpus, while those of SwDA, the ICSI Meeting Recorder Dialog Act Corpus (MRDA), and Call-Home Spanish (CHS) are just compressed versions of the original sets, the two tag sets used to annotate VERBMOBIL are disjoint. Furthermore, the first one includes domain-dependent labels (Jekat et al., 1995), while the second is completely domain-independent (Alexandersson et al., 1998).
Multiple corpora have complementary tag sets which refer to different aspects. For instance, MRDA, DIHANA, and NESPOLE have a set of generic labels which can be specialized using labels from different sets. However, while in the first case the specialized labels are still domainindependent, in the remaining two the generic labels are complemented with domain-specific information at different levels. On the DIME corpus, the two tag sets refer to different aspects of the dialog, namely, obligations and grounding. Finally, the LEGO corpus has independent tag sets for user and system segments.
In an attempt to standardize dialog act annotation and, consequently, set the ground for more comparable research in the area, Bunt et al. (2012) defined the ISO 24617-2 standard. According to it, dialog act annotations should be performed on functional segments rather than on turns or utterances (Carroll & Tanenhaus, 1978). Furthermore, the annotation of each segment does not consist of a single label, but rather of a complex structure containing information about the participants, relations with other functional segments, the semantic dimension of the dialog act, its communicative function, and optional qualifiers concerning certainty, conditionality, partiality, and sentiment. However, annotating all of these aspects is an exhaustive process and, consequently, the amount of data annotated according to the standard is still reduced and, in many cases, not all of the aspects are considered (Petukhova et al., 2014;Bunt et al., 2016;Ribeiro et al., 2016).
As previously stated, most recent studies on automatic dialog act recognition take advantage of different DNN architectures. Such approaches require large amounts of data to train. Consequently, the automatic prediction of dialog acts as defined by the standard has only been approached in a few studies (Ribeiro et al., 2015;Mezza et al., 2018). On the other hand, SwDA is the most explored corpus for the task, since it is the one with the highest number of anno-tated segments, it features open-domain dialogs, and its tag set is domain-independent. Thus, the conclusions drawn from experiments on it are expected to generalize well to other scenarios.

State-of-the-Art on Dialog Act Recognition
The approaches that achieve highest performance on the dialog act recognition task are based on DNNs. Thus, in this section, we focus on studies that use such approaches. To our knowledge, the first of those studies was that by Kalchbrenner & Blunsom (2013). The described approach uses a CNN-based approach to generate segment representations from randomly initialized word embeddings. Then, it uses a RNN-based discourse model that combines the sequence of segment representations with speaker information and outputs the corresponding sequence of dialog acts. By limiting the discourse model to consider information from the two preceding segments only, this approach achieved 73.9% accuracy on the SwDA corpus. Lee & Dernoncourt (2016) compared the performance of a Long Short-Term Memory (LSTM) unit against that of a CNN to generate segment representations from pre-trained embeddings of its words. In order to generate the corresponding dialog act classifications, the segment representations were then fed to a 2-layer feed-forward network, in which the first layer normalizes the representations and the second selects the class with highest probability. In their experiments, the CNN-based approach consistently led to similar or better results than the LSTM-based one. The architecture was also used to provide context information from up to two preceding segments at two levels. The first level refers to the concatenation of the representations of the preceding segments with that of the current segment before providing it to the feed-forward network. The second refers to the concatenation of the normalized representations before providing them to the output layer. This approach achieved 65.8% accuracy on the Dialog State Tracking Challenge 4 (DSTC4) corpus, 84.6% on MRDA with 5 classes (Ang et al., 2005), and 71.4% on SwDA. However, the influence of context information varied across corpora. Ji et al. (2016) (Shriberg et al., 2004) Human Meetings English 106k 5 / 11 + 39 N AMI (Carletta et al., 2005) Human Meetings English 102k 15 N VERBMOBIL (Kay et al., 1992) Human Schedules Multiple 59k 42 / 33 M CHS (Levin et al., 1998) Human Open Spanish 45k 10 / 37 N DSTC4 (Kim et al., 2016) Human Travel English 31k 89 Y MapTask (Anderson et al., 1991) Human Routes English 27k 12 N DIHANA (Benedí et al., 2006) WoZ Trains Spanish 23k 11 + 10 + 13 M LEGO (Schmitt et al., 2012) Machine Buses English 14k 22 + 28 Y NESPOLE (Costantini et al., 2002) Human Travel Multiple 8k 67 + 91 M DIME (Villaseñor et al., 2001) WoZ Kitchen Design Spanish 5k 15 + 15 M Table 1: Corpora annotated with dialog act information, ordered by number of segments. The values for the number of segments are rounded. The interaction column states whether the dialogs are between humans or if there is a dialog system involved. In the latter case it distinguishes between WoZ scenarios and interactions with a real machine. In the tags column, the / and -symbols refer to alternative tag sets, while the + symbol refers to different levels of annotation. The last column, DD, states whether the tag set is domain-dependent (Y), domain-independent (N), or mixed (M). 2010) to model the sequence of words in the dialog with a latent variable model over shallow discourse structure to model the relations between adjacent segments which, in this context, represent the dialog acts. This way, the model can perform word prediction using discriminativelytrained vector representations while maintaining a probabilistic representation of a targeted linguistic element, such as the dialog act. In order to function as a dialog act classifier, the model was trained to maximize the conditional probability of a sequence of dialog acts given a sequence of segments, achieving 77.0% accuracy on SwDA. Tran et al. (2017b) used a hierarchical RNN with an attentional mechanism to predict the dialog act classifications of a whole dialog. The model is hierarchical, since it includes an utterance-level RNN to generate the representation of the utterance from its tokens and another to generate the sequence of dialog act labels from the sequence of utterance representations. The attentional mechanism is between the two, since it uses information from the dialoglevel RNN to identify the most important tokens in the current utterance and filter its representation. Using this approach they achieved 74.5% accuracy on SwDA and 63.3% on the HCRC Map Task Corpus (MapTask) corpus. Later, they were able to improve the performance on SwDA to 75.6% by propagating uncertainty information concerning the previous predictions (Tran et al., 2017c). Additionally, they experimented with gated attention in the context of a generative model, achieving 74.2% on SwDA and 65.94% on MapTask (Tran et al., 2017a).
The previous studies explored the use of a single recurrent or convolutional layer to gen-erate the segment representation from those of its words. However, the approaches with highest performance on the task use multiple of those layers. On the one hand, Khanpour et al. (2016) achieved their best results using a segment representation generated by concatenating the outputs of a stack of 10 LSTM units at the last time step. This way, the model is able to capture long distance relations between tokens. On the other hand, Liu et al. (2017) generated the segment representation by combining the outputs of three parallel CNNs with different context window sizes, in order to capture different functional patterns. In both cases, pre-trained word embeddings were used as input to the network. Overall, from the reported results, it is not possible to state which is the top performing segment representation approach since the evaluation was performed on different subsets of SwDA. Still, Khanpour et al. (2016) reported 73.9% accuracy on the validation set and 80.1% on the test set, while Liu et al. (2017) reported 74.5% and 76.9% accuracy on the two sets used to evaluate their experiments. Additionally, Khanpour et al. (2016) reported 86.8% accuracy on MRDA.
Additionally, Liu et al. (2017) explored the use of context information concerning speaker changes and from the surrounding segments. The first was provided as a flag and concatenated to the segment representation. Concerning the latter, they explored the use of discourse models, as well as of approaches that concatenated the context information directly to the segment representation. The discourse models transform the model into a hierarchical one by generating a sequence of dialog act classifications from the sequence of segment representations. Thus, when predicting the classification of a segment, the surrounding ones are also taken into account. However, when the discourse model is based on a CNN or a bidirectional LSTM unit, it considers information from future segments, which is not available to a dialog system. Still, even when relying on future information, the approaches based on discourse models performed worse than those that concatenated the context information directly to the segment representation. In this sense, providing that information in the form of the classification of the surrounding segments led to better results than using their words, even when those classifications were obtained automatically. This conclusion is in line with what we had shown in our previous study using SVMs (Ribeiro et al., 2015). Furthermore, both studies have shown that, as expected, the first preceding segment is the most important and that the influence decays with the distance. Using the setup with gold standard labels from three preceding segments, the results on the two sets used to evaluate the approach improved to 79.6% and 81.8%, respectively.
It is important to make some remarks concerning tokenization and token representation. In all the previously described studies, tokenization was performed at the word level. Furthermore, with the exception of the first study Kalchbrenner & Blunsom (2013), which used randomly initialized embeddings, and those by Tran et al. (2017a,b,c), for which the embedding approach was not disclosed, the representation of those words was given by pre-trained embeddings. Khanpour et al. (2016) compared the performance when using Word2Vec (Mikolov et al., 2013) and Global Vectors for Word Representation (GloVe) (Pennington et al., 2014) embeddings trained on multiple corpora. Although both embedding approaches capture information concerning words that commonly appear together, the best results were achieved using Word2Vec embeddings. In terms of dimensionality, Khanpour et al. (2016) achieved the best results when using 150-dimensional embeddings. However, 200-dimensional embeddings were used in other studies (Lee & Dernoncourt, 2016;Liu et al., 2017), which was not one of the compared values.
The approaches described in all of the previous studies perform tokenization at the word level. However, we have shown that there are also important cues for intention at a sub-word level which can only be captured when using a finergrained tokenization, such as at the characterlevel (Ribeiro et al., 2018). The cues at that level mostly refer to aspects concerning the morphology of words, such as lemmas and affixes. To capture that information, we adapted the CNNbased segment representation approach by Liu et al. (2017) to use characters instead of words as tokens. This way, we were able to explore context windows of different sizes to capture those different morphological aspects. In this sense, our best results were achieved when using three parallel CNNs with window sizes 3, 5, and 7, which are able to capture affixes, lemmas, and inter-word relations, respectively. Using this approach we achieved 76.8% and 73.2% accuracy on the validation and test sets of SwDA, respectively. These results are in line with those of the word-level approach. However, the combination of the two levels improved the results to 78.0% and 74.0%, respectively, which shows that character-and word-level tokenizations provide complementary information. Finally, by including context information from three preceding segments, we improved the results to 82.0% accuracy on the validation set and 79.0% on the test set.

Dialog Act Recognition on Spanish Data
Research on dialog act recognition on Spanish data has been mainly performed on two corpora -DIHANA and CHS. Both feature spontaneous telephonic dialogs. However, as shown in Table 1, while the dialogs from the first are between humans and a WoZ dialog system, the ones from the latter are between humans. Furthermore, while CHS is annotated using task-independent labels, DIHANA is annotated using a three-level hierarchical label scheme, in which the first level refers to the generic task-independent dialog act and the others complement it with task-specific information. There is also a series of experiments on dialog act recognition on the DIME corpus (Coria & Pineda, 2005, 2009. However, these focused on using prosodic information to predict specific subsets of the obligations and grounding dialog acts that the corpus is annotated with. Since our work focuses on dialog act recognition from textual data, we will only provide further detail on the studies performed on the first two corpora. The first dialog act recognition experiments on the DIHANA corpus employed HMMs using both prosodic (Tamarit & Martínez-Hinarejos, 2008) -energy and pitch -and textual (Martínez-Hinarejos et al., 2008) -n-grams -features. The first achieved 60.70% accuracy on the first level, while the latter achieved 93.40% on the combination of the first two levels and 89.70% on the combination of all levels. The latter study, as well as a more recent one (Martínez-Hinarejos et al., 2015), also explored the recognition of dialog acts on unsegmented turns using n-gram transducers. However, in those cases, the focus was on the segmentation process and the classification approaches did not differ from the previous. Finally, the approach which obtained best results on the manually segmented dialogs was based on SVMs using n-grams, the presence of wh-words, and punctuation, as well as context information from three preceding segments as features (Gambäck et al., 2011). This approach also applied Active Learning (AL) to reduce the amount of data required for training, achieving 94.08% accuracy on the combination of the first two levels and 90.97% on the combination of all levels.
Similarly to the DIHANA corpus, the first dialog act recognition experiments on the CHS corpus also employed HMMs with different types of n-gram (Levin et al., 1999;Ries, 1999). The latter study improved the results by combining the HMMs with NNs using unigrams and Part of Speech (POS) tags as features, achieving 76.1% accuracy. The task was also approached using Latent Semantic Analysis (LSA) in three different studies (Serafin et al., 2003;Serafin & Di Eugenio, 2004;Di Eugenio et al., 2010). The first used both plain LSA and multiple adaptations based on clustering and the incorporation of features concerning the preceding dialog acts. However, there was no improvement over plain LSA, which achieved 65.36% accuracy on the tag set with 37 classes and 68.91% on the compressed set of 10 classes. On the other hand, the remaining studies experimented with multiple syntactic and dialog related features and were able to improve the results of plain LSA, up to 77.74% and 81.27%, respectively. In the last study, these results were further improved to 80.34% and 82.88% by applying an instance-based learning approach, namely k-Nearest Neighbors (k-NN), to the reduced semantic spaces computed by LSA. However, in both cases, the improvements were achieved using features concerning the dialog game, that is, the generic intention of the whole dialog, and whether the speaker is taking initiative or replying or providing feedback to the other speaker. Although in general the dialog game is known, there are also cases in which a dialog system is not aware of it. Furthermore, identifying whether a speaker is taking initiative, replying, or providing feedback can be seen as a simplification of the dialog act recognition task. Thus, it is not fair to use that information if it is not obtained automatically as well. Finally, the corpus was also explored in domain adaptation experiments for dialog act classification using a reduced set of classes (Margolis et al., 2010).

Experimental Setup
We want to assess whether the top performing approaches described in the previous section perform similarly on a language other than English. Furthermore, we want to explore their applicability in the multi-label classification scenarios posed by the two bottom levels of the DIHANA corpus dialog act annotations. Since those levels refer to different task-specific aspects, we also assess how context information from the preceding segments influences the ability to predict each of those aspects. Similarly, we assess how that ability is influenced by information from the upper levels. Finally, we want to assess whether the hierarchical combination of the best approaches for each level is able to outperform the flat approach that was used in previous studies on the corpus.
In this section we describe our experimental setup, starting with a description of the DIHANA corpus and its dialog act annotations. Then, we present the generic architecture used in our experiments and explain how it changes according to the aspect and the characteristics of the level in focus, especially considering the differences between single-and multi-label classification. Finally, we describe our training and evaluation approaches, including the differences in the metrics used for single-and multi-label problems.

Dataset
The DIHANA corpus (Benedí et al., 2006) consists of 900 dialogs between 225 human speakers and a WoZ telephonic train information system. There are 6,280 user turns and 9,133 system turns, with a vocabulary size of 823 words and a total of 48,243 words. The turns were manually transcribed, segmented, and annotated with dialog acts (Alcácer et al., 2005). The total number of annotated segments is 23,547, with 9,715 corresponding to user segments and 13,832 to system segments. One of the annotated dialogs is shown in Figure 1.
The dialog act annotations are hierarchically decomposed in three levels (Martínez-Hinarejos et al., 2002). The first level (L1) represents the generic intention of the segment, independently of task details, while the remaining (L2 and L3) represent task-specific information. The first level has 11 labels, distributed according to L1: Espera, L2: Nil, L3: Nil system: Elúnico tren que realiza el trayecto es un Diurno que sale a las 9 y 25 de la mañana.
(The only train that makes that journey is a Diurno that departs at 9:25 a.m.) labels are exclusive to user segments -Acceptance and Rejection -and four to system segments -Opening, Waiting, New Consult, and Confirmation. Furthermore, the most common label, Question, covers 27% of the segments.
Although they share most labels, the two taskspecific levels of the hierarchy focus on different information. While the second level is related to the kind of information that is implicitly focused in the segment, the third level is related to the   Figure 1. Since it reveals the intention of finding a train schedule, it has Departure Time as a Level 2 label. However, since that departure time is not explicitly refered in the segment, that label is not part of its Level 3 labels. On the other hand, the segment explicitly refers a departure place, a destination, and a date. Thus, it has the corresponding Level 3 labels -Origin, Destination, and Day. The label distributions in both levels are shown in Table 3. We can see that there are 10 common labels and three additional ones on Level 3 -Order Number, Number of Trains, and Trip Type. Furthermore, both levels have the Nil label, which represents the absence of label in that level. In this sense, we can see that only 63% of the segments have Level 2 labels, and that the percentage is even lower, 52%, when considering Level 3 labels. This is mainly due to the fact that segments labeled as Opening, Closing, Undefined, Not Understood, Waiting, and New Consult on the first level cannot have labels on the remaining levels. Finally, it is important to refer that while each segment may only have one Level 1 label, it may have multiple Level 2 and Level 3 labels.
As a final remark, it is important to refer that some Level 2 -Duration, Ticket Class, and Service -and Level 3 -Service and Durationlabels only occur in 0.1% of the segments or less. Thus, these are hard to predict using machine learning approaches that focus on maximizing the overall accuracy on the corpus.

Network Architecture
Since we want to assess the performance of different DNN-based approaches on dialog act recog-nition on the DIHANA corpus, we must define a common ground for comparison. Thus, we use a generic network architecture, shown in Figure 2, which is based on those of the top performing approaches referred to in Section 2.2. The generic approach to obtain a dialog act classification for a segment is as follows: First, the segment is split into tokens, which are passed to an embedding layer. Then, the sequence of token embeddings is passed to the segment representation approach. The obtained representation can then be concatenated with additional information from other sources before being passed to a dimensionality reduction layer. Finally, the obtained reduced representation is passed to the output layer, which generates the dialog act classification. The motivation for each of these steps and their characteristics according to the level of the hierarchy in focus are described below.

Embedding Layer
The input of our network is the sequence of tokens in the segment. Similarly to most previous  approaches on dialog act recognition, we perform tokenization at the word level. As shown in our previous study (Ribeiro et al., 2018), the character level is also able to provide important information. However, for simplification, we do not include it in this study. Furthermore, we ignore punctuation, since it may not be available for a dialog system. The tokens are then passed to the embedding layer to be transformed into a vectorial representation corresponding to their position in the embedding space. In our experiments, we use pre-trained word embeddings obtained by applying Word2Vec (Mikolov et al., 2013) on the Spanish Billion Words Corpus (Cardellino, 2016). Although we have explored embedding spaces with different dimensionality, we only report the results obtained using dimensionality 200, as used by Liu et al. (2017), since it consistently led to better results than the ones explored by Khanpour et al. (2016).

Segment Representation
The segment representation step generates a vectorial representation of the segment through the combination of the representations of its tokens. As stated in Section 2.2, the two approaches with higher performance on dialog act recognition on English data vary on this step. While the approach by Khanpour et al. (2016) is based on RNNs, the one by Liu et al. (2017) is based on CNNs. Both have their own advantages, as while the first focuses on capturing information from relevant sequences of tokens, the latter focuses on the context surrounding each token and, thus, captures relevant patterns. Since the different levels in the label hierarchy have different characteristics, we use both approaches in our experiments to assess whether one outperforms the other in every situation or the one with best per-formance varies according to the level. As described in Section 2.2, the RNN-based approach by Khanpour et al. (2016) uses a stack of 10 LSTM units. The segment representation is given by the concatenation of the outputs of the 10 LSTM units at the last time step, that is, after processing all the tokens in the segment. Using the output at the last time step instead of other pooling operation makes sense, since the recurrent units process the tokens sequentially. Thus, that output contains information from the whole segment. The results reported in this article were obtained using a stack of five Gated Recurrent Units (GRUs) instead of the stack of 10 LSTMs, since it led to similar performance with reduced resource consumption on our preliminary experiments. A graphical representation of this approach is shown in Figure 3.
Also as described in Section 2.2, the CNNbased approach by Liu et al. (2017) uses three parallel temporal CNNs with window sizes between one and three, inclusively. This means that it focuses on sets of at most three consecutive words. A previous study by Kim (2014) used window sizes between three and five, in order to capture relations between more distant words, which were relevant for the approached tasks. Considering the task at hand, the most relevant window sizes depend on the level in focus, as the task-specific dialog acts are typically related to the presence of specific words, while generic dialog acts are more related to the structure of the segment and, consequently, larger windows. Thus, we explore the use of different window sizes for each level. The outputs of the CNNs suffer a max pooling operation and are afterwards concatenated to generate the segment representation. A graphical representation of the approach is shown in Figure 4.

Context Information
Previous studies have confirmed the importance of context information provided by the preceding segments for dialog act recognition (Ribeiro et al., 2015;Lee & Dernoncourt, 2016;Liu et al., 2017). Additionally, those studies have shown that the influence of preceding segments decays with distance and that the dialog act classifica-tion of those segments is more informative than their words. Thus, in our experiments we use the same label-based representation approach used in our study on the SwDA corpus (Ribeiro et al., 2015) and also by Liu et al. (2017) to provide context information to the network. That is, the labels from the preceding segments are transformed into the corresponding one-hot encodings and concatenated to the segment representation. Similarly to Liu et al. (2017), we explore the use of context information from up to three preceding segments, since our previous study has shown that the improvement achieved by using additional segments is negligible. In the context of a dialog system identifying its conversational partner's intention, the system only has access to the preceding segments. Thus, in our experiments we do not use information extracted from future segments. It is important to refer that we use the manual annotations of the segments to provide the context information. Thus, the obtained results represent an upper bound for the approach. We did not use automatic labels in our experiments since both our study and that by Liu et al. (2017) have shown that this approach performs better than its competitors, which use features concerning the words of preceding segments, even when the labels are obtained automatically. According to those studies, accuracy is expected to drop around 2 percentage points when using automatic labels. However, in the context of a dialog system, the system is aware of the dialog acts of its own segments. Thus, only the classification of user segments is subject to error, which shall reduce the accuracy drop. Still, as future work, it is important to assess the concrete performance decay in this scenario. Additionally, since the DIHANA corpus has hierarchical dialog act labels, when dealing with a certain level, we also explore the use of context information from the upper levels, relative to both the current and preceding segments. To provide this information we use the same labelbased representation approach described for providing context information from the preceding segments.

Dimensionality Reduction Layer
In order to avoid result differences caused by using segment representations with different dimensionality, our architecture includes a dimensionality reduction layer that maps the generated segment representations into a 100-dimensional space. This way, the observed differences in performance are due to the nature of the segment representation approach and the information it is able to capture and not to factors related to the dimensionality. Furthermore, in order to reduce the probability of overfitting to the training data, this layer also applies dropout, disabling 50% of the neurons during the training phase.

Output Layer
The output layer maps the 100-dimensional representation into a dialog act label. This is done using a dense layer with number of units equal to the number of labels. Since each segment has a single Level 1 label, we use the softmax activation together with the categorical cross entropy loss function when predicting those labels. However, that is not valid for the other levels, since they allow each segment to have multiple labels. Thus, in those cases, we use the sigmoid activation together with the binary cross entropy loss function, which, given the possibility of multiple labels, is actually the Hamming loss function, which is appropriate for this kind of problem (Díez et al., 2015). In both cases, for performance reasons, we use the Adam optimizer (Kingma & Ba, 2015).

Training and Evaluation
To implement our networks we used Keras (Chollet et al., 2015) with the TensorFlow (Abadi et al., 2015) backend. We used mini-batching with batches of size 512 and the training phase stopped after 10 epochs without improvement on the validation set. Since there is some nondeterminism involved, the results presented in the next section refer to the mean (m) and standard deviation (s) of the results obtained over 10 runs.
In order to evaluate our approaches, we performed 5-fold cross-validation using the folds defined in the first experiments on the DIHANA corpus (Tamarit & Martínez-Hinarejos, 2008;Martínez-Hinarejos et al., 2008). The evaluation metrics vary according to the level of the hierarchy in focus. Since each segment has a single Level 1 label, at this stage we are dealing with a single-label classification problem. Thus, similarly to previous approaches on dialog act recognition, performance can be evaluated using accuracy. However, that is not the most appropriate metric for Levels 2 and 3, since they pose multilabel classification problems. Thus, we assess performance on those levels using the adapted metrics described by Sorower (2010). The multilabel equivalent of accuracy is the exact match ratio (MR), defined as where Y i is the set of gold standard labels of example i, Z i is the set of labels predicted by the classifier for the same example, and I is the indicator function. The problem with this metric is that it does not account for partial correctness, which is common in multi-label classification problems. Thus, the single-label metrics of accuracy (Acc), precision (P), recall (R) and Fmeasure (F 1 ) are adapted to the multi-label problem as follows: where the operator |X| is used to obtain the number of elements in the set X. Additionally, as previously stated, the Hamming loss (HL), which states how many times, on average, the relevance of an example to a class label is incorrectly predicted and is defined as where L is the set of all possible labels, is also an appropriate metric to evaluate the performance on multi-label classification problems. In the next section, the results for every metric except the Hamming loss are presented as a percentage.
To assess whether the differences between two approaches are statistically significant, we randomly selected one of the runs for each approach and performed a binomial test on their accuracy in Level 1 experiments and on their exact match ratio in Level 2 and 3 experiments. In the discussion in the next section we consider a confidence level of 95%, that is, we consider that there is a statistically significant difference between the approaches if the p-value of the binomial test is lower than 0.05.
Since each level in the hierarchical dialog act annotation of the DIHANA corpus has different characteristics and poses different problems, we start by presenting the results achieved on each of the levels independently. Furthermore, since we want to assess the importance of context information from upper levels, we start on the top level and descend the hierarchy. Finally, we present the results achieved on the hierarchical combination of the different levels.

Level 1
The results obtained when using the recurrent and convolutional segment representation approaches to predict Level 1 labels are shown in Table 4. We can see that the CNN-based approach leads to better performance than the RNN-based one (p ≈ 0.04). However, both approaches lead to average accuracy results above 90% and the difference between them is just 0.5 percentage points, which suggests that they are able to capture similar generic intention information. Still, while the network using the CNNbased approach takes an average of 0.61 seconds per epoch to train and 27 epochs to converge, training the network using the RNN-based approach takes much longer, with an average of 17.63 seconds per epoch and 46 epochs to converge. Additionally, as expected, using wider context windows around each token leads to better results (p ≈ 0.03), which confirms that the generic Level 1 dialog acts are more related to the structure of the segment than to specific keywords. Still, since three different context windows are used in parallel and the two sets used in our experiments overlap, the accuracy difference between using the narrower windows used by Liu et al. (2017) in their study and the wider ones used by Kim (2014) Table 4: Accuracy results on Level 1 using the two segment representation approaches.
Concerning context information provided by the preceding segments, the results in Table 5 show that the first preceding segment is the most important, leading to an accuracy improvement of 4.45 percentage points (p ≈ 6.7e −167 ). An additional improvement of 1.77 percentage points is achieved by providing information from two additional segments (p ≈ 4.6e −58 ). This pattern was expected, since it had already been observed in our study (Ribeiro et al., 2015) and that by Liu et al. (2017) on the SwDA corpus, which is also annotated with task-independent dialog act labels.   Table 6: Level 1 accuracy results on user and system segments.
When information from three preceding segments is used, the classifier only fails to accurately predict two percent of the segments. That result takes into account all the segments in the DIHANA Corpus. However, the system segments are scripted and, thus, are easier to predict than the user segments. In fact, if we consider the scenario of a dialog system trying to predict dialog acts, it is aware of its own and must only predict those of its conversational partners. In this sense, in Table 6 we can see the results achieved when considering user and system segments independently. As expected, the average accuracy on system segments is 99.91%. On user segments that value decreases to 95.17%, which still reveals high performance.
Looking at each label individually, the hardest to identify is Undefined, with a recall around 57%. This was expected since that label covers all the cases which cannot be labeled with any of the other labels, including problems in the dialog. All the remaining labels have a recall above 95%, with the lowest being that of the Answer label, which is also the lowest in terms of precision (96%). In both cases, the confusion is typically with the Question label, which makes sense, since questions and answers may have the same words and only differ in terms of their order. In fact, if we consider questions in declarative form, there may be no difference at all.
Considering previous studies on dialog act recognition on the DIHANA Corpus, only Tamarit & Martínez-Hinarejos (2008) assessed the performance on the Level 1 alone, achieving 60.70% accuracy. However, their study focused on the use of prosodic information and, thus, it is not fair to compare their results with ours, since our approach takes advantage of the transcriptions.

Level 2
As stated in Section 3.1, some Level 1 labels can only be paired with the Nil label on the remaining levels. Thus, segments labeled with one of those labels on Level 1 have their labels on the remaining levels defined, independently of their content. Thus, we do not take those segments into account in our experiments on Levels 2 and 3.
Similarly to what happened on Level 1, in Table 7, we can see that using the CNN-based segment representation approach leads to better results than the RNN-based one. The only exception is the Hamming loss, which, on average, is equal for both approaches. On every non-loss metric, the CNN-based approach surpasses the RNN-based one by over 1 percentage point (p ≈ 1.1e −14 ). In this case, the discrepancy in the number of epochs required for training is smaller, with an average of 46 for the CNN-based approach and 56 for the RNN-based one. Furthermore, since we are considering less segments, the training times per epoch are reduced to 0.40 and 11.67 seconds, respectively.
Contrarily to what happened on Level 1, using narrower context windows apparently leads to better results. However, the difference is not statistically significant (p ≈ 0.12). Still, this shows that task-specific dialog act labels are more related to certain keywords than the generic labels of Level 1. Furthermore, since the number of labels per segment is typically low, the classifiers tend to avoid selecting incorrect labels, which is reflected in higher precision than recall for every approach.
The results in Table 8 show that, similarly to what happened on Level 1, the preceding segments are able to provide relevant context information for the task. However, in this case, the importance of the first preceding segment is more pronounced, reducing the loss to less than a third and improving the remaining metrics by around 20 percentage points (p ≈ 5.0e −324 . This makes sense considering that the dialogs feature many question-answer pairs focused on the same target information, which is the focus of Level 2 labels. Thus, in those cases, the labels of both segments are the same. Consequently, the labels of the first preceding segment provide an important cue for the identification of those of the current segment. In Table 9, we can see that context information from Level 1 is also important. Using information from the current segment only leads to a slight but significant improvement (p ≈ 0.01). However, also considering the Level 1 classification of the first preceding segment leads to an improvement around 1.5 percentage points on every non-loss metric (p ≈ 8.7e −6 ). This is still explained by the presence of multiple questionanswer pairs in the dialogs, as if the preceding segment is labeled as Question on Level 1, then the current segment probably has the same Level 2 labels as the preceding segment. The improvement achieved using information from additional preceding segments is not statistically significant Similarly to what happened on Level 1, the performance on user segments is different from that on system segments. In Table 10, we can see that on system segments, the average value of every non-loss metric is around 98.4%, while on user segments the average exact match ratio is 91.28% and the remaining non-loss metrics are around 92%.
Considering the labels individually, the best approach is unable to identify any of the three less predominant labels in the dataset. However, this was expected, since none of them appears in more than 29 segments. Thus, they are irrelevant for an approach focused on reducing the loss on the overall dataset and require specialized approaches or additional data to be identified. The Arrival Time label has an F-measure around 75% since it is easily confused with the Departure Time label and is the less predominant of the two. Although the Train Type label has precision above 95%, it only has around 87% recall. This happens since the label only appears in 2% of the segments. Thus, in segments that focus on multiple aspects, information from the keywords that refer to the type of train is neglected in favor of that which allows the identification of more predominant labels. All the remaining labels have an F-measure above 95% with balanced precision and recall.
Previous studies on dialog act recognition on  Table 7: Results on Level 2 using the two segment representation approaches.    Table 10: Level 2 results on user and system segments.

MR
the DIHANA Corpus did not explore the Level 2 on its own, but rather combined it with the Level 1, using the combination of the labels of the two levels as the label set and looking at the problem as a single-label classification problem, similar to the classification of Level 1. Thus, our results on the Level 2 cannot be compared directly with those of previous studies. The results achieved on the combination of both levels are discussed in Section 4.4. Table 11 shows that, similarly to what happened on the other levels, the CNN-based segment representation approach leads to better results than the RNN-based one (p ≈ 9.6e −5 ). However, in this case, the difference is less pronounced. In fact, when using the set of wider windows, the CNN-based approach performs worse than the RNN-based one (p ≈ 1.2e −4 ). This is due to the fact that the Level 3 focuses on the information that is explicitly referred to in the segments and, thus, is even more keyword-oriented than Level 2. That also explains the average results above 96% on every non-loss metric. The average training times per epoch are the same as those for Level 2. However, in this case, more epochs are required until convergence -86 for the RNN-based approach and 80 for the CNN-based one.

Level 3
The results in Table 12 show that, in this case, the improvement provided by context information from the preceding segments is negligible and not statistically significant (p ≈ 0.48). Once again, this is explained by the nature of Level 3 and its focus on what is explicitly referred to in the current segment. Thus, the preceding segments are not relevant.
In Table 13 we can see that the improvement provided by Level 2 information is slightly supe-     rior than that provided by the Level 3 information from the preceding segments. In this case, considering the Level 2 label of the same segment leads to a statistically significant improvement (p ≈ 0.03). This can be explained by the fact that when a certain kind of information is explicitly referred to in a segment, it is typically also focused by the segment and, thus, overlaps between the Level 2 and 3 labels of a segment are common. Considering Level 2 information from preceding segments does not lead to statistically significant improvements (p ≈ 0.13).

MR
Since the Level 1 labels are related to the generic intention of the segment, they have no direct relation to what is explicitly referred to in the segment and, thus, to the Level 3. This is confirmed by the results in Table 14, which show that the improvement provided by Level 1 information is negligible and not statistically significant (p ≈ 0.13).
In Table 15, we can see that, in this case, the performance difference between user and system segments is not as pronounced. Once again, this is explained by the fact that the Level 3 is highly focused on keywords and, thus, the fact that the system segments are scripted does not have the same influence on classification.
Considering the labels individually, similarly to what happened on Level 2, the best approach is unable to identify the less predominant labels, Duration and Service, since none of them appears in more than 19 segments. Of the remaining labels, Arrival Time is that with lowest recall, 88%, since it is easily confused with the more predominant Departure Time label. All the remaining labels have an F-measure above 97% with balanced precision and recall.
Similarly to the Level 2, previous studies on dialog act recognition on the DIHANA corpus did not explore the Level 3 alone, but rather combined it with the remaining levels. Consequently, we are also unable to directly compare our results on the Level 3 with those of previous studies. The hierarchical combination of the multiple levels is explored in the next section.

Hierarchical Classification
As previously stated, previous studies on dialog act recognition on the DIHANA corpus did not explore the task-specific levels of the hierarchy independently, but rather in combination with the levels above them. This makes sense from a hierarchical point of view, as each level is supposed to depend on those above it. However, as discussed in Section 3.1, since each level focuses on a different aspect concerning the intention of the speaker, the only restriction imposed by the annotation scheme is that segments annotated with a Level 1 label that refers to dialog structuring or communication problems cannot have labels on the remaining levels. Still, the results reported in the previous sections show that the ability to predict the label at a given level is improved when context information from the level directly above it is used. Furthermore, in order to accurately identify the intention of a speaker, the system must be able to accurately predict the labels at the three levels. Thus, we also assess the performance on the hierarchical combination of the multiple levels.
The previous studies on the task approached the problem of the combined classification of the different levels as a single-label classification problem in which each combination of labels present in the corpus is considered a single independent label. However, this approach has two flaws. On the one hand, it is a simplification of the problem as it limits the possible labels to the combinations existing in the dataset. On the other hand, it does not take the multi-label nature of the task-specific levels into account.
Contrarily to those studies, we approach the problem hierarchically by combining the best classifiers for each level. That is, for each segment, we start by predicting its Level 1 label using the CNN-based classifier with wide context windows and context information from three preceding segments. Then, we predict its Level 2 labels using the CNN-based classifier with narrow context windows, Level 2 context information from three preceding segments, and Level 1 context information from the current and first preceding segment. Finally, we predict its Level 3 labels using the CNN-based classifier with narrow context windows and Level 2 context information from the current segment. In order to account for the fact that the Level 2 and 3 classifiers were not trained on the segments with Level 1 labels that do not allow labels on the remaining levels, if the Level 1 classifier predicts one of those labels for the segment, it is automatically assigned no labels on the remaining levels.
Using this hierarchical approach, the bottom levels are still considered multi-label classification problems. Thus, every combination of labels is possible and not just those that appear on the dataset. Still, in order to confirm that the problem approached by previous studies is actually simpler, we also present the results achieved when the task is approached as a single-label classification problem. To obtained those results, we used a classifier with the same architecture as the best Level 1 classifier, that is, a CNN-based classifier with wide context windows and context information from three preceding segments. However, in this case, the classifier was trained to predict the combination of all labels for the segment at once and each of those combinations is seen as an independent label.
For comparison with the results achieved on previous studies, we use the exact match ratio to evaluate the performance of both the hierarchical and single-label approaches. Thus, if the prediction of the Level 1 label is inaccurate or there is any missing or additional Level 2 or 3 label, the whole prediction for the segment is considered wrong. Table 16 shows the results achieved on the combination of Levels 1 and 2. Using the hierarchical approach we achieved an average of 94.28% exact match ratio, which is already above the 93.40% reported by Martínez-Hinarejos et al.
(2008) (p ≈ 3.0e −8 ) and in line with the 94.08% reported by Gambäck et al. (2011) (p ≈ 0.20). By approaching the task as a single-label classification problem we achieved 96.24%, which is almost two percentage points above the result achieved using the hierarchical approach (p ≈ 7.0e −43 ). This confirms that this view on the problem is actually a simplification.  Gambäck et al. (2011) 94.08 Table 16: Results achieved on the combination of Levels 1 and 2. Table 17 shows the results achieved on the combination of the three levels. We can see that most of the conclusions drawn for the combination of Levels 1 and 2 can also be drawn in this case. Using the hierarchical approach we achieved an average of 92.34% exact match ratio, which is above the 89.70% reported by Martínez-Hinarejos et al. (2008) (p ≈ 9.5e −44 ) and the 90.97% reported by Gambäck et al. (2011) . However, while on the combination of the two top levels the result of the hierarchical approach was not statistically different from that reported by Gambäck et al. (2011), in this case there is a statistically significant improvement of 1.37 percentage points (p ≈ 6.6e −14 ). By approaching the task as a single-label classification problem, the exact match ratio is improved to 93.98% (p ≈ 1.5e −22 ), once again confirming that the problem is simpler.

Conclusions
In this article we have explored automatic dialog act recognition on the DIHANA corpus. This dataset and its three-level annotation scheme pose problems which have been neglected since the studies on dialog act recognition started focusing on English data and, especially, on the SwDA corpus. The first problem concerns the language difference. Additionally, contrarily to the flat and single-label classification problem posed by the SWBD-DAMSL annotations of the SwDA corpus, the dialog act annotations of the DIHANA corpus pose a hierarchical classification problem. Furthermore, the two lower levels of that hierarchy pose multi-label classification problems. We have studied how the state-of-theart approaches on dialog act recognition on English data can be applied to these problems and which aspects of those approaches are relevant for the prediction of the labels of each level, according to its characteristics. A conclusion that was common to all levels was that the CNN-based approach on segment representation led to better performance than the RNN-based approach. This approach, applied to dialog act recognition on English data by Liu et al. (2017), features three parallel temporal CNNs with context windows of different sizes. This way, the segment representation approach takes sets of words of different sizes into account and, depending on the sizes of the windows, is able to capture information concerning both specific words and the structure of the segment. In this sense, the task-independent labels of Level 1 are more related to the structure of the segment and, thus, the best results were achieved using a set of wider context windows. On the other hand, the task-specific labels of Levels 2 and 3 are more related to certain keywords and, thus, using a set of narrower windows led to improved performance. The importance of the selected window sizes was especially pronounced on the experiments on Level 3, since when using wider windows the CNN-based approach performed worse than the RNN-based one. However, that is explainable by the nature of that level, which focuses on the kind of information that is explicitly referred to in the segments and, thus, the classification of a segment is given by the presence of specific words.
The relation between the Level 3 labels and the presence of certain keywords in the segment also explains the lack of importance of context information from the preceding segments to the prediction of those labels. On the other hand, that information is relevant for predicting the labels of the remaining levels. On Level 1, the experiments revealed a pattern similar to that revealed in both our study (Ribeiro et al., 2015) and that by Liu et al. (2017) on SwDA, which is also annotated with task-independent labels. However, the importance of context information from the preceding segments was especially pronounced on the experiments on Level 2, reducing Hamming loss to less than a third and improving the remaining metrics by over 20 percentage points. The Level 2 focuses on the kind of information implicitly focused by the segment. Thus, since the dialogs in the DIHANA corpus feature multiple pairs of segments focused on the same kind of information, the preceding segments, especially the first, provide an important cue for the classification of the current segment.
Still considering the Level 2 and the characteristics of the dialogs, most of the pairs of segments focused on the same kind of information are question-answer pairs. Question and Answer are Level 1 labels. Thus, Level 1 context information from both the current and preceding segments also provides cues for the prediction of Level 2 labels. On the other hand, it is irrelevant when predicting Level 3 labels. However, there is a relation between the kind of information that is implicitly focused in a segment and that which is explicitly referred to in it. Thus, the sets of Level 2 and 3 labels of a segment typically overlap. Consequently, Level 2 context information is able to slightly improve the performance when predicting Level 3 labels.
The system segments of the DIHANA corpus are scripted and, thus, are easier to predict than the user segments. Furthermore, a dialog system is aware of the dialog acts of its own segments and must only predict those of its conversational partner's segments. Thus, for such a scenario, only the performance on user segments is relevant. As expected, the performance was higher on system segments on every level. However, on user segments, the average accuracy on Level 1 and the average exact match ratio on the remaining levels was still above 90%. Furthermore, it is important to refer that since the Level 3 is highly keyword related, the performance difference is not as pronounced.
Finally, by hierarchically combining the best classifiers for each level, we achieved an average exact match ratio of 94.28% on the combinations of Levels 1 and 2 and 92.34% on the combination of the three levels. These results are already in line or above those achieved on previous studies on dialog act recognition on the DIHANA corpus. However, those studies considered a simplified version of the problem by reducing it to a singlelabel classification problem with the label of a segment consisting of the concatenation of the la-bels of the three levels. Since this approach only considers the label combinations present in the corpus, the number of possible labels is reduced in comparison to our approach, which looks at the prediction of Level 2 and 3 labels as multilabel classification problems. By approaching the problem in a manner comparable to that of those studies, the previous values increase to 96.24% and 93.98%, respectively.
In terms of future work it would be interesting to assess whether the conclusions drawn on this study on Spanish data and previously on English data also hold on data in other languages with different morphological typology. In terms of multi-label dialog act recognition, it would be interesting to explore the use of other loss functions when training the network, especially one based on F-measure, which is not as influenced by the reduced number of positive classes per segment as the Hamming loss. Furthermore, it is important to assess whether segment representation approaches based on character-level tokenization are able to capture additional information for predicting the task-specific labels. It would also be interesting to explore means to perform the hierarchical classification of the multiple levels using a single network instead of three independent classifiers. Finally, it is important to assess the decay in performance in a real scenario. That is, one in which the dialog system is not simulated and, thus, must deal with problems related to Automatic Speech Recognition (ASR) and use automatically predicted labels as context information.