DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue

Shikib Mehri♠ Mihail Eric♢ Dilek Hakkani-Tur♢
♠Language Technologies Institute, Carnegie Mellon University
♢Amazon Alexa AI
amehri@cs.cmu.edu, {mihaeric, hakkanit}@amazon.com

* This work was done while the first author was an intern at Amazon.

Abstract

: A long-standing goal of task-oriented dialogue research is the ability to flexibly adapt dialogue models to new domains. To progress research in this direction, we introduce DialoGLUE (Dialogue Language Understanding Evaluation), a public benchmark consisting of 7 task-oriented dialogue datasets covering 4 distinct natural language understanding tasks, designed to encourage dialogue research in representation-based transfer, domain adaptation, and sample-efficient task learning. We release several strong baseline models, demonstrating performance improvements over a vanilla BERT architecture and state-of-the-art results on 5 out of 7 tasks, by pre-training on a large open-domain dialogue corpus and task-adaptive self-supervised training. Through the DialoGLUE benchmark, the baseline methods, and our evaluation scripts, we hope to facilitate progress towards the goal of developing more general task-oriented dialogue models.¹

1 Introduction

One of the ultimate goals of task-oriented conversational systems is the ability to flexibly bootstrap new dialogue functionalities across diverse domains of user interest. For example, once we have successfully built a dialogue assistant that can handle restaurant booking queries, we would ideally like that assistant to quicklystart servicing hotel reservation queries without too much additional manual effort. Unfortunately in the modern conversational assistant ecosystem, the work required to scale up functionalities is often linear with respect to the number of desired domains.

For modelling improvements to claim meaningful progress towards generality, the improvements must extend beyond a single dataset and instead hold across several different dialogue tasks and corpora. We argue that one of the core roadblocks in progressing the generality of conversational systems toward this desired state is a lack of standardization in both datasets and evaluation used by the community. These problems have been noted more broadly in the natural language understanding field, inspiring numerous efforts to propose unified benchmarks spanning multiple downstream tasks across different corpora with consolidated evaluation procedures (Wang et al., 2018, 2019; McCann et al., 2018).

While today there is a reasonable abundance of available corpora for building data-driven dialogue systems (Serban et al., 2018), little work has been done to bring together the efforts of researchers to reflect the properties we care about in our systems: statistical learning that is data-efficient and robustly transfers across domains and tasks.

In this work we propose DialoGLUE, a public benchmark consisting of 7 diverse task-oriented spoken-language datasets across 4 distinct natural language understanding tasks including intent prediction, slot-filling, semantic parsing, and dialogue state tracking. Many of these datasets include multiple system functionalities and in total, the DialoGLUE benchmark covers over 40 different domains. Our benchmark is designed to encourage dialogue research in representation-based transfer, domain adaptation, and sample-efficient task learning. Furthermore, it consists entirely of previously-published datasets that have reported results, as these resources have been vetted by the broader community to be sufficiently difficult and interesting.

As part of DialoGLUE, we also release evaluation scripts and competitive BERT-based baselines on the downstream tasks. We introduce ConvBERT, a BERT-base model which has been trained on a large open-domain dialogue corpus. In combination with task-adaptive pre-training and multi-tasking, ConvBERT matches or exceeds state-of-the-art results on five of the seven DialoGLUE datasets. Most notably, we attain a +2.98 improvement in the joint goal accuracy over the best dialogue state tracking models on the MultiWOZ corpus. While our baselines demonstrate the efficacy of task-adaptive finetuning in transferring across datasets, there is still plenty of headroom in the aggregated benchmark scores encouraging further research.

In summary, the contributions of this work are four-fold: (1) a challenging task-oriented dialogue benchmark consisting of 7 distinct datasets across 4 domains, (2) standardized evaluation measures for reporting results (3) competitive baselines across the tasks, and (4) a public leaderboard for tracking scientific progress on the benchmark.

The DialoGLUE leaderboard and evaluation scripts are hosted using the EvalAI platform (Yadav et al., 2019). Our code and baseline models are open-sourced.²

2 Related Work

2.1 NLP Benchmarks

In part, the development of unified benchmarks have helped drive progress towards more general models of language. The GLUE (Wang et al., 2018) and SuperGLUE benchmarks (Wang et al., 2019) have provided a consistent benchmark that allows pre-trained language models (Devlin et al., 2018; Radford et al., 2018) to be evaluated on a variety of tasks.

While GLUE and SuperGLUE are concentrated on language understanding tasks, decaNLP (McCann et al., 2018) consists of a broader set of NLP tasks including question answering, machine translation and summarization.

Within conversational systems, the recently proposed Dialogue Dodecathlon is a benchmark for knowledge-grounded, situated, and multi-modal dialogue generation consisting of several open-domain dialogue datasets (Shuster et al., 2019). In contrast to the Dialogue Dodecathlon, DialoGLUE focuses on task-oriented dialogue and language understanding.

DialoGLUE is inspired by GLUE and SuperGLUE, however our efforts differ in that we aim to produce a benchmark for natural language understanding in the context of task-oriented dialogue. Task-oriented dialogue poses a unique set of challenges; dialogue is intrinsically goal-driven, multi-turn and often informal/noisy (Henderson et al., 2019; Zhang et al., 2019b). Through DialoGLUE, we hope to provide a benchmark for assessing models that aim to tackle these unique challenges across several tasks within task-oriented dialogue.

2.2 Pre-trained Models

Large-scale pre-training has attained significant performance gains across many tasks within NLP (Devlin et al., 2018; Radford et al., 2018). Through self-supervised pre-training on large natural language corpora, these models gain generalized language understanding capabilities that transfer effectively to downstream tasks (Wang et al., 2018), including intent prediction (Chen et al., 2019; Castellucci et al., 2019) and dialogue state tracking (Heck et al., 2020). However, recent work has demonstrated that leveraging dialog-specific pre-trained models, such as ConveRT (Henderson et al., 2019; Casanueva et al., 2020) obtains better results.

Large-scale pre-training on open-domain dialogue has demonstrated surprising generality, with models like DialoGPT (Zhang et al., 2019b), Meena (Adiwardana et al., 2020) and Blender (Roller et al., 2020) achieving response generation performance competitive with humans in certain settings. ConveRT has demonstrated that pre-training on open-domain dialogue transfers to task-oriented dialogue, with significant performance improvements over BERT on both intent prediction (Casanueva et al., 2020) and slot filling (Coope et al., 2020). Inspired by these results, we fine-tune BERT (Devlin et al., 2018) on a large open-domain dialogue corpus prior to fine-tuning on the downstream tasks of DialoGLUE.

2.3 Task-Adaptive Training

Though large-scale pre-training results in strong performance when transferring to downstream tasks, performing self-supervised training on a target dataset allows the model to better adapt to the dataset prior to fine-tuning (Mehri et al., 2019; Gururangan et al., 2020). Since our target datasets differ from the pre-training data, we hypothesize that task-adaptive training will allow pre-trained models to better adapt to task-oriented dialogue.

3 DialoGLUE Tasks

The DialoGLUE benchmark consists of 7 different datasets spanning 4 different tasks: intent prediction, slot filling, semantic parsing and dialogue state tracking. These datasets all share the common goal of language understanding in the context of dialogue, making them suitable for DialoGLUE.

3.1 Intent Prediction

Intent prediction is the task of extracting meaning from a natural language utterance in order to understand the user’s goal (Hemphill et al., 1990; Coucke et al., 2018). We leverage three different datasets for the task of intent prediction, all of which span several domains and consist of many different intents.

banking77 (Casanueva et al., 2020) contains 13,083 utterances related to banking with 77 different fine-grained intents. Despite only consisting of a single domain, this dataset is challenging as it requires fine-grained differentiation betweeen very similar intents like card payment wrong exchange rate and wrong exchange rate for cash withdrawal.

clinc150 (Larson et al., 2019) contains 23,700 utterances spanning 10 domains (e.g., travel, kitchen/dining, utility, small talk, etc.) and 150 different intent classes. This dataset also consists of out of scope utterances, which do not belong to any of the other intents and must therefore be classified as out of scope.

hwu64 (Liu et al., 2019) includes 25,716 utterances for 64 intents spanning 21 domains (e.g., alarm, music, IoT, news, calendar, etc.). The domains and intents of this dataset are similar to ones we expect users to ask a virtual assistant (e.g., Alexa, Google Assistant, Siri).

Casanueva et al. (2020) forego a validation set when using these datasets for intent prediction and instead only use a training and testing set. We instead designate a portion of the training set to be the validation set.

3.2 Slot Filling

Slot filling, a vital component of task-oriented dialogue systems, is the task of identifying values for pre-defined attributes in a natural language utterance. Following the set up of Coope et al. (2020), we include two slot filling datasets in DialoGLUE.

restaurant8k (Coope et al., 2020) comprises of 8,198 utterances from a commercial restaurant booking system and consists of 5 slots (date, time, people, first name, last name).

DSTC8 (Rastogi et al., 2019) consists of slot annotations extracted by Coope et al. (2020) spanning 4 domains (buses, events, homes, rental cars) for a total of 5,569 utterances.

For both these datasets, the value for a particular slot will always be a contiguous span of the utterance. For some utterances, the expected slot (e.g., “people”, “time”) is provided. This allows an otherwise ambiguous utterance like “four” to be interpreted either as “four people” or “four o’clock”.

3.3 Semantic Parsing

TOP (Gupta et al., 2018) is a dataset of 44k utterances wherein each utterance is annotated with a hierarchical semantic representation. Each utterance has a top-level intent, which serves as the root of the tree. Every word of the utterance is a leaf node of the tree. The sub-trees correspond to different slots and intents, with the dataset having an average tree depth of 2.54. TOP covers two different domains, navigation (directions, distance, traffic) and events.

3.4 Dialogue State Tracking

MultiWOZ (Budzianowski et al., 2018; Eric et al., 2019) is a multi-domain dialogue dataset that contains 7 domains including restaurants, hotels and attractions. We use MultiWOZ for dialogue state tracking, the task of interpreting the user utterances throughout a dialogue in order to maintain a state representation of their requests. Dialogue state tracking is an important component of task-oriented dialogue systems; in order to fulfill a user request, it is necessary to track the user’s goals over the course of multiple turns.

4 Methods

In this section, we describe several methods employed for the DialoGLUE tasks. We begin by describing the architectures for the 4 different tasks, all of which are built around an underlying BERT-like model. We then discuss ConvBERT, a BERT model that was fine-tuned on a large open-domain dialogue corpus. Finally, we describe our use of task-adaptive masked language modelling, which allows us to better adapt our pre-trained models to the DialoGLUE tasks.

4.1 Architectures

Intent Prediction: We fine-tune a BERT model to encode an utterance and predict its intent. Specifically, we use the pooled representation output by a BERT model and pass it through a linear layer to predict the intent class.

Slot Filling: For slot filling, we represent the problem as IOB tagging (Ramshaw and Marcus, 1999)wherein every token in the utterance is labeled as either being the beginning of a slot value (B-), inside a slot value (I-) or not belonging to a slot value (O). We use BERT to produce a latent representation of each token, which is passed through a linear layer that predicts the appropriate tag (e.g., “B-time”, “I-people”).

Semantic Parsing: We transform the hierarchical representations of the TOP dataset into (i) a top-level intent for the utterance which corresponds to the root of the tree and (ii) a label for each word of the utterance which is the path from the root to each leaf node (which is always a word). Given this, we train a model to simultaneously predict the top-level intent for the utterance using the BERT pooled representation and the labels for each word using BERT’s latent representation of each word.

Dialogue State Tracking: Our state tracking architecture is inspired by the TripPy of Heck et al. (2020) which uses an underlying BERT model and a triple copy strategy to perform state tracking. The TripPy model uses (i) span prediction and a copy mechanism to extract values from a user utterance, (ii) a copy mechanism over concepts mentioned by the system utterance and (iii) a copy mechanism over the DS memory, the existing dialogue state.

These architectures are held consistent throughout our experiments. Given a BERT-like encoder, we can plug it into all of the aforementioned architectures and evaluate its performance on all of the DialoGLUE tasks. This allows us to assess the quality of the underlying encoders, with confidence that performance improvements come from the improved representational power of the encoders.

4.2 ConvBERT

Though pre-trained models (e.g., BERT) have exhibited strong language understanding capabilities, recent work has suggested that they may be insufficient for modelling dialogue, due to dialogue’s intrinsically goal-driven, linguistically diverse, multi-turn and often informal/noisy nature (Henderson et al., 2019; Zhang et al., 2019b). The unique challenges of modelling dialogue have been addressed by training on large amounts of conversational data, from online forums. We extend these efforts by fine-tuning BERT on a large open-domain dialogue corpus consisting of nearly 700 million conversations to produce ConvBERT.

By training ConvBERT with large amounts of open-domain dialogue, we hypothesize that it is better able to produce semantically meaningful latent representations of utterances and multi-turn dialogues than a BERT model. Specifically, we fine-tune an uncased BERT-base for 4 epochs using a masked language modelling objective. Here our input representation is the last 3 turns of dialogue context followed by the [SEP] token and then the dialogue response. We truncate the entire input to have sequence length 72, and we train using the Adam optimizer with an initial learning rate of 3 ⋅ 10⁻⁴.

4.3 Task-Adaptive Training

Task-adaptive training is the process of adapting a pre-trained model to a specific task or domain, by performing self-supervised training prior to fine-tuning on the downstream task. This has been shown to help with domain adaptation (Mehri et al., 2019; Gururangan et al., 2020). To adapt BERT-based encoders to the various DialoGLUE tasks, we leverage task-adaptive training. Specifically, we perform self-supervised training with the masked language modelling (MLM) objective on each dataset. We explore both (i) pre-training with MLM prior to fine-tuning on the specific task and (ii) multi-tasking by simultaneously performing self-supervised MLM and fine-tuning on the task. An example experimental setting is as follows: (1) start with the pre-trained ConvBERT model, (2) do MLM pre-training on the utterances of HWU, (3) fine-tune on HWU using the intent prediction architecture and simultaneously perform self-supervised MLM training on the utterances of HWU.

To further study the benefits of self-supervised training, we fine-tune both BERT and ConvBERT with masked language modelling over all the DialoGLUE datasets. In this manner we aim to adapt the two pre-trained models to task-oriented dialogue through self-supervised training.

5 Experiments


Model	Average	banking77	hwu64	clinc150	restaurant8k	DSTC8	TOP	MultiWOZ

BERT	86.08	93.02	89.87	95.93	95.53	90.05	81.90	56.30
+ Pre	86.18	92.34	91.82	96.27	95.78	89.48	81.54	56.07
+ Multi	85.97	92.27	90.99	96.22	95.61	89.93	81.46	55.30
+ Pre, Multi	85.92	93.20	90.99	95.67	95.04	89.96	82.08	55.06
ConvBERT	86.01	92.95	90.43	97.07	95.90	87.58	82.13	56.00
+ Pre	86.19	93.25	92.84	97.09	95.33	87.02	82.00	55.67
+ Multi	85.97	93.20	91.36	97.09	95.39	90.02	82.63	56.48
+ Pre, Multi	86.89	93.44	92.38	97.11	95.44	91.20	82.08	56.56
BERT-DG	86.11	91.75	90.89	95.98	95.23	90.24	81.16	57.54
+ Pre	86.16	92.01	91.26	96.20	94.61	89.79	81.29	58.00
+ Multi	86.38	92.53	90.61	95.89	95.44	90.81	81.04	58.34
+ Pre, Multi	86.18	92.57	91.26	96.22	95.11	88.69	80.89	58.53
ConvBERT-DG	82.9	93.21	91.64	96.96	93.44	74.54	72.22	58.57
+ Pre	84.1	93.05	92.94	97.11	95.38	90.88	60.68	58.65
+ Multi	82.78	93.02	91.73	97.13	95.93	88.97	53.97	58.70
+ Pre, Multi	85.34	92.99	91.82	97.11	94.34	86.49	76.36	58.29

Table 1: Full data experiments on DialoGLUE. The average score on the DialoGLUE benchmark is shown in the leftmost column.

5.1 Experimental Setup

The experiments are carried out with four BERT-like models: (1) BERT-base, (2) ConvBERT which is BERT trained on open-domain dialogues, (3) BERT-DG which is BERT trained on the full DialoGLUE data in a self-supervised manner with masked language modelling and (4) ConvBERT-DG which is ConvBERT trained on the full DialoGLUE data in a self-supervised manner.

We carry out experiments with each of these four models in four different settings: (1) directly fine-tuning on the target task, (2) pre-training with MLM on the target dataset prior to fine-tuning, (3) multi-tasking with MLM on the target dataset during fine-tuning and (4) both pre-training and multi-tasking with MLM.

We perform self-supervised MLM pre-training for 3 epochs prior to fine-tuning. Fine-tuning on a target task is carried out until the performance on the validation set does not improve for 10 epochs. When multi-tasking, we alternate between fine-tuning on the target task and self-supervised training with MLM after every epoch.

To assess the effectiveness of our pre-trained models for few-shot learning, we carry out few-shot experiments. In such experiments, self-supervised training is performed only on the few-shot data, which is 10% of the full data. The MLM pre-training and multi-tasking is performed with only the few-shot versions of each dataset. However, both BERT-DG and ConvBERT-DG are trained with the full DialoGLUE data, albeit in a self-supervised manner, meaning that they see more in-domain data than either BERT or ConvBERT in the few-shot experiments. For all the few-shot experiments, we train five times with different random seeds and report the average performance across the five runs.


banking77 (accuracy) ✓

USE (Casanueva et al., 2020)	92.81
ConveRT (Casanueva et al., 2020)	93.01
USE + ConveRT (Casanueva et al., 2020)	93.36
ConvBERT + Pre + Multi	93.44

hwu64 (accuracy) ✓

USE (Casanueva et al., 2020)	91.25
ConveRT (Casanueva et al., 2020)	91.24
USE + ConveRT (Casanueva et al., 2020)	92.62
ConvBERT-DG + Pre	92.94

clinc150 (accuracy) ✓

USE (Casanueva et al., 2020)	95.06
ConveRT (Casanueva et al., 2020)	97.16
USE + ConveRT (Casanueva et al., 2020)	97.16
ConvBERT-DG + Multi	97.13

restaurant8k (F-1) ✓

Span-BERT (Coope et al., 2020)	93.00
V-CNN-CRF (Coope et al., 2020)	94.00
Span-ConveRT (Coope et al., 2020)	96.00
ConvBERT-DG + Multi	95.93

DSTC8 (F-1)

Span-BERT (Coope et al., 2020)	91.50
V-CNN-CRF (Coope et al., 2020)	91.25
Span-ConveRT (Coope et al., 2020)	94.00
ConvBERT + Pre + Multi	91.20

TOP (Exact Match)

RNNG (Gupta et al., 2018)	78.51
SR + ELMo (Einolghozati et al., 2019)	87.25
Seq2Seq-PTR (Rongali et al., 2020)	86.67
ConvBERT + Multi	82.63

MultiWOZ (Joint Goal Accuracy) ✓

DST-Picklist (Zhang et al., 2019a)	53.30
TripPy (Heck et al., 2020)	55.30
SimpleTOD (Hosseini-Asl et al., 2020)	55.72
ConvBERT-DG + Multi	58.70

Table 2: Comparison to prior work on all seven datasets. We match or exceed state-of-the-art results on five out of seven datasets (marked with checkmarks), with significant improvements (+3) on the MultiWOZ corpus.

5.2 Evaluation

Our evaluation metrics are consistent with prior work on these datasets. For intent prediction (banking77, clinc150, hwu64) we use accuracy. For the slot filling tasks (restaurant8k, DSTC8), we use macro-averaged F-1 score as defined by Coope et al. (2020). We use exact-match for TOP, which measures how often we exactly reconstruct the hierarchical semantic representation. For MultiWOZ, we use joint goal accuracy.


Model	Average	banking77	hwu64	clinc150	restaurant8k	DSTC8	TOP	MultiWOZ

BERT	66.07	79.87	81.69	89.52	87.28	45.05	74.38	4.69
+ Pre	66.57	80.72	83.05	89.73	86.37	47.17	74.41	4.55
+ Multi	66.11	79.89	82.32	89.69	87.53	44.92	74.45	3.95
+ Pre, Multi	66.87	81.49	82.70	90.53	86.34	48.55	74.17	4.29
ConvBERT	68.03	83.63	83.77	92.10	86.90	49.08	74.86	5.90
+ Pre	67.36	83.68	83.77	92.10	86.90	45.20	74.92	5.09
+ Multi	68.16	83.15	82.32	92.33	86.71	50.49	75.21	5.48
+ Pre, Multi	68.22	83.99	84.52	92.75	86.17	48.40	78.84	6.87
BERT-DG	72.70	81.47	83.23	90.57	85.31	43.85	74.80	49.70
+ Pre	72.80	81.79	83.74	90.44	86.66	43.45	74.34	49.40
+ Multi	73.00	81.60	83.18	90.43	86.48	44.86	74.79	49.67
+ Pre, Multi	72.90	81.08	83.40	90.09	86.26	46.32	73.56	49.86
ConvBERT-DG	73.75	84.42	85.17	92.87	87.65	41.94	75.27	48.94
+ Pre	74.10	84.74	85.63	93.16	86.95	43.61	75.32	49.26
+ Multi	74.35	84.09	85.74	93.14	87.48	45.31	75.37	49.35
+ Pre, Multi	73.80	85.06	85.69	93.06	87.58	44.36	72.01	48.89

Table 3: Few-shot data experiments on DialoGLUE. The values in this table are averaged across five runs, with different random seeds.

5.3 Results

The results of the full data experiments are shown in Table 1. We attain a performance gain over the vanilla BERT model (Devlin et al., 2018) across all seven datasets. These results highlight the efficacy of both the ConvBERT model and the task-adaptive self-supervised training. Across four datasets, the best results are attained by ConvBERT with both MLM pre-training and multi-tasking. We compare to prior work in Table 2, wherein we demonstrate that ConvBERT in combination with task-adaptive training, matches or exceeds state-of-the-art performance across five out of seven of the datasets. We attain a +2.98 improvement in the joint goal accuracy on the dialogue state tracking task of MultiWOZ. These strong results, which hold true across several datasets, suggest that large-scale pre-training on open-domain dialogue data in combination with task-adaptive self-supervised training transfers effectively to several task-oriented dialogue tasks.

When looking at the aggregate performance across all the DialoGLUE tasks, neither ConvBERT nor task-adaptive training attain improvements over BERT. However by combining these two approaches, there is a +0.81 improvement in the average score. This suggests that through large-scale pre-training on open-domain dialogue, ConvBERT learns skills that are valuable to DialoGLUE, however it is only through task-adaptive training that these skills are transferred effectively to the downstream tasks.

A noteworthy outcome of these experiments is the fact that the BERT model with task-adaptive self-supervised training sometimes outperforms ConvBERT. This indicates that in certain settings, it is more beneficial to perform self-supervised training on the the downstream dataset rather than a much larger dialogue dataset.

Performing self-supervised training across the combination of the DialoGLUE datasets gives mixed results. ConvBERT-DG attains a significant performance gain on MultiWOZ, suggesting that self-supervised training on other task-oriented dialogue corpora helps significantly in modelling MultiWOZ dialogues. Across other datasets, it is only marginally better than the ConvBERT model and sometimes worse. Aside from the unique case of MultiWOZ, it appears that self-supervised training with additional task-oriented dialogue data, beyond just the dataset in question, does not provide significant improvements. For two datasets, DSTC8 and TOP, there is a decrease in performance which may be indicative of catastrophic forgetting. Namely, the ConvBERT-DG model may have lost the language understanding capabilities captured by the ConvBERT model through the additional self-supervised training, and only partially recovers this through the task-specific self-supervised training. Future work should explore better mechanisms for performing self-supervised training across the combination of the DialoGLUE datasets, as well as multi-tasking across the seven tasks.

While our models achieve state-of-the-art performance across five of the seven tasks, they underperform on TOP and DSTC8. On the TOP dataset, the best models use sophisticated architectures which have been tailored to the task of semantic parsing (Einolghozati et al., 2019; Rongali et al., 2020). With the DialoGLUE benchmark, our objective is to improve the underlying language encoders in a manner that results in consistent performance gains across all of the tasks. We are more concerned with the aggregate improvement across the DialoGLUE benchmark, rather than the performance on a single task. As such, we try to avoid complex task-specific architectures when simpler models achieve competitive results.

The results of the few-shot experiments are shown in Table 3. The few-shot experiments are particularly important for assessing the generalizability of the methods and their ability to transfer to downstream tasks. In low data environments, self-supervised training on the entirety of the DialoGLUE datasets results in performance gains – with BERT-DG and ConvBERT-DG doing better than BERT and ConvBERT respectively. However, this is not entirely surprising as these models are exposed to more utterances from every dataset, albeit without any of the labels.

Most significantly, on MultiWOZ we see a 40 percent difference between BERT-DG and ConvBERT-DG over BERT and ConvBERT. For state tracking in particular, it appears that seeing additional dialogue data in a self-supervised setting, results in significant improvements. This may suggest that dialogue state tracking is more dependent on having semantically meaningful representations of dialogue.

Self-supervised training on the same dataset also helps significantly in few-shot environments. Across almost every dataset, the best result is obtained through some form of task-adaptive MLM training. Especially in settings with fewer training examples, adapting the pre-trained models to the domains of the dataset is necessary for good performance on the downstream task.

ConvBERT is also far more effective in the few-shot experiments, than it was in the full data experiments with a +1.96 point improvement in the aggregate score over BERT. While the full datasets may be sufficient to effectively transfer BERT to task-oriented dialogue, with only 10% of the data, the benefits of the large-scale open-domain pre-training are far clearer.

6 Conclusion

To facilitate research into producing generalizable models of dialogue, we introduce DialoGLUE, benchmark for natural language understanding in the context of task-oriented dialogue. We experiment with several baseline methods for the DialoGLUE benchmark, demonstrating the efficacy of large-scale pre-training on open-domain dialogue and task-adaptive self-supervised training.

To improve performance on DialoGLUE, future work should explore: (1) Large-scale pre-training that attains generalized language understanding capabilities which transfer effectively to task-oriented dialogue. (2) Mechanisms of adapting pre-trained models to specific tasks, beyond the task-adaptive masked language modelling we explore. In particular, we believe there to be potential in extending our preliminary exploration of self-supervised training on the combination of all the DialoGLUE datasets. (3) Multi-tasking across the seven datasets of DialoGLUE, as a means of transferring skills across the datasets.

The results of our experiments demonstrate that there is significant room for improvement on DialoGLUE, particularly in the few-shot settings. The DialoGLUE benchmark is hosted publicly and we invite the research community to submit to the leaderboard.³

References

Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.

Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278.

Iñigo Casanueva, Tadas Temčinas, Daniela Gerz, Matthew Henderson, and Ivan Vulić. 2020. Efficient intent detection with dual sentence encoders. arXiv preprint arXiv:2003.04807.

Giuseppe Castellucci, Valentina Bellomaria, Andrea Favalli, and Raniero Romagnoli. 2019. Multi-lingual intent detection and slot filling in a joint bert-based model. arXiv preprint arXiv:1907.02884.

Qian Chen, Zhu Zhuo, and Wen Wang. 2019. Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909.

Sam Coope, Tyler Farghly, Daniela Gerz, Ivan Vulić, and Matthew Henderson. 2020. Span-convert: Few-shot span extraction for dialog with pretrained conversational representations. arXiv preprint arXiv:2005.08866.

Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, et al. 2018. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Arash Einolghozati, Panupong Pasupat, Sonal Gupta, Rushin Shah, Mrinal Mohit, Mike Lewis, and Luke Zettlemoyer. 2019. Improving semantic parsing for task oriented dialog. arXiv preprint arXiv:1902.06000.

Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi, Sanchit Agarwal, Shuyag Gao, and Dilek Hakkani-Tur. 2019. Multiwoz 2.1: Multi-domain dialogue state corrections and state tracking baselines. arXiv preprint arXiv:1907.01669.

Sonal Gupta, Rushin Shah, Mrinal Mohit, Anuj Kumar, and Mike Lewis. 2018. Semantic parsing for task oriented dialog using hierarchical representations. arXiv preprint arXiv:1810.07942.

Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.

Michael Heck, Carel van Niekerk, Nurul Lubis, Christian Geishauser, Hsien-Chin Lin, Marco Moresi, and Milica Gašić. 2020. Trippy: A triple copy strategy for value independent neural dialog state tracking. arXiv preprint arXiv:2005.02877.

Charles T. Hemphill, John J. Godfrey, and George R. Doddington. 1990. The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990.

Matthew Henderson, Iñigo Casanueva, Nikola Mrkšić, Pei-Hao Su, Ivan Vulić, et al. 2019. Convert: Efficient and accurate conversational representations from transformers. arXiv preprint arXiv:1911.03688.

Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, and Richard Socher. 2020. A simple language model for task-oriented dialogue. arXiv preprint arXiv:2005.00796.

Stefan Larson, Anish Mahendran, Joseph J. Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K. Kummerfeld, Kevin Leach, Michael A. Laurenzano, Lingjia Tang, and Jason Mars. 2019. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1311–1316, Hong Kong, China. Association for Computational Linguistics.

Xingkun Liu, Arash Eshghi, Pawel Swietojanski, and Verena Rieser. 2019. Benchmarking natural language understanding services for building conversational agents. arXiv preprint arXiv:1903.05566.

B. McCann, N. Keskar, Caiming Xiong, and R. Socher. 2018. The natural language decathlon: Multitask learning as question answering. ArXiv, abs/1806.08730.

Shikib Mehri, Evgeniia Razumovsakaia, Tiancheng Zhao, and Maxine Eskenazi. 2019. Pretraining methods for dialog context representation learning. arXiv preprint arXiv:1906.00414.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf.

Lance A Ramshaw and Mitchell P Marcus. 1999. Text chunking using transformation-based learning. In Natural language processing using very large corpora, pages 157–176. Springer.

Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2019. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. arXiv preprint arXiv:1909.05855.

Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M Smith, et al. 2020. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637.

Subendhu Rongali, Luca Soldaini, Emilio Monti, and Wael Hamza. 2020. Don’t parse, generate! a sequence to sequence architecture for task-oriented semantic parsing. In Proceedings of The Web Conference 2020, pages 2962–2968.

I. Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, and Joelle Pineau. 2018. A survey of available corpora for building data-driven dialogue systems. ArXiv, abs/1512.05742.

Kurt Shuster, Da Ju, Stephen Roller, Emily Dinan, Y-Lan Boureau, and Jason Weston. 2019. The dialogue dodecathlon: Open-domain knowledge and image grounded conversational agents. arXiv preprint arXiv:1911.03768.

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, F. Hill, Omer Levy, and Samuel R. Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. ArXiv, abs/1905.00537.

Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.

Deshraj Yadav, Rishabh Jain, Harsh Agrawal, Prithvijit Chattopadhyay, Taranjeet Singh, Akash Jain, Shiv Baran Singh, Stefan Lee, and Dhruv Batra. 2019. Evalai: Towards better evaluation systems for ai agents.

Jian-Guo Zhang, Kazuma Hashimoto, Chien-Sheng Wu, Yao Wan, Philip S Yu, Richard Socher, and Caiming Xiong. 2019a. Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking. arXiv preprint arXiv:1910.03544.

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019b. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536.