MVP:自然语言生成的多任务监督预训练

Tianyi Tang1,4, Junyi Li1,3, Wayne Xin Zhao1,4 🖂 🖂 Corresponding author    Ji-Rong Wen1,2,4
1Gaoling School of Artificial Intelligence, Renmin University of China
2School of Information, Renmin University of China
3DIRO, Université de Montréal
4Beijing Key Laboratory of Big Data Management and Analysis Methods
steventianyitang@outlook.com lijunyi@ruc.edu.cn batmanfly@gmail.com

摘要

预训练语言模型(PLM)在自然语言生成(NLG)任务中取得了显着的成功。 到目前为止,大多数面向NLG的PLM都是使用大规模通用语料库以无监督的方式进行预训练的。 与此同时,越来越多的使用标记数据进行预训练的模型(有监督预训练”)与无监督预训练模型相比表现出了卓越的性能。 受监督预训练成功的激励,我们提出M多任务超级VP再训练(MVP) 用于自然语言生成。 我们从 11 不同 NLG 任务的 77 数据集中收集了大规模自然语言生成语料库 MVPCorpus。 然后我们将这些示例统一为通用的文本到文本格式,以有监督的方式预训练文本生成模型 MVP。 对于每项任务,我们进一步预训练特定的软提示,以激发模型执行特定任务的能力。 我们的 MVP 模型可以看作是在相对较小的 PLM 上利用最新指令调整的实践。 大量的实验证明了我们的 MVP 模型在许多 NLG 任务中的有效性和通用性,它在 17 数据集的 13 数据集上实现了最先进的性能,优于 BART通过 9.3% 和 Flan-T5 通过 5.8%

1简介

自然语言生成(NLG,也称为文本生成)是语言智能的一项重要能力,旨在按需生成类似人类的文本(Garbacea and Mei,2020) 自从预训练和微调范式出现以来,预训练语言模型(PLM)已经主导了 NLG 任务的主流方法(Lewis 等人,2020;Brown 等人,2020) 凭借大规模的通用语料库,大多数 PLM 都是利用内在数据相关性作为监督信号,以无监督(自监督)的方式进行预训练。 然而,无监督的预训练很可能会包含影响下游任务性能的噪声(冯等人,2022),也会导致知识获取速度变慢(张等人, 2021)

与此同时,越来越多的大规模标记数据集变得容易获取(邓等人,2009;刘等人,2020) 越来越多的证据表明,使用标记数据进行预训练可以进一步提高 PLM 的性能,无论是在计算机视觉(He 等人,2016;Dosovitskiy 等人,2021)和自然语言处理领域(林等人,2020b;苏等人,2022) 这些有希望的发展促使我们考虑使用标记数据预训练文本生成模型,这被称为“监督预训练(Feng等人,2022) 现有工作表明,有监督预训练可以显式学习任务特定特征,并减轻无监督预训练和监督微调之间的差异(Lin等人,2020b)

此外,大多数 NLG 系统通常以监督方式进行训练,需要监督信号来学习输入到输出的转换。 例如,对话系统学习根据历史话语生成适当的响应,文本摘要系统学习根据人类编写的摘要从长文档中提取基本信息。 因此,我们怀疑监督预训练本质上更适合面向 NLG 的 PLM,因为它可以在预训练阶段的早期提供与任务相关的指令,而不是稍后的精细-调整阶段

受最近监督预训练成功的启发,我们提出M多任务超级VP再训练(MVP )通过利用各种带标签的文本生成数据集来生成自然语言。 特别地,我们收集了一个大规模标记语料库,MVPCorpus,由 77 数据集和 11 文本生成任务组成。 由于最近的研究表明,大规模的多任务预训练(Aribandi等人,2022)是推广到大型PLM新任务的关键,因此我们将这些标记数据集组合起来进行多任务预训练。 现有的热门作品,如表1所示,主要关注NLU任务(Sanh等人,2022;Aribandi等人,2022)或使用无监督预训练(Lewis 等人, 2020; Raffel 等人, 2020),没有考虑 NLG 任务的有监督预训练。 为了填补这一空白,我们探索了监督预训练和多任务学习,以推导有效通用 NLG 模型。

Settings Supervised Pre-training Unsupervised Pre-training
NLG MVP (ours) GPT-2, MASS, BART, T5
NLU FLAN, T0, Muppet, ExT5 BERT, XLNet, RoBERTa, T5
表格1: 使用(无)监督预训练的 NLG 和 NLU 任务的代表性 PLM。 我们在5节中对有监督预训练进行了更详细的比较和讨论。

为了开发我们的方法,我们采用基于 Transformer 的(Vaswani 等人,2017) 序列到序列模型作为骨干。 在多任务训练中,不同的任务可能会“抵消”通过其他任务学到的能力(He and Choi,2021) 为了缓解这个潜在问题,我们建议基于 MVP 模型学习特定于任务的提示,遵循前缀调整(Li和Liang,2021)的结构。 针对特定任务的预训练可以提示“存储”每个相应任务的专业知识。 将 MVP 与特定任务提示相结合可以进一步激发模型执行某些特定任务的能力。

总而言之,我们的主要贡献围绕以下研究问题:

  • 如何以有监督的预训练方式训练面向 NLG 的 PLM? 为了准备监督语料库,我们收集了大量标记的 MVPCorpus,其中包含跨不同领域和特定目标的 11 NLG 任务的 77 数据集。 据我们所知,MVPCorpus 是最大的 NLG 数据集集合。 首先,我们使用任务指令将不同的NLG任务制定为通用的文本到文本形式,以便可以以统一的方式使用监督语料库来预训练NLG模型。 我们的工作提出了一种简单而通用的方法,通过利用各种标记的 NLG 数据集来预训练功能更强大的 NLG 模型。

  • 有监督的预训练 NLG 模型能否既有效又通用? 大量实验表明,有监督预训练 MVP 在完全调优(+9.3% 比例)和参数高效调优(+4.3% 比例)设置方面均优于无监督预训练 BART 。 我们的 MVP 模型在 17 数据集的 13 上实现了最先进的性能,并且比 Flan-T5 Chung 等人 (2022) 的性能高出 5.8% 我们的零样本性能也大幅超过了 T0-11B Sanh 等人 (2022) 此外,对看不见的 NLG 和 NLU 任务的实验表明,我们的监督 MVP 模型对于看不见的任务具有很强的通用性。

为了重现和重用我们的工作,我们在链接中发布了 MVPCorpus 集合、所有 MVP 模型变体以及相应的代码:https://github.com/RUCAIBox/MVP

2相关工作

Refer to caption
图1: 我们的 MVP 模型的预训练过程和特定任务提示的概述。

预先训练的语言模型。

预训练语言模型在广泛的任务中取得了非凡的成功,其中大多数都是以无监督的方式进行预训练(Devlin等人,2019;Brown等人,2020) 例如,以大规模纯文本作为无监督预训练语料库(570GB),GPT-3(Brown等人,2020)采用语言建模作为预训练语料库。 -训练任务,根据先前的标记预测下一个词符。 同时,计算机视觉社区也从标记数据集 ImageNet (邓等人,2009)中受益匪浅。 有影响力的模型,例如 ResNet (He 等人,2016) 和 ViT (Dosovitskiy 等人,2021),利用 ImageNet 进行预训练。 受标记数据预训练成功的启发,机器翻译研究人员探索了监督预训练(McCann等人,2017;Lin等人,2020b) Lin 等人 (2020b) 尝试使用多种语言的并行数据预训练翻译模型。 尽管使用的预训练数据少得多,mRASP 仍然比以无监督方式预训练的翻译模型获得了更好的性能(Liu 等人,2020) 在本文中,我们建议使用标记数据集 (23GB) 的集合以监督方式预训练通用 NLG 模型。

多任务学习。

我们的预训练过程也与多任务学习(MTL)相关,这是一种将多个任务混合到单个训练过程中的方法(Collobert 和 Weston,2008) 使用 MTL 训练的模型可以受益于相关任务的有用知识,从而提高性能(Subramanian 等人,2018) 最近,MT-DNN (Liu 等人, 2019a) 和 Muppet (Aghajanyan 等人, 2021) 在多任务过程中收集了数十个数据集,并在下游任务。 Muppet 中提出的预微调模式与我们的研究具有相似的想法。 Aribandi 等人 (2022) 进一步结合 T5 (Raffel 等人, 2020) 的去噪预训练任务和多任务学习来预训练新模型, ExT5。 MTL还对文本生成的子领域做出了贡献,例如开放式对话系统(张等人,2020)、面向任务的对话系统(苏等人,2022)、文本样式迁移(Bujnowski 等人, 2020)、问题回答(Khashabi 等人, 2020) 与此同时,研究人员探索了在多任务数据集上训练的模型的可迁移性(Mishra 等人,2022) FLAN (Wei 等人, 2022)、T0 (Sanh 等人, 2022)、ZeroPrompt (Xu 等人, 2022) 和 FLAN -T5 (Chung 等人, 2022) 研究大型语言模型(大语言模型)的零样本或少样本泛化能力 Zhao 等人 (2023) 在大量数据上进行训练具有精心设计的提示的任务数据集。 与这些工作相比,我们的目标是探索多任务学习,以有监督的预训练方式导出有效通用 NLG 模型。

及时学习。

即时学习是 NLP 领域中一种蓬勃发展的方法。 即时学习将微调文本转换为类似于预训练的格式,以利用隐式预训练知识并减轻预训练和微调之间的差异(Liu等人,2021b) GPT-2 (Radford 等人, 2019) 和 T5 (Raffel 等人, 2020) 在输入文本中添加人工编写的任务提示。 例如,T5 在输入文档中添加“Summarize:”以执行摘要任务。 一些研究人员还为每个任务和数据集设计了精心设计的提示,并研究其有效性和鲁棒性(Wei等人,2022;Sanh等人,2022) 为了克服手动构建提示的限制,研究人员开发了可以在连续空间中优化的连续(软)提示(Lester等人,2021;Qin和Eisner,2021;Tang等人,2022b) 考虑到软提示的随机初始化,Gu等人(2022)提出PPT使用未标记数据预训练连续提示。 SPoT (Vu 等人, 2022)、UnifiedSKG (Xie 等人, 2022) 和 PTG (Li 等人, 2022a) 进一步学习相关任务的提示并将提示转移到新任务。

3MVP 模型

本节介绍我们的MVP模型:一个M多任务超级VP自然训练模型语言的产生。 我们模型的概述如图1所示。

3.1数据收集

形式上,自然语言生成(NLG)任务旨在生成以输入数据𝒳为条件的标记序列𝒴=(y1,y2,,yn)(例如一段文本或结构化数据)(Li等人,2022b)

在本文中,我们收集了一个大规模标记的 MVPCorpus,其中包含来自 11 代表性 NLG 任务的 77 标记数据集111我们在这项工作中不考虑机器翻译任务,只关注英语任务。,包括常识生成、数据到文本生成、开放式对话系统、释义生成、问答、问题生成、故事生成、任务导向对话系统、文本简化、文本风格迁移和文本摘要。 这些数据集来自不同的领域并且大小不同。 一些数据集是精心手工制作的,因此规模相对较小,而另一些数据集是为大规模弱监督而创建的。 这些任务的详细描述可以在附录A.1中找到。

接下来,我们将每个任务的不同输入数据𝒳转换为统一的文本到文本格式。 例如,我们通过使用数据的特殊词符“[SEP]”连接三元组或键值对来线性化结构化数据(例如,知识图或表)文本生成,我们利用特殊的词符“[X_SEP]”来分隔答案和段落以进行问题生成。 每个任务的转换后的输入格式可以在附录E中找到。

我们将MVPCorpus分为两部分,分别用于预训练和微调(评估)。 对于监督预训练,我们利用来自 7 任务的 50 数据集,包括数据到文本生成、开放式对话系统、问答、问题生成、故事生成、任务面向对话系统和文本摘要。 我们还消除了与评估数据重叠的预训练示例,以避免数据泄漏(更多详细信息参见附录A.2)。 最后,我们有一个包含 32M 个示例的 25GB 监督预训练语料库。 预训练数据集统计结果如表9所示。

为了进行评估,我们利用了文献中更常用的其余 27 数据集。 在这些数据集中,23数据集来自预训练中使用的7任务。 我们将它们称为 seen 任务,并使用它们来测试我们模型的有效性。 其余的4数据集分别来自常识生成、释义生成、简化和风格迁移的任务。 我们将它们称为unseen任务,并使用它们来检查我们模型的通用性。

3.2模型架构

我们的 MVP 模型建立在标准 Transformer 编码器-解码器架构(Vaswani 等人,2017)之上。 与仅解码器的 PLM(例如 GPT-3 (Brown 等人,2020) 和前缀 LM(例如 UniLM (Dong 等人,2019))相比,编码器-解码器架构对于文本生成任务更有效(Raffel 等人,2020) 在第一阶段,我们使用来自七个任务的标记数据集的混合来预训练 MVP 主干。 为了指示每个任务,我们将人工编写的指令应用于每个任务实例。 例如,我们写“Summarize:”作为摘要任务的提示。 每个任务的手动说明如附录E所示。

在第二阶段,我们冻结 MVP 主干并预训练一组特定于任务的提示(连续向量),以激发模型执行某些特定任务的能力。 特别地,我们遵循前缀调整(Li和Liang,2021)在每个Transformer层插入连续向量,并使用相应的任务内数据集的混合来学习它们() > 同一任务下的数据集222例如,我们使用摘要数据集训练特定于摘要的提示,例如 Newsroom (Grusky 等人, 2018)、WikiHow (Koupaee 和 Wang,2018) 和 MSNews (Liu 等人,2021a))。 相比于提示调优(Lester 等人, 2021)仅在输入层添加提示,分层提示更加有效和稳定(Liu 等人, 2022),特别是对于 NLG 任务。 这些在任务之间不共享的软提示对特定于任务的语义知识进行编码,以缓解多任务学习引起的模糊问题(He and Choi,2021)

3.3培训详情

我们的 MVP 模型在编码器和解码器中均采用具有 12 层的 Transformer(406M 个参数),与 BARTlarge 的模型大小相同(刘易斯等人,2020) 我们使用 BART 参数初始化主干网络,为后续的 NLG 任务提供良好的起点(Dong 等人,2019;Zhang 等人,2020) 我们以批量大小8,192预训练模型,并采用温度缩放混合策略(Raffel等人,2020),速率为T=2 缩小任务和数据集的差异。

我们遵循前缀调整(Li和Liang,2021),通过将可训练向量添加到每一层的多头注意力模块来预训练特定于任务的提示。 提示长度设置为100,我们利用隐藏大小为800的MLP重新参数化函数来提高训练的鲁棒性和性能(Li和Liang,2021) 因此,每个任务提示大约有 62M 个参数。 然后,我们冻结 MVP 模型并训练七组特定于任务的提示,每组对应一个不同的任务。

在这两个阶段中,输入和输出序列的最大长度都设置为1,024,以支持示例包含更多标记。 我们使用标准序列到序列交叉熵损失以 3×105 的恒定学习率优化模型。 我们应用带有 β1=0.9β2=0.98ϵ=1×106 的 AdamW 优化器来提高训练稳定性(Liu 等人, 2019b) 权重衰减系数为0.1 为了进行测试,我们选择验证性能最高的检查点。 所有实验均在 32 NVIDIA Tesla V100 32GB GPU 上进行。 我们使用文本生成库 TextBox (Tang 等人, 2022a) 来实现我们的模型。

总之,我们预训练了406M生成模型MVP和七组62M任务特定提示。 对于每个下游任务,用户可以直接利用主干(406M),也可以进一步将 MVP 与特定于任务的提示(468M)结合起来。

Methods CNN/DailyMail WebNLG SQuAD (QG) CoQA
R-1 R-2 R-L B-4 ME R-L B-4 ME R-L F1 EM
MVP 44.52 21.62 41.10 67.82 47.47 76.88 26.26 27.35 53.49 86.43 77.78
BART 44.16e 21.28 40.90 64.55b 46.51 75.13 22.00f 26.40 52.55 68.60f
Flan-T5 43.45 21.01 40.03 66.60 46.93 75.76 25.55 26.90 53.51 84.18 75.44
Single 44.36 21.54 40.88 67.74 46.89 76.94 26.09 27.15 53.29 86.20 77.26
MVP+S 44.63 21.72 41.21 68.19 47.75 76.81 25.69 27.04 53.20 86.65 77.93
MVP+R 44.14 21.45 40.72 67.61 47.65 76.70 25.71 27.03 53.09 85.95 77.22
MVP+M 43.97 21.16 40.46 67.45 47.57 76.81 25.46 26.79 52.95 86.28 77.26
SOTA 47.16a 22.55 43.87 66.14b 47.25 76.10 25.97c 27.33 53.43 84.50d
Methods ROCStories PersonaChat MultiWOZ
B-1 B-2 D-1 D-4 B-1 B-2 D-1 D-2 B-4 Success Inform
MVP 33.79 15.76 3.02 75.65 50.73 40.69 1.65 11.23 20.26 76.40 85.00
BART 30.70g 13.30 69.90 49.90f 40.00 1.30 8.00 17.89j 74.91 84.88
Flan-T5 32.72 15.23 2.97 68.97 48.55 40.22 1.40 7.85 19.73 70.20 78.70
Single 32.67 15.29 2.72 72.97 49.96 40.53 1.27 7.63 19.73 75.60 83.70
MVP+S 33.92 15.60 3.44 80.58 47.91 39.97 1.52 9.54 20.32 79.90 86.80
MVP+R 32.93 15.32 2.88 73.83 48.45 40.09 1.30 7.95 19.02 73.30 81.80
MVP+M 33.30 15.51 2.71 74.24 46.26 39.30 1.36 8.07 19.93 72.70 79.70
SOTA 33.40g 15.40 69.30 49.90f 40.00 1.50h 9.40 20.50i 85.30 94.40
表2: 在完全调优设置下看到的七个任务的主要结果。 所有方法中最好和第二好的结果分别用粗体下划线标记。 这里的 SQuAD 数据集用于问题生成任务。 字母 B、R、D 和 ME 分别表示 BLEU、ROUGE、Distinct 和 METEOR。 “-”表示该工作没有计算出相应的结果。 a (Ravaut 等人, 2022)b (柯等人,2021)c (包等人,2021)d (肖等人,2020)e (刘易斯等人,2020)f (刘等人,2021a)g (关 等人,2021)h (陈等人,2022)i (何等人,2022)j (林等人,2020c)

4实验结果

在本节中,我们主要研究 MVP 模型的有效性通用性 我们在不同的环境中进行了广泛的实验:

  • 完全调优场景下,我们使用27生成数据集和GLUE基准(Wang等人,2019)进行评估。 4.1 节和附录 C 分析了 7 所见任务的 23 数据集的结果。 4.3部分包括4未见生成任务和8理解任务的结果。 为了更好地与ExT5进行比较,我们在附录C.2中的GEM基准(Gehrmann等人,2021)上进行了实验。

  • 零样本学习中,我们将我们的模型与4.2节中的T0进行比较。

  • 参数高效调整设置中,我们使用与4.1节中相同的数据集,结果可以在4.4节中找到。

  • 我们在第 4.5 节中进行了人工评估

对于完整的调优设置(表211),我们调整整个模型(包括骨干MVP和提示),而对于参数高效的调优(表6),我们只进行参数提示,但冻结MVP的参数权重。 我们通过标签平滑(Szegedy等人,2016)因子0.1的seq2seq损失和具有默认超参数的AdamW优化器来优化模型。 我们扫描 {16,64,256} 中的批量大小和 {5×106,1×105,3×105} 中的学习率,以找到每个评估任务的最佳超参数。 我们利用具有最佳验证性能的检查点进行测试集推理。 在推理过程中,我们将 Beam 大小设置为 5,将非重复 ngram 大小设置为 3 有关微调和评估的详细信息可以在附录B中找到。

Methods CNN/DailyMail WebNLG SQuAD (QG) CoQA
R-1 R-2 R-L B-4 ME R-L B-4 ME R-L F1 EM
FT BART 44.16 21.28 40.90 64.55 46.51 75.13 22.00 26.40 52.55 68.60
FT MVP 44.52 21.62 41.10 67.82 47.47 76.88 26.26 27.35 53.49 86.43 77.78
T0-3B 01.40 10.20 18.43 3.06 12.43 14.91 13.30 06.60
T0-11B 00.26 06.13 14.12 2.63 07.00 15.25 09.18 04.36
MVP 29.50 11.29 25.92 34.42 31.33 52.33 2.90 13.94 15.48 29.40 18.20
MVP+S 25.60 09.51 22.67 39.43 34.32 55.34 2.96 15.23 18.23 52.40 37.30
Methods ROCStories PersonaChat MultiWOZ
B-1 B-2 D-1 D-4 B-1 B-2 D-1 D-2 B-4 Success Inform
FT BART 30.70 13.30 69.90 49.90 40.00 1.30 8.00 17.89 74.91 84.88
FT MVP 33.79 15.76 3.02 75.65 50.73 40.69 1.65 11.23 20.26 76.40 85.00
T0-3B 08.69 3.02 04.37 35.49 23.20 23.57 2.56 12.06 0.02 2.50 22.10
T0-11B 00.63 0.16 12.41 92.86 32.17 28.35 1.56 07.19 0.00 3.90 22.10
MVP 01.01 0.31 07.18 86.26 35.54 32.71 2.87 16.38 3.08 2.50 22.20
MVP+S 10.52 3.54 02.13 69.55 37.04 33.38 2.66 14.84 0.38 2.50 22.10
表3: 零样本学习中七个未见过的数据集的结果。 鉴于 T0 已经在 CNN/DailyMail 数据集上进行了预训练,我们排除了它们的结果以提供公平的比较(表示为“-”)。

4.1全面调优性能

我们对七个已知任务的七个新数据集进行了实验,以验证我们的两阶段预训练方法的有效性 我们设计了多种型号。 在第一阶段,MVP 使用多任务训练监督预训练,我们将其与使用不同策略的其他两个进行比较:

  • BARTlarge (Lewis 等人, 2020):BART 是一种广泛使用的 PLM,用于自然语言生成,使用去噪自动编码作为无监督预编码。培训目标。

  • Flan-T5large (Chung 等人, 2022):Flan-T5 是一种最新的语言模型,在各种 NLP 任务上以监督方式训练,这可以成为我们模型的强大竞争对手。

  • 单任务预训练(Single):我们在多任务训练中使用相同预训练设置下的任务内数据集为每个任务单独训练单个模型。 例如,我们使用摘要数据集(例如 Newsroom、WikiHow 和 MSNews)预训练摘要模型。 因此,我们总共有七个单任务预训练模型。

对于集成单任务预训练提示的第二阶段(表示为MVP+S),我们将其与使用不同提示的两个变体进行比较:

  • 随机初始化提示(MVP+R):MVP模型的分层提示是随机初始化的,无需预训练。

  • 多任务预训练提示(MVP+M):我们仅使用与主干预训练相同的混合数据集为所有任务预训练一组提示。

除了这些变体之外,我们还包括文献中原始论文的最佳报告结果以进行比较(表示为SOTA)。 从表2的结果可以看出:

首先,有监督预训练模型( MVP、Flan-T5 和 Single)比无监督预训练模型 BART 取得了更好的性能,平均提高了 9.3%3.13%4.4%(按比例)。 这一发现验证了我们的监督预训练方法的有效性,该方法使模型能够获取更多特定于任务的信息。 关于多任务预训练 (MVP) 和单任务 (Single),我们的 MVP 模型的性能优于单任务模型 5.0% 这一结果表明,多任务学习方法可以通过学习跨任务的可转移语义信息来增强单任务性能。 值得注意的是,我们的 MVP 模型的性能优于 Flan-T5 5.8%,这显示了训练在我们的 NLG 数据集 MVPCorpus 上的重要性。

其次,任务特定提示学习有效缓解多任务学习的“模糊”问题。 对于数据到文本生成和问题回答等任务,具有单任务提示的 MVP (MVP+S) 始终优于其他两个变体(MVP+R 和 MVP+M)。 这验证了任务特定提示可以获取任务专业知识并激发 MVP 模型执行某些任务的能力。

最后,我们的监督预训练方法在数据到文本生成、问题生成、问答、故事生成和开放式对话任务方面取得了五个新的 SOTA 结果。 我们还在表11中的八个数据集中的六个中实现了 SOTA 性能,这显示了我们的 MVP 模型强大的文本生成能力。 至于其余任务,SOTA 模型结合了定制技术,例如重新排序框架(Ravaut 等人,2022)和各种特定于任务的目标 (He 等人, 2022),从而产生更好的性能。 相比之下,我们的 MVP 模型只需通用架构和统一的学习目标就可以产生有竞争力的结果。

AESOP Quora
B-4 R-1 R-2 R-L ME
+BART 47.30a 73.30 54.10 75.10 49.70
+MVP 49.81 74.78 56.84 76.34 53.40
SC & BLEU GYAFC E&M GYAFC F&R
B-4 Accuracy HM B-4 Accuracy HM
+BART 76.50b 93.70 83.90 79.30 92.00 85.20
+MVP 77.18 94.49 84.96 79.43 92.12 85.31
表 4: 未见过的 NLG 任务的结果。 我们使用 AESOP 和 SC & BLEU 分别表示 Sun 等人 (2021)Lai 等人 (2021) 提出的方法。a (孙等人,2021)b (赖等人,2021)
Methods CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE Average
Matt. Acc. F1/Acc. P/S Corr. F1/Acc. m./mm. Acc. Acc.
BART 60.30 96.30 90.47 / 86.70 90.97 / 90.30 73.03 / 89.87 90.03 / 89.27 94.60 79.83 85.17
MVP 59.87 96.43 92.07 / 89.43 91.37 / 90.90 73.20 / 90.13 89.70 / 88.73 95.10 82.87 85.88
表 5: GLUE 基准上的 NLU 任务结果。

4.2 零样本性能

由于我们没有在七个常用数据集上预训练 MVP,因此我们进一步进行零样本实验来查看模型的域转移能力。 我们将 T0-3B 和 T0-11B (Sanh 等人, 2022) 作为我们的基线,它们是在各种下游任务上训练的大型模型。 结果列于表3中。 我们可以观察到,除了 ROCStories 和 MultiWOZ 上的少数指标外,我们的小型 MVP 模型 (406M) 在所有指标上都比 T0-3B 和 T0-11B 的表现有很大优势。 这证明了在 MVPCorpus 上使用监督预训练的有效性。

然而,所有任务都表明,零样本设置中的模型比完全调整设置的模型表现要差得多。 这表明对 NLU 任务有效的训练策略可能无法为 NLG 任务产生令人满意的结果。 尽管我们的模型已经获得了任务知识,但如果不进行微调,它很难在新领域中表现良好。 因此,仍然有必要针对某些任务和领域开发特定的 NLG 模型。 我们的 MVP 模型可以成为进一步研究的有效模型。

Methods CNN/DailyMail WebNLG SQuAD (QG) CoQA
R-1 R-2 R-L B-4 ME R-L B-4 ME R-L F1 EM
MVP+S 43.03 20.27 39.72 66.73 47.42 76.36 25.28 26.66 52.69 86.44 76.84
BART+R 42.47 19.82 39.15 65.54 46.86 75.24 24.27 26.07 52.03 82.22 71.92
MVP+R 42.84 20.21 39.61 66.12 47.12 75.83 25.05 26.34 52.57 85.51 75.56
MVP+M 42.99 20.36 39.70 66.40 47.16 75.89 25.24 26.49 52.88 85.90 76.34
FT BART 44.16 21.28 40.90 64.55 46.51 75.13 22.00 26.40 52.55 68.60
FT MVP 44.52 21.62 41.10 67.82 47.47 76.88 26.26 27.35 53.49 86.43 77.78
Methods ROCStories PersonaChat MultiWOZ
B-1 B-2 D-1 D-4 B-1 B-2 D-1 D-2 B-4 Success Inform
MVP+S 32.94 15.12 2.98 71.09 47.11 39.51 1.39 7.28 19.24 71.40 77.80
BART+R 32.14 14.71 2.85 68.94 46.23 38.98 1.30 6.82 17.94 62.20 69.20
MVP+R 32.28 14.85 2.97 70.29 46.70 39.23 1.31 6.98 18.86 64.40 71.40
MVP+M 32.62 15.28 2.95 69.58 46.78 39.40 1.33 7.13 19.13 67.20 72.90
FT BART 30.70 13.30 69.90 49.90 40.00 1.30 8.00 17.89 74.91 84.88
FT MVP 33.79 15.76 3.02 75.65 50.73 40.69 1.65 11.23 20.26 76.40 85.00
表 6: 在参数高效设置下看到的七项任务的结果。 我们还包括了 BART 和 MVP 在完全调优设置(表示为 FT)下的结果进行比较。

4.3 未见任务的一般性

在本小节中,我们将在未见过的 NLG 和 NLU 任务上测试 MVP 模型,以验证其通用性。

看不见的 NLG 任务。

根据 Deng 等人 (2021) 的说法,NLG 任务可以分配给以下三个类别之一:压缩(例如摘要)、转导(例如摘要)。 ,翻译),或创作(例如,故事生成)。 由于我们在预训练期间不包含任何转导任务,因此我们使用两个看不见的转导 NLG 任务来评估我们的 MVP 模型:释义生成和文本样式转换。 我们为这两个任务选择 SOTA 方法, AESOP (Sun 等人, 2021) 用于释义生成,SC & BLEU (Lai 等人, 2021) ) 用于文本样式转换,并将其骨干 BART 替换为我们的 MVP 模型进行比较。 从表4的结果可以看出,我们的模型比BART的性能好于2.3%,并取得了两个新的SOTA结果,这验证了我们模型的强大通用性。 这一发现表明我们的 MVP 模型比 BART 更强大,可以作为通用而有效的骨干网。

未见过的 NLU 任务。

尽管 MVP 是专门为 NLG 任务设计的,但我们还使用广泛使用的 GLUE 基准(Wang 等人,2019)来评估其在未见过的 NLU 任务上的性能。 我们使用序列分类方法 (Lewis 等人,2020) 将我们的模型与 BARTlarge 进行比较。 根据表 5 中显示的结果,我们的 MVP 模型在 12 指标中的 9 方面优于 BART,并且具有优于 0.71% 的整体性能。 t4>。 这一结果表明了我们的 MVP 模型的通用性,并进一步证明了监督预训练不仅学习了生成能力,而且还提高了整体语义表示。

4.4参数高效的调优性能

在轻量级微调设置中,我们仅调整提示,同时冻结骨干 MVP 模型以验证其在资源受限情况下的有效性。 除了我们的 MVP+S 模型之外,我们还考虑比较以下方法:

  • 前缀调优 (Li和Liang,2021):前缀调优是一种流行的基于提示的轻量级文本生成调优方法。 我们采用BART作为其骨干网,表示为BART+R

  • 仅调整随机初始化的提示(MVP+R):该变体仅调整 MVP+R 的随机初始化的提示,与前缀调整的思路类似。

  • 仅调整多任务预训练提示(MVP+M):此变体仅调整 MVP+M 的多任务预训练提示。 SPoT (Vu 等人, 2022)中已经使用了这样的想法。

从表6的实验结果可以看出:MVP模型在轻量级设置下的良好表现进一步证明了有监督预训练的有效性。 通过比较两种随机初始化的提示方法(BART+R 和 MVP+R),我们可以看到 MVP+R 由于其多任务监督主干而实现了优于 BART+R (+2.0%) 的性能。 此外,当使用预训练提示进行初始化时,MVP+S 和 MVP+M 比 MVP+R 取得了更好的结果,这与 SPoT (Vu 等人, 2022) 的研究结果一致。 与 MVP+M 相比,MVP+S 的表现略好 1.2%,这表明特定于任务的提示对于改进生成任务中的模型很有用。 令人惊讶的是,我们的轻量级 MVP+S 甚至可以在问题生成和问题回答等任务上超越完全调整的 BART,展示了所提出的监督预训练方法的有效性。

Datasets MVP wins (%) Ties (%) BART wins (%)
CNN/DM 46.50 10.67 42.83
WebNLG 32.17 45.67 22.17
ROCStories 46.50 11.33 42.17
PersonaChat 35.33 34.00 30.67
表 7: 使用 Krippendorff 的 α=0.418 对四项任务进行人工评估,衡量人类评判者之间的注释者相关性。

4.5人类评价

考虑到自动指标与人类判断之间存在一定差距(Sai等人,2022),我们进一步进行人类评估,以更好地展示我们的MVP模型的生成能力。 我们在四个任务上将 MVP 与 BART 进行比较,包括文本摘要、数据到文本生成、开放式对话系统和故事生成。 遵循 van der Lee 等人 (2021) 的做法,我们对每个任务使用低、中、高词频的 100 输入的分层样本。 我们邀请六位人类评委来评估 MVP 和 BART 生成的文本。 然后他们需要选择哪一个更好,或者根据流畅性、信息量、一致性、任务特征等选择平局。附录D中列出了更多人工评估详细信息。表 7 显示了每个数据集的“MVP 获胜”、“平局”和“BART 获胜”的比例。 从结果中我们可以看出,从人类的角度来看,MVP 可以生成比 BART 更好的文本。

Methods #NLG (PT) #NLU (PT) #NLG (FT) #NLU (FT) SP model SP prompts Open source
FLAN 3 9 2 9
T0 2 6 0 4
Muppet 1 3 1 3
ExT5 3 8 6 8
SPoT 1 4 0 6
MVP (ours) 7 0 11 3
表8: MVP 与现有有监督预训练工作的比较。 #NLG/#NLU 分别是 NLG 和 NLU 任务的数量。 PT、FT 和 SP 分别表示预训练、微调和监督预训练。

5讨论

与现有方法的差异。

据我们所知,现有的有监督预训练工作主要集中在 NLU 任务(Aghajanyan 等人,2021;Aribandi 等人,2022)或少量 NLG 任务(Lin等人,2020b;苏等人,2022) 鉴于监督预训练方法取得了优异的性能,探索监督预训练对于推导有效通用 NLG 模型非常重要。 我们的工作在这个方向上做出了重大贡献,在 17 数据集的 13 上使用单个模型实现了 SOTA 性能。 与它的强大对手 ExT5 (Aribandi 等人,2022) 相比,我们的 MVP 模型在 2627 指标中表现优于它(详见附录 C.2)。 为了更好地理解我们的工作与之前的监督(多任务)预训练研究之间的差异,我们在表8中进行了详细比较。 正如我们所看到的,我们的工作使用最多的 NLG 任务进行了监督预训练和微调的研究,结合了特定于任务的提示,并释放了所有重要的资源来重现或重用我们的工作。

适用性。

为了方便我们的工作应用,我们发布了集合语料库、预训练模型、特定任务提示和生成文本。 我们收集的MVPCorpus是目前最大的NLG任务集合,可以作为近期大语言模型Zhao等人(2023)的优质资源。 我们可以使用所有数据来预训练通用模型,或者选择一个子集来继续预训练特定领域或特定任务的模型(Gururangan 等人,2020) 我们的 MVPCorpus 也可以考虑作为不同NLG任务的评估基准。 此外,我们的 MVP 模型可用于在各种 NLG 任务中取得有竞争力的结果。 用户可以根据足够的标记数据来模拟 MVP 模型或将其与特定于任务的提示集成。 值得注意的是,我们的 MVP 模型可以直接用于零样本学习中获得良好的性能。 此外,我们的 MVP 模型可以提供有效的参数初始化来改进现有方法,如 4.3 节中所述。 最后,特定于任务的提示和生成的文本可以进一步用于研究任务相似性及其对多任务预训练的影响。

6结论

在本文中,我们提出了自然环境的M多任务监督VP再训练(MVP)语言的产生。 首先,我们从 11 个不同 NLG 任务的 77 数据集中收集了一个大规模 NLG 语料库 MVPCorpus。 将各种 NLG 任务转换为统一的文本到文本格式后,我们提出多任务监督预训练来学习有效通用模型MVP 带有 NLG 任务的特定于任务的提示。 大量实验表明:(1)有监督预训练作为一种有效的解决方案有利于 NLG 任务。 我们的 MVP 模型优于其强大的同行 BART 和 Flan-T5,甚至在 17 数据集的 13 上实现了 SOTA 性能; (2)有监督的预训练模型对于看不见的生成甚至理解任务具有很强的通用性。

在未来的工作中,我们将通过覆盖更多其他语言的数据集来探索 MVP 模型的多语言版本。 这样的模型有望捕获与语言无关的任务特征并改进小语种的生成任务。 此外,研究不同任务在统一语义空间中如何相互关联是很有趣的,这可以启发先验地合并任务关系的方法。

致谢

该工作得到了国家自然科学基金委的部分资助,批准号为: 62222215,北京市自然科学基金,批准号: 北京市杰出青年科学家计划,批准号:4222027 BJJWZYJH012019100020098。 赵鑫是通讯作者。

局限性

尽管我们努力收集尽可能多的生成任务和数据集,但我们仅在少量任务和数据集上评估模型的生成质量和通用性。 我们模型的可解释性和稳健性需要进一步分析。 此外,尽管我们尝试采用文献中广泛认可的分类,但在收集下游任务和任务内数据集时仍然存在主观性。 由于计算能力的限制,我们没有研究我们的方法在不同模型规模下的性能。 类似于ExT5(Aribandi等人,2022),从头开始的多任务预训练的有效性也值得深入研究。

更广泛的影响

在本文中,我们使用带标签的 NLG 数据集预训练了语言模型 MVP。 根据研究(Bender 等人,2021;Bommasani 等人,2021),PLM 倾向于“记住”他们在预训练语料库中“看到”的内容。 这可能会导致下游任务的预训练数据产生不良偏差。 训练数据干预可能是缓解这一问题的解决方案(Lu等人,2020) 研究有监督预训练是否比无监督预训练产生的偏差更少也很有趣。

环境影响是我们应该考虑的另一个因素。 我们尝试了更有效的预培训策略,并发布了我们的 PLM 以供未来的工作使用。 与 T5 (Raffel 等人, 2020) 和 GPT-3 (Brown 等人, 2020) 等数百亿参数的大型 PLM 相比,我们预-仅训练具有数亿参数的小模型。 此外,我们利用有监督的预训练数据并使用预训练的 BART 初始化我们的模型,这两者都提高了我们模型的收敛性。 最终,我们的模型预训练了大约 20,000 步骤,而相同大小的 BART 预训练了 500,000 步骤。

再现性

为了重现和重用我们的工作,我们发布了 MVPCorpus 集合、模型(例如, MVP、特定于任务的提示和多任务变体)、中间结果(例如, 生成的文本),以及用于预训练和微调的源代码,链接为:https://github.com/RUCAIBox/MVP 实验的详细设置列于附录B中。 我们希望这些开源资源能够促进未来监督预训练的工作,并为 NLG 研究的进步做出贡献。

参考

附录 A任务和数据集

A.1 任务和数据集描述

我们在表 910 中提供了论文中用于预训练和微调的任务和数据集的详细信息。 如果预训练的数据集没有有效的训练集,我们将分割集的 10% 进行验证。

我们列出了所有数据集的许可证(如果有)。 所有数据集都是公开的。 其中大部分可以直接从 GitHub 或 Google Drive 下载。 ROCStories (Mostafazadeh 等人, 2016) 和 CommonGen (Lin 等人, 2020a) 可以在填写表格后获取。 GYAFC (Rao 和 Tetreault,2018) 在请求雅虎和数据集作者后即可访问。

我们在本文中使用的任务和数据集如下:

  • 数据到文本生成旨在生成有关结构化数据的描述性文本,例如知识图谱和表格。 我们使用以下数据集进行预训练:

    1. 1.

      议程 (Koncel-Kedziorski 等人, 2019);

    2. 2.

      ENT-DESC (Cheng 等人, 2020);

    3. 3.

      GenWiki (Jin 等人, 2020);

    4. 4.

      LogicNLG (Chen 等人, 2020a);

    5. 5.

      TEKGEN (Agarwal 等人, 2021);

    6. 6.

      WEATHERGOV (梁等人, 2009);

    7. 7.

      WikiTableT (Chen 等人, 2021)

    我们利用以下数据集进行微调评估:

    1. 1.

      WebNLG (Gardent 等人, 2017),我们使用2.1版本;

    2. 2.

      WikiBio (Lebret 等人,2016)

  • 开放式对话系统,也称为聊天机器人,专注于日常交流。 我们使用以下数据集进行预训练:

    1. 1.

      Cleaned OpenSubtitles Dialogs (Cleaned OS Dialogs) (Welivita 等人, 2021),它是 OpenSubtitles Dialogs (Lison 等人, 2018) 的清理变体;

    2. 2.

      CMU 文档接地对话 (CMUDog) (Zhou 等人, 2018);

    3. 3.

      好奇心(Rodriguez 等人, 2020);

    4. 4.

      DREAM (Sun 等人, 2019);

    5. 5.

      同理心对话(Rashkin 等人, 2019);

    6. 6.

      电影对话(Dodge 等人, 2016);

    7. 7.

      互助(Stratos,2019)

    8. 8.

      OpenDialKG (Moon 等人, 2019);

    9. 9.

      主题聊天 (Gopalakrishnan 等人, 2019);

    10. 10.

      维基百科巫师(Dinan 等人,2019)

    我们利用以下数据集进行微调评估:

    1. 1.

      DailyDialog (李等人, 2017);

    2. 2.

      DSTC7-AVSD (Alamri 等人, 2018);

    3. 3.

      PersonaChat (张等人,2018)

  • 释义生成涉及用相同语义但不同句法或词汇形式重写句子。 我们利用以下数据集进行微调评估:

    1. 1.

      Quora(也称为 QQP-Pos)(Kumar 等人, 2020),它是 Quora 问题对的子集333https://www.kaggle.com/c/quora-question-pairs

  • 问答要求模型根据可选的背景信息回答问题。 请注意,我们在论文中以生成方式执行此任务。 我们使用以下数据集进行预训练:

    1. 1.

      HotpotQA (杨等人, 2018);

    2. 2.

      MS MARCO (Nguyen 等人, 2016);

    3. 3.

      MSQG (Liu 等人, 2021a),由于它是为 QG 设计的,因此我们反转问答以丰富 QA 示例;

    4. 4.

      NarrativeQA (Kočiský 等人, 2018);

    5. 5.

      自然问题(Kwiatkowski 等人, 2019);

    6. 6.

      NewsQA (Trischler 等人, 2017);

    7. 7.

      QuAC (Choi 等人, 2018);

    8. 8.

      TriviaQA (Joshi 等人, 2017);

    9. 9.

      WebQuestions (Berant 等人,2013)

    我们利用以下数据集进行微调评估:

    1. 1.

      CoQA (Reddy 等人, 2019);

    2. 2.

      SQuAD (Rajpurkar 等人, 2016),我们使用 1.1 版本。

  • 问题生成根据给定的段落及其相应的答案生成一个连贯的问题。 我们使用以下数据集进行预训练:

    1. 1.

      HotpotQA (杨等人, 2018);

    2. 2.

      MS MARCO (Nguyen 等人, 2016);

    3. 3.

      MSQG (刘等人, 2021a);

    4. 4.

      NarrativeQA (Kočiský 等人, 2018);

    5. 5.

      NewsQA (Trischler 等人, 2017);

    6. 6.

      QuAC (Choi 等人,2018)

    其中大部分是 QA 任务,我们颠倒问题和答案来丰富 QG 示例。

    我们利用以下数据集进行微调评估:

    1. 1.

      CoQA (Reddy 等人, 2019);

    2. 2.

      SQuAD (Rajpurkar 等人, 2016),我们使用 1.1 版本。

  • 故事生成创建一个带有简短标题的长而信息丰富的文本。 我们使用以下数据集进行预训练:

    1. 1.

      ChangeMyView (Hua 和 Wang,2020)

    2. 2.

      英文Gigaword (Rush 等人, 2015);

    3. 3.

      Hippocorpus (Sap 等人, 2020);

    4. 4.

      WikiPlots (Markriedl, );

    5. 5.

      WritePrompts (Fan 等人, 2018),我们将原始训练集进行拆分以进行预训练和相应的验证。

    考虑到英语Gigaword是一个大型摘要数据集,我们使用摘要作为标题依次生成段落,以丰富故事生成的示例。

    我们利用以下数据集进行微调评估:

    1. 1.

      ROCStories (Mostafazadeh 等人, 2016);

    2. 2.

      WritePrompts (Fan 等人, 2018),我们使用 Guan 等人 (2021) 创建的集合(他们将原始有效集和测试集分开用于训练、验证和测试)测试)对我们的模型进行公平比较。

  • 面向任务的对话系统满足用户的现实生活需求,例如餐厅预订、飞机预订等。 我们使用数据集进行预训练,遵循 Su 等人 (2022)

    1. 1.

      CamRest676 (文等人, 2017);

    2. 2.

      Frames (El Asri 等人, 2017);

    3. 3.

      KVRET (Eric 等人, 2017);

    4. 4.

      MetaLWOZ (Lee 等人, 2019);

    5. 5.

      MSR-E2E (李等人, 2018);

    6. 6.

      MultiWOZ (Budzianowski 等人, 2018);

    7. 7.

      模式引导(Rastogi 等人, 2020a);

    8. 8.

      TaskMaster (Byrne 等人, 2019);

    9. 9.

      WOZ (Mrkšić 等人,2017)

    我们利用以下数据集进行微调评估:

    1. 1.

      MultiWOZ (Budzianowski 等人, 2018),我们使用 2.0 版本。

  • 文本样式迁移修改给定文本的样式(例如,情感和形式),同时保留其与样式无关的内容。 我们利用以下数据集进行微调评估:

    1. 1.

      GYAFC (Rao 和 Tetreault,2018),它有两个子域:“娱乐和音乐”(E&M)和“家庭和关系”(F&R)。

  • 文本摘要将长文档压缩为简短的文本,同时保留基本细节。 我们使用以下数据集进行预训练:

    1. 1.

      英文Gigaword (Graff 等人, 2003),我们使用Rush 等人(2015)提供的变体;

    2. 2.

      MediaSum (朱 等人, 2021);

    3. 3.

      MSNews (刘等人, 2021a);

    4. 4.

      新闻中心(Grusky 等人, 2018);

    5. 5.

      WikiHow (Koupaee 和 Wang,2018)

    我们利用以下数据集进行微调评估:

    1. 1.

      CNN/DailyMail (Hermann 等人, 2015),我们使用 See 等人 (2017) 提供的变体;

    2. 2.

      SAMSum (Gliwa 等人, 2019);

    3. 3.

      XSum (Narayan 等人, 2018)

为了更好地与 ExT5 (Aribandi 等人, 2022) 进行比较,我们利用语言生成基准 GEM (Gehrmann 等人, 2021) 进行微调评估。 GEM包括五项任务:

  • 常识生成

    1. 1.

      CommonGen (CG) (Lin 等人, 2020a)

  • 数据到文本生成

    1. 1.

      DART (南等人, 2021);

    2. 2.

      E2E NLG 清理(Novikova 等人,2017)

    3. 3.

      ToTTo (苏 等人, 2021);

    4. 4.

      WebNLG (Gardent 等人,2017)

  • 对话系统

    1. 1.

      模式引导对话(SGD)(Rastogi 等人,2020b)

  • 文本简化

    1. 1.

      WikiAuto + Turk/ASSET (WiA-T/A) (Jiang 等人, 2020; Xu 等人, 2016; Alva-Manchego 等人, 2020)

  • 文本摘要

    1. 1.

      Wiki-Lingua (WLE) (Ladhak 等人,2020)

为了测试我们模型的泛化能力,我们还利用了自然语言标准基准 GLUE (Wang 等人,2019),它由三个任务组成:

  • 自然语言推理

    1. 1.

      MNLI (Williams 等人, 2018);

    2. 2.

      QNLI (Rajpurkar 等人, 2016; Wang 等人, 2019);

    3. 3.

      RTE (Dagan 等人, 2006; Haim 等人, 2006; Giampiccolo 等人, 2007; Bentivogli 等人, 2009)

  • 释义检测

    1. 1.

      MRPC (Dolan 和 Brockett,2005)

    2. 2.

      QQP 3;

    3. 3.

      STS-B (Cer等人,2017)

  • 文本分类

    1. 1.

      CoLA (Warstadt 等人, 2019);

    2. 2.

      SST-2 (Socher 等人,2013)

A.2数据泄露

由于我们的模型是在大量标记数据集上进行预训练的,因此它可能在预训练期间“看到”了微调测试集的示例,这导致与其他方法的比较不公平。 因此,我们消除了与任一测试数据集共享 n-gram 重叠的预训练示例。 Brown 等人 (2020) 之后,n 是第 5th 个百分位示例长度(以单词为单位),最大值为n 设置为 13 最后,我们从预训练数据集中删除了 17,848 示例。 每个数据集的“清理”示例数量可以在表9中找到。

Dataset #Train Cleaned #Train #Valid #Test Input Output License
AGENDA 38,720 38,720 1,000 1,000 52.1 141.2 N/A
ENT-DESC 88,652 88,652 11,081 11,081 279.9 31.0 N/A
GenWiki 681,436 681,436 75,716 1,000 21.4 29.5 MIT
LogicNLG 28,450 28,450 4,260 4,305 178.4 14.2 MIT
TEKGEN 6,310,061 6,307,995 788,746 796,982 17.0 21.2 CC BY-SA 2.0
WEATHERGOV 25,000 25,000 1,000 3,528 148.7 30.6 N/A
WikiTableT 1,453,794 1,452,778 4,533 4,351 81.0 99.7 MIT
Cleaned OS Dialogs 13,355,487 13,355,368 1,483,944 - 75.5 16.7 N/A
CMUDoG 82,818 82,818 5,555 14,510 433.0 12.2 N/A
Curiosity 64,930 64,551 8,539 8,495 144.4 20.2 CC BY-NC 4.0
DREAM 14,264 14,242 4,709 4,766 75.6 13.6 N/A
Empathetic Dialogues 64,636 64,636 9,308 8,426 52.7 12.9 CC BY-NC 4.0
Movie Dialog 762,751 762,711 8,216 8,066 126.9 44.0 N/A
MuTual 33,691 33,691 4,090 3,248 53.6 14.5 N/A
OpenDialKG 69,680 69,680 7,743 - 54.2 12.4 CC BY-NC 4.0
Topical-Chat 179,750 179,750 22,295 22,452 223.3 20.0 CDLA-Sharing-1.0
Wizard of Wikipedia 148,357 147,702 15,767 15,564 297.0 16.7 MIT
HotpotQA 90,447 87,815 7,405 - 187.9 2.2 CC BY-SA 4.0
MS MARCO 681,445 681,226 77,580 - 68.7 13.3 N/A
MSQG 198,058 198,029 11,008 - 48.1 3.7 CC BY-SA 4.0
NarrativeQA 65,494 65,494 6,922 21,114 584.1 4.2 Apache 2.0
Natural Questions 96,676 96,676 10,693 6,490 9.0 2.1 CC BY-SA 3.0
NewsQA 97,850 97,700 5,486 5,396 726.8 5.0 MIT
QuAC 83,568 83,485 31,906 - 487.9 12.5 CC BY-SA 4.0
TriviaQA 78,785 78,785 8,837 11,313 14.0 2.0 Apache 2.0
WebQuestions 8,933 8,933 4,863 4,863 6.7 2.4 CC BY 4.0
HotpotQA 90,440 87,808 6,972 - 79.6 19.8 CC BY-SA 4.0
MS MARCO 681,445 681,226 77,580 - 75.9 6.0 N/A
MSQG 198,058 198,029 11,008 11,022 45.9 6.0 CC BY-SA 4.0
NarrativeQA 65,494 65,494 6,922 21,114 579.7 8.6 Apache 2.0
NewsQA 97,850 97,700 5,486 5,396 724.2 7.6 MIT
QuAC 69,109 69,026 26,301 - 496.7 6.5 CC BY-SA 4.0
ChangeMyView 42,462 42,459 6,480 7,562 17.9 104.1 MIT
English Gigaword 3,803,957 3,802,620 189,651 1,951 8.8 33.3 MIT
Hippocorpus 6,168 6,168 686 - 34.1 262.6 CDLA-Permissive 2.0
WikiPlots 101,642 101,641 11,294 - 3.4 338.5 N/A
WritingPrompts 272,600 272,518 15,620 15,138 28.4 630.8 MIT
CamRest676 4,872 4,872 616 - 55.3 9.4 N/A
Frames 26,631 26,631 2,106 - 116.1 13.0 MIT
KVRET 14,136 14,136 1,616 - 30.5 9.3 N/A
MetaLWOZ 176,073 176,073 17,912 - 45.6 8.0 N/A
MSR-E2E 103,362 103,362 5,235 - 51.3 12.8 Microsoft
Schema-Guided 494,946 494,933 73,089 - 120.8 12.5 CC BY-SA 4.0
TaskMaster 249,664 249,662 20,680 - 95.6 12.0 CC BY 4.0
WOZ 6,364 6,359 1,260 - 47.0 10.6 N/A
English Gigaword 3,803,957 3,802,620 189,651 1,951 33.3 8.8 MIT
MediaSum 443,596 442,021 10,000 10,000 1641.0 14.4 N/A
MSNews 136,082 135,937 7,496 7,562 309.9 9.8 CC BY-SA 4.0
Newsroom 995,041 989,351 108,837 108,862 642.4 26.7 N/A
WikiHow 157,252 157,247 5,599 5,577 502.6 45.6 CC BY-NC-SA
表 9: 用于预训练 MVP 模型的数据集的统计数据和许可。 #train、#Valid 和 #Test 分别表示训练、有效和测试集中的示例数量。 Cleaned #训练表示过滤后的训练示例数。 输入和输出分别是输入和输出序列中的平均单词数(按空格分割)。
Task Dataset #Train #Valid #Test Input Output License
Commonsen generation CommonGen 67,389 993 5.5 11.6 MIT
Data-to-text generation DART 62,659 2,768 27.5 21.5 MIT
E2E 33,525 4,299 9.5 20.6 CC BY-SA 4.0
ToTTo 120,761 7,700 37.8 18.0 CC BY-SA 3.0
WebNLG 34,338 4,313 4,222 18.0 19.9 CC BY-NA-SA 4.0
WebNLG (GEM) 35,426 1,667 17.7 22.7 CC BY-NA-SA 4.0
WikiBio 582,659 72,831 72,831 81.6 26.1 CC BY-SA 3.0
Open-ended dialogue DailyDialog 76,052 7,069 6,740 72.5 13.9 CC BY-NC-SA 4.0
DSTC7-AVSD 76,590 17,870 1,710 148.2 11.5 MIT
PersonaChat 122,499 14,602 14,056 132.1 11.9 MIT
SGD 164,982 10,000 134.7 11.3 CC BY-SA 4.0
Natural language inference MNLI-m 392,702 9,815 9,796 29.8 Mixed
MNLI-mm 9,832 9,847
QNLI 104,743 5,463 5,463 36.6 CC BY-SA 4.0
RTE 2,490 277 3,000 51.0 N/A
Paraphrase generation Quora 137,185 3,000 3,000 10.9 10.8 N/A
Paraphrase detection MRPC 3,668 408 1,725 43.8 N/A
QQP 363,846 40,430 390,965 22.3 N/A
STS-B 5,749 1,500 1,379 20.3 N/A
Question answering CoQA 107,286 31,621 349.4 2.6 Mixed
SQuAD 75,722 10,570 11,877 156.2 3.6 CC BY-SA 4.0
Question generation CoQA 107,286 31,621 346.6 5.5 Mixed
SQuAD 75,722 10,570 11,877 148.3 11.6 CC BY-SA 4.0
Story generation ROCStories 176,688 9,816 4,909 9.0 40.7 N/A
WritingPrompts 53,516 4,000 2,000 25.5 150.4 MIT
Task-oriented dialogue MultiWOZ 170,220 22,074 22,116 128.3 11.3 MIT
Text classification CoLA 8,551 1,043 1,063 7.7 N/A
SST-2 67,349 872 1,821 9.8 N/A
Text simplification WiA-A 483,801 20,000 359 26.2 21.5 Mixed
WiA-T 359
Text style transfer GYAFC-E&M 52,595 11,508 1,416 9.9 10.6 N/A
GYAFC-F&R 51,967 11,152 1,332 10.7 11.3
Text summarization CNN/DailyMail 287,227 13,368 11,490 679.8 48.3 MIT
SAMSum 14,732 818 819 103.4 20.3 CC BY-NC-ND 4.0
WLE 99,020 28,614 367.6 33.4 CC0 1.0
XSum 204,045 11,332 11,334 373.7 21.1 MIT
表 10: 用于评估 MVP 模型的数据集的统计数据和许可。 MNLI数据集的许可由OANC、CC BY-SA 3.0和CC BY 3.0组成。 CoQA数据集的许可证由CC BY-SA 4.0、MSR-LA和Apache 2.0组成。 WiA-A/T数据集的许可证由CC BY-NC 3.0、CC BY-NC 4.0和GNU通用公共许可证v3.0组成。
Methods XSum SAMSum CoQA QG
R-1 R-2 R-L R-1 R-2 R-L B-4 ME R-L
BART 45.14d 22.27 37.25 51.74b 26.46 48.72 12.34c 35.78 46.88
MVP 45.60 22.47 37.42 53.78 29.12 49.37 23.48 47.79 55.09
MVP+S 45.67 22.63 37.50 53.81 29.75 49.43 23.43 47.49 55.25
SOTA 49.57a 25.08 41.81 53.89b 28.85 49.29 15.78c 40.15 50.98
Methods WritingPrompts DailyDialog WikiBio
B-1 B-2 D-1 D-4 B-1 B-2 D-1 D-2 B-4
BART 22.40e 8.40 31.30 44.30f 39.20 3.90 21.10
MVP 32.34 13.11 2.12 64.58 46.19 41.81 4.61 25.06 48.42
MVP+S 30.12 11.46 3.97 83.70 45.71 42.92 5.10 27.14 48.19
SOTA 22.40e 8.40 31.30 46.10f 40.70 4.10 22.20 45.10g
Methods DSTC7-AVSD SQuAD
B-1 B-2 B-3 B-4 ME R-L CIDEr F1 EM
BART 82.40f 69.10 58.20 48.70 31.30 63.50 1.38 91.56i 84.23
MVP 83.75 70.89 60.19 50.94 32.12 65.04 1.45 93.45 87.20
MVP+S 83.81 71.07 60.45 51.20 31.77 64.76 1.44 93.45 87.17
SOTA 83.20f 70.50 59.80 50.60 31.40 63.80 1.39 96.22h 91.26
表 11: 在完全调整设置下看到的六项任务的结果。 a (阮等人,2021)b (唐等人,2022c)c (顾等人,2021)d (刘易斯等人,2020)e (关 等人,2021)f (陈等人,2022)g (陈等人,2020b)h (拉斐尔等人,2020)i (徐等人,2021)
Methods DART E2E ToTTo
B-4 R-2 ME B-4 R-2 ME B-4 R-2 ME
T5.1.1 34.31 45.22 36.30 42.57 46.60 38.20 39.79 49.90 36.80
ExT5 36.62 48.14 37.60 42.25 46.70 38.10 40.14 50.33 36.90
MVP 39.13 48.92 38.53 37.38 47.96 39.39 50.58 55.24 41.27
MVP+S 38.83 48.49 38.41 37.32 47.40 38.90 50.69 55.52 41.29
Methods WebNLG CommonGen SGD
B-4 R-2 ME B-4 R-2 ME B-4 R-2 ME
T5.1.1 31.67 43.31 34.40 8.38 17.01 20.20 33.15 36.17 32.40
ExT5 35.03 48.17 36.50 9.68 19.04 21.40 34.74 37.77 33.00
MVP 47.03 59.00 42.34 32.59 37.71 33.00 45.63 48.29 38.48
MVP+S 47.03 59.03 42.28 34.10 37.87 33.11 45.24 48.25 38.47
Methods WiA-A WiA-T WLE
B-4 R-2 ME B-4 R-2 ME B-4 R-2 ME
T5.1.1 29.30 38.37 30.10 42.12 50.52 36.2 15.55 20.47 19.60
ExT5 29.23 37.98 30.00 41.39 50.38 35.8 16.64 21.16 20.40
MVP 71.55 70.88 48.19 91.73 83.46 57.34 18.80 22.84 21.95
MVP+S 70.37 70.65 47.70 91.12 83.59 56.95 18.52 22.57 22.02
表 12: 完全调优设置下 GEM 基准测试的结果。 我们使用T5.1.1和ExT5的大版本,所有结果均来自Aribandi等人(2022)

附录B微调和评估详细信息

在本节中,我们将介绍微调和评估每个下游任务的详细信息。

对于4节(表26)和附录C(表11),微调细节在4

对于附录C.2(表12)中GEM基准的实验,微调设置与上述相同。 我们使用BLEU-4、ROUGE-2和METEOR进行评估。 我们使用 GEM 评估脚本111111https://github.com/GEM-benchmark/GEM-metrics

对于4.3节中的实验(表45),微调和评估细节如下:

附录 C其他结果

在本节中,我们提供 MVP 模型和其他基线的其他结果。

C.1 通用数据集的结果

我们还在完全调优设置下对八个常见数据集进行了实验。 由于第4节的空间限制,这些结果显示在表11中。 我们可以看到这些结果与第 4 节中的结果具有相似的趋势,并且我们在 8 数据集的 6 中实现了 SOTA 性能。

C.2 创业板基准测试结果

为了更好地与ExT5(Aribandi等人,2022)进行比较,我们在GEM基准(Gehrmann等人,2021)上进行了实验。 对于“看不见的”常识生成和文本简化任务,我们分别利用数据到文本生成和摘要的提示。 结果如表 12 所示,我们的 MVP 模型在 27 指标中的 26 方面优于 ExT5。

附录D人类评估

我们聘请了六名精通英语、托福或雅思成绩高于1107.0的大学生。 我们为每个实例为每位法官支付了 0.2 美元,400 实例的总预算为 320 美元。 我们为每位评委提供的文字说明如图E所示。

附录E定性示例

在本节中,我们将展示第 4 节中任务的线性化输入、人工编写的任务指令以及单个数据集的相应输出。 我们提供了完全调优设置下的 BART、MVP 和 MVP+S 的结果。 为了最大限度地减少人为干预,我们选择测试集的第一个和第二个实例。

Thank you for taking the time to help us evaluate our scientific research! Our task is to present you with two pieces of machine-generated text and ask you to decide which one is superior. Your opinion will only be used to compare our two models; it will not be used for any other purpose.
We have four tasks to evaluate: 1. Text summarization: the input is a lengthy piece of news, and the output is a brief description of the content. Examine whether the abstract covers the majority of the news and whether there are any factual errors. 2. Knowledge-graph-to-text generation: the input is a knowledge graph (multiple triples), and the output is a text description of the graph. Note whether the description encompasses all of the input triples. 3. Open-ended dialogue: the input is two users’ background information and chat history, and the output is the next response. Examine whether the response is consistent with the contexts and background of the user at the time. 4. Story generation: the input is the beginning of the story, and the output is the following story. Keep in mind that the story needs to be coherent and consistent.
For each instance, you will see an input and two outputs (you will not know which model it comes from) in the table below, and you need to choose which one you believe is better (or a tie). You can base your decision on the output’s fluency, grammar, logic, whether it conforms to the input, and the features of each task.
Input she was on a flight . Output she was trying to take a nap . suddenly , her ears started ringing . the flight attendant tried to fix it but she could n’t . she had to call for help . luckily , they were able to fix the problem . she was bored and her ears hurt . she decided to take a nap . luckily , she was able to get a good night ’s sleep . but the next morning , she woke up and felt sick .
Left Wins Ties Right Wins
图2: 人类评估指南。
Input
Summarize: Marseille, France (CNN)The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane. Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation." He added, "A person who has such a video needs to immediately give it to the investigators." Robin’s comments follow claims by two magazines, German daily Bild and French Paris Match, of a cell phone video showing the harrowing final seconds from on board Germanwings Flight 9525 as it crashed into the French Alps. All 150 on board were killed. Paris Match and Bild reported that the video was recovered from a phone at the wreckage site. The two publications described the supposed video, but did not post it on their websites. The publications said that they watched the video, which was found by a source close to the investigation. "One can hear cries of ’My God’ in several languages," Paris Match reported. "Metallic banging can also be heard more than three times, perhaps of the pilot trying to open the cockpit door with a heavy object. Towards the end, after a heavy shake, stronger than the others, the screaming intensifies. Then nothing." "It is a very disturbing scene," said Julian Reichelt, editor-in-chief of Bild online. An official with France’s accident investigation agency, the BEA, said the agency is not aware of any such video. Lt. Col. Jean-Marc Menichini, a French Gendarmerie spokesman in charge of communications on rescue efforts around the Germanwings crash site, told CNN that the reports were "completely wrong" and "unwarranted." Cell phones have been collected at the site, he said, but that they "hadn’t been exploited yet." Menichini said he believed the cell phones would need to be sent to the Criminal Research Institute in Rosny sous-Bois, near Paris, in order to be analyzed by specialized technicians working hand-in-hand with investigators. But none of the cell phones found so far have been sent to the institute, Menichini said. Asked whether staff involved in the search could have leaked a memory card to the media, Menichini answered with a categorical "no." Reichelt told "Erin Burnett: Outfront" that he had watched the video and stood by the report, saying Bild and Paris Match are "very confident" that the clip is real. He noted that investigators only revealed they’d recovered cell phones from the crash site after Bild and Paris Match published their reports. "That is something we did not know before. … Overall we can say many things of the investigation weren’t revealed by the investigation at the beginning," he said. What was mental state of Germanwings co-pilot? German airline Lufthansa confirmed Tuesday that co-pilot Andreas Lubitz had battled depression years before he took the controls of Germanwings Flight 9525, which he’s accused of deliberately crashing last week in the French Alps. Lubitz told his Lufthansa flight training school in 2009 that he had a "previous episode of severe depression," the airline said Tuesday. Email correspondence between Lubitz and the school discovered in an internal investigation, Lufthansa said, included medical documents he submitted in connection with resuming his flight training. The announcement indicates that Lufthansa, the parent company of Germanwings, knew of Lubitz’s battle with depression, allowed him to continue training and ultimately put him in the cockpit. Lufthansa, whose CEO Carsten Spohr previously said Lubitz was 100% fit to fly, described its statement Tuesday as a "swift and seamless clarification" and said it was sharing the information and documents – including training and medical records – with public prosecutors. Spohr traveled to the crash site Wednesday, where recovery teams have been working for the past week to recover human remains and plane debris scattered across a steep mountainside. He saw the crisis center set up in Seyne-les-Alpes, laid a wreath in the village of Le Vernet, closer to the crash site, where grieving families have left flowers at a simple stone memorial. Menichini told CNN late Tuesday that no visible human remains were left at the site but recovery teams would keep searching. French President Francois Hollande, speaking Tuesday, said that it should be possible to identify all the victims using DNA analysis by the end of the week, sooner than authorities had previously suggested. In the meantime, the recovery of the victims’ personal belongings will start Wednesday, Menichini said. Among those personal belongings could be more cell phones belonging to the 144 passengers and six crew on board. Check out the latest from our correspondents. The details about Lubitz’s correspondence with the flight school during his training were among several developments as investigators continued to delve into what caused the crash and Lubitz’s possible motive for downing the jet. A Lufthansa spokesperson told CNN on Tuesday that Lubitz had a valid medical certificate, had passed all his examinations and "held all the licenses required." Earlier, a spokesman for the prosecutor’s office in Dusseldorf, Christoph Kumpa, said medical records reveal Lubitz suffered from suicidal tendencies at some point before his aviation career and underwent psychotherapy before he got his pilot’s license. Kumpa emphasized there’s no evidence suggesting Lubitz was suicidal or acting aggressively before the crash. Investigators are looking into whether Lubitz feared his medical condition would cause him to lose his pilot’s license, a European government official briefed on the investigation told CNN on Tuesday. While flying was "a big part of his life," the source said, it’s only one theory being considered. Another source, a law enforcement official briefed on the investigation, also told CNN that authorities believe the primary motive for Lubitz to bring down the plane was that he feared he would not be allowed to fly because of his medical problems. Lubitz’s girlfriend told investigators he had seen an eye doctor and a neuropsychologist, both of whom deemed him unfit to work recently and concluded he had psychological issues, the European government official said. But no matter what details emerge about his previous mental health struggles, there’s more to the story, said Brian Russell, a forensic psychologist. "Psychology can explain why somebody would turn rage inward on themselves about the fact that maybe they weren’t going to keep doing their job and they’re upset about that and so they’re suicidal," he said. "But there is no mental illness that explains why somebody then feels entitled to also take that rage and turn it outward on 149 other people who had nothing to do with the person’s problems." Germanwings crash compensation: What we know. Who was the captain of Germanwings Flight 9525? CNN’s Margot Haddad reported from Marseille and Pamela Brown from Dusseldorf, while Laura Smith-Spark wrote from London. CNN’s Frederik Pleitgen, Pamela Boykoff, Antonia Mortensen, Sandrine Amiel and Anna-Maja Rappard contributed to this report.
Gold
Match are "very confident" the video clip is real, an editor says. Andreas Lubitz had informed his Lufthansa training school of an episode of severe depression, airline says.
BART
Prosecutor: "A person who has such a video needs to immediately give it to the investigators" French Gendarmerie spokesman: Cell phones have been collected at the site, but haven’t been exploited.
MVP
Prosecutor: "So far no videos were used in the crash investigation" An official with France’s accident investigation agency says the agency is not aware of any video.
MVP+S
Prosecutor: "So far no videos were used in the crash investigation" Two publications claim to have found a cell phone video of the crash. "It is a very disturbing scene," says editor-in-chief of Bild online.
Table 13: The first instance from the CNN/Daily Mail dataset. Human-written task instructions are labeled in italics.
Input
Summarize: The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC’s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians’ efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday’s ceremony, said it was a move toward greater justice. "As Palestine formally becomes a State Party to the Rome Statute today, the world is also a step closer to ending a long era of impunity and injustice," he said, according to an ICC news release. "Indeed, today brings us closer to our shared goals of justice and peace." Judge Kuniko Ozaki, a vice president of the ICC, said acceding to the treaty was just the first step for the Palestinians. "As the Rome Statute today enters into force for the State of Palestine, Palestine acquires all the rights as well as responsibilities that come with being a State Party to the Statute. These are substantive commitments, which cannot be taken lightly," she said. Rights group Human Rights Watch welcomed the development. "Governments seeking to penalize Palestine for joining the ICC should immediately end their pressure, and countries that support universal acceptance of the court’s treaty should speak out to welcome its membership," said Balkees Jarrah, international justice counsel for the group. "What’s objectionable is the attempts to undermine international justice, not Palestine’s decision to join a treaty to which over 100 countries around the world are members." In January, when the preliminary ICC examination was opened, Israeli Prime Minister Benjamin Netanyahu described it as an outrage, saying the court was overstepping its boundaries. The United States also said it "strongly" disagreed with the court’s decision. "As we have said repeatedly, we do not believe that Palestine is a state and therefore we do not believe that it is eligible to join the ICC," the State Department said in a statement. It urged the warring sides to resolve their differences through direct negotiations. "We will continue to oppose actions against Israel at the ICC as counterproductive to the cause of peace," it said. But the ICC begs to differ with the definition of a state for its purposes and refers to the territories as "Palestine." While a preliminary examination is not a formal investigation, it allows the court to review evidence and determine whether to investigate suspects on both sides. Prosecutor Fatou Bensouda said her office would "conduct its analysis in full independence and impartiality." The war between Israel and Hamas militants in Gaza last summer left more than 2,000 people dead. The inquiry will include alleged war crimes committed since June. The International Criminal Court was set up in 2002 to prosecute genocide, crimes against humanity and war crimes. CNN’s Vasco Cotovio, Kareem Khadder and Faith Karimi contributed to this report.
Gold
Membership gives the ICC jurisdiction over alleged crimes committed in Palestinian territories since last June. Israel and the United States opposed the move, which could open the door to war crimes investigations against Israelis.
BART
Palestinian Authority becomes 123rd member of the International Criminal Court. The move gives the court jurisdiction over alleged crimes in Palestinian territories. Israel and the United States opposed the Palestinians’ efforts to join the body.
MVP
"Today brings us closer to our shared goals of justice and peace," foreign minister says. The Palestinians signed the ICC’s founding Rome Statute in January. The move gives the court jurisdiction over alleged crimes in Palestinian territories.
MVP+S
"Today brings us closer to our shared goals of justice and peace," foreign minister says. The United States says it "strongly" disagrees with the decision. The Palestinian Authority is the 123rd member of the International Criminal Court.
Table 14: The second instance from the CNN/Daily Mail dataset.
Input
Describe the following data: Abilene,_Texas | cityServed | Abilene_Regional_Airport
Gold
Abilene, Texas is served by the Abilene regional airport.
Abilene Regional Airport serves the city of Abilene in Texas.
BART
Abilene Regional Airport serves the city of Abilene in Texas.
MVP
Abilene Regional Airport serves the city of Abilene, Texas.
MVP+S
Abilene Regional Airport serves the city of Abilene, Texas.
Table 15: The first instance from the WebNLG dataset, which has two golden target sentences.
Input
Describe the following data: "Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas" | location | Adolfo_Suárez_Madrid–Barajas_Airport
Gold
Adolfo Suárez Madrid–Barajas Airport can be found in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.
Adolfo Suarez Madrid-Barajas airport is located at Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.
Adolfo Suarez Madrid-Barajas Airport is located in Madrid, Paracuellos de Jarama, San Sebastian de los Reyes and Alcobendas.
BART
Adolfo Suárez Madrid–Barajas Airport can be found in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.
MVP
Adolfo Suárez Madrid–Barajas Airport can be found in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.
MVP+S
Adolfo Suárez Madrid–Barajas Airport is located in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.
Table 16: The second instance from the WebNLG dataset, which has three golden target sentences.
Input
Generate the question based on the answer: Saint Bernadette Soubirous [SEP] Architecturally , the school has a Catholic character . Atop the Main Building ’ s gold dome is a golden statue of the Virgin Mary . Immediately in front of the Main Building and facing it , is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes " . Next to the Main Building is the Basilica of the Sacred Heart . Immediately behind the basilica is the Grotto , a Marian place of prayer and reflection . It is a replica of the grotto at Lourdes , France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858 . At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ) , is a simple , modern stone statue of Mary .
Gold
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France ?
BART
Who is believed to have appeared to the Virgin Mary at Lourdes ?
MVP
Who did the Virgin Mary appear to in Lourdes ?
MVP+S
The Grotto is a replica of the grotto at Lourdes , France where the Virgin Mary appeared to whom ?
Table 17: The first instance from the SQuAD dataset.
Input
Generate the question based on the answer: a copper statue of Christ [SEP] Architecturally , the school has a Catholic character . Atop the Main Building ’ s gold dome is a golden statue of the Virgin Mary . Immediately in front of the Main Building and facing it , is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes " . Next to the Main Building is the Basilica of the Sacred Heart . Immediately behind the basilica is the Grotto , a Marian place of prayer and reflection . It is a replica of the grotto at Lourdes , France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858 . At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ) , is a simple , modern stone statue of Mary .
Gold
What is in front of the Notre Dame Main Building ?
BART
What is in front of the Main Building and facing it ?
MVP
What is immediately in front of the Main Building ?
MVP+S
What is immediately in front of the Main Building ?
Table 18: The second instance from the SQuAD dataset.
Input
Answer the following question: what color was cotton ? [X_SEP] once upon a time , in a barn near a farm house , there lived a little white kitten named cotton . cotton lived high up in a nice warm place above the barn where all of the farmer ’ s horses slept . but cotton wasn ’ t alone in her little home above the barn , oh no . she shared her hay bed with her mommy and 5 other sisters . all of her sisters were cute and fluffy , like cotton . but she was the only white one in the bunch . the rest of her sisters were all orange with beautiful white tiger stripes like cotton ’ s mommy . being different made cotton quite sad . she often wished she looked like the rest of her family . so one day , when cotton found a can of the old farmer ’ s orange paint , she used it to paint herself like them . when her mommy and sisters found her they started laughing . " what are you doing , cotton ? ! " " i only wanted to be more like you " . cotton ’ s mommy rubbed her face on cotton ’ s and said " oh cotton , but your fur is so pretty and special , like you . we would never want you to be any other way " . and with that , cotton ’ s mommy picked her up and dropped her into a big bucket of water . when cotton came out she was herself again . her sisters licked her face until cotton ’ s fur was all all dry . " don ’ t ever do that again , cotton ! " they all cried . " next time you might mess up that pretty white fur of yours and we wouldn ’ t want that ! " then cotton thought , " i change my mind . i like being special " .
Gold
white
BART
white
MVP
white
MVP+S
white
Table 19: The first instance from the CoQA dataset.
Input
Answer the following question: what color was cotton ? [SEP] white [X_SEP] where did she live ? [X_SEP] once upon a time , in a barn near a farm house , there lived a little white kitten named cotton . cotton lived high up in a nice warm place above the barn where all of the farmer ’ s horses slept . but cotton wasn ’ t alone in her little home above the barn , oh no . she shared her hay bed with her mommy and 5 other sisters . all of her sisters were cute and fluffy , like cotton . but she was the only white one in the bunch . the rest of her sisters were all orange with beautiful white tiger stripes like cotton ’ s mommy . being different made cotton quite sad . she often wished she looked like the rest of her family . so one day , when cotton found a can of the old farmer ’ s orange paint , she used it to paint herself like them . when her mommy and sisters found her they started laughing . " what are you doing , cotton ? ! " " i only wanted to be more like you " . cotton ’ s mommy rubbed her face on cotton ’ s and said " oh cotton , but your fur is so pretty and special , like you . we would never want you to be any other way " . and with that , cotton ’ s mommy picked her up and dropped her into a big bucket of water . when cotton came out she was herself again . her sisters licked her face until cotton ’ s fur was all all dry . " don ’ t ever do that again , cotton ! " they all cried . " next time you might mess up that pretty white fur of yours and we wouldn ’ t want that ! " then cotton thought , " i change my mind . i like being special " .
Gold
in a barn
BART
in a barn
MVP
in a barn
MVP+S
in a barn
Table 20: The second instance from the CoQA dataset.
Input
Given the story title: male was out jogging one morning .
Gold
the weather was crisp and cool . male felt good and energetic . he decided to keep jogging longer than normal . male went several more miles out of his way .
BART
he saw a man struggling to walk across a field . male stopped to help him . he saw a guy struggling to walking across a fields . the man thanked male profusely !
MVP
he ran into a man who was holding a knife . male was scared , so he ran away . the man threatened male . male ran back into his house .
MVP+S
he noticed a sound coming from a tree . male went to investigate . he found a small baby bird stuck in the trunk . male took the bird to the vet and they released it .
Table 21: The first instance from the ROCStories dataset.
Input
Given the story title: ken was driving around in the snow .
Gold
he needed to get home from work . he was driving slowly to avoid accidents . unfortunately the roads were too slick and ken lost control . his tires lost traction and he hit a tree .
BART
he hit a patch of black ice . ken slid and hurt his knee . he went to the doctor for it . krampus was sent home with a concussion .
MVP
he was in his car and it was very cold . ken started to back up . the car started to make a loud noise . kena had to back out and call for help .
MVP+S
he was having trouble starting his car . he had to push it a couple of blocks . his car would n’t start . ken had to call a tow truck for help .
Table 22: The second instance from the ROCStories dataset.
Input
Given the dialog: i love to meet new people . [SEP] i have a turtle named timothy . [SEP] my favorite sport is ultimate frisbee . [SEP] my parents are living in bora bora . [SEP] autumn is my favorite season . [X_SEP] hello , how are you doing tonight ?
Gold
i am well an loving this interaction how are you ?
BART
i am doing well , how are you ?
MVP
i am doing well . how are you ?
MVP+S
i am doing well . how are you ?
Table 23: The first instance from the PersonaChat dataset.
Input
Given the dialog: i just bought a brand new house . [SEP] i like to dance at the club . [SEP] i run a dog obedience school . [SEP] i have a big sweet tooth . [SEP] i like taking and posting selkies . [X_SEP] hello , how are you doing tonight ? [SEP] i am well an loving this interaction how are you ?
Gold
i am great . i just got back from the club .
BART
i am doing well , thanks for asking . what do you do for fun ?
MVP
i am doing well . i just got home from dancing at the club .
MVP+S
i am doing well . i just bought a new house .
Table 24: The second instance from the PersonaChat dataset.
Input
Given the task dialog: Belief state [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton .
Given the task dialog: Dialogue action [X_SEP] [db_nores] [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton .
Given the task dialog: System response [X_SEP] [db_nores] [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton .
Gold
[taxi] destination pizza hut fen ditton departure saint john ’s college
[taxi] [request] leave arrive
what time do you want to leave and what time do you want to arrive by ?
BART
[taxi] destination pizza express fen ditton departure saint johns college
[taxi] [request] leave
what time would you like to leave or arrive ?
MVP
[taxi] destination pizza hut fen ditton departure saint john ’s college
[taxi] [request] leave
sure ! when would you like to arrive ?
MVP+S
[taxi] destination pizza hut fen ditton departure saint john ’s college
[taxi] [request] leave arrive
what time would you like to leave ?
Table 25: The first instance from the MultiWOZ dataset. The task-oriented dialogue is composed of dialogue state tracking, dialogue action learning, and system response generation.
Input
Given the task dialog: Belief state [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton . [SEP] what time do you want to leave and what time do you want to arrive by ? [SEP] i want to leave after 17:15 .
Given the task dialog: Dialogue action [X_SEP] [db_nores] [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton . [SEP] what time do you want to leave and what time do you want to arrive by ? [SEP] i want to leave after 17:15 .
Given the task dialog: System response [X_SEP] [db_nores] [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton . [SEP] what time do you want to leave and what time do you want to arrive by ? [SEP] i want to leave after 17:15 .
Gold
[taxi] destination pizza hut fen ditton departure saint john ’s college leave 17:15
[taxi] [inform] car phone
booking completed ! your taxi will be [value_car] contact number is [value_phone]
BART
[taxi] destination pizza hut fen ditton departure saint john ’s college leave 17:15
[taxi] [inform] car phone
booking completed ! booked car type : [value_car] contact number : [value_phone]
MVP
[taxi] destination pizza hut fen ditton departure saint john ’s college leave 17:15
[taxi] [inform] car phone
booking completed ! booked car type : [value_car] contact number : [value_phone]
MVP+S
[taxi] destination pizza hut fen ditton departure saint john ’s college leave 17:15
[taxi] [inform] car phone
booking completed ! booked car type : [value_car] contact number : [value_phone]
Table 26: The second instance from the MultiWOZ dataset.