MVP：自然语言生成的多任务监督预训练

Tianyi Tang^1,4, Junyi Li^1,3, Wayne Xin Zhao^{1,4 🖂} ^🖂 Corresponding author Ji-Rong Wen^1,2,4
¹Gaoling School of Artificial Intelligence, Renmin University of China
²School of Information, Renmin University of China
³DIRO, Université de Montréal
⁴Beijing Key Laboratory of Big Data Management and Analysis Methods
steventianyitang@outlook.com lijunyi@ruc.edu.cn batmanfly@gmail.com

摘要

预训练语言模型（PLM）在自然语言生成（NLG）任务中取得了显着的成功。到目前为止，大多数面向NLG的PLM都是使用大规模通用语料库以无监督的方式进行预训练的。与此同时，越来越多的使用标记数据进行预训练的模型(即“有监督预训练”）与无监督预训练模型相比表现出了卓越的性能。受监督预训练成功的激励，我们提出M多任务超级V化P再训练(MVP) 用于自然语言生成。我们从 $11$ 不同 NLG 任务的 $77$ 数据集中收集了大规模自然语言生成语料库 MVPCorpus。然后我们将这些示例统一为通用的文本到文本格式，以有监督的方式预训练文本生成模型 MVP。对于每项任务，我们进一步预训练特定的软提示，以激发模型执行特定任务的能力。我们的 MVP 模型可以看作是在相对较小的 PLM 上利用最新指令调整的实践。大量的实验证明了我们的 MVP 模型在许多 NLG 任务中的有效性和通用性，它在 $17$ 数据集的 $13$ 数据集上实现了最先进的性能，优于 BART通过 $9.3\%$ 和 Flan-T5 通过 $5.8\%$ 。

1简介

自然语言生成（NLG，也称为文本生成）是语言智能的一项重要能力，旨在按需生成类似人类的文本（Garbacea and Mei，2020）。自从预训练和微调范式出现以来，预训练语言模型（PLM）已经主导了 NLG 任务的主流方法（Lewis 等人，2020；Brown 等人，2020）。凭借大规模的通用语料库，大多数 PLM 都是利用内在数据相关性作为监督信号，以无监督（自监督）的方式进行预训练。然而，无监督的预训练很可能会包含影响下游任务性能的噪声（冯等人，2022），也会导致知识获取速度变慢（张等人， 2021）。

与此同时，越来越多的大规模标记数据集变得容易获取（邓等人，2009；刘等人，2020）。越来越多的证据表明，使用标记数据进行预训练可以进一步提高 PLM 的性能，无论是在计算机视觉（He 等人，2016；Dosovitskiy 等人，2021）和自然语言处理领域（林等人，2020b；苏等人，2022）。这些有希望的发展促使我们考虑使用标记数据预训练文本生成模型，这被称为“监督预训练”（Feng等人，2022）。现有工作表明，有监督预训练可以显式学习任务特定特征，并减轻无监督预训练和监督微调之间的差异（Lin等人，2020b）。

此外，大多数 NLG 系统通常以监督方式进行训练，需要监督信号来学习输入到输出的转换。例如，对话系统学习根据历史话语生成适当的响应，文本摘要系统学习根据人类编写的摘要从长文档中提取基本信息。因此，我们怀疑监督预训练本质上更适合面向 NLG 的 PLM，因为它可以在预训练阶段的早期提供与任务相关的指令，而不是稍后的精细-调整阶段。

受最近监督预训练成功的启发，我们提出M多任务超级V化P再训练(MVP )通过利用各种带标签的文本生成数据集来生成自然语言。特别地，我们收集了一个大规模标记语料库，MVPCorpus，由 $77$ 数据集和 $11$ 文本生成任务组成。由于最近的研究表明，大规模的多任务预训练（Aribandi等人，2022）是推广到大型PLM新任务的关键，因此我们将这些标记数据集组合起来进行多任务预训练。现有的热门作品，如表1所示，主要关注NLU任务（Sanh等人，2022；Aribandi等人，2022）或使用无监督预训练(Lewis 等人, 2020; Raffel 等人, 2020)，没有考虑 NLG 任务的有监督预训练。为了填补这一空白，我们探索了监督预训练和多任务学习，以推导有效和通用 NLG 模型。

Settings	Supervised Pre-training	Unsupervised Pre-training
NLG	MVP (ours)	GPT-2, MASS, BART, T5
NLU	FLAN, T0, Muppet, ExT5	BERT, XLNet, RoBERTa, T5

表格1：使用（无）监督预训练的 NLG 和 NLU 任务的代表性 PLM。我们在5节中对有监督预训练进行了更详细的比较和讨论。

为了开发我们的方法，我们采用基于 Transformer 的（Vaswani 等人，2017）序列到序列模型作为骨干。在多任务训练中，不同的任务可能会“抵消”通过其他任务学到的能力（He and Choi，2021）。为了缓解这个潜在问题，我们建议基于 MVP 模型学习特定于任务的提示，遵循前缀调整（Li和Liang，2021）的结构。针对特定任务的预训练可以提示“存储”每个相应任务的专业知识。将 MVP 与特定任务提示相结合可以进一步激发模型执行某些特定任务的能力。

总而言之，我们的主要贡献围绕以下研究问题：

•

如何以有监督的预训练方式训练面向 NLG 的 PLM？ 为了准备监督语料库，我们收集了大量标记的 MVPCorpus，其中包含跨不同领域和特定目标的 $11$ NLG 任务的 $77$ 数据集。据我们所知，MVPCorpus 是最大的 NLG 数据集集合。首先，我们使用任务指令将不同的NLG任务制定为通用的文本到文本形式，以便可以以统一的方式使用监督语料库来预训练NLG模型。我们的工作提出了一种简单而通用的方法，通过利用各种标记的 NLG 数据集来预训练功能更强大的 NLG 模型。
•

有监督的预训练 NLG 模型能否既有效又通用？ 大量实验表明，有监督预训练 MVP 在完全调优( $+9.3\%$ 比例）和参数高效调优( $+4.3\%$ 比例）设置方面均优于无监督预训练 BART 。我们的 MVP 模型在 $17$ 数据集的 $13$ 上实现了最先进的性能，并且比 Flan-T5 Chung 等人 (2022) 的性能高出 $5.8\%$ 。我们的零样本性能也大幅超过了 T0-11B Sanh 等人 (2022)。此外，对看不见的 NLG 和 NLU 任务的实验表明，我们的监督 MVP 模型对于看不见的任务具有很强的通用性。

为了重现和重用我们的工作，我们在链接中发布了 MVPCorpus 集合、所有 MVP 模型变体以及相应的代码：https://github.com/RUCAIBox/MVP。

2相关工作

Refer to caption — 图1：我们的 MVP 模型的预训练过程和特定任务提示的概述。

预先训练的语言模型。

预训练语言模型在广泛的任务中取得了非凡的成功，其中大多数都是以无监督的方式进行预训练（Devlin等人，2019；Brown等人，2020）。例如，以大规模纯文本作为无监督预训练语料库( $570$ GB），GPT-3（Brown等人，2020）采用语言建模作为预训练语料库。 -训练任务，即根据先前的标记预测下一个词符。同时，计算机视觉社区也从标记数据集 ImageNet （邓等人，2009）中受益匪浅。有影响力的模型，例如 ResNet （He 等人，2016）和 ViT （Dosovitskiy 等人，2021），利用 ImageNet 进行预训练。受标记数据预训练成功的启发，机器翻译研究人员探索了监督预训练（McCann等人，2017；Lin等人，2020b）。 Lin 等人 (2020b) 尝试使用多种语言的并行数据预训练翻译模型。尽管使用的预训练数据少得多，mRASP 仍然比以无监督方式预训练的翻译模型获得了更好的性能（Liu 等人，2020）。在本文中，我们建议使用标记数据集 ( $23$ GB) 的集合以监督方式预训练通用 NLG 模型。

多任务学习。

我们的预训练过程也与多任务学习（MTL）相关，这是一种将多个任务混合到单个训练过程中的方法（Collobert 和 Weston，2008）。使用 MTL 训练的模型可以受益于相关任务的有用知识，从而提高性能（Subramanian 等人，2018）。最近，MT-DNN (Liu 等人, 2019a) 和 Muppet (Aghajanyan 等人, 2021) 在多任务过程中收集了数十个数据集，并在下游任务。 Muppet 中提出的预微调模式与我们的研究具有相似的想法。 Aribandi 等人 (2022) 进一步结合 T5 (Raffel 等人, 2020) 的去噪预训练任务和多任务学习来预训练新模型， ExT5。 MTL还对文本生成的子领域做出了贡献，例如开放式对话系统（张等人，2020）、面向任务的对话系统（苏等人，2022）、文本样式迁移(Bujnowski 等人, 2020)、问题回答(Khashabi 等人, 2020)。与此同时，研究人员探索了在多任务数据集上训练的模型的可迁移性（Mishra 等人，2022）。 FLAN (Wei 等人, 2022)、T0 (Sanh 等人, 2022)、ZeroPrompt (Xu 等人, 2022) 和 FLAN -T5 (Chung 等人, 2022) 研究大型语言模型（大语言模型）的零样本或少样本泛化能力 Zhao 等人 (2023) 在大量数据上进行训练具有精心设计的提示的任务数据集。与这些工作相比，我们的目标是探索多任务学习，以有监督的预训练方式导出有效和通用 NLG 模型。

及时学习。

即时学习是 NLP 领域中一种蓬勃发展的方法。即时学习将微调文本转换为类似于预训练的格式，以利用隐式预训练知识并减轻预训练和微调之间的差异（Liu等人，2021b）。 GPT-2 (Radford 等人, 2019) 和 T5 (Raffel 等人, 2020) 在输入文本中添加人工编写的任务提示。例如，T5 在输入文档中添加“Summarize:”以执行摘要任务。一些研究人员还为每个任务和数据集设计了精心设计的提示，并研究其有效性和鲁棒性（Wei等人，2022；Sanh等人，2022）。为了克服手动构建提示的限制，研究人员开发了可以在连续空间中优化的连续（软）提示（Lester等人，2021；Qin和Eisner，2021；Tang等人，2022b）。考虑到软提示的随机初始化，Gu等人(2022)提出PPT使用未标记数据预训练连续提示。 SPoT (Vu 等人, 2022)、UnifiedSKG (Xie 等人, 2022) 和 PTG (Li 等人, 2022a) 进一步学习相关任务的提示并将提示转移到新任务。

3MVP 模型

本节介绍我们的MVP模型：一个M多任务超级V化P自然训练模型语言的产生。我们模型的概述如图1所示。

3.1数据收集

形式上，自然语言生成（NLG）任务旨在生成以输入数据 $\mathcal{X}$ 为条件的标记序列 $\mathcal{Y}=(y_{1},y_{2},\dots,y_{n})$ (例如一段文本或结构化数据）（Li等人，2022b）。

在本文中，我们收集了一个大规模标记的 MVPCorpus，其中包含来自 $11$ 代表性 NLG 任务的 $77$ 标记数据集¹¹1我们在这项工作中不考虑机器翻译任务，只关注英语任务。，包括常识生成、数据到文本生成、开放式对话系统、释义生成、问答、问题生成、故事生成、任务导向对话系统、文本简化、文本风格迁移和文本摘要。这些数据集来自不同的领域并且大小不同。一些数据集是精心手工制作的，因此规模相对较小，而另一些数据集是为大规模弱监督而创建的。这些任务的详细描述可以在附录A.1中找到。

接下来，我们将每个任务的不同输入数据 $\mathcal{X}$ 转换为统一的文本到文本格式。例如，我们通过使用数据的特殊词符“[SEP]”连接三元组或键值对来线性化结构化数据(例如，知识图或表）文本生成，我们利用特殊的词符“[X_SEP]”来分隔答案和段落以进行问题生成。每个任务的转换后的输入格式可以在附录E中找到。

我们将MVPCorpus分为两部分，分别用于预训练和微调（评估）。对于监督预训练，我们利用来自 $7$ 任务的 $50$ 数据集，包括数据到文本生成、开放式对话系统、问答、问题生成、故事生成、任务面向对话系统和文本摘要。我们还消除了与评估数据重叠的预训练示例，以避免数据泄漏（更多详细信息参见附录A.2)。最后，我们有一个包含 $32$ M 个示例的 $25$ GB 监督预训练语料库。预训练数据集统计结果如表9所示。

为了进行评估，我们利用了文献中更常用的其余 $27$ 数据集。在这些数据集中， $23$ 数据集来自预训练中使用的 $7$ 任务。我们将它们称为 seen 任务，并使用它们来测试我们模型的有效性。其余的 $4$ 数据集分别来自常识生成、释义生成、简化和风格迁移的任务。我们将它们称为unseen任务，并使用它们来检查我们模型的通用性。

3.2模型架构

我们的 MVP 模型建立在标准 Transformer 编码器-解码器架构（Vaswani 等人，2017）之上。与仅解码器的 PLM（例如 GPT-3 （Brown 等人，2020）和前缀 LM（例如 UniLM （Dong 等人，2019）)相比，编码器-解码器架构对于文本生成任务更有效（Raffel 等人，2020）。在第一阶段，我们使用来自七个任务的标记数据集的混合来预训练 MVP 主干。为了指示每个任务，我们将人工编写的指令应用于每个任务实例。例如，我们写“Summarize:”作为摘要任务的提示。每个任务的手动说明如附录E所示。

在第二阶段，我们冻结 MVP 主干并预训练一组特定于任务的提示(即连续向量），以激发模型执行某些特定任务的能力。特别地，我们遵循前缀调整（Li和Liang，2021）在每个Transformer层插入连续向量，并使用相应的任务内数据集的混合来学习它们(即) > 同一任务下的数据集²²2例如，我们使用摘要数据集训练特定于摘要的提示，例如 Newsroom (Grusky 等人, 2018)、WikiHow （Koupaee 和 Wang，2018）和 MSNews （Liu 等人，2021a）。）。相比于提示调优(Lester 等人, 2021)仅在输入层添加提示，分层提示更加有效和稳定(Liu 等人, 2022)，特别是对于 NLG 任务。这些在任务之间不共享的软提示对特定于任务的语义知识进行编码，以缓解多任务学习引起的模糊问题（He and Choi，2021）。

3.3培训详情

我们的 MVP 模型在编码器和解码器中均采用具有 $12$ 层的 Transformer( $406$ M 个参数），与 BART_large 的模型大小相同（刘易斯等人，2020）。我们使用 BART 参数初始化主干网络，为后续的 NLG 任务提供良好的起点（Dong 等人，2019；Zhang 等人，2020）。我们以批量大小 $8{,}192$ 预训练模型，并采用温度缩放混合策略（Raffel等人，2020），速率为 $T=2$ 缩小任务和数据集的差异。

我们遵循前缀调整（Li和Liang，2021），通过将可训练向量添加到每一层的多头注意力模块来预训练特定于任务的提示。提示长度设置为 $100$ ，我们利用隐藏大小为 $800$ 的MLP重新参数化函数来提高训练的鲁棒性和性能（Li和Liang，2021）。因此，每个任务提示大约有 $62$ M 个参数。然后，我们冻结 MVP 模型并训练七组特定于任务的提示，每组对应一个不同的任务。

在这两个阶段中，输入和输出序列的最大长度都设置为 $1{,}024$ ，以支持示例包含更多标记。我们使用标准序列到序列交叉熵损失以 $3\times 10^{-5}$ 的恒定学习率优化模型。我们应用带有 $\beta_{1}=0.9$ 、 $\beta_{2}=0.98$ 、 $\epsilon=1\times 10^{-6}$ 的 AdamW 优化器来提高训练稳定性(Liu 等人, 2019b)。权重衰减系数为 $0.1$ 。为了进行测试，我们选择验证性能最高的检查点。所有实验均在 $32$ NVIDIA Tesla V $100$ $32$ GB GPU 上进行。我们使用文本生成库 TextBox (Tang 等人, 2022a) 来实现我们的模型。

总之，我们预训练了 $406$ M生成模型MVP和七组 $62$ M任务特定提示。对于每个下游任务，用户可以直接利用主干( $406$ M），也可以进一步将 MVP 与特定于任务的提示( $468$ M）结合起来。

Methods	CNN/DailyMail			WebNLG			SQuAD (QG)			CoQA
Methods	R- $1$	R- $2$	R-L	B- $4$	ME	R-L	B- $4$	ME	R-L	F1	EM
MVP	44.52	21.62	41.10	67.82	47.47	76.88	26.26	27.35	53.49	86.43	77.78
BART	44.16^e	21.28	40.90	64.55^b	46.51	75.13	22.00^f	26.40	52.55	68.60^f	–
Flan-T5	43.45	21.01	40.03	66.60	46.93	75.76	25.55	26.90	53.51	84.18	75.44
Single	44.36	21.54	40.88	67.74	46.89	76.94	26.09	27.15	53.29	86.20	77.26
MVP+S	44.63	21.72	41.21	68.19	47.75	76.81	25.69	27.04	53.20	86.65	77.93
MVP+R	44.14	21.45	40.72	67.61	47.65	76.70	25.71	27.03	53.09	85.95	77.22
MVP+M	43.97	21.16	40.46	67.45	47.57	76.81	25.46	26.79	52.95	86.28	77.26
SOTA	47.16^a	22.55	43.87	66.14^b	47.25	76.10	25.97^c	27.33	53.43	84.50^d	–
Methods	ROCStories				PersonaChat				MultiWOZ
Methods	B- $1$	B- $2$	D- $1$	D- $4$	B- $1$	B- $2$	D- $1$	D- $2$	B- $4$	Success	Inform
MVP	33.79	15.76	3.02	75.65	50.73	40.69	1.65	11.23	20.26	76.40	85.00
BART	30.70^g	13.30	–	69.90	49.90^f	40.00	1.30	8.00	17.89^j	74.91	84.88
Flan-T5	32.72	15.23	2.97	68.97	48.55	40.22	1.40	7.85	19.73	70.20	78.70
Single	32.67	15.29	2.72	72.97	49.96	40.53	1.27	7.63	19.73	75.60	83.70
MVP+S	33.92	15.60	3.44	80.58	47.91	39.97	1.52	9.54	20.32	79.90	86.80
MVP+R	32.93	15.32	2.88	73.83	48.45	40.09	1.30	7.95	19.02	73.30	81.80
MVP+M	33.30	15.51	2.71	74.24	46.26	39.30	1.36	8.07	19.93	72.70	79.70
SOTA	33.40^g	15.40	–	69.30	49.90^f	40.00	1.50^h	9.40	20.50ⁱ	85.30	94.40

表2：在完全调优设置下看到的七个任务的主要结果。所有方法中最好和第二好的结果分别用粗体和下划线标记。这里的 SQuAD 数据集用于问题生成任务。字母 B、R、D 和 ME 分别表示 BLEU、ROUGE、Distinct 和 METEOR。 “-”表示该工作没有计算出相应的结果。 ^a (Ravaut 等人, 2022) ^b （柯等人，2021） ^c （包等人，2021） ^d （肖等人，2020） ^e （刘易斯等人，2020） ^f （刘等人，2021a） ^g （关等人，2021） ^h （陈等人，2022） ⁱ （何等人，2022） ^j （林等人，2020c）

4实验结果

在本节中，我们主要研究 MVP 模型的有效性和通用性。我们在不同的环境中进行了广泛的实验：

•

在完全调优场景下，我们使用 $27$ 生成数据集和GLUE基准（Wang等人，2019）进行评估。 4.1 节和附录 C 分析了 $7$ 所见任务的 $23$ 数据集的结果。 4.3部分包括 $4$ 未见生成任务和 $8$ 理解任务的结果。为了更好地与ExT5进行比较，我们在附录C.2中的GEM基准（Gehrmann等人，2021）上进行了实验。
•

在零样本学习中，我们将我们的模型与4.2节中的T0进行比较。
•

在参数高效调整设置中，我们使用与4.1节中相同的数据集，结果可以在4.4节中找到。
•

我们在第 4.5 节中进行了人工评估。

对于完整的调优设置（表2和11)，我们调整整个模型（包括骨干MVP和提示），而对于参数高效的调优（表6)，我们只进行参数提示，但冻结MVP的参数权重。我们通过标签平滑（Szegedy等人，2016）因子 $0.1$ 的seq2seq损失和具有默认超参数的AdamW优化器来优化模型。我们扫描 $\{16,64,256\}$ 中的批量大小和 $\{5\times 10^{-6},1\times 10^{-5},3\times 10^{-5}\}$ 中的学习率，以找到每个评估任务的最佳超参数。我们利用具有最佳验证性能的检查点进行测试集推理。在推理过程中，我们将 Beam 大小设置为 $5$ ，将非重复 ngram 大小设置为 $3$ 。有关微调和评估的详细信息可以在附录B中找到。

Methods	CNN/DailyMail			WebNLG			SQuAD (QG)			CoQA
Methods	R- $1$	R- $2$	R-L	B- $4$	ME	R-L	B- $4$	ME	R-L	F1	EM
FT BART	44.16	21.28	40.90	64.55	46.51	75.13	22.00	26.40	52.55	68.60	–
FT MVP	44.52	21.62	41.10	67.82	47.47	76.88	26.26	27.35	53.49	86.43	77.78
T0-3B	–	–	–	01.40	10.20	18.43	3.06	12.43	14.91	13.30	06.60
T0-11B	–	–	–	00.26	06.13	14.12	2.63	07.00	15.25	09.18	04.36
MVP	29.50	11.29	25.92	34.42	31.33	52.33	2.90	13.94	15.48	29.40	18.20
MVP+S	25.60	09.51	22.67	39.43	34.32	55.34	2.96	15.23	18.23	52.40	37.30
Methods	ROCStories				PersonaChat				MultiWOZ
Methods	B- $1$	B- $2$	D- $1$	D- $4$	B- $1$	B- $2$	D- $1$	D- $2$	B- $4$	Success	Inform
FT BART	30.70	13.30	–	69.90	49.90	40.00	1.30	8.00	17.89	74.91	84.88
FT MVP	33.79	15.76	3.02	75.65	50.73	40.69	1.65	11.23	20.26	76.40	85.00
T0-3B	08.69	3.02	04.37	35.49	23.20	23.57	2.56	12.06	0.02	2.50	22.10
T0-11B	00.63	0.16	12.41	92.86	32.17	28.35	1.56	07.19	0.00	3.90	22.10
MVP	01.01	0.31	07.18	86.26	35.54	32.71	2.87	16.38	3.08	2.50	22.20
MVP+S	10.52	3.54	02.13	69.55	37.04	33.38	2.66	14.84	0.38	2.50	22.10

表3：零样本学习中七个未见过的数据集的结果。鉴于 T0 已经在 CNN/DailyMail 数据集上进行了预训练，我们排除了它们的结果以提供公平的比较（表示为“-”）。

4.1全面调优性能

我们对七个已知任务的七个新数据集进行了实验，以验证我们的两阶段预训练方法的有效性。我们设计了多种型号。在第一阶段，MVP 使用多任务训练监督预训练，我们将其与使用不同策略的其他两个进行比较：

•

BART_large (Lewis 等人, 2020)：BART 是一种广泛使用的 PLM，用于自然语言生成，使用去噪自动编码作为无监督预编码。培训目标。
•

Flan-T5_large (Chung 等人, 2022)：Flan-T5 是一种最新的语言模型，在各种 NLP 任务上以监督方式训练，这可以成为我们模型的强大竞争对手。
•

单任务预训练（Single）：我们在多任务训练中使用相同预训练设置下的任务内数据集为每个任务单独训练单个模型。例如，我们使用摘要数据集(例如 Newsroom、WikiHow 和 MSNews）预训练摘要模型。因此，我们总共有七个单任务预训练模型。

对于集成单任务预训练提示的第二阶段（表示为MVP+S)，我们将其与使用不同提示的两个变体进行比较：

•

随机初始化提示（MVP+R）：MVP模型的分层提示是随机初始化的，无需预训练。
•

多任务预训练提示（MVP+M）：我们仅使用与主干预训练相同的混合数据集为所有任务预训练一组提示。

除了这些变体之外，我们还包括文献中原始论文的最佳报告结果以进行比较（表示为SOTA)。从表2的结果可以看出：

首先，有监督预训练模型(即 MVP、Flan-T5 和 Single）比无监督预训练模型 BART 取得了更好的性能，平均提高了 $9.3\%$ 、 $3.13\%$ 和 $4.4\%$ （按比例）。这一发现验证了我们的监督预训练方法的有效性，该方法使模型能够获取更多特定于任务的信息。关于多任务预训练 (MVP) 和单任务 (Single)，我们的 MVP 模型的性能优于单任务模型 $5.0\%$ 。这一结果表明，多任务学习方法可以通过学习跨任务的可转移语义信息来增强单任务性能。值得注意的是，我们的 MVP 模型的性能优于 Flan-T5 $5.8\%$ ，这显示了训练在我们的 NLG 数据集 MVPCorpus 上的重要性。

其次，任务特定提示学习有效缓解多任务学习的“模糊”问题。对于数据到文本生成和问题回答等任务，具有单任务提示的 MVP (MVP+S) 始终优于其他两个变体（MVP+R 和 MVP+M）。这验证了任务特定提示可以获取任务专业知识并激发 MVP 模型执行某些任务的能力。

最后，我们的监督预训练方法在数据到文本生成、问题生成、问答、故事生成和开放式对话任务方面取得了五个新的 SOTA 结果。我们还在表11中的八个数据集中的六个中实现了 SOTA 性能，这显示了我们的 MVP 模型强大的文本生成能力。至于其余任务，SOTA 模型结合了定制技术，例如重新排序框架（Ravaut 等人，2022）和各种特定于任务的目标 (He 等人, 2022)，从而产生更好的性能。相比之下，我们的 MVP 模型只需通用架构和统一的学习目标就可以产生有竞争力的结果。

AESOP	Quora
AESOP	B- $4$	R- $1$	R- $2$	R-L	ME
+BART	47.30^a	73.30	54.10	75.10	49.70
+MVP	49.81	74.78	56.84	76.34	53.40

SC & BLEU	GYAFC E&M			GYAFC F&R
SC & BLEU	B- $4$	Accuracy	HM	B- $4$	Accuracy	HM
+BART	76.50^b	93.70	83.90	79.30	92.00	85.20
+MVP	77.18	94.49	84.96	79.43	92.12	85.31

表 4：未见过的 NLG 任务的结果。我们使用 AESOP 和 SC & BLEU 分别表示 Sun 等人 (2021) 和 Lai 等人 (2021) 提出的方法。 ^a （孙等人，2021） ^b （赖等人，2021）

Methods	CoLA	SST-2	MRPC	STS-B	QQP	MNLI	QNLI	RTE	Average
Methods	Matt.	Acc.	F1/Acc.	P/S Corr.	F1/Acc.	m./mm.	Acc.	Acc.	Average
BART	60.30	96.30	90.47 / 86.70	90.97 / 90.30	73.03 / 89.87	90.03 / 89.27	94.60	79.83	85.17
MVP	59.87	96.43	92.07 / 89.43	91.37 / 90.90	73.20 / 90.13	89.70 / 88.73	95.10	82.87	85.88

表 5： GLUE 基准上的 NLU 任务结果。

4.2 零样本性能

由于我们没有在七个常用数据集上预训练 MVP，因此我们进一步进行零样本实验来查看模型的域转移能力。我们将 T0-3B 和 T0-11B (Sanh 等人, 2022) 作为我们的基线，它们是在各种下游任务上训练的大型模型。结果列于表3中。我们可以观察到，除了 ROCStories 和 MultiWOZ 上的少数指标外，我们的小型 MVP 模型 (406M) 在所有指标上都比 T0-3B 和 T0-11B 的表现有很大优势。这证明了在 MVPCorpus 上使用监督预训练的有效性。

然而，所有任务都表明，零样本设置中的模型比完全调整设置的模型表现要差得多。这表明对 NLU 任务有效的训练策略可能无法为 NLG 任务产生令人满意的结果。尽管我们的模型已经获得了任务知识，但如果不进行微调，它很难在新领域中表现良好。因此，仍然有必要针对某些任务和领域开发特定的 NLG 模型。我们的 MVP 模型可以成为进一步研究的有效模型。

Methods	CNN/DailyMail			WebNLG			SQuAD (QG)			CoQA
Methods	R- $1$	R- $2$	R-L	B- $4$	ME	R-L	B- $4$	ME	R-L	F1	EM
MVP+S	43.03	20.27	39.72	66.73	47.42	76.36	25.28	26.66	52.69	86.44	76.84
BART+R	42.47	19.82	39.15	65.54	46.86	75.24	24.27	26.07	52.03	82.22	71.92
MVP+R	42.84	20.21	39.61	66.12	47.12	75.83	25.05	26.34	52.57	85.51	75.56
MVP+M	42.99	20.36	39.70	66.40	47.16	75.89	25.24	26.49	52.88	85.90	76.34
FT BART	44.16	21.28	40.90	64.55	46.51	75.13	22.00	26.40	52.55	68.60	–
FT MVP	44.52	21.62	41.10	67.82	47.47	76.88	26.26	27.35	53.49	86.43	77.78
Methods	ROCStories				PersonaChat				MultiWOZ
Methods	B- $1$	B- $2$	D- $1$	D- $4$	B- $1$	B- $2$	D- $1$	D- $2$	B- $4$	Success	Inform
MVP+S	32.94	15.12	2.98	71.09	47.11	39.51	1.39	7.28	19.24	71.40	77.80
BART+R	32.14	14.71	2.85	68.94	46.23	38.98	1.30	6.82	17.94	62.20	69.20
MVP+R	32.28	14.85	2.97	70.29	46.70	39.23	1.31	6.98	18.86	64.40	71.40
MVP+M	32.62	15.28	2.95	69.58	46.78	39.40	1.33	7.13	19.13	67.20	72.90
FT BART	30.70	13.30	–	69.90	49.90	40.00	1.30	8.00	17.89	74.91	84.88
FT MVP	33.79	15.76	3.02	75.65	50.73	40.69	1.65	11.23	20.26	76.40	85.00

表 6：在参数高效设置下看到的七项任务的结果。我们还包括了 BART 和 MVP 在完全调优设置（表示为 FT）下的结果进行比较。

4.3 未见任务的一般性

在本小节中，我们将在未见过的 NLG 和 NLU 任务上测试 MVP 模型，以验证其通用性。

看不见的 NLG 任务。

根据 Deng 等人 (2021) 的说法，NLG 任务可以分配给以下三个类别之一：压缩(例如摘要）、转导(例如摘要）。，翻译），或创作(例如，故事生成）。由于我们在预训练期间不包含任何转导任务，因此我们使用两个看不见的转导 NLG 任务来评估我们的 MVP 模型：释义生成和文本样式转换。我们为这两个任务选择 SOTA 方法，即 AESOP (Sun 等人, 2021) 用于释义生成，SC & BLEU (Lai 等人, 2021) ) 用于文本样式转换，并将其骨干 BART 替换为我们的 MVP 模型进行比较。从表4的结果可以看出，我们的模型比BART的性能好于 $2.3\%$ ，并取得了两个新的SOTA结果，这验证了我们模型的强大通用性。这一发现表明我们的 MVP 模型比 BART 更强大，可以作为通用而有效的骨干网。

未见过的 NLU 任务。

尽管 MVP 是专门为 NLG 任务设计的，但我们还使用广泛使用的 GLUE 基准（Wang 等人，2019）来评估其在未见过的 NLU 任务上的性能。我们使用序列分类方法（Lewis 等人，2020）将我们的模型与 BART_large 进行比较。根据表 5 中显示的结果，我们的 MVP 模型在 $12$ 指标中的 $9$ 方面优于 BART，并且具有优于 $0.71\%$ 的整体性能。 t4>。这一结果表明了我们的 MVP 模型的通用性，并进一步证明了监督预训练不仅学习了生成能力，而且还提高了整体语义表示。

4.4参数高效的调优性能

在轻量级微调设置中，我们仅调整提示，同时冻结骨干 MVP 模型以验证其在资源受限情况下的有效性。除了我们的 MVP+S 模型之外，我们还考虑比较以下方法：

•

前缀调优（Li和Liang，2021）：前缀调优是一种流行的基于提示的轻量级文本生成调优方法。我们采用BART作为其骨干网，表示为BART+R。
•

仅调整随机初始化的提示（MVP+R）：该变体仅调整 MVP+R 的随机初始化的提示，与前缀调整的思路类似。
•

仅调整多任务预训练提示（MVP+M）：此变体仅调整 MVP+M 的多任务预训练提示。 SPoT (Vu 等人, 2022)中已经使用了这样的想法。

从表6的实验结果可以看出：MVP模型在轻量级设置下的良好表现进一步证明了有监督预训练的有效性。通过比较两种随机初始化的提示方法（BART+R 和 MVP+R），我们可以看到 MVP+R 由于其多任务监督主干而实现了优于 BART+R ( $+2.0\%$ ) 的性能。此外，当使用预训练提示进行初始化时，MVP+S 和 MVP+M 比 MVP+R 取得了更好的结果，这与 SPoT (Vu 等人, 2022) 的研究结果一致。与 MVP+M 相比，MVP+S 的表现略好 $1.2\%$ ，这表明特定于任务的提示对于改进生成任务中的模型很有用。令人惊讶的是，我们的轻量级 MVP+S 甚至可以在问题生成和问题回答等任务上超越完全调整的 BART，展示了所提出的监督预训练方法的有效性。

Datasets	MVP wins (%)	Ties (%)	BART wins (%)
CNN/DM	46.50	10.67	42.83
WebNLG	32.17	45.67	22.17
ROCStories	46.50	11.33	42.17
PersonaChat	35.33	34.00	30.67

表 7：使用 Krippendorff 的

\alpha=0.418

对四项任务进行人工评估，衡量人类评判者之间的注释者相关性。

4.5人类评价

考虑到自动指标与人类判断之间存在一定差距（Sai等人，2022），我们进一步进行人类评估，以更好地展示我们的MVP模型的生成能力。我们在四个任务上将 MVP 与 BART 进行比较，包括文本摘要、数据到文本生成、开放式对话系统和故事生成。遵循 van der Lee 等人 (2021) 的做法，我们对每个任务使用低、中、高词频的 $100$ 输入的分层样本。我们邀请六位人类评委来评估 MVP 和 BART 生成的文本。然后他们需要选择哪一个更好，或者根据流畅性、信息量、一致性、任务特征等等选择平局。附录D中列出了更多人工评估详细信息。表 7 显示了每个数据集的“MVP 获胜”、“平局”和“BART 获胜”的比例。从结果中我们可以看出，从人类的角度来看，MVP 可以生成比 BART 更好的文本。

Methods	#NLG (PT)	#NLU (PT)	#NLG (FT)	#NLU (FT)	SP model	SP prompts	Open source
FLAN	3	9	2	9	✓	✗	✗
T0	2	6	0	4	✓	✗	✓
Muppet	1	3	1	3	✓	✗	✓
ExT5	3	8	6	8	✓	✗	✗
SPoT	1	4	0	6	✗	✓	✗
MVP (ours)	7	0	11	3	✓	✓	✓

表8： MVP 与现有有监督预训练工作的比较。 #NLG/#NLU 分别是 NLG 和 NLU 任务的数量。 PT、FT 和 SP 分别表示预训练、微调和监督预训练。

5讨论

与现有方法的差异。

据我们所知，现有的有监督预训练工作主要集中在 NLU 任务（Aghajanyan 等人，2021；Aribandi 等人，2022）或少量 NLG 任务（Lin等人，2020b；苏等人，2022）。鉴于监督预训练方法取得了优异的性能，探索监督预训练对于推导有效和通用 NLG 模型非常重要。我们的工作在这个方向上做出了重大贡献，在 $17$ 数据集的 $13$ 上使用单个模型实现了 SOTA 性能。与它的强大对手 ExT5 (Aribandi 等人，2022) 相比，我们的 MVP 模型在 $26$ 和 $27$ 指标中表现优于它（详见附录 C.2)。为了更好地理解我们的工作与之前的监督（多任务）预训练研究之间的差异，我们在表8中进行了详细比较。正如我们所看到的，我们的工作使用最多的 NLG 任务进行了监督预训练和微调的研究，结合了特定于任务的提示，并释放了所有重要的资源来重现或重用我们的工作。

适用性。

为了方便我们的工作应用，我们发布了集合语料库、预训练模型、特定任务提示和生成文本。我们收集的MVPCorpus是目前最大的NLG任务集合，可以作为近期大语言模型Zhao等人(2023)的优质资源。我们可以使用所有数据来预训练通用模型，或者选择一个子集来继续预训练特定领域或特定任务的模型（Gururangan 等人，2020）我们的 MVPCorpus 也可以考虑作为不同NLG任务的评估基准。此外，我们的 MVP 模型可用于在各种 NLG 任务中取得有竞争力的结果。用户可以根据足够的标记数据来模拟 MVP 模型或将其与特定于任务的提示集成。值得注意的是，我们的 MVP 模型可以直接用于零样本学习中获得良好的性能。此外，我们的 MVP 模型可以提供有效的参数初始化来改进现有方法，如 4.3 节中所述。最后，特定于任务的提示和生成的文本可以进一步用于研究任务相似性及其对多任务预训练的影响。

6结论

在本文中，我们提出了自然环境的M多任务监督V化P再训练(MVP)语言的产生。首先，我们从 $11$ 个不同 NLG 任务的 $77$ 数据集中收集了一个大规模 NLG 语料库 MVPCorpus。将各种 NLG 任务转换为统一的文本到文本格式后，我们提出多任务监督预训练来学习有效和通用模型MVP 带有 NLG 任务的特定于任务的提示。大量实验表明：（1）有监督预训练作为一种有效的解决方案有利于 NLG 任务。我们的 MVP 模型优于其强大的同行 BART 和 Flan-T5，甚至在 $17$ 数据集的 $13$ 上实现了 SOTA 性能；（2）有监督的预训练模型对于看不见的生成甚至理解任务具有很强的通用性。

在未来的工作中，我们将通过覆盖更多其他语言的数据集来探索 MVP 模型的多语言版本。这样的模型有望捕获与语言无关的任务特征并改进小语种的生成任务。此外，研究不同任务在统一语义空间中如何相互关联是很有趣的，这可以启发先验地合并任务关系的方法。

致谢

该工作得到了国家自然科学基金委的部分资助，批准号为： 62222215，北京市自然科学基金，批准号：北京市杰出青年科学家计划，批准号：4222027 BJJWZYJH012019100020098。赵鑫是通讯作者。

局限性

尽管我们努力收集尽可能多的生成任务和数据集，但我们仅在少量任务和数据集上评估模型的生成质量和通用性。我们模型的可解释性和稳健性需要进一步分析。此外，尽管我们尝试采用文献中广泛认可的分类，但在收集下游任务和任务内数据集时仍然存在主观性。由于计算能力的限制，我们没有研究我们的方法在不同模型规模下的性能。类似于ExT5（Aribandi等人，2022），从头开始的多任务预训练的有效性也值得深入研究。

更广泛的影响

在本文中，我们使用带标签的 NLG 数据集预训练了语言模型 MVP。根据研究（Bender 等人，2021；Bommasani 等人，2021），PLM 倾向于“记住”他们在预训练语料库中“看到”的内容。这可能会导致下游任务的预训练数据产生不良偏差。训练数据干预可能是缓解这一问题的解决方案（Lu等人，2020）。研究有监督预训练是否比无监督预训练产生的偏差更少也很有趣。

环境影响是我们应该考虑的另一个因素。我们尝试了更有效的预培训策略，并发布了我们的 PLM 以供未来的工作使用。与 T5 (Raffel 等人, 2020) 和 GPT-3 (Brown 等人, 2020) 等数百亿参数的大型 PLM 相比，我们预-仅训练具有数亿参数的小模型。此外，我们利用有监督的预训练数据并使用预训练的 BART 初始化我们的模型，这两者都提高了我们模型的收敛性。最终，我们的模型预训练了大约 $20,000$ 步骤，而相同大小的 BART 预训练了 $500,000$ 步骤。

再现性

为了重现和重用我们的工作，我们发布了 MVPCorpus 集合、模型(例如， MVP、特定于任务的提示和多任务变体）、中间结果(例如， 生成的文本），以及用于预训练和微调的源代码，链接为：https://github.com/RUCAIBox/MVP。实验的详细设置列于附录B中。我们希望这些开源资源能够促进未来监督预训练的工作，并为 NLG 研究的进步做出贡献。

参考

Agarwal et al. (2021) Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.
Aghajanyan et al. (2021) Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. 2021. Muppet: Massive multi-task representations with pre-finetuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5799–5811, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Alamri et al. (2018) Huda Alamri, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Jue Wang, Irfan Essa, Dhruv Batra, Devi Parikh, Anoop Cherian, Tim K Marks, et al. 2018. Audio visual scene-aware dialog (avsd) challenge at dstc7. arXiv preprint arXiv:1806.00525.
Alva-Manchego et al. (2020) Fernando Alva-Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Benoît Sagot, and Lucia Specia. 2020. ASSET: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4668–4679, Online. Association for Computational Linguistics.
Aribandi et al. (2022) Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, and Donald Metzler. 2022. Ext5: Towards extreme multi-task scaling for transfer learning. In International Conference on Learning Representations.
Bao et al. (2021) Hangbo Bao, Li Dong, Wenhui Wang, Nan Yang, and Furu Wei. 2021. s2s-ft: Fine-tuning pretrained transformer encoders for sequence-to-sequence learning. arXiv preprint arXiv:2110.13640.
Bender et al. (2021) Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery.
Bentivogli et al. (2009) Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth pascal recognizing textual entailment challenge. In In Proc Text Analysis Conference (TAC’09.
Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA. Association for Computational Linguistics.
Bommasani et al. (2021) Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, Brussels, Belgium. Association for Computational Linguistics.
Bujnowski et al. (2020) Pawel Bujnowski, Kseniia Ryzhova, Hyungtak Choi, Katarzyna Witkowska, Jaroslaw Piersa, Tymoteusz Krumholc, and Katarzyna Beksa. 2020. An empirical study on multi-task learning for text style transfer and paraphrase generation. In Proceedings of the 28th International Conference on Computational Linguistics: Industry Track, pages 50–63, Online. International Committee on Computational Linguistics.
Byrne et al. (2019) Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai Sankar, Arvind Neelakantan, Ben Goodrich, Daniel Duckworth, Semih Yavuz, Amit Dubey, Kyu-Young Kim, and Andy Cedilnik. 2019. Taskmaster-1: Toward a realistic and diverse dialog dataset. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4516–4525, Hong Kong, China. Association for Computational Linguistics.
Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada. Association for Computational Linguistics.
Chen et al. (2021) Mingda Chen, Sam Wiseman, and Kevin Gimpel. 2021. WikiTableT: A large-scale data-to-text dataset for generating Wikipedia article sections. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 193–209, Online. Association for Computational Linguistics.
Chen et al. (2022) Wei Chen, Yeyun Gong, Song Wang, Bolun Yao, Weizhen Qi, Zhongyu Wei, Xiaowu Hu, Bartuer Zhou, Yi Mao, Weizhu Chen, Biao Cheng, and Nan Duan. 2022. DialogVED: A pre-trained latent variable encoder-decoder model for dialog response generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4852–4864, Dublin, Ireland. Association for Computational Linguistics.
Chen et al. (2020a) Wenhu Chen, Jianshu Chen, Yu Su, Zhiyu Chen, and William Yang Wang. 2020a. Logical natural language generation from open-domain tables. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7929–7942, Online. Association for Computational Linguistics.
Chen et al. (2020b) Wenhu Chen, Yu Su, Xifeng Yan, and William Yang Wang. 2020b. KGPT: Knowledge-grounded pre-training for data-to-text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8635–8648, Online. Association for Computational Linguistics.
Cheng et al. (2020) Liying Cheng, Dekun Wu, Lidong Bing, Yan Zhang, Zhanming Jie, Wei Lu, and Luo Si. 2020. ENT-DESC: Entity description generation by exploring knowledge graph. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1187–1197, Online. Association for Computational Linguistics.
Choi et al. (2018) Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184, Brussels, Belgium. Association for Computational Linguistics.
Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
Collobert and Weston (2008) Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, volume 307 of ACM International Conference Proceeding Series, pages 160–167. ACM.
Dagan et al. (2006) Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pages 177–190, Berlin, Heidelberg. Springer Berlin Heidelberg.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), pages 248–255, Los Alamitos, CA, USA. IEEE Computer Society.
Deng et al. (2021) Mingkai Deng, Bowen Tan, Zhengzhong Liu, Eric Xing, and Zhiting Hu. 2021. Compression, transduction, and creation: A unified framework for evaluating natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7580–7605, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Dinan et al. (2019) Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations.
Dodge et al. (2016) Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander H. Miller, Arthur Szlam, and Jason Weston. 2016. Evaluating prerequisite qualities for learning end-to-end dialog systems. In 4th International Conference on Learning Representations, ICLR 2016.
Dolan and Brockett (2005) William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
El Asri et al. (2017) Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, and Kaheer Suleman. 2017. Frames: a corpus for adding memory to goal-oriented dialogue systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 207–219, Saarbrücken, Germany. Association for Computational Linguistics.
Eric et al. (2017) Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D. Manning. 2017. Key-value retrieval networks for task-oriented dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 37–49, Saarbrücken, Germany. Association for Computational Linguistics.
Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
Feng et al. (2022) Yutong Feng, Jianwen Jiang, Mingqian Tang, Rong Jin, and Yue Gao. 2022. Rethinking supervised pre-training for better downstream transferring. In International Conference on Learning Representations.
Garbacea and Mei (2020) Cristina Garbacea and Qiaozhu Mei. 2020. Neural language generation: Formulation, methods, and evaluation. arXiv preprint arXiv:2007.15780.
Gardent et al. (2017) Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating training corpora for NLG micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188, Vancouver, Canada. Association for Computational Linguistics.
Gehrmann et al. (2021) Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa Prasad Majumder, Pedro Henrique Martins, Angelina McMillan-Major, Simon Mille, Emiel van Miltenburg, Moin Nadeem, Shashi Narayan, Vitaly Nikolaev, Andre Niyongabo Rubungo, Salomey Osei, Ankur Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Vikas Raunak, Juan Diego Rodriguez, Sashank Santhanam, João Sedoc, Thibault Sellam, Samira Shaikh, Anastasia Shimorina, Marco Antonio Sobrevilla Cabezudo, Hendrik Strobelt, Nishant Subramani, Wei Xu, Diyi Yang, Akhila Yerukola, and Jiawei Zhou. 2021. The GEM benchmark: Natural language generation, its evaluation and metrics. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 96–120, Online. Association for Computational Linguistics.
Giampiccolo et al. (2007) Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pages 1–9, Prague. Association for Computational Linguistics.
Gliwa et al. (2019) Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. 2019. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China. Association for Computational Linguistics.
Gopalakrishnan et al. (2019) Karthik Gopalakrishnan, Behnam Hedayatnia, Qinglang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. 2019. Topical-chat: Towards knowledge-grounded open-domain conversations. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, pages 1891–1895. ISCA.
Graff et al. (2003) David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34.
Grusky et al. (2018) Max Grusky, Mor Naaman, and Yoav Artzi. 2018. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 708–719, New Orleans, Louisiana. Association for Computational Linguistics.
Gu et al. (2021) Jing Gu, Mostafa Mirshekari, Zhou Yu, and Aaron Sisto. 2021. ChainCQG: Flow-aware conversational question generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2061–2070, Online. Association for Computational Linguistics.
Gu et al. (2022) Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2022. PPT: Pre-trained prompt tuning for few-shot learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8410–8423, Dublin, Ireland. Association for Computational Linguistics.
Guan et al. (2021) Jian Guan, Xiaoxi Mao, Changjie Fan, Zitao Liu, Wenbiao Ding, and Minlie Huang. 2021. Long text generation by modeling sentence-level and discourse-level coherence. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6379–6393, Online. Association for Computational Linguistics.
Gururangan et al. (2020) Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
Haim et al. (2006) R Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, volume 7.
He and Choi (2021) Han He and Jinho D. Choi. 2021. The stem cell hypothesis: Dilemma behind multi-task learning with transformer encoders. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5555–5577, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Los Alamitos, CA, USA. IEEE Computer Society.
He et al. (2022) Wanwei He, Yinpei Dai, Yinhe Zheng, Yuchuan Wu, Zheng Cao, Dermot Liu, Peng Jiang, Min Yang, Fei Huang, Luo Si, Jian Sun, and Yongbin Li. 2022. Galaxy: A generative pre-trained model for task-oriented dialog with semi-supervised learning and explicit policy injection. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):10749–10757.
Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
Hua and Wang (2020) Xinyu Hua and Lu Wang. 2020. PAIR: Planning and iterative refinement in pre-trained transformers for long text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 781–793, Online. Association for Computational Linguistics.
Jiang et al. (2020) Chao Jiang, Mounica Maddela, Wuwei Lan, Yang Zhong, and Wei Xu. 2020. Neural CRF model for sentence alignment in text simplification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7943–7960, Online. Association for Computational Linguistics.
Jin et al. (2020) Zhijing Jin, Qipeng Guo, Xipeng Qiu, and Zheng Zhang. 2020. GenWiki: A dataset of 1.3 million content-sharing text and graphs for unsupervised graph-to-text generation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2398–2409, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
Ke et al. (2021) Pei Ke, Haozhe Ji, Yu Ran, Xin Cui, Liwei Wang, Linfeng Song, Xiaoyan Zhu, and Minlie Huang. 2021. JointGT: Graph-text joint representation learning for text generation from knowledge graphs. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2526–2538, Online. Association for Computational Linguistics.
Khashabi et al. (2020) Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907, Online. Association for Computational Linguistics.
Kočiský et al. (2018) Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
Koncel-Kedziorski et al. (2019) Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Hajishirzi. 2019. Text Generation from Knowledge Graphs with Graph Transformers. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2284–2293, Minneapolis, Minnesota. Association for Computational Linguistics.
Koupaee and Wang (2018) Mahnaz Koupaee and William Yang Wang. 2018. Wikihow: A large scale text summarization dataset. arXiv preprint arXiv:1810.09305.
Kumar et al. (2020) Ashutosh Kumar, Kabir Ahuja, Raghuram Vadapalli, and Partha Talukdar. 2020. Syntax-guided controlled generation of paraphrases. Transactions of the Association for Computational Linguistics, 8:329–345.
Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
Ladhak et al. (2020) Faisal Ladhak, Esin Durmus, Claire Cardie, and Kathleen McKeown. 2020. WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4034–4048, Online. Association for Computational Linguistics.
Lai et al. (2021) Huiyuan Lai, Antonio Toral, and Malvina Nissim. 2021. Thank you BART! rewarding pre-trained models improves formality style transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 484–494, Online. Association for Computational Linguistics.
Lebret et al. (2016) Rémi Lebret, David Grangier, and Michael Auli. 2016. Neural text generation from structured data with application to the biography domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1203–1213, Austin, Texas. Association for Computational Linguistics.
Lee et al. (2019) Sungjin Lee, Hannes Schulz, Adam Atkinson, Jianfeng Gao, Kaheer Suleman, Layla El Asri, Mahmoud Adada, Minlie Huang, Shikhar Sharma, Wendy Tay, and Xiujun Li. 2019. Multi-domain task-completion dialog challenge. In Dialog System Technology Challenges, volume 8.
Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
Li et al. (2022a) Junyi Li, Tianyi Tang, Jian-Yun Nie, Ji-Rong Wen, and Xin Zhao. 2022a. Learning to transfer prompts for text generation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3506–3518, Seattle, United States. Association for Computational Linguistics.
Li et al. (2022b) Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2022b. A survey of pretrained language models based text generation. arXiv preprint arXiv:2201.05273.
Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online. Association for Computational Linguistics.
Li et al. (2018) Xiujun Li, Yu Wang, Siqi Sun, Sarah Panda, Jingjing Liu, and Jianfeng Gao. 2018. Microsoft dialogue challenge: Building end-to-end task-completion dialogue systems. arXiv preprint arXiv:1807.11125.
Li et al. (2017) Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Liang et al. (2009) Percy Liang, Michael Jordan, and Dan Klein. 2009. Learning semantic correspondences with less supervision. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 91–99, Suntec, Singapore. Association for Computational Linguistics.
Lin et al. (2020a) Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. 2020a. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online. Association for Computational Linguistics.
Lin et al. (2020b) Zehui Lin, Xiao Pan, Mingxuan Wang, Xipeng Qiu, Jiangtao Feng, Hao Zhou, and Lei Li. 2020b. Pre-training multilingual neural machine translation by leveraging alignment information. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2649–2663, Online. Association for Computational Linguistics.
Lin et al. (2020c) Zhaojiang Lin, Andrea Madotto, Genta Indra Winata, and Pascale Fung. 2020c. MinTL: Minimalist transfer learning for task-oriented dialogue systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3391–3405, Online. Association for Computational Linguistics.
Lison et al. (2018) Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. 2018. OpenSubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
Liu et al. (2021a) Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu, Linjun Shou, Ming Gong, Pengcheng Wang, Jiusheng Chen, Daxin Jiang, Jiancheng Lv, Ruofei Zhang, Winnie Wu, Ming Zhou, and Nan Duan. 2021a. GLGE: A new general language generation evaluation benchmark. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 408–420, Online. Association for Computational Linguistics.
Liu et al. (2021b) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021b. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586.
Liu et al. (2022) Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2022. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, Dublin, Ireland. Association for Computational Linguistics.
Liu et al. (2019a) Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019a. Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4487–4496, Florence, Italy. Association for Computational Linguistics.
Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
Liu et al. (2019b) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Lu et al. (2020) Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. 2020. Gender Bias in Neural Natural Language Processing, pages 189–202. Springer International Publishing, Cham.
(88) Markriedl. https://github.com/markriedl/WikiPlots. Accessed: 2022-12-18.
McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Mishra et al. (2022) Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2022. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
Moon et al. (2019) Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba. 2019. OpenDialKG: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 845–854, Florence, Italy. Association for Computational Linguistics.
Mostafazadeh et al. (2016) Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849, San Diego, California. Association for Computational Linguistics.
Mrkšić et al. (2017) Nikola Mrkšić, Diarmuid Ó Séaghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve Young. 2017. Neural belief tracker: Data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1777–1788, Vancouver, Canada. Association for Computational Linguistics.
Nan et al. (2021) Linyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto, Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Mutethia Mutuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Yi Chern Tan, Xi Victoria Lin, Caiming Xiong, Richard Socher, and Nazneen Fatema Rajani. 2021. DART: Open-domain structured data record to text generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 432–447, Online. Association for Computational Linguistics.
Narayan et al. (2018) Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
Nguyen et al. (2021) Thong Nguyen, Anh Tuan Luu, Truc Lu, and Tho Quan. 2021. Enriching and controlling global semantics for text summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9443–9456, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In CoCo@NIPS, volume 1773 of CEUR Workshop Proceedings. CEUR-WS.org.
Novikova et al. (2017) Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. 2017. The E2E dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201–206, Saarbrücken, Germany. Association for Computational Linguistics.
Qin and Eisner (2021) Guanghui Qin and Jason Eisner. 2021. Learning how to ask: Querying LMs with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5203–5212, Online. Association for Computational Linguistics.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
Rao and Tetreault (2018) Sudha Rao and Joel Tetreault. 2018. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 129–140, New Orleans, Louisiana. Association for Computational Linguistics.
Rashkin et al. (2019) Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy. Association for Computational Linguistics.
Rastogi et al. (2020a) Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2020a. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. volume 34, pages 8689–8696.
Rastogi et al. (2020b) Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2020b. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. volume 34, pages 8689–8696.
Ravaut et al. (2022) Mathieu Ravaut, Shafiq Joty, and Nancy Chen. 2022. SummaReranker: A multi-task mixture-of-experts re-ranking framework for abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4504–4524, Dublin, Ireland. Association for Computational Linguistics.
Reddy et al. (2019) Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
Rodriguez et al. (2020) Pedro Rodriguez, Paul Crook, Seungwhan Moon, and Zhiguang Wang. 2020. Information seeking in the spirit of learning: A dataset for conversational curiosity. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8153–8172, Online. Association for Computational Linguistics.
Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics.
Sai et al. (2022) Ananya B. Sai, Akash Kumar Mohankumar, and Mitesh M. Khapra. 2022. A survey of evaluation metrics used for nlg systems. ACM Comput. Surv., 55(2).
Sanh et al. (2022) Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M Rush. 2022. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
Sap et al. (2020) Maarten Sap, Eric Horvitz, Yejin Choi, Noah A. Smith, and James Pennebaker. 2020. Recollection versus imagination: Exploring human memory and cognition via neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1970–1978, Online. Association for Computational Linguistics.
See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
Stratos (2019) Karl Stratos. 2019. Mutual information maximization for simple and accurate part-of-speech induction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1095–1104, Minneapolis, Minnesota. Association for Computational Linguistics.
Su et al. (2022) Yixuan Su, Lei Shu, Elman Mansimov, Arshit Gupta, Deng Cai, Yi-An Lai, and Yi Zhang. 2022. Multi-task pre-training for plug-and-play task-oriented dialogue system. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4661–4676, Dublin, Ireland. Association for Computational Linguistics.
Su et al. (2021) Yixuan Su, David Vandyke, Sihui Wang, Yimai Fang, and Nigel Collier. 2021. Plan-then-generate: Controlled data-to-text generation via planning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 895–909, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Subramanian et al. (2018) Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning. In International Conference on Learning Representations.
Sun et al. (2021) Jiao Sun, Xuezhe Ma, and Nanyun Peng. 2021. AESOP: Paraphrase generation with adaptive syntactic control. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5176–5189, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Sun et al. (2019) Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, and Claire Cardie. 2019. DREAM: A challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics, 7:217–231.
Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, Los Alamitos, CA, USA. IEEE Computer Society.
Tang et al. (2022a) Tianyi Tang, Junyi Li, Zhipeng Chen, Yiwen Hu, Zhuohao Yu, Wenxun Dai, Wayne Xin Zhao, Jian-yun Nie, and Ji-rong Wen. 2022a. TextBox 2.0: A text generation library with pre-trained language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 435–444, Abu Dhabi, UAE. Association for Computational Linguistics.
Tang et al. (2022b) Tianyi Tang, Junyi Li, Wayne Xin Zhao, and Ji-Rong Wen. 2022b. Context-tuning: Learning contextualized prompts for natural language generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6340–6354, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Tang et al. (2022c) Xiangru Tang, Arjun Nair, Borui Wang, Bingyao Wang, Jai Desai, Aaron Wade, Haoran Li, Asli Celikyilmaz, Yashar Mehdad, and Dragomir Radev. 2022c. CONFIT: Toward faithful dialogue summarization with linguistically-informed contrastive fine-tuning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5657–5668, Seattle, United States. Association for Computational Linguistics.
Trischler et al. (2017) Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191–200, Vancouver, Canada. Association for Computational Linguistics.
van der Lee et al. (2021) Chris van der Lee, Albert Gatt, Emiel van Miltenburg, and Emiel Krahmer. 2021. Human evaluation of automatically generated text: Current trends and best practice guidelines. Computer Speech and Language, 67:101151.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Vedantam et al. (2015) Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, Los Alamitos, CA, USA. IEEE Computer Society.
Vu et al. (2022) Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou’, and Daniel Cer. 2022. SPoT: Better frozen model adaptation through soft prompt transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5039–5059, Dublin, Ireland. Association for Computational Linguistics.
Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
Warstadt et al. (2019) Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
Wei et al. (2022) Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
Welivita et al. (2021) Anuradha Welivita, Yubo Xie, and Pearl Pu. 2021. A large-scale dataset for empathetic response generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1251–1264, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Wen et al. (2017) Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, Milica Gašić, Lina M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 438–449, Valencia, Spain. Association for Computational Linguistics.
Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
Xiao et al. (2020) Dongling Xiao, Han Zhang, Yukun Li, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2020. Ernie-gen: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pages 3997–4003. International Joint Conferences on Artificial Intelligence Organization. Main track.
Xie et al. (2022) Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I Wang, et al. 2022. Unifiedskg: Unifying and multi-tasking structured knowledge grounding with text-to-text language models. arXiv preprint arXiv:2201.05966.
Xu et al. (2022) Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Yanggang Wang, Haiyu Li, and Zhilin Yang. 2022. Zeroprompt: Scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization. arXiv preprint arXiv:2201.06910.
Xu et al. (2021) Peng Xu, Davis Liang, Zhiheng Huang, and Bing Xiang. 2021. Attention-guided generative models for extractive question answering. arXiv preprint arXiv:2110.06393.
Xu et al. (2016) Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. 2016. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415.
Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
Zhang et al. (2021) Yian Zhang, Alex Warstadt, Xiaocheng Li, and Samuel R. Bowman. 2021. When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1112–1125, Online. Association for Computational Linguistics.
Zhang et al. (2020) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2020. DIALOGPT : Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 270–278, Online. Association for Computational Linguistics.
Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223.
Zhou et al. (2018) Kangyan Zhou, Shrimai Prabhumoye, and Alan W Black. 2018. A dataset for document grounded conversations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 708–713, Brussels, Belgium. Association for Computational Linguistics.
Zhu et al. (2021) Chenguang Zhu, Yang Liu, Jie Mei, and Michael Zeng. 2021. MediaSum: A large-scale media interview dataset for dialogue summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5927–5934, Online. Association for Computational Linguistics.

附录 A任务和数据集

A.1 任务和数据集描述

我们在表 9 和 10 中提供了论文中用于预训练和微调的任务和数据集的详细信息。如果预训练的数据集没有有效的训练集，我们将分割集的 $10\%$ 进行验证。

我们列出了所有数据集的许可证（如果有）。所有数据集都是公开的。其中大部分可以直接从 GitHub 或 Google Drive 下载。 ROCStories (Mostafazadeh 等人, 2016) 和 CommonGen (Lin 等人, 2020a) 可以在填写表格后获取。 GYAFC （Rao 和 Tetreault，2018）在请求雅虎和数据集作者后即可访问。

我们在本文中使用的任务和数据集如下：

•
数据到文本生成旨在生成有关结构化数据的描述性文本，例如知识图谱和表格。我们使用以下数据集进行预训练：
1. 1.
  
  议程 (Koncel-Kedziorski 等人, 2019);
2. 2.
  
  ENT-DESC (Cheng 等人, 2020);
3. 3.
  
  GenWiki (Jin 等人, 2020);
4. 4.
  
  LogicNLG (Chen 等人, 2020a);
5. 5.
  
  TEKGEN (Agarwal 等人, 2021);
6. 6.
  
  WEATHERGOV (梁等人, 2009);
7. 7.
  
  WikiTableT (Chen 等人, 2021)。
我们利用以下数据集进行微调评估：
1. 1.
  
  WebNLG (Gardent 等人, 2017)，我们使用2.1版本；
2. 2.
  
  WikiBio （Lebret 等人，2016）。
•
开放式对话系统，也称为聊天机器人，专注于日常交流。我们使用以下数据集进行预训练：
1. 1.
  
  Cleaned OpenSubtitles Dialogs (Cleaned OS Dialogs) (Welivita 等人, 2021)，它是 OpenSubtitles Dialogs (Lison 等人, 2018) 的清理变体；
2. 2.
  
  CMU 文档接地对话 (CMUDog) (Zhou 等人, 2018);
3. 3.
  
  好奇心(Rodriguez 等人, 2020);
4. 4.
  
  DREAM (Sun 等人, 2019);
5. 5.
  
  同理心对话(Rashkin 等人, 2019);
6. 6.
  
  电影对话(Dodge 等人, 2016);
7. 7.
  
  互助（Stratos，2019）；
8. 8.
  
  OpenDialKG (Moon 等人, 2019);
9. 9.
  
  主题聊天 (Gopalakrishnan 等人, 2019);
10. 10.
  
  维基百科巫师（Dinan 等人，2019）。
我们利用以下数据集进行微调评估：
1. 1.
  
  DailyDialog (李等人, 2017);
2. 2.
  
  DSTC7-AVSD (Alamri 等人, 2018);
3. 3.
  
  PersonaChat （张等人，2018）。
•
释义生成涉及用相同语义但不同句法或词汇形式重写句子。我们利用以下数据集进行微调评估：
1. 1.
  
  Quora（也称为 QQP-Pos）(Kumar 等人, 2020)，它是 Quora 问题对的子集³³3https://www.kaggle.com/c/quora-question-pairs。
•
问答要求模型根据可选的背景信息回答问题。请注意，我们在论文中以生成方式执行此任务。我们使用以下数据集进行预训练：
1. 1.
  
  HotpotQA (杨等人, 2018);
2. 2.
  
  MS MARCO (Nguyen 等人, 2016);
3. 3.
  
  MSQG (Liu 等人, 2021a)，由于它是为 QG 设计的，因此我们反转问答以丰富 QA 示例；
4. 4.
  
  NarrativeQA (Kočiský 等人, 2018);
5. 5.
  
  自然问题(Kwiatkowski 等人, 2019);
6. 6.
  
  NewsQA (Trischler 等人, 2017);
7. 7.
  
  QuAC (Choi 等人, 2018);
8. 8.
  
  TriviaQA (Joshi 等人, 2017);
9. 9.
  
  WebQuestions （Berant 等人，2013）。
我们利用以下数据集进行微调评估：
1. 1.
  
  CoQA (Reddy 等人, 2019);
2. 2.
  
  SQuAD (Rajpurkar 等人, 2016)，我们使用 1.1 版本。
•
问题生成根据给定的段落及其相应的答案生成一个连贯的问题。我们使用以下数据集进行预训练：
1. 1.
  
  HotpotQA (杨等人, 2018);
2. 2.
  
  MS MARCO (Nguyen 等人, 2016);
3. 3.
  
  MSQG (刘等人, 2021a);
4. 4.
  
  NarrativeQA (Kočiský 等人, 2018);
5. 5.
  
  NewsQA (Trischler 等人, 2017);
6. 6.
  
  QuAC （Choi 等人，2018）。
其中大部分是 QA 任务，我们颠倒问题和答案来丰富 QG 示例。
我们利用以下数据集进行微调评估：
1. 1.
  
  CoQA (Reddy 等人, 2019);
2. 2.
  
  SQuAD (Rajpurkar 等人, 2016)，我们使用 1.1 版本。
•
故事生成创建一个带有简短标题的长而信息丰富的文本。我们使用以下数据集进行预训练：
1. 1.
  
  ChangeMyView （Hua 和 Wang，2020）；
2. 2.
  
  英文Gigaword (Rush 等人, 2015);
3. 3.
  
  Hippocorpus (Sap 等人, 2020);
4. 4.
  
  WikiPlots (Markriedl, );
5. 5.
  
  WritePrompts (Fan 等人, 2018)，我们将原始训练集进行拆分以进行预训练和相应的验证。
考虑到英语Gigaword是一个大型摘要数据集，我们使用摘要作为标题依次生成段落，以丰富故事生成的示例。
我们利用以下数据集进行微调评估：
1. 1.
  
  ROCStories (Mostafazadeh 等人, 2016);
2. 2.
  
  WritePrompts (Fan 等人, 2018)，我们使用 Guan 等人 (2021) 创建的集合（他们将原始有效集和测试集分开用于训练、验证和测试）测试）对我们的模型进行公平比较。
•
面向任务的对话系统满足用户的现实生活需求，例如餐厅预订、飞机预订等。我们使用数据集进行预训练，遵循 Su 等人 (2022)：
1. 1.
  
  CamRest676 (文等人, 2017);
2. 2.
  
  Frames (El Asri 等人, 2017);
3. 3.
  
  KVRET (Eric 等人, 2017);
4. 4.
  
  MetaLWOZ (Lee 等人, 2019);
5. 5.
  
  MSR-E2E (李等人, 2018);
6. 6.
  
  MultiWOZ (Budzianowski 等人, 2018);
7. 7.
  
  模式引导(Rastogi 等人, 2020a);
8. 8.
  
  TaskMaster (Byrne 等人, 2019);
9. 9.
  
  WOZ （Mrkšić 等人，2017）。
我们利用以下数据集进行微调评估：
1. 1.
  
  MultiWOZ (Budzianowski 等人, 2018)，我们使用 2.0 版本。
•
文本样式迁移修改给定文本的样式(例如，情感和形式），同时保留其与样式无关的内容。我们利用以下数据集进行微调评估：
1. 1.
  
  GYAFC （Rao 和 Tetreault，2018），它有两个子域：“娱乐和音乐”（E&M）和“家庭和关系”（F&R）。
•
文本摘要将长文档压缩为简短的文本，同时保留基本细节。我们使用以下数据集进行预训练：
1. 1.
  
  英文Gigaword (Graff 等人, 2003)，我们使用Rush 等人(2015)提供的变体；
2. 2.
  
  MediaSum (朱等人, 2021);
3. 3.
  
  MSNews (刘等人, 2021a);
4. 4.
  
  新闻中心(Grusky 等人, 2018);
5. 5.
  
  WikiHow （Koupaee 和 Wang，2018）。
我们利用以下数据集进行微调评估：
1. 1.
  
  CNN/DailyMail (Hermann 等人, 2015)，我们使用 See 等人 (2017) 提供的变体；
2. 2.
  
  SAMSum (Gliwa 等人, 2019);
3. 3.
  
  XSum (Narayan 等人, 2018)。

为了更好地与 ExT5 (Aribandi 等人, 2022) 进行比较，我们利用语言生成基准 GEM (Gehrmann 等人, 2021) 进行微调评估。 GEM包括五项任务：

•
常识生成：
1. 1.
  
  CommonGen (CG) (Lin 等人, 2020a)。
•
数据到文本生成：
1. 1.
  
  DART (南等人, 2021);
2. 2.
  
  E2E NLG 清理（Novikova 等人，2017）；
3. 3.
  
  ToTTo (苏等人, 2021);
4. 4.
  
  WebNLG （Gardent 等人，2017）。
•
对话系统：
1. 1.
  
  模式引导对话（SGD）（Rastogi 等人，2020b）。
•
文本简化：
1. 1.
  
  WikiAuto + Turk/ASSET (WiA-T/A) (Jiang 等人, 2020; Xu 等人, 2016; Alva-Manchego 等人, 2020)。
•
文本摘要：
1. 1.
  
  Wiki-Lingua (WLE) （Ladhak 等人，2020）。

为了测试我们模型的泛化能力，我们还利用了自然语言标准基准 GLUE （Wang 等人，2019），它由三个任务组成：

•
自然语言推理：
1. 1.
  
  MNLI (Williams 等人, 2018);
2. 2.
  
  QNLI (Rajpurkar 等人, 2016; Wang 等人, 2019);
3. 3.
  
  RTE (Dagan 等人, 2006; Haim 等人, 2006; Giampiccolo 等人, 2007; Bentivogli 等人, 2009)。
•
释义检测：
1. 1.
  
  MRPC （Dolan 和 Brockett，2005）；
2. 2.
  
  QQP 3;
3. 3.
  
  STS-B （Cer等人，2017）。
•
文本分类：
1. 1.
  
  CoLA (Warstadt 等人, 2019);
2. 2.
  
  SST-2 （Socher 等人，2013）。

A.2数据泄露

由于我们的模型是在大量标记数据集上进行预训练的，因此它可能在预训练期间“看到”了微调测试集的示例，这导致与其他方法的比较不公平。因此，我们消除了与任一测试数据集共享 $n$ -gram 重叠的预训练示例。在 Brown 等人 (2020) 之后， $n$ 是第 $5$ ^th 个百分位示例长度（以单词为单位），最大值为 $n$ 设置为 $13$ 。最后，我们从预训练数据集中删除了 $17,848$ 示例。每个数据集的“清理”示例数量可以在表9中找到。

Dataset	#Train	Cleaned #Train	#Valid	#Test	Input	Output	License
AGENDA	38,720	38,720	1,000	1,000	52.1	141.2	N/A
ENT-DESC	88,652	88,652	11,081	11,081	279.9	31.0	N/A
GenWiki	681,436	681,436	75,716	1,000	21.4	29.5	MIT
LogicNLG	28,450	28,450	4,260	4,305	178.4	14.2	MIT
TEKGEN	6,310,061	6,307,995	788,746	796,982	17.0	21.2	CC BY-SA 2.0
WEATHERGOV	25,000	25,000	1,000	3,528	148.7	30.6	N/A
WikiTableT	1,453,794	1,452,778	4,533	4,351	81.0	99.7	MIT
Cleaned OS Dialogs	13,355,487	13,355,368	1,483,944	-	75.5	16.7	N/A
CMUDoG	82,818	82,818	5,555	14,510	433.0	12.2	N/A
Curiosity	64,930	64,551	8,539	8,495	144.4	20.2	CC BY-NC 4.0
DREAM	14,264	14,242	4,709	4,766	75.6	13.6	N/A
Empathetic Dialogues	64,636	64,636	9,308	8,426	52.7	12.9	CC BY-NC 4.0
Movie Dialog	762,751	762,711	8,216	8,066	126.9	44.0	N/A
MuTual	33,691	33,691	4,090	3,248	53.6	14.5	N/A
OpenDialKG	69,680	69,680	7,743	-	54.2	12.4	CC BY-NC 4.0
Topical-Chat	179,750	179,750	22,295	22,452	223.3	20.0	CDLA-Sharing-1.0
Wizard of Wikipedia	148,357	147,702	15,767	15,564	297.0	16.7	MIT
HotpotQA	90,447	87,815	7,405	-	187.9	2.2	CC BY-SA 4.0
MS MARCO	681,445	681,226	77,580	-	68.7	13.3	N/A
MSQG	198,058	198,029	11,008	-	48.1	3.7	CC BY-SA 4.0
NarrativeQA	65,494	65,494	6,922	21,114	584.1	4.2	Apache 2.0
Natural Questions	96,676	96,676	10,693	6,490	9.0	2.1	CC BY-SA 3.0
NewsQA	97,850	97,700	5,486	5,396	726.8	5.0	MIT
QuAC	83,568	83,485	31,906	-	487.9	12.5	CC BY-SA 4.0
TriviaQA	78,785	78,785	8,837	11,313	14.0	2.0	Apache 2.0
WebQuestions	8,933	8,933	4,863	4,863	6.7	2.4	CC BY 4.0
HotpotQA	90,440	87,808	6,972	-	79.6	19.8	CC BY-SA 4.0
MS MARCO	681,445	681,226	77,580	-	75.9	6.0	N/A
MSQG	198,058	198,029	11,008	11,022	45.9	6.0	CC BY-SA 4.0
NarrativeQA	65,494	65,494	6,922	21,114	579.7	8.6	Apache 2.0
NewsQA	97,850	97,700	5,486	5,396	724.2	7.6	MIT
QuAC	69,109	69,026	26,301	-	496.7	6.5	CC BY-SA 4.0
ChangeMyView	42,462	42,459	6,480	7,562	17.9	104.1	MIT
English Gigaword	3,803,957	3,802,620	189,651	1,951	8.8	33.3	MIT
Hippocorpus	6,168	6,168	686	-	34.1	262.6	CDLA-Permissive 2.0
WikiPlots	101,642	101,641	11,294	-	3.4	338.5	N/A
WritingPrompts	272,600	272,518	15,620	15,138	28.4	630.8	MIT
CamRest676	4,872	4,872	616	-	55.3	9.4	N/A
Frames	26,631	26,631	2,106	-	116.1	13.0	MIT
KVRET	14,136	14,136	1,616	-	30.5	9.3	N/A
MetaLWOZ	176,073	176,073	17,912	-	45.6	8.0	N/A
MSR-E2E	103,362	103,362	5,235	-	51.3	12.8	Microsoft
Schema-Guided	494,946	494,933	73,089	-	120.8	12.5	CC BY-SA 4.0
TaskMaster	249,664	249,662	20,680	-	95.6	12.0	CC BY 4.0
WOZ	6,364	6,359	1,260	-	47.0	10.6	N/A
English Gigaword	3,803,957	3,802,620	189,651	1,951	33.3	8.8	MIT
MediaSum	443,596	442,021	10,000	10,000	1641.0	14.4	N/A
MSNews	136,082	135,937	7,496	7,562	309.9	9.8	CC BY-SA 4.0
Newsroom	995,041	989,351	108,837	108,862	642.4	26.7	N/A
WikiHow	157,252	157,247	5,599	5,577	502.6	45.6	CC BY-NC-SA

表 9：用于预训练 MVP 模型的数据集的统计数据和许可。 #train、#Valid 和 #Test 分别表示训练、有效和测试集中的示例数量。 Cleaned #训练表示过滤后的训练示例数。输入和输出分别是输入和输出序列中的平均单词数（按空格分割）。

Task	Dataset	#Train	#Valid	#Test	Input	Output	License
Commonsen generation	CommonGen	67,389	993	–	5.5	11.6	MIT
Data-to-text generation	DART	62,659	2,768	–	27.5	21.5	MIT
	E2E	33,525	4,299	–	9.5	20.6	CC BY-SA 4.0
	ToTTo	120,761	7,700	–	37.8	18.0	CC BY-SA 3.0
	WebNLG	34,338	4,313	4,222	18.0	19.9	CC BY-NA-SA 4.0
	WebNLG (GEM)	35,426	1,667	–	17.7	22.7	CC BY-NA-SA 4.0
	WikiBio	582,659	72,831	72,831	81.6	26.1	CC BY-SA 3.0
Open-ended dialogue	DailyDialog	76,052	7,069	6,740	72.5	13.9	CC BY-NC-SA 4.0
	DSTC7-AVSD	76,590	17,870	1,710	148.2	11.5	MIT
	PersonaChat	122,499	14,602	14,056	132.1	11.9	MIT
	SGD	164,982	10,000	–	134.7	11.3	CC BY-SA 4.0
Natural language inference	MNLI-m	392,702	9,815	9,796	29.8	–	Mixed
	MNLI-mm	392,702	9,832	9,847	29.8	–	Mixed
	QNLI	104,743	5,463	5,463	36.6	–	CC BY-SA 4.0
	RTE	2,490	277	3,000	51.0	–	N/A
Paraphrase generation	Quora	137,185	3,000	3,000	10.9	10.8	N/A
Paraphrase detection	MRPC	3,668	408	1,725	43.8	–	N/A
	QQP	363,846	40,430	390,965	22.3	–	N/A
	STS-B	5,749	1,500	1,379	20.3	–	N/A
Question answering	CoQA	107,286	31,621	–	349.4	2.6	Mixed
Question answering	SQuAD	75,722	10,570	11,877	156.2	3.6	CC BY-SA 4.0
Question generation	CoQA	107,286	31,621	–	346.6	5.5	Mixed
Question generation	SQuAD	75,722	10,570	11,877	148.3	11.6	CC BY-SA 4.0
Story generation	ROCStories	176,688	9,816	4,909	9.0	40.7	N/A
Story generation	WritingPrompts	53,516	4,000	2,000	25.5	150.4	MIT
Task-oriented dialogue	MultiWOZ	170,220	22,074	22,116	128.3	11.3	MIT
Text classification	CoLA	8,551	1,043	1,063	7.7	–	N/A
Text classification	SST-2	67,349	872	1,821	9.8	–	N/A
Text simplification	WiA-A	483,801	20,000	359	26.2	21.5	Mixed
Text simplification	WiA-T	483,801	20,000	359	26.2	21.5	Mixed
Text style transfer	GYAFC-E&M	52,595	11,508	1,416	9.9	10.6	N/A
Text style transfer	GYAFC-F&R	51,967	11,152	1,332	10.7	11.3	N/A
Text summarization	CNN/DailyMail	287,227	13,368	11,490	679.8	48.3	MIT
	SAMSum	14,732	818	819	103.4	20.3	CC BY-NC-ND 4.0
	WLE	99,020	28,614	–	367.6	33.4	CC0 1.0
	XSum	204,045	11,332	11,334	373.7	21.1	MIT

表 10：用于评估 MVP 模型的数据集的统计数据和许可。 MNLI数据集的许可由OANC、CC BY-SA 3.0和CC BY 3.0组成。 CoQA数据集的许可证由CC BY-SA 4.0、MSR-LA和Apache 2.0组成。 WiA-A/T数据集的许可证由CC BY-NC 3.0、CC BY-NC 4.0和GNU通用公共许可证v3.0组成。

Methods	XSum			SAMSum			CoQA QG
Methods	R- $1$	R- $2$	R-L	R- $1$	R- $2$	R-L	B- $4$	ME	R-L
BART	45.14^d	22.27	37.25	51.74^b	26.46	48.72	12.34^c	35.78	46.88
MVP	45.60	22.47	37.42	53.78	29.12	49.37	23.48	47.79	55.09
MVP+S	45.67	22.63	37.50	53.81	29.75	49.43	23.43	47.49	55.25
SOTA	49.57^a	25.08	41.81	53.89^b	28.85	49.29	15.78^c	40.15	50.98
Methods	WritingPrompts				DailyDialog				WikiBio
Methods	B- $1$	B- $2$	D- $1$	D- $4$	B- $1$	B- $2$	D- $1$	D- $2$	B- $4$
BART	22.40^e	8.40	–	31.30	44.30^f	39.20	3.90	21.10	–
MVP	32.34	13.11	2.12	64.58	46.19	41.81	4.61	25.06	48.42
MVP+S	30.12	11.46	3.97	83.70	45.71	42.92	5.10	27.14	48.19
SOTA	22.40^e	8.40	–	31.30	46.10^f	40.70	4.10	22.20	45.10^g
Methods	DSTC7-AVSD							SQuAD
Methods	B- $1$	B- $2$	B- $3$	B- $4$	ME	R-L	CIDEr	F1	EM
BART	82.40^f	69.10	58.20	48.70	31.30	63.50	1.38	91.56ⁱ	84.23
MVP	83.75	70.89	60.19	50.94	32.12	65.04	1.45	93.45	87.20
MVP+S	83.81	71.07	60.45	51.20	31.77	64.76	1.44	93.45	87.17
SOTA	83.20^f	70.50	59.80	50.60	31.40	63.80	1.39	96.22^h	91.26

表 11：在完全调整设置下看到的六项任务的结果。 ^a （阮等人，2021） ^b （唐等人，2022c） ^c （顾等人，2021） ^d （刘易斯等人，2020） ^e （关等人，2021） ^f （陈等人，2022） ^g （陈等人，2020b） ^h （拉斐尔等人，2020） ⁱ （徐等人，2021）

Methods	DART			E2E			ToTTo
Methods	B- $4$	R- $2$	ME	B- $4$	R- $2$	ME	B- $4$	R- $2$	ME
T5.1.1	34.31	45.22	36.30	42.57	46.60	38.20	39.79	49.90	36.80
ExT5	36.62	48.14	37.60	42.25	46.70	38.10	40.14	50.33	36.90
MVP	39.13	48.92	38.53	37.38	47.96	39.39	50.58	55.24	41.27
MVP+S	38.83	48.49	38.41	37.32	47.40	38.90	50.69	55.52	41.29
Methods	WebNLG			CommonGen			SGD
Methods	B- $4$	R- $2$	ME	B- $4$	R- $2$	ME	B- $4$	R- $2$	ME
T5.1.1	31.67	43.31	34.40	8.38	17.01	20.20	33.15	36.17	32.40
ExT5	35.03	48.17	36.50	9.68	19.04	21.40	34.74	37.77	33.00
MVP	47.03	59.00	42.34	32.59	37.71	33.00	45.63	48.29	38.48
MVP+S	47.03	59.03	42.28	34.10	37.87	33.11	45.24	48.25	38.47
Methods	WiA-A			WiA-T			WLE
Methods	B- $4$	R- $2$	ME	B- $4$	R- $2$	ME	B- $4$	R- $2$	ME
T5.1.1	29.30	38.37	30.10	42.12	50.52	36.2	15.55	20.47	19.60
ExT5	29.23	37.98	30.00	41.39	50.38	35.8	16.64	21.16	20.40
MVP	71.55	70.88	48.19	91.73	83.46	57.34	18.80	22.84	21.95
MVP+S	70.37	70.65	47.70	91.12	83.59	56.95	18.52	22.57	22.02

表 12：完全调优设置下 GEM 基准测试的结果。我们使用T5.1.1和ExT5的大版本，所有结果均来自Aribandi等人(2022)。

附录B微调和评估详细信息

在本节中，我们将介绍微调和评估每个下游任务的详细信息。

对于4节（表2和6)和附录C（表11)，微调细节在4

•

对于数据到文本生成任务，我们使用 BLEU(- $4$ )、ROUGE-L 和 METEOR 进行评估。我们使用Chen 等人 (2020b)⁴⁴4https://github.com/wenhuchen/Data-to-text-Evaluation-Metric；
•

对于开放式对话系统任务，我们使用 BLEU- $1$ 、BLEU- $2$ 、Distinct- $1$ 和 Distinct- $2$ 进行评估。对于 DSTC7-AVSD，我们还利用 CIDEr (Vedantam 等人，2015)。我们使用具有平滑函数 $7$ 的 NLTK 3.5 来计算 PersonaChat 和 DailyDialog 的 BLEU，并利用脚本⁵⁵5https://github.com/lemuria-wchen/DialogVED/blob/main/src/utils/evaluate.py 评估 DSTC7-AVSD ;
•

对于问答任务，我们使用精确匹配（EM）和宏观平均F1分数（F1）进行评估。我们使用提供的 CoQA 脚本⁶⁶6https://github.com/PaddlePaddle/ERNIE/blob/repro/ernie-gen/eval/tasks/coqa/eval.py 和 SQuAD⁷⁷7https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py。
•

对于问题生成任务，我们使用 BLEU- $4$ 、ROUGE-L 和 METEOR 进行评估。我们使用Dong 等人 (2019)⁸⁸8https://github.com/microsoft/unilm/blob/master/unilm-v1/src/qg/eval.py；
•

对于故事生成，我们采用 $p=0.9$ 和 $0.7$ 温度的核心采样，遵循 Guan 等人 (2021)。我们使用语料库 BLEU- $1$ 、BLEU- $2$ 、Distinct- $1$ 和 Distinct- $4$ 进行评估。我们使用NLTK 3.5来计算Guan等人(2021)的语料库BLEU；
•

对于面向任务的对话系统任务，我们使用BLEU(- $4$ )、inform(rate)、success(rate)和组合分数进行评估。 Inform 和 success 是专门为任务导向对话系统设计的两个准确度指标，综合得分定义为 $(\text{Inform}+\text{Success})\times 0.5+\text{BLEU}$ （Budzianowski 等人，2018）。我们使用Su等人(2022)⁹⁹9https://github.com/awslabs/pptod/blob/main/E2E_TOD/eval.py；
•

对于文本摘要任务，我们使用ROUGE- $1$ 、ROUGE- $2$ 和ROUGE-L进行评估。我们使用工具包 files2rouge¹⁰¹⁰10https://github.com/pltrdy/files2rouge。

对于附录C.2（表12)中GEM基准的实验，微调设置与上述相同。我们使用BLEU- $4$ 、ROUGE- $2$ 和METEOR进行评估。我们使用 GEM 评估脚本¹¹¹¹11https://github.com/GEM-benchmark/GEM-metrics。

对于4.3节中的实验（表4和5)，微调和评估细节如下：

•

对于释义生成任务，我们采用 AESOP 提供的微调和评估脚本(Sun 等人, 2021)¹²¹²12https://github.com/PlusLabNLP/AESOP。评估指标为BLEU- $4$ 、ROUGE- $1$ 、ROUGE- $2$ 、ROUGE-L和METEOR。
•

对于文本样式转换任务，我们采用 SC & BLEU 提供的微调和评估脚本 (Lai 等人, 2021)¹³¹³13https://github.com/laihuiyuan/pre-trained-formality-transfer。我们按照 Lai 等人 (2021) 对 E&M 和 F&R 领域的数据进行非正式到正式的转换并训练模型。评估指标为 BLEU- $4$ 、准确度和 HM。准确率是通过预训练的 TextCNN 计算来评估风格强度，HM 表示 BLEU- $4$ 和风格准确度（Lai 等人，2021）的调和平均值。
•

对于 GLUE 任务，我们利用 Hugging Face¹⁴¹⁴14https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification。超参数与原BART一致(Lewis等人, 2020)¹⁵¹⁵15https://github.com/facebookresearch/fairseq/blob/main/examples/bart/README.glue.md。评价由官网计算¹⁶¹⁶16https://gluebenchmark.com/。

附录 C其他结果

在本节中，我们提供 MVP 模型和其他基线的其他结果。

C.1 通用数据集的结果

我们还在完全调优设置下对八个常见数据集进行了实验。由于第4节的空间限制，这些结果显示在表11中。我们可以看到这些结果与第 4 节中的结果具有相似的趋势，并且我们在 $8$ 数据集的 $6$ 中实现了 SOTA 性能。

C.2 创业板基准测试结果

为了更好地与ExT5（Aribandi等人，2022）进行比较，我们在GEM基准（Gehrmann等人，2021）上进行了实验。对于“看不见的”常识生成和文本简化任务，我们分别利用数据到文本生成和摘要的提示。结果如表 12 所示，我们的 MVP 模型在 $27$ 指标中的 $26$ 方面优于 ExT5。

附录D人类评估

我们聘请了六名精通英语、托福或雅思成绩高于 $110$ 或 $7.0$ 的大学生。我们为每个实例为每位法官支付了 $0.2$ 美元， $400$ 实例的总预算为 $320$ 美元。我们为每位评委提供的文字说明如图E所示。

附录E定性示例

在本节中，我们将展示第 4 节中任务的线性化输入、人工编写的任务指令以及单个数据集的相应输出。我们提供了完全调优设置下的 BART、MVP 和 MVP+S 的结果。为了最大限度地减少人为干预，我们选择测试集的第一个和第二个实例。

Thank you for taking the time to help us evaluate our scientific research! Our task is to present you with two pieces of machine-generated text and ask you to decide which one is superior. Your opinion will only be used to compare our two models; it will not be used for any other purpose.

We have four tasks to evaluate: 1. Text summarization: the input is a lengthy piece of news, and the output is a brief description of the content. Examine whether the abstract covers the majority of the news and whether there are any factual errors. 2. Knowledge-graph-to-text generation: the input is a knowledge graph (multiple triples), and the output is a text description of the graph. Note whether the description encompasses all of the input triples. 3. Open-ended dialogue: the input is two users’ background information and chat history, and the output is the next response. Examine whether the response is consistent with the contexts and background of the user at the time. 4. Story generation: the input is the beginning of the story, and the output is the following story. Keep in mind that the story needs to be coherent and consistent.

For each instance, you will see an input and two outputs (you will not know which model it comes from) in the table below, and you need to choose which one you believe is better (or a tie). You can base your decision on the output’s fluency, grammar, logic, whether it conforms to the input, and the features of each task.

Input she was on a flight . Output she was trying to take a nap . suddenly , her ears started ringing . the flight attendant tried to fix it but she could n’t . she had to call for help . luckily , they were able to fix the problem . she was bored and her ears hurt . she decided to take a nap . luckily , she was able to get a good night ’s sleep . but the next morning , she woke up and felt sick .

Left Wins	Ties	Right Wins

图2：人类评估指南。

Input

Summarize: Marseille, France (CNN)The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane. Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation." He added, "A person who has such a video needs to immediately give it to the investigators." Robin’s comments follow claims by two magazines, German daily Bild and French Paris Match, of a cell phone video showing the harrowing final seconds from on board Germanwings Flight 9525 as it crashed into the French Alps. All 150 on board were killed. Paris Match and Bild reported that the video was recovered from a phone at the wreckage site. The two publications described the supposed video, but did not post it on their websites. The publications said that they watched the video, which was found by a source close to the investigation. "One can hear cries of ’My God’ in several languages," Paris Match reported. "Metallic banging can also be heard more than three times, perhaps of the pilot trying to open the cockpit door with a heavy object. Towards the end, after a heavy shake, stronger than the others, the screaming intensifies. Then nothing." "It is a very disturbing scene," said Julian Reichelt, editor-in-chief of Bild online. An official with France’s accident investigation agency, the BEA, said the agency is not aware of any such video. Lt. Col. Jean-Marc Menichini, a French Gendarmerie spokesman in charge of communications on rescue efforts around the Germanwings crash site, told CNN that the reports were "completely wrong" and "unwarranted." Cell phones have been collected at the site, he said, but that they "hadn’t been exploited yet." Menichini said he believed the cell phones would need to be sent to the Criminal Research Institute in Rosny sous-Bois, near Paris, in order to be analyzed by specialized technicians working hand-in-hand with investigators. But none of the cell phones found so far have been sent to the institute, Menichini said. Asked whether staff involved in the search could have leaked a memory card to the media, Menichini answered with a categorical "no." Reichelt told "Erin Burnett: Outfront" that he had watched the video and stood by the report, saying Bild and Paris Match are "very confident" that the clip is real. He noted that investigators only revealed they’d recovered cell phones from the crash site after Bild and Paris Match published their reports. "That is something we did not know before. … Overall we can say many things of the investigation weren’t revealed by the investigation at the beginning," he said. What was mental state of Germanwings co-pilot? German airline Lufthansa confirmed Tuesday that co-pilot Andreas Lubitz had battled depression years before he took the controls of Germanwings Flight 9525, which he’s accused of deliberately crashing last week in the French Alps. Lubitz told his Lufthansa flight training school in 2009 that he had a "previous episode of severe depression," the airline said Tuesday. Email correspondence between Lubitz and the school discovered in an internal investigation, Lufthansa said, included medical documents he submitted in connection with resuming his flight training. The announcement indicates that Lufthansa, the parent company of Germanwings, knew of Lubitz’s battle with depression, allowed him to continue training and ultimately put him in the cockpit. Lufthansa, whose CEO Carsten Spohr previously said Lubitz was 100% fit to fly, described its statement Tuesday as a "swift and seamless clarification" and said it was sharing the information and documents – including training and medical records – with public prosecutors. Spohr traveled to the crash site Wednesday, where recovery teams have been working for the past week to recover human remains and plane debris scattered across a steep mountainside. He saw the crisis center set up in Seyne-les-Alpes, laid a wreath in the village of Le Vernet, closer to the crash site, where grieving families have left flowers at a simple stone memorial. Menichini told CNN late Tuesday that no visible human remains were left at the site but recovery teams would keep searching. French President Francois Hollande, speaking Tuesday, said that it should be possible to identify all the victims using DNA analysis by the end of the week, sooner than authorities had previously suggested. In the meantime, the recovery of the victims’ personal belongings will start Wednesday, Menichini said. Among those personal belongings could be more cell phones belonging to the 144 passengers and six crew on board. Check out the latest from our correspondents. The details about Lubitz’s correspondence with the flight school during his training were among several developments as investigators continued to delve into what caused the crash and Lubitz’s possible motive for downing the jet. A Lufthansa spokesperson told CNN on Tuesday that Lubitz had a valid medical certificate, had passed all his examinations and "held all the licenses required." Earlier, a spokesman for the prosecutor’s office in Dusseldorf, Christoph Kumpa, said medical records reveal Lubitz suffered from suicidal tendencies at some point before his aviation career and underwent psychotherapy before he got his pilot’s license. Kumpa emphasized there’s no evidence suggesting Lubitz was suicidal or acting aggressively before the crash. Investigators are looking into whether Lubitz feared his medical condition would cause him to lose his pilot’s license, a European government official briefed on the investigation told CNN on Tuesday. While flying was "a big part of his life," the source said, it’s only one theory being considered. Another source, a law enforcement official briefed on the investigation, also told CNN that authorities believe the primary motive for Lubitz to bring down the plane was that he feared he would not be allowed to fly because of his medical problems. Lubitz’s girlfriend told investigators he had seen an eye doctor and a neuropsychologist, both of whom deemed him unfit to work recently and concluded he had psychological issues, the European government official said. But no matter what details emerge about his previous mental health struggles, there’s more to the story, said Brian Russell, a forensic psychologist. "Psychology can explain why somebody would turn rage inward on themselves about the fact that maybe they weren’t going to keep doing their job and they’re upset about that and so they’re suicidal," he said. "But there is no mental illness that explains why somebody then feels entitled to also take that rage and turn it outward on 149 other people who had nothing to do with the person’s problems." Germanwings crash compensation: What we know. Who was the captain of Germanwings Flight 9525? CNN’s Margot Haddad reported from Marseille and Pamela Brown from Dusseldorf, while Laura Smith-Spark wrote from London. CNN’s Frederik Pleitgen, Pamela Boykoff, Antonia Mortensen, Sandrine Amiel and Anna-Maja Rappard contributed to this report.

Gold

Match are "very confident" the video clip is real, an editor says. Andreas Lubitz had informed his Lufthansa training school of an episode of severe depression, airline says.

BART

Prosecutor: "A person who has such a video needs to immediately give it to the investigators" French Gendarmerie spokesman: Cell phones have been collected at the site, but haven’t been exploited.

MVP

Prosecutor: "So far no videos were used in the crash investigation" An official with France’s accident investigation agency says the agency is not aware of any video.

MVP+S

Prosecutor: "So far no videos were used in the crash investigation" Two publications claim to have found a cell phone video of the crash. "It is a very disturbing scene," says editor-in-chief of Bild online.

Table 13: The first instance from the CNN/Daily Mail dataset. Human-written task instructions are labeled in italics.

Input

Summarize: The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC’s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians’ efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday’s ceremony, said it was a move toward greater justice. "As Palestine formally becomes a State Party to the Rome Statute today, the world is also a step closer to ending a long era of impunity and injustice," he said, according to an ICC news release. "Indeed, today brings us closer to our shared goals of justice and peace." Judge Kuniko Ozaki, a vice president of the ICC, said acceding to the treaty was just the first step for the Palestinians. "As the Rome Statute today enters into force for the State of Palestine, Palestine acquires all the rights as well as responsibilities that come with being a State Party to the Statute. These are substantive commitments, which cannot be taken lightly," she said. Rights group Human Rights Watch welcomed the development. "Governments seeking to penalize Palestine for joining the ICC should immediately end their pressure, and countries that support universal acceptance of the court’s treaty should speak out to welcome its membership," said Balkees Jarrah, international justice counsel for the group. "What’s objectionable is the attempts to undermine international justice, not Palestine’s decision to join a treaty to which over 100 countries around the world are members." In January, when the preliminary ICC examination was opened, Israeli Prime Minister Benjamin Netanyahu described it as an outrage, saying the court was overstepping its boundaries. The United States also said it "strongly" disagreed with the court’s decision. "As we have said repeatedly, we do not believe that Palestine is a state and therefore we do not believe that it is eligible to join the ICC," the State Department said in a statement. It urged the warring sides to resolve their differences through direct negotiations. "We will continue to oppose actions against Israel at the ICC as counterproductive to the cause of peace," it said. But the ICC begs to differ with the definition of a state for its purposes and refers to the territories as "Palestine." While a preliminary examination is not a formal investigation, it allows the court to review evidence and determine whether to investigate suspects on both sides. Prosecutor Fatou Bensouda said her office would "conduct its analysis in full independence and impartiality." The war between Israel and Hamas militants in Gaza last summer left more than 2,000 people dead. The inquiry will include alleged war crimes committed since June. The International Criminal Court was set up in 2002 to prosecute genocide, crimes against humanity and war crimes. CNN’s Vasco Cotovio, Kareem Khadder and Faith Karimi contributed to this report.

Gold

Membership gives the ICC jurisdiction over alleged crimes committed in Palestinian territories since last June. Israel and the United States opposed the move, which could open the door to war crimes investigations against Israelis.

BART

Palestinian Authority becomes 123rd member of the International Criminal Court. The move gives the court jurisdiction over alleged crimes in Palestinian territories. Israel and the United States opposed the Palestinians’ efforts to join the body.

MVP

"Today brings us closer to our shared goals of justice and peace," foreign minister says. The Palestinians signed the ICC’s founding Rome Statute in January. The move gives the court jurisdiction over alleged crimes in Palestinian territories.

MVP+S

"Today brings us closer to our shared goals of justice and peace," foreign minister says. The United States says it "strongly" disagrees with the decision. The Palestinian Authority is the 123rd member of the International Criminal Court.

Table 14: The second instance from the CNN/Daily Mail dataset.

Input

Describe the following data: Abilene,_Texas | cityServed | Abilene_Regional_Airport

Gold

Abilene, Texas is served by the Abilene regional airport.

Abilene Regional Airport serves the city of Abilene in Texas.

BART

Abilene Regional Airport serves the city of Abilene in Texas.

MVP

Abilene Regional Airport serves the city of Abilene, Texas.

MVP+S

Abilene Regional Airport serves the city of Abilene, Texas.

Table 15: The first instance from the WebNLG dataset, which has two golden target sentences.

Input

Describe the following data: "Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas" | location | Adolfo_Suárez_Madrid–Barajas_Airport

Gold

Adolfo Suárez Madrid–Barajas Airport can be found in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.

Adolfo Suarez Madrid-Barajas airport is located at Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.

Adolfo Suarez Madrid-Barajas Airport is located in Madrid, Paracuellos de Jarama, San Sebastian de los Reyes and Alcobendas.

BART

Adolfo Suárez Madrid–Barajas Airport can be found in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.

MVP

Adolfo Suárez Madrid–Barajas Airport can be found in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.

MVP+S

Adolfo Suárez Madrid–Barajas Airport is located in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.

Table 16: The second instance from the WebNLG dataset, which has three golden target sentences.

Input

Generate the question based on the answer: Saint Bernadette Soubirous [SEP] Architecturally , the school has a Catholic character . Atop the Main Building ’ s gold dome is a golden statue of the Virgin Mary . Immediately in front of the Main Building and facing it , is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes " . Next to the Main Building is the Basilica of the Sacred Heart . Immediately behind the basilica is the Grotto , a Marian place of prayer and reflection . It is a replica of the grotto at Lourdes , France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858 . At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ) , is a simple , modern stone statue of Mary .

Gold

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France ?

BART

Who is believed to have appeared to the Virgin Mary at Lourdes ?

MVP

Who did the Virgin Mary appear to in Lourdes ?

MVP+S

The Grotto is a replica of the grotto at Lourdes , France where the Virgin Mary appeared to whom ?

Table 17: The first instance from the SQuAD dataset.

Input

Generate the question based on the answer: a copper statue of Christ [SEP] Architecturally , the school has a Catholic character . Atop the Main Building ’ s gold dome is a golden statue of the Virgin Mary . Immediately in front of the Main Building and facing it , is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes " . Next to the Main Building is the Basilica of the Sacred Heart . Immediately behind the basilica is the Grotto , a Marian place of prayer and reflection . It is a replica of the grotto at Lourdes , France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858 . At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ) , is a simple , modern stone statue of Mary .

Gold

What is in front of the Notre Dame Main Building ?

BART

What is in front of the Main Building and facing it ?

MVP

What is immediately in front of the Main Building ?

MVP+S

What is immediately in front of the Main Building ?

Table 18: The second instance from the SQuAD dataset.

Input

Answer the following question: what color was cotton ? [X_SEP] once upon a time , in a barn near a farm house , there lived a little white kitten named cotton . cotton lived high up in a nice warm place above the barn where all of the farmer ’ s horses slept . but cotton wasn ’ t alone in her little home above the barn , oh no . she shared her hay bed with her mommy and 5 other sisters . all of her sisters were cute and fluffy , like cotton . but she was the only white one in the bunch . the rest of her sisters were all orange with beautiful white tiger stripes like cotton ’ s mommy . being different made cotton quite sad . she often wished she looked like the rest of her family . so one day , when cotton found a can of the old farmer ’ s orange paint , she used it to paint herself like them . when her mommy and sisters found her they started laughing . " what are you doing , cotton ? ! " " i only wanted to be more like you " . cotton ’ s mommy rubbed her face on cotton ’ s and said " oh cotton , but your fur is so pretty and special , like you . we would never want you to be any other way " . and with that , cotton ’ s mommy picked her up and dropped her into a big bucket of water . when cotton came out she was herself again . her sisters licked her face until cotton ’ s fur was all all dry . " don ’ t ever do that again , cotton ! " they all cried . " next time you might mess up that pretty white fur of yours and we wouldn ’ t want that ! " then cotton thought , " i change my mind . i like being special " .

Gold

white

BART

white

MVP

white

MVP+S

white

Table 19: The first instance from the CoQA dataset.

Input

Answer the following question: what color was cotton ? [SEP] white [X_SEP] where did she live ? [X_SEP] once upon a time , in a barn near a farm house , there lived a little white kitten named cotton . cotton lived high up in a nice warm place above the barn where all of the farmer ’ s horses slept . but cotton wasn ’ t alone in her little home above the barn , oh no . she shared her hay bed with her mommy and 5 other sisters . all of her sisters were cute and fluffy , like cotton . but she was the only white one in the bunch . the rest of her sisters were all orange with beautiful white tiger stripes like cotton ’ s mommy . being different made cotton quite sad . she often wished she looked like the rest of her family . so one day , when cotton found a can of the old farmer ’ s orange paint , she used it to paint herself like them . when her mommy and sisters found her they started laughing . " what are you doing , cotton ? ! " " i only wanted to be more like you " . cotton ’ s mommy rubbed her face on cotton ’ s and said " oh cotton , but your fur is so pretty and special , like you . we would never want you to be any other way " . and with that , cotton ’ s mommy picked her up and dropped her into a big bucket of water . when cotton came out she was herself again . her sisters licked her face until cotton ’ s fur was all all dry . " don ’ t ever do that again , cotton ! " they all cried . " next time you might mess up that pretty white fur of yours and we wouldn ’ t want that ! " then cotton thought , " i change my mind . i like being special " .

Gold

in a barn

BART

in a barn

MVP

in a barn

MVP+S

in a barn

Table 20: The second instance from the CoQA dataset.

Input

Given the story title: male was out jogging one morning .

Gold

the weather was crisp and cool . male felt good and energetic . he decided to keep jogging longer than normal . male went several more miles out of his way .

BART

he saw a man struggling to walk across a field . male stopped to help him . he saw a guy struggling to walking across a fields . the man thanked male profusely !

MVP

he ran into a man who was holding a knife . male was scared , so he ran away . the man threatened male . male ran back into his house .

MVP+S

he noticed a sound coming from a tree . male went to investigate . he found a small baby bird stuck in the trunk . male took the bird to the vet and they released it .

Table 21: The first instance from the ROCStories dataset.

Input

Given the story title: ken was driving around in the snow .

Gold

he needed to get home from work . he was driving slowly to avoid accidents . unfortunately the roads were too slick and ken lost control . his tires lost traction and he hit a tree .

BART

he hit a patch of black ice . ken slid and hurt his knee . he went to the doctor for it . krampus was sent home with a concussion .

MVP

he was in his car and it was very cold . ken started to back up . the car started to make a loud noise . kena had to back out and call for help .

MVP+S

he was having trouble starting his car . he had to push it a couple of blocks . his car would n’t start . ken had to call a tow truck for help .

Table 22: The second instance from the ROCStories dataset.

Input

Given the dialog: i love to meet new people . [SEP] i have a turtle named timothy . [SEP] my favorite sport is ultimate frisbee . [SEP] my parents are living in bora bora . [SEP] autumn is my favorite season . [X_SEP] hello , how are you doing tonight ?

Gold

i am well an loving this interaction how are you ?

BART

i am doing well , how are you ?

MVP

i am doing well . how are you ?

MVP+S

i am doing well . how are you ?

Table 23: The first instance from the PersonaChat dataset.

Input

Given the dialog: i just bought a brand new house . [SEP] i like to dance at the club . [SEP] i run a dog obedience school . [SEP] i have a big sweet tooth . [SEP] i like taking and posting selkies . [X_SEP] hello , how are you doing tonight ? [SEP] i am well an loving this interaction how are you ?

Gold

i am great . i just got back from the club .

BART

i am doing well , thanks for asking . what do you do for fun ?

MVP

i am doing well . i just got home from dancing at the club .

MVP+S

i am doing well . i just bought a new house .

Table 24: The second instance from the PersonaChat dataset.

Input

Given the task dialog: Belief state [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton .

Given the task dialog: Dialogue action [X_SEP] [db_nores] [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton .

Given the task dialog: System response [X_SEP] [db_nores] [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton .

Gold

[taxi] destination pizza hut fen ditton departure saint john ’s college

[taxi] [request] leave arrive

what time do you want to leave and what time do you want to arrive by ?

BART

[taxi] destination pizza express fen ditton departure saint johns college

[taxi] [request] leave

what time would you like to leave or arrive ?

MVP

[taxi] destination pizza hut fen ditton departure saint john ’s college

[taxi] [request] leave

sure ! when would you like to arrive ?

MVP+S

[taxi] destination pizza hut fen ditton departure saint john ’s college

[taxi] [request] leave arrive

what time would you like to leave ?

Table 25: The first instance from the MultiWOZ dataset. The task-oriented dialogue is composed of dialogue state tracking, dialogue action learning, and system response generation.

Input

Given the task dialog: Belief state [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton . [SEP] what time do you want to leave and what time do you want to arrive by ? [SEP] i want to leave after 17:15 .

Given the task dialog: Dialogue action [X_SEP] [db_nores] [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton . [SEP] what time do you want to leave and what time do you want to arrive by ? [SEP] i want to leave after 17:15 .

Given the task dialog: System response [X_SEP] [db_nores] [X_SEP] i would like a taxi from saint john ’s college to pizza hut fen ditton . [SEP] what time do you want to leave and what time do you want to arrive by ? [SEP] i want to leave after 17:15 .

Gold

[taxi] destination pizza hut fen ditton departure saint john ’s college leave 17:15

[taxi] [inform] car phone

booking completed ! your taxi will be [value_car] contact number is [value_phone]

BART

[taxi] destination pizza hut fen ditton departure saint john ’s college leave 17:15

[taxi] [inform] car phone

booking completed ! booked car type : [value_car] contact number : [value_phone]

MVP

[taxi] destination pizza hut fen ditton departure saint john ’s college leave 17:15

[taxi] [inform] car phone

booking completed ! booked car type : [value_car] contact number : [value_phone]

MVP+S

[taxi] destination pizza hut fen ditton departure saint john ’s college leave 17:15

[taxi] [inform] car phone

booking completed ! booked car type : [value_car] contact number : [value_phone]

Table 26: The second instance from the MultiWOZ dataset.