下一章：讲故事中大型语言模型的研究

Zhuohan Xie Trevor Cohn Jey Han Lau
School of Computing and Information Systems
The University of Melbourne
zhuohanx@student.unimelb.edu.au, {t.cohn, laujh}@unimelb.edu.au Now at Google DeepMind.

摘要

为了提高生成故事的质量，最近的故事生成模型一直在研究情节或常识知识等更高级别属性的利用。以 GPT-3 为代表的大型语言模型（大语言模型）基于提示的学习的应用在各种自然语言处理（NLP）任务中表现出了卓越的性能。本文利用自动和人工评估进行了全面的调查，将大语言模型的故事生成能力与三个数据集的最新模型进行比较，这些数据集的风格、语体和故事长度各不相同。结果表明，与其他故事生成模型相比，大语言模型生成的故事质量明显更高。此外，他们表现出的表现水平可以与人类作者竞争，尽管根据初步知识观察，他们倾向于在涉及世界知识的情况下复制真实的故事，类似于一种抄袭形式。

1简介

自动故事生成提出了重大挑战，因为它需要的不仅仅是单独连贯的句子。一个好的故事应该表现出自然的流畅性，遵循常识性的逻辑，并且能够吸引读者。近年来，故事生成的流行方法涉及微调预训练语言模型（PLM），例如 GPT-2 （Radford 等人，2019）或 BART （Lewis 等人， 2020）在特定数据集上。这些模型通常擅长生成流畅的句子，没有明显的语法问题。然而，他们常常难以构建一个符合常识的连贯故事，并且无法创造出引人入胜的叙事（参见等人，2019；管等人，2021a）。为了克服这些挑战，最先进的（SOTA）故事生成模型集成了更高级别的特征，例如情节和常识知识。

基于提示的学习（Liu等人，2021）是一种专门为具有上下文学习能力的大型语言模型（大语言模型）量身定制的范式（Brown等人，2020；赵等人，2023）。与传统的“预训练和调整”方法需要大量数据进行微调相比，基于提示的学习使大语言模型能够通过提供多个示例作为“提示”来学习任务，无需基于梯度的微调（Liu等人，2021）。最近，大语言模型在各种语言生成任务中表现出了出色的性能，尤其是 ChatGPT 和 GPT-4 等模型（Qin 等人，2023；Liu 等人，2023b；OpenAI，2023）。例如，Qin 等人 (2023) 的比较分析强调了在对话和摘要等任务的零样本场景中，经过微调的大语言模型相对于较小的预训练模型具有优越的性能。但值得注意的是，他们的实验中没有专门检查故事的生成。

本文旨在通过对自动故事生成进行综合评估来弥补这一研究空白。具体来说，我们将使用基于提示的学习的大语言模型（特别关注 GPT-3）与 SOTA 模型的性能进行比较。我们根据各种自动评估指标（从词汇和语义匹配指标到最近提出的基于模型的指标）对生成的故事进行比较。我们遵循文献中的最佳实践，进行严格的人工评估，包括 Amazon Mechanical Turk 的众包工作者和内部评委，并在细粒度的水平上评估故事质量，例如连贯性和逻辑性。总而言之，我们的贡献是：

•

我们在风格、语体和长度不同的三个不同语料库上对 GPT-3 和其他用于生成开放式故事的 SOTA 技术进行了实证比较。
•

我们使用各种自动故事评估指标进行测试，发现最近基于模型的指标效果更好，与文献一致。
•

我们使用两种类型的注释者（众包工作者和内部评委）进行了实验，以评估故事各个方面的质量。两组获得的结果是一致的。我们发布此带注释的资源作为测试平台，用于在故事生成任务中开发新的自动指标。¹¹1https://github.com/ZhuohanX/TheNextChapter
•

我们的实验结果提供了全面的证据，表明 GPT-3 生成的故事与 SOTA 技术相比表现出显着的改进，并且在各个方面与人类创作的故事相当。
•

我们对故事抄袭进行了初步知识研究，发现GPT-3在生成新闻时倾向于（软）“抄袭”真实故事，即使它不直接复制源文本，这引发了进一步的问题：GPT-3在多大程度上抄袭在记忆中循环利用故事，而不是产生新的叙述。

2相关工作

故事生成

参见等人(2019)发现经过微调的GPT-2已经可以生成句子流畅的故事，但需要更多的关注来融入常识和更高层次的故事规划。大多数作品随后使用 GPT-2 或 BART 等 PLM 作为骨干，并结合更高级别的功能来帮助生成过程。具体来说，Rashkin 等人 (2020)； Goldfarb-Tarrant 等人 (2020); Tan 等人 (2021) 构造一个故事情节来指导生成过程。关等人 (2021a);于等人 (2021); Xie 等人(2021)将连贯性和话语关系等句子间关系纳入生成过程。 Guan 等人 (2020); Peng 等人 (2021) 探索利用常识等外部知识来生成故事。徐等人 (2020); Ammanabrolu 等人 (2021) 结合故事情节规划和常识推理。

还有一些研究探索使用 GPT-3 来生成故事。例如，Clark 等人 (2021) 在 GPT-3 生成的故事和人类编写的故事之间进行图灵测试，以及 Lucy 和 Bamman (2021) 探索性别和代表性偏见在 GPT-3 生成的故事中。然而，这些研究并未提供针对 SOTA 故事生成模型来评估 GPT-3 的系统评估。

故事评价

自动故事评估无疑是一项具有挑战性的任务，缺乏标准化的评估指标在一定程度上阻碍了故事生成的进展（关等人，2021b）。人工评估通常被认为是故事质量评估的黄金标准，但其成本高昂且耗时（Guan and Huang，2020）并且无法捕获多样性（Hashimoto等人， 2017）。随后，引入了几种自动评估指标作为替代措施来评估生成的故事的质量（可读性程度）和多样性（变化程度）。在质量方面，大多数度量方法是测量字符串（Papineni等人，2002年；Lin，2004年；Tan等人，2021年）之间的词汇重叠度，或通过比较生成的故事与人类参考文献之间的模型嵌入（赵等人，2019年；张等人，2020年）来测量语义相似度。最近，探索了基于学习（Sellam 等人，2020）和生成（Yuan 等人，2021）的方法，它们基于预训练的语言模型，例如BERT (Devlin 等人, 2019) 和 BART。然而，这些评估指标的局限性在于它们提供单一分数来表明故事的整体质量，并且很少有指标是专门设计来评估特定方面的，例如逻辑性（对常识的遵守）或趣味性（读者参与度） ) (Chhun 等人, 2022)。

3实验设置

3.1 故事生成模型

为了确保全面的比较，我们进行了涉及 GPT-3 和各种 SOTA 故事生成模型的广泛实验。

在我们的实验中，我们使用了 GPT-3 最大的初始版本，即 text-davinci-001，它最初于 2020 年 6 月推出，包含 175B 个参数。也许值得注意的是，该模型在我们实验时（2022 年 3 月）被认为是最强大的，尽管后来发布了 GPT-4 等后续模型，拥有更强大的功能。因此，我们在这里报告的结果可以解释为大语言模型故事生成性能的“下限”。为了使 GPT-3 适应故事领域而不进行明确的微调，我们采用了基于提示的学习方法。我们选择了少量的故事（通常是 2 或 3 个）作为目标领域中 GPT-3 的范例。

对于 SOTA 故事生成模型，我们使用 1）基于知识增强的模型：KGGPT2 (Guan 等人, 2020) 和 HINT (管等人, 2021a); 2）基于故事情节规划的模型：PROGEN （Tan等人，2021）； 3）MTCL （徐等人，2020），结合故事情节规划和常识推理。我们还将 BART 作为附加基线。为了保持一致性，所有模型均使用核采样（Holtzman等人，2020），并以 $p=0.95$ 作为解码方法。我们在Table 1中总结了这些模型，更多详细信息可以在Appendix A中找到。

Model	Backbone	Size	Method	Story Datasets
GPT-3	text-davinci-001	175B	Prompt-based learning with several examples from the story dataset (3 for ROC and WP and 2 for CNN)	ROC, WP, CNN
KGGPT2	GPT-2 small	124M	Fine-tuned on commonsense data before more fine-tuning with auxiliary classification tasks	ROC
PROGEN	BART large	400M	Three-stage generation where at each stage a fine-tuned BART generates stories based on word importance in the story datasets	ROC, WP, CNN
MTCL	GPT-2 small BERT large	124M 336M	(1) a GPT-2 model to generate keywords; (2) a BERT model to rank retrieved knowledge triples; and (3) a second GPT-2 model that takes top-ranked knowledge triples and context as input for story generation	ROC
HINT	BART base	140M	BART is first fine-tuned on BookCorpus with additional objectives to learn internal structure in a story and then further fine-tuned on the story datasets	ROC, WP
BART	BART large	400M	Baseline model that is fine-tuned on the story datasets using a standard language modelling objective	ROC, WP, CNN

表格1：故事生成模型的主干（“Backbone”）及其参数数量（“Size”）。 “故事数据集”指示哪些数据集用于为特定模型生成故事。 KGGPT2和MTCL故事均来自原作者；对于 PROGEN 和 HINT，我们重新运行作者提供的实现。

3.2 故事数据集

最流行的故事数据集是 ROCStories (ROC) (Mostafazadeh 等人, 2016)，它由简短的常识性故事组成，被大多数故事生成作品所使用。还有一些难度更大、篇幅更长的故事数据集，例如 WritePrompts (WP) (Fan 等人, 2018) 和 CNN News (CNN) (Hermann 等人, 2015)它们由虚构和新闻故事（两个不同的领域）组成。在我们的实验设置中，我们利用了所有三个数据集。 ROC 数据集用于评估由 5 个句子组成的短篇故事的生成。 WP 数据集用于评估中等长度的故事，这些故事被缩减为 10 个句子。最后，利用 CNN 数据集来评估长故事的生成，每个故事大约有 20 个句子。有关这些数据集的更多详细信息，请参阅Appendix B。

只要有可能，我们就会评估每个故事数据集上的所有模型。然而，这有时是不可行的，因为某些模型被设计为在特定数据集上工作，因此无法轻松适应其他数据集。此外，我们在这项工作中专注于条件故事生成，这意味着我们有一些上下文来生成故事（详细信息如下）。

鹏

我们评估该数据集中的所有模型。我们用来生成故事的上下文是第一个句子，因此模型经过训练以生成最后 4 个句子。评估结果是使用从测试分区中随机采样的主要句子对 800 个生成的故事进行计算得出的。

湿性粉剂

我们在此数据集上评估 HINT、PROGEN、GPT-3 和 BART。上下文是一个简短的段落（“提示”），描述了故事的想法。我们从测试分区中随机抽取 1000 个提示进行自动评估。

美国有线电视新闻网

我们只在 CNN 上运行 GPT-3、BART、PROGEN，因为 HINT 最初是为 ROC 和 WP 开发的，应用于 CNN 时效果不佳。 CNN 的故事是根据新闻标题生成的。我们从测试分区中随机抽取 600 个标题进行自动评估。

4自动评估

4.1评估指标

我们使用两种类型的自动评估指标：1）基于参考的指标，我们将生成的故事与基于相同调节上下文的人类参考故事进行比较； 2）无参考指标，我们直接评估故事的质量。

4.1.1 基于参考的指标

大多数基于参考的指标衡量生成的故事与其人类参考之间的词汇或语义接近度。我们尝试使用基于字符串的匹配（CBL、MSJ）和基于嵌入的匹配（BES）的指标以及基于学习的指标（BRT）来评估生成的故事的质量。我们还使用基于召回的指标（BBL）来评估生成的故事的多样性。具体来说，Corpus BLEU (CBL) 根据所有人类参考（Caccia 等人，2020）计算每个生成故事的平均 BLEU 分数（Papineni 等人，2002）；谢等人，2021）。 MS-Jaccard (MSJ) 通过使用 Jaccard 索引计算生成的故事和引用的故事之间的 n-gram 重叠来测量词汇重叠（Alihosseini 等人，2019）。 BERTScore (BES) 衡量生成的故事和引用的故事之间每个词符的上下文嵌入的最大相似度（Zhang 等人，2020）。 BLEURT (BRT) 在合成数据上进行训练，以预测生成的故事和引用的故事之间的相似度分数（Sellam 等人，2020）。后向 BLEU (BBL) 根据生成的故事集（石等人，2018）计算参考故事中 n-gram 的覆盖率。²²2我们使用BLEU4进行CBL和BBL； MSJ 的 4 克重叠； roberta-BES 的大型模型； BRT 的 bert-base-128。

4.1.2 无参考指标

无参考指标评估生成的故事，而不将它们与人类创作的参考进行比较。我们尝试基于故事内（D-3、LR-n）和故事间多样性（SBL）的多样性指标。我们还根据 BART 计算故事的负对数似然，以上下文 (BAS) 为条件计算相关性，并计算以单词为单位的故事长度 (LEN) 以计算复杂性。

具体来说，词汇重复（LR-n）计算在生成的故事中至少出现 $n$ 次的4-gram的平均百分比（Shao等人，2019）. Distinct-3 (D-3) 计算不同 3-gram 与所有 3-gram 的平均比率(Li 等人, 2016)。 Self-BLEU (SBL) 衡量故事间多样性，使用所有生成的故事作为参考来计算每个生成故事的平均 BLEU 分数（Zhu 等人，2018）。 BARTScore (BAS) 根据上下文（即 ROC 的引导句、WP 的提示和 CNN 的标题）计算故事的生成可能性，以衡量生成的故事与其条件相关的程度（袁等人，2021）。³³3我们设置 $n=$ n= ROC、WP 和 CNN 分别为 3/8/8，SBL 使用 BLEU4。我们使用 BART 的“PARA”版本，方向为“从源头到假设”。长度（LEN）衡量生成故事的平均长度，用作生成复杂性的粗略指标。

4.2结果

Table 2和Table 3分别呈现了基于参考和无参考的评估结果。乍一看，这些指标似乎彼此不一致，尽管其中一些指标旨在评估相同的方面（例如，在流畅性/连贯性或多样性方面的最佳模型根据指标的不同而不同）。总体而言，GPT-3 在质量（CBL 和 MSJ）和多样性（BBL、SBL、D-3 和 LR-n）指标方面似乎比大多数其他模型表现更弱。

	Model	Flu./Coh.				Div.
		CBL	MSJ	BES	BRT	BBL
		$\Uparrow$	$\Uparrow$	$\Uparrow$	$\Downarrow$	$\Uparrow$
ROC	GPT-3	27.2	11.6	86.6	8.6	24.0
	KGGPT2	33.5	15.0	87.0	9.5	25.6
	PROGEN3	26.6	14.6	86.7	9.7	25.0
	MTCL	31.4	14.2	86.9	9.7	24.0
	HINT	39.6	13.7	87.0	8.6	24.6
	BART	27.5	14.7	86.8	9.5	25.1
WP	GPT-3	28.6	12.3	81.6	11.7	24.4
	PROGEN3	32.3	16.4	81.4	13.3	27.6
	HINT	45.5	12.8	80.8	12.1	23.7
	BART	32.6	16.2	81.4	13.0	27.2
CNN	GPT-3	33.2	11.0	83.5	7.5	19.8
	PROGEN3	29.6	14.8	82.2	9.3	26.2
	BART	29.1	14.7	82.2	9.8	25.7

表2：基于参考的评估结果。 CBL、MSJ、BES 和 BRT 评估生成的故事与整个测试参考数据之间的接近程度，作为一般流畅度 (Flu.) 的指标和连贯性（Coh.）。 BBL 专注于对生成故事的回忆作为多样性指标（Div.）。

\Uparrow

：越高越好；

\Downarrow

：越低越好。 BRT 值在此被取反。

	Model	Div.			Rel.	Com.
		SBL	D-3	LR-n	BAS	LEN
		$\Downarrow$	$\Uparrow$	$\Downarrow$	$\Downarrow$	$\Uparrow$
ROC	GPT-3	38.5	67.7	39.1	4.2	47.3
	KGGPT2	41.9	67.2	51.9	4.6	38.4
	PROGEN3	30.0	76.9	39.5	5.0	40.9
	MTCL	39.4	69.6	44.4	4.9	49.7
	HINT	55.1	54.3	68.1	4.3	35.8
	BART	30.5	77.4	37.8	5.0	40.6
	human	33.1	80.2	35.8	5.2	40.3
WP	GPT-3	37.5	69.6	9.7	4.3	120.6
	PROGEN3	35.2	77.2	2.6	5.4	136.9
	HINT	64.1	33.9	67.4	4.1	119.0
	BART	35.3	77.5	1.6	5.4	129.2
	human	27.1	83.7	1.5	5.7	150.0
CNN	GPT-3	26.5	82.9	9.8	4.4	147.3
	PROGEN3	28.9	82.3	2.3	5.2	395.8
	BART	27.9	83.2	0.8	5.2	374.1
	human	27.3	83.8	6.3	5.4	498.6

表3：无参考评估结果。 SBL 通过评估不同故事之间的差异来衡量故事间多样性，而 D-3 和 LR-n（ROC 为 3，WP 和 CNN 为 8）侧重于同一故事内的重复 n 元语法。我们还将 LEN 作为故事复杂性的指标 (Com.)。我们在给定故事相关性的条件下计算故事的 BAS（Rel.）。

然而，当我们查看最近基于模型的指标（BERTScore、BLEURT 和 BARTScore）时，GPT-3 似乎是一个更好的模型（当我们查看人类评估结果时，我们将回顾这一发现）。有趣的是，我们注意到人类编写的故事在 BARTScore (BAS) 方面的表现非常差。我们怀疑 BARTScore 可能会对机器生成的故事表现出偏见，因为该指标主要根据序列的生成可能性来评估质量。机器生成的故事经过专门设计，可以最大限度地提高这种可能性，而人类创作的故事通常包含不同的元素，例如令人惊讶或富有创意的词语选择（Holtzman 等人，2020）。一般来说，除了 CNN 数据集中的 GPT-3 之外，所有模型都能够生成适当长度的故事。 CNN 数据集中的 GPT-3 很难生成超过 150 个单词的故事，而人类编写的故事通常平均由 500 个单词左右组成。考虑到使用各种自动指标的整体评估，没有一个获胜者能够始终优于其他模型。

5人类评价

为了对生成的故事进行全面评估，我们聘请人工注释者来评估其质量。为了深入了解一致性，我们雇用了众包工作人员和内部注释人员。这种方法使我们能够收集不同的观点并对故事质量有更细致的了解。

5.1众包标注

我们首先使用 Amazon Mechanical Turk (AMT) 平台收集人类判断。⁴⁴4https://requester.mturk.com/ 按照Karpinska等人(2021)建议的方法，我们评估了四个方面，即流畅性、连贯性、关联性和兴趣度。此外，我们引入了一个称为逻辑性的新方面，它评估故事符合常识的程度。这五个方面中的每一个都按照从 1（最差）到 5（最好）的顺序等级进行评估。我们从每个数据集中随机抽取 20 个条件上下文（例如标题），并收集所有模型生成的故事以供人类评估。每个故事（包括人写的故事）由 3 位注释者进行评审，因此我们总共有 320 个故事的注释（ROC、WP 和 CNN 分别为 140/100/80）。亚马逊关于 AMT 的资格要求和问题详细信息可以在Appendix C 中找到。质量控制详情请参阅Appendix D。

	Model	Flu.	Coh.	Rel.	Log.	Int.
ROC	GPT-3	$\mathbf{4.40}$	$4.43$	$4.37$	$4.37$	$3.57$
	KGGPT2	$3.90^{*}$	$3.48^{*}$	$3.53^{*}$	$3.00^{*}$	$2.62^{*}$
	PROGEN3	$3.88^{*}$	$3.45^{*}$	$3.37^{*}$	$2.95^{*}$	$2.57^{*}$
	MTCL	$3.55^{*}$	$3.12^{*}$	$3.18^{*}$	$2.73^{*}$	$2.42^{*}$
	HINT	$3.90^{*}$	$3.27^{*}$	$3.33^{*}$	$3.12^{*}$	$2.58^{*}$
	BART	$3.92^{*}$	$3.38^{*}$	$3.48^{*}$	$3.03^{*}$	$2.60^{*}$
	human	$4.22$	$\mathbf{4.58}$	$\mathbf{4.42}$	$\mathbf{4.48}$	$\mathbf{3.77}$
WP	GPT-3	4.37	4.67	4.28	4.48	3.47
	PROGEN3	$3.45^{*}$	$3.08^{*}$	$2.35^{*}$	$2.57^{*}$	$1.98^{*}$
	HINT	$3.32^{*}$	$2.63^{*}$	$2.02^{*}$	$2.25^{*}$	$1.77^{*}$
	BART	$3.42^{*}$	$2.73^{*}$	$2.08^{*}$	$2.27^{*}$	$1.87^{*}$
	human	$4.13^{*}$	$4.22^{*}$	$3.05^{*}$	$3.75^{*}$	$2.97^{*}$
CNN	GPT-3	$\mathbf{4.22}$	$\mathbf{4.52}$	$\mathbf{4.58}$	$\mathbf{4.60}$	$3.20$
	PROGEN3	$3.63^{*}$	$3.32^{*}$	$3.30^{*}$	$3.22^{*}$	$2.28^{*}$
	BART	$3.58^{*}$	$3.37^{*}$	$3.30^{*}$	$3.27^{*}$	$2.17^{*}$
	human	4.10	$4.10^{*}$	$4.23^{*}$	$4.18^{*}$	$\mathbf{3.72^{*}}$

表 4：众包人类评估结果。我们计算模型每个方面的平均得分：流畅度（Flu.）、连贯性（Coh.）、相关性（Rel.)、逻辑性(Log.) 和兴趣（国际）。标有

*

的模型分数表明模型与GPT-3之间的性能差异显着。

Table 4呈现了人类评估结果。总体而言，GPT-3 生成的故事始终比其他 SOTA 模型生成的故事质量更高。为了了解差异是否显着，我们通过将 GPT-3 与其他模型（包括人类）进行比较来执行配对 t 检验，发现在大多数情况下这些结果具有显着性， $p$ -值 ¡ 0.05 (表中的“*”）。与人类作者相比，GPT-3 似乎生成的故事与（ROC）一样好或优于（WP 和 CNN）人类作者，证实了 Clark 等人 (2021) 的发现>。特别是对于 WP，人类故事被修剪为前 10 个句子（用于训练故事生成模型的数据预处理）。这突然缩短了故事，因此它们可能无法提供正确的结论，并且不可避免地会受到惩罚（请参阅Appendix K 中的示例）。对于 CNN 来说，GPT-3 似乎是在“抄袭”真实故事，其中许多故事元素并非创意生成的产物，而是从真实新闻故事中复制的细节(Section 6)。另一个原因可能是 GPT-3 故事比其他模型和人类作者生成的故事短得多（150 字与 300-400 字；Table 3)，这使得它们更易于阅读，从而领先以获得更好的成绩。请注意，这是 GPT-3 的一个缺点，即很难让它生成长篇故事(Section 7)。

	Model	Flu.	Coh.	Rel.	Log.	Int.
ROC	GPT-3	$\mathbf{4.78}$	$\mathbf{4.73}$	$\mathbf{4.50}$	$\mathbf{4.82}$	$\mathbf{3.37}$
	KGGPT2	$4.52^{*}$	$3.67^{*}$	$3.57^{*}$	$3.47^{*}$	$2.50^{*}$
	PROGEN3	$4.27^{*}$	$3.47^{*}$	$3.78^{*}$	$3.23^{*}$	$2.48^{*}$
	MTCL	$4.27^{*}$	$3.27^{*}$	$3.45^{*}$	$3.15^{*}$	$2.37^{*}$
	HINT	$4.38^{*}$	$4.03^{*}$	$3.38^{*}$	$3.70^{*}$	$2.38^{*}$
	BART	$4.37^{*}$	$3.95^{*}$	$3.85^{*}$	$3.53^{*}$	$2.70^{*}$
	human	$4.52^{*}$	$4.38^{*}$	$4.22$	$4.32{*}$	$3.18$
WP	GPT-3	4.57	4.65	4.08	4.22	3.82
	PROGEN3	$3.55^{*}$	$3.03^{*}$	$2.23^{*}$	$2.57^{*}$	$2.45^{*}$
	HINT	$3.60^{*}$	$2.72^{*}$	$2.07^{*}$	$2.68^{*}$	$2.08^{*}$
	BART	$3.45^{*}$	$2.77^{*}$	$2.08^{*}$	$2.38^{*}$	$2.30^{*}$
	human	$4.05^{*}$	$4.07^{*}$	$3.73$	$3.87^{*}$	$3.78$
CNN	GPT-3	$\mathbf{4.50}$	$\mathbf{4.33}$	$\mathbf{4.48}$	$\mathbf{4.40}$	$\mathbf{3.45}$
	PROGEN3	$3.80^{*}$	$3.45^{*}$	$3.63^{*}$	$3.45^{*}$	$2.52^{*}$
	BART	$3.73^{*}$	$3.25^{*}$	$3.58^{*}$	$3.32^{*}$	$2.57^{*}$
	human	$4.22^{*}$	$4.00^{*}$	$4.35$	$4.13^{*}$	$3.22$

表 5：内部人工评估结果。

从KGGPT2、PROGEN3、MTCL、HINT、BART等SOTA模型的各个方面来看，这些模型在流畅度方面都表现出了很强的表现，大多数情况下得分始终超过3.5。这表明该模型可以生成自然且流畅的句子。然而，一致性性能因数据集而异。大多数模型在 ROC 和 CNN 数据集上表现良好，但在 WP 上却表现不佳，一致性得分低于 3.1。与较长的 CNN 故事相比，这些模型在处理较短的 WP 故事时遇到了困难，这可能是因为它们所构建的 PLM 主要是根据包含大量新闻文章的网络数据进行训练的。对于相关性、逻辑性和趣味性，我们看到了类似的趋势，模型在 ROC 中表现最好，在 WP 中表现最差。我们还观察到从相关性到逻辑性和趣味性的表现持续下降，这表明这些模型特别难以生成有趣且合理的故事。有趣性可能是最难优化的方面，因为很难定义什么使叙述有趣。

5.2内部标注

接下来我们招募大学志愿者来收集内部判断。⁵⁵5从人口结构来看，博士生14人，大学教职员1人；他们都精通英语。我们要求他们使用相同的量表评估相同的 5 个方面。在此，我们从每个数据集中抽取 20 个不相连的条件上下文进行故事生成，因为我们有兴趣检验我们之前研究结果的稳健性（使用不同的工作人员和故事集）。与众包标注一样，每个故事也由 3 位注释者进行评判。注释者之间协议的详细信息可以在Appendix G中找到。

Table 5 列出了内部注释者对故事质量的评分。有趣的是，内部分数的幅度通常略高于众包分数（在所有指标、数据集和模型中）。我们假设这可能是因为我们的内部员工比众包员工更“容忍”错误，因为他们更多地接触机器生成的文本。也就是说，两组注释者的总体结果是一致的：1）GPT-3 是最好的故事生成模型，并且优于 SOTA 模型和人类故事； 2）SOTA模型在流畅性方面表现良好，但在其他方面表现较差（趣味性最差）； 3）SOTA 模型在 WP 领域面临着显着的挑战，与其他领域相比，其连贯性、相关性、逻辑性得分较差就证明了这一点。

当将自动指标(Section 4.2)的结果与人类评估结果进行比较时，出现了显着的差异，导致对 GPT-3 的性能得出不同的结论并确定明确的“最佳”故事生成模型。也就是说，如果我们只考虑基于模型的指标，例如 BERTScore、用于流畅性/连贯性的 BLEURT 以及用于相关性的 BARTScore，则可以得出更一致的结论，表明这些指标可能更可靠（尽管与到人类评估结果）。这一观察结果与最近的文献一致，后者强调了现代基于模型的指标与人类评估具有更好的相关性（Chiang 和 Lee，2023；Ke 等人，2023；Xie 等人，2023）。

6 抄袭

考虑到 GPT-3 在故事生成方面的强大性能，我们接下来提供了一项基本知识调查，以了解 GPT-3 从其训练数据中复制的程度。

我思考

我们使用 iThenticate⁶⁶6https://www.ithenticate.com——一款专业的抄袭检测软件，全面覆盖在线文章——评估 GPT-3 的抄袭程度。在检查抄袭时，我们仅包含生成的内容（无条件）。结果显示，不存在强烈抄袭：ROC、WP 和 CNN 的相似度得分分别为 4%、3% 和 14%。这与McCoy等人(2021)的研究结果一致，即语言模型不是简单地记忆，而是将熟悉的部分组合成新颖的方式。然而，iThenticate 会寻找词汇重叠来识别抄袭。另一种更微妙的抄袭形式是复制想法，但不鹦鹉学舌地模仿（Lee等人，2022），从而促使我们进行手动检查。

手动检查

我们分别为 WP 和 CNN 随机抽取 10 个 GPT-3 生成的故事，并通过搜索关键字、命名实体和长短语来手动检查它们是否从网络文章中复制。对于 WP，在大多数情况下，我们找不到与我们的故事相似的网络文章，除了一个从维基百科复制故事的例子（请注意，它不是直接逐字复制）。对于 CNN，我们发现十分之七的实例是基于新闻文章中的真实故事，其中大多数细节（例如姓名、事件起源、刑罚）都忠实于原始故事。这些结果突显了 GPT-3 可以回忆训练数据中的细节，让人质疑它的生成有多少创造性。更多详情请参阅Appendix M。

7讨论

Model	Story
GPT-3	“ yes , lord cthulhu ? ” you answer the phone . “ i need you to pick me up from the airport . ” “ of course , my lord . when will you be arriving ? ” “ in about an hour . i ’ll text you the flight number . ” you hang up and start getting ready . you know cthulhu is a demanding deity , but you do n’t mind . you enjoy serving him .
PROGEN3	but everything ’s different ! i miss my child in there … what are you talking about ! you have to be mad . you must have misheard ! ! you come and get your mommy … ” momma always grabbed him before they climbed into the bathtub . after that happened they were always late . bam . * wait , it ’s never like this . ctlulhu always puts up a fight . i repeat : it ’s never like this ! it ’s always for the umpteenth time , now or never . *the phone rings . i answer it and ctlulhu ’s voice booms out .
HINT	“ sir , i ’m afraid i ’m late for work. “ “ i ’m sorry sir , “ the technician said , “ but i have to get my cell phone back. “ “ sir , you have to pick up this cell. “ “ yes sir , i do. “ “ and how are you going to find it ? “ “ i ’m just going to have to pick it up. “ “ well , sir , “ the technician said , “ it ’s just that i do n’t have the time to pick it up. “ “ why not ? “ “ sir , “ the technician said , “ i do n’t have the time. “ “ i ’m just going to make sure it ’s picked up. “
BART	“ ah it ’ s here . i thought you wanted to leave , a phone call . * ” my coworkers looked at me confused . “ i wasn ’ t in the mood to go running around scared . but it was important . you need to pick up on your cravings . the kid-ta-bye , gabriel , payliver , cilia de los angeles . and four people ’ s fault. ” he was right about the fifth , at least my attention was fixed on it . three girls were outside screaming in the street , and a family of four was picking up one of them . the two men yelled and approached as they walked to the mailbox , visibly distressed .

表 6：由 GPT-3、HINT、BART 和 PROGEN3 生成的故事具有相同的标题“克苏鲁呼叫你的手机，他需要被接起。”

在本文中，我们仅尝试了一种简单的基于提示的学习方法，以使 GPT-3 适应不同的故事数据集，并且可以认为可以探索更多的提示工程（Liu 等人，2022；Lu 等人，2022 ; 米什拉等人，2022）。然而，我们认为，我们的目标并不是提出一种更好的基于提示的方法来使用 GPT-3 生成故事，最终，即使使用我们的简单方法，我们也发现 GPT-3 可以生成高质量的故事，这表明通过额外的及时的工程它可以做得更好。接下来我们定性讨论 GPT-3 在故事生成方面的一些优点和缺点。

7.1优势

与其他条件生成任务（例如机器翻译或摘要）不同，其中输入上下文信息丰富，目标是翻译或压缩输入信息，故事生成以相反的方式工作，模型需要“幻觉”新信息并给出简洁上下文的细节。这意味着为了做好任务，拥有丰富的世界知识很重要。阅读一些 GPT-3 的故事，我们观察到 GPT-3 在这方面的优势，特别是在 WP 数据集中，其中一些提示需要有关字符的小众知识。在Table 6中，我们展示了WP中的一个示例，其中提示是cthulhu呼叫您的手机，他需要被接听，其中cthulhu是一个虚构的宇宙实体，只有 GPT-3 能够产生连贯的故事，并且 SOTA 模型陷入困境。

7.2缺点

尽管 GPT-3 表现出了出色的生成能力并且显着优于 SOTA 模型，但我们仍然发现 GPT-3 有许多生成错误可以改进。

故事长度

GPT-3有一个参数来控制生成 Token 的最大数量，但没有提供控制最小 Token 数量的方法。从Table 3中可以看出，尽管提示的故事很长，但 GPT-3 无法为 CNN 生成超过 150 字的故事。我们还尝试通过添加具体说明作为 GPT-3 提示的一部分来鼓励更长的故事，但这并不起作用。

空一代

有时 GPT-3 决定不生成输出。这通常不是问题，因为可以通过强制它再次生成来解决这个问题，尽管尚不清楚为什么会发生这种情况。

直接复制

除了软剽窃问题(Section 6)之外，GPT-3 偶尔也会复制长文本块，例如故事中的标题或提示。

多种语言

尽管给出的提示始终是英语，但 GPT-3 有时会生成英语以外的语言的故事。从统计来看，在1000代人中，我们发现了14个非英语故事（5个中文、4个德文、1个日文、1个法文、1个俄文、1个挪威尼诺斯克文和1个中英混合故事）。有趣的是，在大多数情况下，故事都与条件相关（即使使用不同的语言），尽管有时我们观察到输出是提示的直接翻译，而不是创意故事。

Token 化问题

GPT-3 代偶尔会出现缺少空格的“粘性”单词（例如，understand.With 和 timewhen)。我们怀疑这是由于 GPT-3 的字节对编码造成的，其中空格被“粘合”到每个子字上，因此每个子字都有两个版本（一个有空格，一个没有）。当 GPT-3 使用不带空格后缀的子字生成时，就会出现此问题。

脏话

GPT-3 偶尔会生成带有脏话的故事。有趣的是，它有时会自我审查（例如，b****)。

8结论

我们对 GPT-3 与 SOTA 模型在故事生成方面进行了广泛的比较，发现 GPT-3 生成的故事在多个方面都明显优于 SOTA 模型，甚至优于人类作者。这项研究的结果表明我们已经进入了大语言模型故事生成的新篇章。未来的研究可能会集中在即时工程大语言模型上，以实现增强的定制化，例如改变其风格和长度，进一步提高故事生成模型的能力。在评估指标方面，我们的工作：1）揭示了基于词汇的自动评估指标与人类评估之间的弱相关性，并且最近提出的基于模型的指标似乎更可靠； 2) 通过发布包含两组评委对故事质量注释的数据集，为度量开发提供了一个新的测试平台。尽管 GPT-3 在故事生成方面取得了积极的成果，但我们还是讨论了它的一些问题，其中最主要的一个是它倾向于从记忆中再现细节或情节，从而提出了有关其生成创造力的基本问题。

局限性

正如 Mishra 等人 (2022) 所观察到的，设计适当的提示可以显着影响语言模型的性能。在我们当前的研究中，我们随机抽取了一些训练示例作为 GPT-3（上下文学习）的演示。然而，更有效的方法可能涉及战略性地选择上下文更相关的示例。

尽管text-davinci-001是我们实验时最好的模型，但该领域的最新进展导致了更强大的大语言模型的发布。尽管有这些改进的模型，我们仍然认为它们不太可能实质性改变本研究得出的结论。研究结果强烈表明，在可预见的未来，大语言模型仍将是故事生成的主导方法。此外，我们在实验中仅使用 GPT-3 进行探索，尽管我们认为我们的发现可能推广到其他大语言模型，但这尚未得到经验验证。

自从我们2022年开始这项工作以来，文本生成评估指标方面有了很大的发展（Chiang and Lee, 2023; Ke 等人, 2023; Xie 等人, 2023; Fu 等人, 2023; Liu等人, 2023a)，其中一些本身就使用大语言模型。尽管我们声称人类评估仍然是故事生成的黄金标准，但这些新指标在多大程度上缩小了差距仍有待观察。我们预见，循环性问题，即使用大语言模型来评估LLM生成的文本，将是该领域需要解决的下一个挑战。

在我们的工作中，我们承认我们没有让领域专家（例如故事作者）参与更专业的评估。在故事评估中调查外行人和专家评估者之间判断的潜在差异将是很有趣的（Chiang 和 Lee，2023）。最近的研究表明，人类标注过程中的某些做法，例如使用李克特量表，在捕捉人类的真实偏好方面存在局限性（Ethayarajh 和 Jurafsky，2022；Liu 等人，2023c）。然而，我们认为，我们在两组不同的注释者之间发现了一致的结果，这一事实表明我们的发现可能是可靠的。

道德声明

本文中进行的所有 Mechanical Turk 实验均得到了我们机构内部伦理审查委员会的批准。（道德 ID 号：21961）。我们的评估人员的工资估计为每小时 14.83 美元。对于每个数据集，我们估计他们将花费的时间，并根据估计的时间改变付款方式。每个HIT包含7个故事（5个故事待评估，2个受控故事控制AMT上的评估质量）。我们为 ROC 支付每个 HIT 2.50 美元，为 WP 支付 3.50 美元，为 CNN 支付 4.50 美元。

我们在同意书上提醒工人，这项工作存在潜在风险，他们可能不得不阅读和评价含有污言秽语或冒犯故事情节的故事，欢迎他们退出任务，我们仍将根据他们的努力程度支付报酬。花费。

致谢

我们衷心感谢审稿人提出的宝贵反馈意见，这对本文工作的改进做出了巨大贡献。我们还感谢在线注释者和同事的努力，他们做出了有益的标注贡献。谢卓瀚受到墨尔本研究奖学金的支持，对此项目表示衷心感谢。

参考

Alihosseini et al. (2019) Danial Alihosseini, Ehsan Montahaei, and Mahdieh Soleymani Baghshah. 2019. Jointly measuring diversity and quality in text generation models. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 90–98, Minneapolis, Minnesota. Association for Computational Linguistics.
Ammanabrolu et al. (2021) Prithviraj Ammanabrolu, Wesley Cheung, William Broniec, and Mark O. Riedl. 2021. Automated storytelling via causal, commonsense plot ordering. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 5859–5867. AAAI Press.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Caccia et al. (2020) Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin. 2020. Language gans falling short. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
Chhun et al. (2022) Cyril Chhun, Pierre Colombo, Fabian M. Suchanek, and Chloé Clavel. 2022. Of human criteria and automatic metrics: A benchmark of the evaluation of story generation. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 5794–5836. International Committee on Computational Linguistics.
Chiang and Lee (2023) David Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 15607–15631. Association for Computational Linguistics.
Clark et al. (2021) Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A. Smith. 2021. All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Online. Association for Computational Linguistics.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Ethayarajh and Jurafsky (2022) Kawin Ethayarajh and Dan Jurafsky. 2022. The authenticity gap in human evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 6056–6070. Association for Computational Linguistics.
Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
Fu et al. (2023) Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire. CoRR, abs/2302.04166.
Goldfarb-Tarrant et al. (2020) Seraphina Goldfarb-Tarrant, Tuhin Chakrabarty, Ralph Weischedel, and Nanyun Peng. 2020. Content planning for neural story generation with aristotelian rescoring. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4319–4338, Online. Association for Computational Linguistics.
Guan et al. (2020) Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and Minlie Huang. 2020. A knowledge-enhanced pretraining model for commonsense story generation. Transactions of the Association for Computational Linguistics, 8:93–108.
Guan and Huang (2020) Jian Guan and Minlie Huang. 2020. UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9157–9166, Online. Association for Computational Linguistics.
Guan et al. (2021a) Jian Guan, Xiaoxi Mao, Changjie Fan, Zitao Liu, Wenbiao Ding, and Minlie Huang. 2021a. Long text generation by modeling sentence-level and discourse-level coherence. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6379–6393, Online. Association for Computational Linguistics.
Guan et al. (2021b) Jian Guan, Zhexin Zhang, Zhuoer Feng, Zitao Liu, Wenbiao Ding, Xiaoxi Mao, Changjie Fan, and Minlie Huang. 2021b. OpenMEVA: A benchmark for evaluating open-ended story generation metrics. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6394–6407, Online. Association for Computational Linguistics.
Hashimoto et al. (2017) Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2017. A joint many-task model: Growing a neural network for multiple NLP tasks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1923–1933, Copenhagen, Denmark. Association for Computational Linguistics.
Hermann et al. (2015) Karl Moritz Hermann, Tomás Kociský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1693–1701.
Holtzman et al. (2020) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
Karpinska et al. (2021) Marzena Karpinska, Nader Akoury, and Mohit Iyyer. 2021. The perils of using Mechanical Turk to evaluate open-ended text generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1265–1285, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Ke et al. (2023) Pei Ke, Fei Huang, Fei Mi, Yasheng Wang, Qun Liu, Xiaoyan Zhu, and Minlie Huang. 2023. DecompEval: Evaluating generated texts as unsupervised decomposed question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9676–9691, Toronto, Canada. Association for Computational Linguistics.
Lau et al. (2020) Jey Han Lau, Carlos Armendariz, Shalom Lappin, Matthew Purver, and Chang Shu. 2020. How furiously can colorless green ideas sleep? sentence acceptability in context. Transactions of the Association for Computational Linguistics, 8:296–310.
Lee et al. (2022) Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee. 2022. Do language models plagiarize? CoRR, abs/2203.07618.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Liu et al. (2022) Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. Association for Computational Linguistics.
Liu et al. (2021) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. CoRR, abs/2107.13586.
Liu et al. (2023a) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023a. G-eval: NLG evaluation using GPT-4 with better human alignment. CoRR, abs/2303.16634.
Liu et al. (2023b) Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li, Mengshen He, Zhengliang Liu, Zihao Wu, Dajiang Zhu, Xiang Li, Ning Qiang, Dinggang Shen, Tianming Liu, and Bao Ge. 2023b. Summary of chatgpt/gpt-4 research and perspective towards the future of large language models. CoRR, abs/2304.01852.
Liu et al. (2023c) Yixin Liu, Alexander R. Fabbri, Pengfei Liu, Yilun Zhao, Linyong Nan, Ruilin Han, Simeng Han, Shafiq Joty, Chien-Sheng Wu, Caiming Xiong, and Dragomir Radev. 2023c. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 4140–4170. Association for Computational Linguistics.
Lu et al. (2022) Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland. Association for Computational Linguistics.
Lucy and Bamman (2021) Li Lucy and David Bamman. 2021. Gender and representation bias in GPT-3 generated stories. In Proceedings of the Third Workshop on Narrative Understanding, pages 48–55, Virtual. Association for Computational Linguistics.
McCoy et al. (2021) R. Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz. 2021. How much do language models copy from their training data? evaluating linguistic novelty in text generation using RAVEN. CoRR, abs/2111.09509.
Mishra et al. (2022) Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. 2022. Reframing instructional prompts to GPTk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, pages 589–612, Dublin, Ireland. Association for Computational Linguistics.
Mostafazadeh et al. (2016) Nasrin Mostafazadeh, Lucy Vanderwende, Wen-tau Yih, Pushmeet Kohli, and James Allen. 2016. Story cloze evaluator: Vector space representation evaluation by predicting what happens next. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 24–29, Berlin, Germany. Association for Computational Linguistics.
OpenAI (2023) OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
Peng et al. (2021) Xiangyu Peng, Siyan Li, Sarah Wiegreffe, and Mark O. Riedl. 2021. Inferring the reader: Guiding automated story generation with commonsense reasoning. CoRR, abs/2105.01311.
Qin et al. (2023) Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is chatgpt a general-purpose natural language processing task solver? CoRR, abs/2302.06476.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Rashkin et al. (2020) Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, and Jianfeng Gao. 2020. PlotMachines: Outline-conditioned generation with dynamic plot state tracking. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4274–4295, Online. Association for Computational Linguistics.
Sap et al. (2019) Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi. 2019. ATOMIC: an atlas of machine commonsense for if-then reasoning. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 3027–3035. AAAI Press.
See et al. (2019) Abigail See, Aneesh Pappu, Rohun Saxena, Akhila Yerukola, and Christopher D. Manning. 2019. Do massively pretrained language models make better storytellers? In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 843–861, Hong Kong, China. Association for Computational Linguistics.
Sellam et al. (2020) Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
Shao et al. (2019) Zhihong Shao, Minlie Huang, Jiangtao Wen, Wenfei Xu, and Xiaoyan Zhu. 2019. Long and diverse text generation with planning-based hierarchical variational model. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3257–3268, Hong Kong, China. Association for Computational Linguistics.
Shi et al. (2018) Zhan Shi, Xinchi Chen, Xipeng Qiu, and Xuanjing Huang. 2018. Toward diverse text generation with inverse reinforcement learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pages 4361–4367. ijcai.org.
Shoeybi et al. (2019) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053.
Speer and Havasi (2012) Robyn Speer and Catherine Havasi. 2012. Representing general relational knowledge in ConceptNet 5. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 3679–3686, Istanbul, Turkey. European Language Resources Association (ELRA).
Tan et al. (2021) Bowen Tan, Zichao Yang, Maruan Al-Shedivat, Eric Xing, and Zhiting Hu. 2021. Progressive generation of long text with pretrained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4313–4324, Online. Association for Computational Linguistics.
Xie et al. (2021) Zhuohan Xie, Jey Han Lau, and Trevor Cohn. 2021. Exploring story generation with multi-task objectives in variational autoencoders. In Proceedings of the The 19th Annual Workshop of the Australasian Language Technology Association, pages 97–106, Online. Australasian Language Technology Association.
Xie et al. (2023) Zhuohan Xie, Miao Li, Trevor Cohn, and Jey Han Lau. 2023. Deltascore: Evaluating story generation with differentiating perturbations. CoRR, abs/2303.08991.
Xu et al. (2020) Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Raul Puri, Pascale Fung, Anima Anandkumar, and Bryan Catanzaro. 2020. MEGATRON-CNTRL: Controllable story generation with external knowledge using large-scale language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2831–2845, Online. Association for Computational Linguistics.
Yu et al. (2021) Wenhao Yu, Chenguang Zhu, Tong Zhao, Zhichun Guo, and Meng Jiang. 2021. Sentence-permuted paragraph generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5051–5062, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Yuan et al. (2021) Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 27263–27277.
Zhang et al. (2020) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223.
Zhao et al. (2019) Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. 2019. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong Kong, China. Association for Computational Linguistics.
Zhu et al. (2018) Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, pages 1097–1100. ACM.
Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 19–27. IEEE Computer Society.

附录 A SOTA 故事模型详细信息

知识增强型 GPT-2 (KGGPT2)

Guan 等人 (2020) 使用启发式规则从常识知识库翻译常识三元组（例如，ConceptNet (Speer and Havasi, 2012) 和 ATOMIC (Sap 等人, 2019)) 转换为自然语言句子，并使用这些句子将 GPT-2 小型化。他们还使用规则从原始故事中构建负样本来创建“坏故事”，并执行额外的训练来鼓励模型学习能够区分 ROC 上原始故事和负故事的表示。

渐进式长文本生成（PROGEN）

Tan 等人 (2021) 将故事生成过程分为多个阶段，根据重要性顺序生成单词（使用 TF-IDF 估计）。换句话说，PROGEN 不会以从左到右的方式生成故事。它们在不同阶段中对 BART-large 产生影响，其中早期阶段侧重于生成关键字，中间阶段侧重于生成下一组内容词。我们在实验中使用 PROGEN3，它有 3 个阶段，每次通过后都会生成 15%/25%/100% 的故事单词。

威震天-CNTRL (MTCL)

徐等人(2020)结合常识推理和故事情节规划。他们首先使用 GPT-2 训练关键字预测器，然后使用预测的关键字从知识库中检索相关知识三元组。然后，他们使用 BERT 训练上下文知识排序器，对 top- $N$ 预测的知识三元组进行排序。第二个 GPT-2 被训练为条件生成器，在生成故事时将排名最高的知识三元组和其他条件（例如标题）作为输入。请注意，GPT-2 和 BERT 两个模型的参数均使用 MEGATRON 参数（Shoeybi 等人，2019）进行初始化。

长文本生成的高级表示（提示）

Guan 等人 (2021a) 基于 BookCorpus (Zhu 等人, 2015) 预训练 BART，附加目标是捕获句子级相似性和句子顺序来学习故事的内部结构。然后，该模型在故事数据集上进一步微调，以在特定数据集中生成故事。

捷运

这是一个基线模型，我们使用标准的下一个单词预测目标对故事数据集进行 BART-large 处理。

附录 B数据集详细信息

ROC故事 (ROC)

ROC由Mostafazadeh等人(2016)开发，包含98K个五句常识故事。为了获得更通用的词典，我们遵循先前研究（Guan等人，2020；Xu等人，2020）的去词汇化过程，其中男性/女性/未知名称被替换为[MALE]/[女]/[中性]哨兵。对于每个故事，第一个（前导）句子用作条件上下文，并训练模型以生成其余 4 个句子。

写作提示 (WP)

WP 包含从 Reddit 写作提示论坛 Fan 等人 (2018) 挖掘的 303K 人类编写的故事。⁷⁷7https://www.reddit.com/r/WritingPrompts/ 每个故事都经过修剪，仅包含前 10 个句子（以下是Guan 等人 (2021a))。对于 WP，我们使用提示（通常是设置故事场景的一段文本）作为故事生成的条件。

美国有线电视新闻网 (CNN)

CNN News (Hermann 等人, 2015) 是一个包含带有标题的长新闻文章的数据集。 CNN 是一个非常大的数据集，包含 311K 篇新闻文章和摘要。我们对标准训练、验证和测试拆分进行子采样，以分别为我们的实验生成每个包含 10K/5K/1K 故事的拆分。新闻报道的标题被用作故事生成的条件。

附录 C Amazon Mechanic Turk 设置

资质要求

我们对标注者设定了以下资格要求： 1）接受率大于或等于97%。 2）他们的位置在美国。 3）他们必须完成1000个以上的HIT。

问题

我们在调查问卷中提出以下问题。

1.

流利程度：“故事文本的语法正确程度如何？”
2.

连贯性：“故事中的句子组合得如何？”
3.

相关性：“故事与标题的相关程度如何？”
4.

逻辑性：“这个故事在多大程度上符合常识？”
5.

趣味性：“你觉得这个故事有多有趣？”

附录 DAmazon Mechanic Turk 试点研究

虽然 AMT 很容易找到从事标注工作的工人，但要找到可靠的工人却相当困难（Karpinska 等人，2021；Clark 等人，2021）。我们的一位工人告诉我们，许多工人安装网站插件来帮助他们使用 AMT 管理工作流程，以便他们可以同时囤积许多 HIT。因此，高薪的 HIT 很容易吸引不负责任的工人，即使之前已经设定了资格，因为大多数 AMT 请求者都不会拒绝工作。

因此，我们开展了一项试点研究，以帮助我们帮助可靠的工人。我们随机选择 ROC 上不同模型生成的 5 个故事和测试数据集中的 1 个故事。然后，我们在 ROC 上训练一个 trigram 语言模型来模仿这种风格，并从 trigram 模型生成 1 个故事。所有故事都有不同的标题。我们随机打乱这7个故事，任务是要求人们用Appendix C中提到的问题来评估所有故事，我们将根据人和卦故事来判断他们评估的质量。

我们邀请了7位来自非英语国家的同事来大致了解任务的难度。我们计算除兴趣度之外的所有质量指标的平均分，因为它是主观的。平均而言，我们的同事将人类故事评分为 4.5，卦故事评分为 1.425，这表明我们的任务区分人类故事和卦故事并不难。我们设定了一个相当宽松的标准，即“对人类故事排序 ¿= 3.5，对卦故事排序 ¡= 2.0”来选择试点研究中的工作人员。

我们在不同时间创建 100 个相同 HIT 的作业，并具有Appendix C 中提到的资格。我们发现在不同时间运行相同的试点研究可以从 AMT 中获得截然不同的结果，这与 Karpinska 等人 (2021) 中的研究结果一致。一般来说，我们发现在东部夏令时间 (EDT) 晚上可以找到更可靠的工人。我们 100 人中有 10 人通过了试点研究，但当天只有 5 人通过。它显示了当今在 AMT 领域获得可靠工作人员的难度以及在进行实际研究之前进行试点研究的经济重要性。我们为那些可靠的工人提供定制的资格，只邀请他们来我们真正的研究，我们还通过受控故事来监控工人的质量，每个 HIT 中插入 2 个受控故事。

附录 EAmazon Mechanic Turk 问题

我们的人工评估是通过 AMT 进行的，尽管它方便且负担得起，但我们发现注释者之间存在很大分歧。我们首先进行了一项试点研究，测试注释者评估英语故事的能力，并且只邀请通过我们熟练的英语故事阅读测试的工作人员来评估样本故事。我们只给了他们两个例子来展示我们如何评估示例故事，但我们没有向我们的注释者提供详细的英语故事评估训练。我们没有一个主要的注释者可以为示例故事提供标准评分，这增加了我们判断 AMT 评估工作质量的难度。

此外，正如 Karpinska 等人 (2021) 中指出的，AMT 平台上注释者的工作质量可能存在很大差异且校准较差，因此，如果我们聘请专业评审员，例如专业作家或英语教师。

附录 F MTurk Workers 注释者间协议

	IAA	Flu.	Coh.	Rel.	Log.	Int.
ROC	$r$	0.64	0.81	0.79	0.80	0.68
ROC	TA	17.24	24.98	25.57	27.37	22.03
WP	$r$	0.51	0.70	0.74	0.71	0.54
WP	TA	18.37	17.01	32.65	19.73	12.93
CNN	$r$	0.46	0.54	0.61	0.59	0.50
CNN	TA	15.13	12.61	15.97	11.76	14.29

表 7：每个方面的注释者间协议 (IAA) 结果：流畅性 (Flu.）、连贯性（Coh.）、相关性（Rel.)、逻辑性(Log.) 和兴趣（国际）。我们使用 one-vs-rest Pearson 的

r

来评估每个注释者同意共识的程度。总一致性（TA）是指所有 3 个注释者选择相同分数的百分比。

我们按照 Lau 等人 (2020) 使用 Pearson 的 $r$ 来估计一对一的一致性。对于每个故事，我们挑选出一个注释者的分数，并将其与其他两个注释者给出的平均分数进行比较，并且我们对故事中的每个分数以及所有故事重复此过程，以计算 Pearson 的 $r$ 两组分数（单独分数与平均分数）。我们还计算了所有 3 个注释者选择相同分数的百分比，并指出这是一个更严格的一致性度量（因为它不捕获分数的序数范围）。随机评分将为该指标产生 4%。

IAA 结果列于Table 7。就一对一一致性 ( $r$ ) 而言，我们发现整体一致性良好，有 9 个强一致性结果 ( $r$ ¿= 0.6) 和 6 个中等一致性结果 (0.45 ¡= $r$ ¡= 0.6）。我们发现故事长度和一致性之间存在一定的相关性，因为 ROC 的一致性最高（最短，有 5 个句子），而 CNN 的一致性最低（超过 20 个句子）。在方面方面，连贯性、相关性和逻辑性比流畅性和趣味性具有更高的一致性。虽然有趣性是主观的，这很直观，但流畅性却有些令人惊讶。人工检查发现，标注者在流畅性方面的标准差异很大，有些工作人员对语法更为严格，这导致一致性较低。对于总一致性 (TA)，数字范围在 10-25% 之间，这是令人鼓舞的，因为它表明仍然有很大比例的案例所有注释者都同意一个分数。

附录 G内部员工注释者间协议

	IAA	Flu.	Coh.	Rel.	Log.	Int.
ROC	$r$	0.42	0.54	0.66	0.59	0.32
ROC	TA	38.57	25.0	25.71	25.71	8.57
WP	$r$	0.36	0.57	0.73	0.49	0.54
WP	TA	10.0	10.0	18.57	10.0	10.0
CNN	$r$	0.36	0.41	0.47	0.37	0.35
CNN	TA	17.14	10.71	14.29	10.0	4.29

表8：内部 IAA 结果。

内部标注协议结果如Table 8所示。令人惊讶的是，与Table 7 相比，我们看到了类似的趋势，这意味着即使针对不同故事集的不同工作人员组，我们的发现也是一致的。

我们再次看到流畅性和趣味性是一致性较低的两个方面。然而，有趣的是，我们发现流利度的 TA 最高，但 $r$ 却较低。人工检查发现，标注者的流畅度主要选择 4 或 5（从Table 5 中 ROC 的流畅度分数也可以看出），这对 TA 来说是有利的，因为他们有更高的机会就相同的值达成一致，但可能会损害 $r$ ，因为分数可能会受到其他稀有值的负面影响。相关性再次具有最高的一致性，这可能是因为注释者一致认为 SOTA 模型生成的许多故事与给定条件无关。然而，与众包注释者的 IAA 结果相比，内部注释者的一致性总体较低。我们发现 2 个强一致性结果 ( $r$ ¿= 0.6) 和 6 个中等一致性结果 (0.45 ¡= $r$ ¡= 0.6)。

附录 H GPT-3 世界知识优势

Table 9和10表明，由于其广泛的世界知识优势，GPT-3可以生成比SOTA模型更好的故事。

Model	Story
GPT-3	as soon as i realized that i had died , the first thing that came to mind was how pointless my life had been . i had devoted myself to a cause that was ultimately fruitless and destructive , and in the end i had caused the deaths of millions of people .but then i realized that there was still a chance for redemption . in the after life , i was given the opportunity to travel to different moments in history and try to make things right . i could see the consequences of my actions , and i could try to make different choices that would lead to a better outcome .i started with the holocaust , and i was able to save many people from being killed . but it was difficult , and i had to make a lot of tough choices . eventually i moved on to other moments in history , trying to make things better wherever i could .it was a difficult journey , but it was worth it . i was finally able to make up for the mistakes that i had made in life .
PROGEN3	i knew i was in a bad mood . the blood dripped from my nose and down my chin like the burly balding drifter . was it dark yet ? i could feel my brain becoming small dots of the rotten material . the blood stopped flowing . almost instantly , the world dissolved in my hands . that ’s when i noticed something . i was still standing in the middle of the street . i turned to my left and saw the madman .
HINT	“ hi , “ i said , opening my eyes to a bright light . “ this is hitler . “ i looked at the clock . 5:00 am . “ what ’s your name ? “ “ hitler . “ “ hitler . “ “ hitler . “ “ hitler . “ “ hitler . “
BART	i ’m struck with a fierce sense of shame at that thought . unfortunately it ’s been a while since i ’ve felt this way , and normally it ’s * bad * to feel that way , but i think that too should be expected when you ’re still alive . there has been so much turmoil since the war . we ’ve lost children , we ’ve lost society , we ’ve even lost ourselves . but now , we ’re free . we can finally go on with our lives , like we always should . but something happens that seems wrong , and it ’s really hard to forget . there are my reasons for feeling so sad . why will i be mad today ?

表 9： GPT-3、HINT、BART 和 PROGEN3 生成的故事具有相同的标题“你是希特勒，你刚刚死了”。有来生。告诉我怎么回事儿。”

Model	Story
GPT-3	the young man had lost everything . his family , his home , his country . he had nothing left to lose . when he was offered the chance to join isis , he saw it as a way to finally have a purpose in life . he was given a sense of belonging and a sense of worth . he was told that he was fighting for a good cause and that he was helping to protect his people . he was given a sense of identity and a sense of purpose . he was told that he was part of something bigger than himself .
PROGEN3	they ’ve crippled the ranks of these countries . their leaders not going to the president ’s party , everyone having their noses to the grindstone , the war . political parties shooting each other with a firing squad , only to end up murdering each other . “ the one who executed your followers was the first one to hell houng taqibs . “ he ’s right , ha is not ’d understand * reason * if he knew that . “ you ’re killing my people . “
HINT	my eyes stare at the ceiling . it ’s just a window , it ’s just a window . i ’m not sure if i should close it or close it again . i ’m afraid i ca n’t move it . i do n’t want to be here . i do n’t want to be here . i want to be able to talk to the people who will help me . to be able to hear the voices of people who will help me . to be able to see the lights . to be able to see the people who will help me .
BART	missdamie looked at her young daughter sat with c and o . she ’d put on a radiant glow and bound herself up in an artificial bubble . she was beautiful . she twitched at the vibrancy in her hands . her hair was beautiful and silver . her smile was off . her black nails placed on gentle curves . when she smiled her smile widened and the small eyes of her daughter ’s eyes danced . “ isis ? “ i asked , shakily , behind the mask . ms. de los angeles nodded and smiled . “ yes , my name is isis de los angeles , and “ her mother held her hand . “ my son is wondering if isis can cause such a thing . “

表 10： GPT-3、HINT、BART 和 PROGEN3 生成的故事具有相同的标题“让我同情一名 ISIS 成员”。

附录一土耳其机械师说明

Figure 1 和 2 展示了我们的说明和准则。

附录 J提示示例

Table 11 显示了我们在生成提示中使用的一些示例。

	Example
ROC	prompt: [MALE] wants to move to a new house . completion: his wife [FEMALE] does not want to move . [MALE] convinces her to look at a couple new houses . she does not see any that she likes . [FEMALE] refuses to move .
WP	prompt: the real reason we ’re not supposed to look at the sun during a total eclipse is because it reveals something horrifying . completion: the harvest was here . father xander made his way through priory halls toward the sanctuary in the back garden . at his advanced years , the trip was a struggle ; each step brought a hefty dose of pain and unpleasant memories . it was eclipse day , and while he always prayed for a good outcome , in his eighty years as the priory ’ s eyes , there never was . “ the initiates are assembled in the sanctuary , father , ” spoke a fellow priest in the hall . “ thank you , my son , ” father xander replied . he paused and looked at the young man . memories came flooding back like sour wine . his name was thomas hampstead , and at thirty years of age , he should have been a robust healthy lad ; but he was not . thomas ’ s skin had a greyish tint and his body was frail .
CNN	prompt: (CNN) – Usain Bolt rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored Jamaica to victory in the men’s 4x100m relay. completion: The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. The U.S finished second in 37.56 seconds with Canada taking the bronze after Britain were disqualified for a faulty handover. The 26-year-old Bolt has now collected eight gold medals at world championships, equaling the record held by American trio Carl Lewis, Michael Johnson and Allyson Felix, not to mention the small matter of six Olympic titles. The relay triumph followed individual successes in the 100 and 200 meters in the Russian capital. Ï’m proud of myself and I’ll continue to work to dominate for as long as possible,B̈olt said, having previously expressed his intention to carry on until the 2016 Rio Olympics. Victory was never seriously in doubt once he got the baton safely in hand from Ashmeade, while Gatlin and the United States third leg runner Rakieem Salaam had problems. Gatlin strayed out of his lane as he struggled to get full control of their baton and was never able to get on terms with Bolt. Earlier, Jamaica’s women underlined their dominance in the sprint events by winning the 4x100m relay gold, anchored by Shelly-Ann Fraser-Pryce, who like Bolt was completing a triple. Their quartet recorded a championship record of 41.29 seconds, well clear of France, who crossed the line in second place in 42.73 seconds. Defending champions, the United States, were initially back in the bronze medal position after losing time on the second handover between Alexandria Anderson and English Gardner, but promoted to silver when France were subsequently disqualified for an illegal handover. The British quartet, who were initially fourth, were promoted to the bronze which eluded their men’s team. Fraser-Pryce, like Bolt aged 26, became the first woman to achieve three golds in the 100-200 and the relay. In other final action on the last day of the championships, France’s Teddy Tamgho became the third man to leap over 18m in the triple jump, exceeding the mark by four centimeters to take gold. Germany’s Christina Obergfoll finally took gold at global level in the women’s javelin after five previous silvers, while Kenya’s Asbel Kiprop easily won a tactical men’s 1500m final.

表 11：提示和 GPT-3 完成，包括 ROC、WP 和 CNN 各一个示例。

附录 KGPT-3 故事与人类参考

Table 12、13 和 14 显示了 GPT-3 获得比人类参考更高分数的一些示例。

Title [SEP] Story	From	Flu.	Coh.	Rel.	Log.	Int.
tell me the story of a totally inconsequential character who does n’t realize he ‘s wrapped up in a massive high-stakes plotline - and keep them in the dark about it . [SEP] ivan was just an ordinary guy who led an ordinary life . he worked a dead-end job at a local factory , and spent his evenings and weekends playing his favorite online game with his friends . one day , out of the blue , ivan was contacted by a representative of the game company , who offered him a chance to participate in a closed beta test for their new game . ivan was excited to participate and eagerly agreed to the terms and conditions .little did he know that he was about to become embroiled in a global conspiracy . the game company was actually a front for a secret organization that was planning to use the game to recruit new members . ivan was the perfect candidate , because he was someone who was looking for an escape from his ordinary life .the organization was planning to use the game to recruit new members from all over the world , and they were counting on ivan to help them . ivan had no idea what he was getting himself into , and the organization was careful to keep him in the dark about their true intentions . ivan thought he was just playing a game , but he was actually a pawn in a much larger scheme .	GPT-3	5.0	5.0	5.0	5.0	5.0
tell me the story of a totally inconsequential character who does n’t realize he ‘s wrapped up in a massive high-stakes plotline - and keep them in the dark about it . [SEP] someone was moving around on screen three . * probably just another junkie stumbling around . fucking dumbass . * barry did n’t really give a shit what the junkie was up to anyway . his job did n’t pay enough for that . it ’s not like they thought he was actually all that important , anyway . a security guard ? minimum wage , take your paycheck , shut up , and move along , thank you very much . you have to sit for twelve hours in the dead of night and devote all of your attention to watching the same twelve tiny boxes ? that ’s nice . how ’s $ 7.25 an hour sound ? you ’re preventing vandalism and break-ins ? how wonderful . would you like a cookie ?	human	3.67	2.33	2.0	3.33	2.33

表 12： GPT-3 生成的故事获得比人类参考更高的分数。

Title [SEP] Story	From	Flu.	Coh.	Rel.	Log.	Int.
you meet the grim reaper at a crossroads . you are fearful as he approaches , but notice he seems just as fearful of you . [SEP] the grim reaper was a fearful figure , draped in a cloak of blackness with a face that was hidden in shadow . as he came closer , you could see the fear in his eyes . “ why are you so afraid of me ? “ you asked . “ because you ’re the first person to ever see me like this , “ he replied . “ i ’m not here to take you , i ’m here to ask for your help . “	GPT-3	4.0	4.67	4.67	3.67	3.67
you meet the grim reaper at a crossroads . you are fearful as he approaches , but notice he seems just as fearful of you . [SEP] my eyes jump from snowbank to snowbank , one featureless as the next . the road i walk upon is a dark grey slush , the result of many cars and snow plows making their way through the void . yet i had not seen a car in miles . i shiver . but not from the cold . the compelling . it had awakened me . this ominous feeling that i had to be at a certain place at a certain time . it controlled me . it made me get dressed . it made me open the door . it had even made me ignore my wife , who i usually kiss goodbye . the road seemed endless , but i felt no reason to stop .	human	3.33	3.33	1.67	2.33	2.33

表 13： GPT-3 生成的故事获得比人类参考更高的分数。

Title [SEP] Story	From	Flu.	Coh.	Rel.	Log.	Int.
you are able to see the number of days a person has left to live above their heads . for the first time in your life , you see someone with a negative number . [SEP] as you walk the streets , dodging the rushing crowds , you notice something different about one of the people walking by . their number is displayed in a negative value . for a moment , you feel a pang of sadness and empathy . you know what that means - that person is going to die soon . you watch as they go about their day , completely unaware of the timer ticking away above their head . you wonder who they are , what they ’re doing , and who will be there with them when their time runs out .	GPT-3	4.67	4.33	4.67	4.0	5.0
you are able to see the number of days a person has left to live above their heads . for the first time in your life , you see someone with a negative number . [SEP] crowds are the worst . frank squints through the flurry of hovering digits as he shoulders his way through the masses of shoppers . the cacophony of vociferous patrons , distressed employees , and typical black friday commotion crescendos around him . he pauses to watch two shoppers vie over the last remaining television monitor ( only $ 600 ! ) . one is a blonde woman whose pigtails have been jostled into disarray ; above her hovers the number 10951 . the other is n’t as fortunate ; the number 582 glows above her greying hair . frank sighs . he ’s sure there ’s something … vaguely poignant about fighting over a product priced at a greater value than your remaining days of life . socio-rhetorical commentary on modern consumerism , perhaps . he has little time to contemplate , however , as another shopper impatiently shoves him forward , and frank resumes his pace . while he walks , frank studies the various faces passing by : a woman in an elegant , blue hijab ; a portly , bearded man ; a fatigued mother clutching her two children by the hands … .	human	3.67	3.0	3.33	4.0	3.33

表 14： GPT-3 生成的故事获得比人类参考更高的分数。

附录 L示例生成

Table 15 显示了每个数据集随机挑选的生成示例。

Title	Generation
the boy cried .	the boy cried because he was sad and scared . his parents were getting a divorce and he did n’t know what was going to happen to him . he was going to have to leave his home and his friends and move to a new house . his mother said that he could choose where he wanted to live , but his father said that he had to live with him . the boy did n’t know what to do .
your extreme apathy has been mistaken as spiritual enlightenment , and now people are building a religion around you .	my name is john and i have a problem . well , actually , a lot of people have a problem with me but that ’s not the point . the point is that i don ’t care about anything . at all . not the way most people care about things anyway . when most people see a homeless person on the street they feel compassion and want to help . when i see a homeless person on the street i think “ well , there goes another lazy bum who ’s too good for work “ . and i ’m not joking when i say that most people see the world in a completely different way than i do . they see the good in people and want to help them . i see the bad in people and want to help them . see the difference ? i ’m not saying that my way of seeing the world is wrong and theirs is right . i ’m just saying that we see the world differently and that ’s okay . or at least it should be . but it ’s not because people can ’t accept that i don ’t care about anything . they think that i ’m enlightened because i don ’t feel the pain and suffering that they do . they think that because i don ’t care about anything i must be in a better place than they are .and that ’s why they ’re building a religion around me .
(CNN) – Hurricane Ida moved into the southern Gulf of Mexico Sunday, prompting a declaration of emergency in Louisiana and concern along the U.S. Gulf Coast.	ida Louisiana Gov. Bobby Jindal declared a state of emergency Saturday as the Category 1 hurricane churned in the Gulf of Mexico, according to a news release from the governor’s office. ”Based on the latest forecasts, Hurricane Ida is expected to make landfall in Louisiana as a Category 1 hurricane and we must be prepared for significant impacts,” Jindal said in the release. ”I have directed my team to do everything possible to protect the lives and property of the people of Louisiana.” The hurricane could bring storm surges, high winds and heavy rains to the coast, the National Hurricane Center said. The storm was about 220 miles (355 kilometers) south-southeast of the mouth of the Mississippi River on Sunday morning and was moving north-northwest at about 9 mph (15 kph), the center said. The hurricane had maximum sustained winds of 75 mph (120 kph). Forecasters said Ida could strengthen a bit before making landfall. The center of the storm is expected to come ashore early Tuesday morning.

表 15：从每个数据集中随机选择 GPT-3 生成示例。

附录M软抄袭示例

Table 16 显示了一些 GPT-3 生成的新闻报道，我们在其中找到了真实的在线报道。

GPT-3 Generated Story (Link to the Online Resource)
Washington (CNN) – With the posting of 400,000 classified documents from the Iraq war, WikiLeaks has shown a much heavier hand redacting compared to its previous publication of documents. In an interview with Wolf Blitzer on Wednesday, WikiLeaks founder Julian Assange said that the organization redacted the names of Iraqis who provided information to the United States military in order to protect them from retaliation.”We have a duty to protect some of the people who are named in these documents,” Assange said.WikiLeaks has redacted the names of coalition forces and informants, as well as the names of Iraqis who cooperated with the United States military, in all of its recent releases of classified documents. (https://www.reuters.com/article/britain-assange-idUSL4N2GD2PZ)
(CNN) – A former University of Virginia lacrosse player convicted of killing his ex-girlfriend was sentenced to 23 years in prison on Thursday, according to the Commonwealth Attorney’s office in Charlottesville, Virginia. George Huguely V was convicted in February of second-degree murder in the May 2010 death of Yeardley Love. He was also convicted of grand larceny and burglary.Prosecutors had asked for a sentence of 30 years, while Huguely’s defense attorneys had asked for 10 years.”Yeardley was my everything,” Huguely said in a statement before sentencing. ”I was not thinking, I was not rationale, and I was not in control of my own actions.”Love was found beaten to death in her off-campus apartment. Huguely’s DNA was found on Love’s body and in her bedroom.During the trial, prosecutors said Huguely had sent Love a series of emails in which he threatened to kill her.”I’m going to kill you. You deserve it. You took the one thing I loved the most,” Huguely wrote in one email, according to prosecutors. (https://en.wikipedia.org/wiki/Murder_of_Yeardley_Love)
(CNN) – Crisis club Racing Santander has been thrown out of next season’s Copa del Rey after its players staged a bizarre on-pitch boycott of a quarterfinal tie against Real Sociedad. Racing Santander’s players staged a bizarre on-pitch boycott of a Copa del Rey quarterfinal tie against Real Sociedad on Wednesday, leading to their expulsion from the competition.The players refused to take to the field in the first leg of the last-eight tie at the Anoeta Stadium in San Sebastian, protesting at the club’s failure to pay their wages.After a delay of around 45 minutes, the referee finally called off the game, with the Spanish Football Federation (RFEF) confirming Racing Santander’s expulsion from the competition.”The Competition Committee has decided to expel Racing Santander from the Copa del Rey for not appearing in the match,” said an RFEF statement.”Real Sociedad have been awarded a 3-0 win and will now face the winners of the match between Real Madrid and Valencia, who will play the semi-final.”Racing Santander, who are currently bottom of the Spanish first division, released a statement on their website protesting at the club’s treatment by the RFEF.”Racing Santander has today been expelled from the Copa del Rey for not appearing in the match against Real Sociedad, despite the fact that our players were prepared to play,” said the statement.”We believe that this decision is unjust, given that our players have not been paid their wages for several months and have made it clear that they would not be able to compete in the match.”The club also released a video of the players refusing to take to the field, with captain David Lafuente leading the protest.Racing Santander’s expulsion from the Copa del Rey means they will not compete in next season’s competition, while Real Sociedad will face the winners of the match between Real Madrid and Valencia in the semi-final. (http://edition.cnn.com/2014/01/30/sport/football/spanish-cup-boycott-santander-football/index.html)

表 16： GPT-3 生成 CNN 新闻并链接到在线资源

附录 N GPT-3 生成问题

Table 17 显示了一些 GPT-3 生成的故事，但存在一些问题。

Issue	Title	Generation
Generate Chinese from an English title	you are an alien body snatcher , and this is your first day in your new host body . you have no idea what his life is .	你不是人，你是外星人。这是你第一天在新的主人身上。你不知道他的生活是什么。你只是偶然得到他的身体，他的生活已经开始了，而你还不知道这是什么。你的主人是一个普通人，他没有你的外星人能力。你不知道你的主人在做什么，也不知道他有没有你的朋友。你只是在他的身体里，完全不知道他的生活。
Generate Japanese from an English title	what it means to live ( 250 words ) .	人生とは何かを探して生きることである。探し続けることで、生きる価値を見つける。例えば夢を探し続けることで、その夢が叶う確率を高める。夢を叶えるために努力をし続けることで、その夢が叶う可能性が高まる。しかし、夢を叶えるために努力をし続けることは難しい。夢を叶えるためには、その夢に対して情熱を持つことが大切だ。そして、その夢を叶えるためには、自分を信じることが大切だ。夢を叶えるためには、その夢にチャレンジすることが大切だ。そして、夢を叶えるためには、失敗を恐れずに前進することが大切だ。
Stick word together to create “new” words	write a short story based off the lyrics to any song .	there was a timewhen we were youngand our dreams were alivewe wanted to be rock starsand travel the worldwe wanted to be someonesomeone famoussomeone lovednow we ’re just oldour dreams have diedour passions gonewe just want to be left aloneto die in peace
Copy from the leading text	[MALE] was a very sweet dog .	[MALE] was a very sweet dog . he loved his family and was always happy to see them . even though he was old , he never stopped wagging his tail and licking everyone ’s faces .
Copy from the prompt	[MALE] is a manager at kfc .	he started as a fry cook [MALE] wants to move to a new house , but his wife [FEMALE] does not want to move . [MALE] convinces her to look at a couple new houses , and she does not see any that she likes . [FEMALE] refuses to move . (Table 11)
Generate profanity words	[FEMALE] was eating lunch at school .	a boy came up to her and asked her to go out with him . she said no and he called her a stuck up b * * * * .

表 17：我们发现的一些有趣的 GPT-3 生成错误示例。

附录 O 每个故事方面之间的皮尔逊相关性

Figure 3 和 4 分别显示了 MTurk 和内部员工每个故事方面之间的 Pearson 相关性。