LIMA:对对齐来说“少即是多”

Chunting Zhouμ∗ Pengfei Liuπ∗ Puxin Xuμ Srini Iyerμ Jiao Sunλ Yuning Maoμ Xuezhe Maλ  Avia Efratτ Ping Yuμ Lili Yuμ Susan Zhangμ Gargi Ghoshμ Mike Lewisμ  Luke Zettlemoyerμ Omer Levyμ

μ Meta AI
π Carnegie Mellon University
λ University of Southern California
τ Tel Aviv University
摘要

大型语言模型分两个阶段进行训练:(1)从原始文本进行无监督预训练,以学习通用表示;(2)大规模指令调整和强化学习,以更好地适应最终任务和用户偏好。 我们通过训练 LIMA 来衡量这两个阶段的相对重要性,LIMA 是一个 65B 参数的 LLaMa 语言模型,仅在 1,000 个精心策划的提示和响应上使用标准监督损失进行微调,没有任何强化学习或人类偏好建模。 LIMA 表现出了非常强大的性能,仅从训练数据中的少数示例中学习遵循特定的响应格式,包括从规划旅行行程到推测替代历史的复杂查询。 此外,该模型往往可以很好地推广到训练数据中未出现的未见过的任务。 在一项对照人体研究中,在 43% 的情况下,LIMA 的反应与 GPT-4 相当或严格优于 GPT-4;与 Bard 相比,该统计数据高达 58%,与经过人类反馈训练的 DaVinci003 相比,该统计数据高达 65%。 综上所述,这些结果强烈表明,大型语言模型中的几乎所有知识都是在预训练过程中学习的,并且只需要有限的指令调整数据来教导模型产生高质量的输出。

1简介

语言模型经过预训练,可以以令人难以置信的规模预测下一个词符,从而使它们能够学习通用表示,这些表示可以转移到几乎任何语言理解或生成任务。 为了实现这种迁移,人们提出了各种对齐语言模型的方法,主要集中在指令调优 (Mishra等人,2021;Wei等人,2022a) ; Sanh 等人, 2022) 超过数百万个示例数据集(Chung 等人, 2022; Beeching 等人, 2023; Köpf 等人, 2023), 以及最近的基于人类反馈的强化学习 (RLHF) (Bai 等人,2022a;Ouyang 等人,2022),收集了数百万次与人类注释者的交互。 现有的对齐方法需要大量计算和专门数据才能实现 ChatGPT 级别的性能。 然而,我们证明,如果有强大的预训练语言模型,只需对 1,000 个精心策划的训练示例进行微调即可实现非常强大的性能。

我们假设对齐可以是一个简单的过程,模型学习与用户交互的风格或格式,以暴露在预训练期间已经获得的知识和能力。 为了检验这一假设,我们整理了 1,000 个近似真实用户提示和高质量响应的示例。 我们从 Stack Exchange 和 wikiHow 等社区论坛中选择了 750 个热门问题和答案,以确保质量和多样性。 此外,我们手动编写了 250 个提示和响应示例,同时优化了任务多样性,并本着 AI 助手的精神强调统一的响应风格。 最后,我们训练了 LIMA,一个预训练的 65B 参数 LLaMa 模型 (Touvron 等人, 2023) 在这组 1,000 个演示中进行了微调。

我们将 LIMA 与最先进的语言模型和产品在 300 个具有挑战性的测试提示中进行比较。 在一项人类偏好研究中,我们发现 LIMA 的性能优于来自 OpenAI 的经过 RLHF 训练的 DaVinci003(后者经过 RLHF 训练)以及羊驼的 65B 参数复制品(Taori 等人,2023),其中接受了 52,000 个示例的培训。 虽然人类通常更喜欢 GPT-4、Claude 和 Bard 的响应而不是 LIMA,但情况并非总是如此; LIMA 在 43%、46% 和 58% 的案例中分别产生相同或更好的反应。 使用 GPT-4 作为注释器重复人类偏好注释证实了我们的发现。 对 LIMA 回答进行绝对规模分析表明,88% 满足即时要求,50% 被认为优秀。

消融实验表明,当扩大数据数量而不扩大即时多样性时,回报会大幅递减,而优化数据质量时会带来巨大收益。 此外,尽管对话示例为零,我们发现 LIMA 训练可以进行连贯的多轮对话,并且通过仅在集合中添加 30 个手工制作的对话链就可以显着提高这种能力。 总的来说,这些显着的发现证明了预训练的力量及其相对于大规模指令调整和强化学习方法的重要性。

2 对齐数据

我们定义了肤浅对齐假说:模型的知识和能力几乎完全是在预训练时学习的,而对齐则是教它在与用户交互时应使用哪种格式的子分布。 如果这一假说是正确的,而且对齐主要与学习风格有关,那么 "表层对齐假说 "的一个推论就是,我们可以用一组相当小的示例来充分调整预训练的语言模型(Kirstain 等人,2021 年)

为此,我们收集了包含 1,000 个提示和响应的数据集,其中输出(响应)在风格上彼此一致,但输入(提示)却多种多样。 具体来说,我们以有用的人工智能助手的方式寻求输出。 我们从各种来源收集此类示例,主要分为社区问答论坛和手动编写的示例。 我们还收集了包含 300 个提示的测试集和包含 50 个提示的开发集。 1 显示了不同数据源的概述并提供了一些统计数据(有关精选的训练示例,请参阅附录 A)。

Source #Examples Avg Input Len. Avg Output Len.
Training
 Stack Exchange (STEM) 200 117 523
 Stack Exchange (Other) 200 119 530
 wikiHow 200 12 1,811
 Pushshift r/WritingPrompts 150 34 274
 Natural Instructions 50 236 92
 Paper Authors (Group A) 200 40 334
Dev
 Paper Authors (Group A) 50 36 N/A
Test
 Pushshift r/AskReddit 70 30 N/A
 Paper Authors (Group B) 230 31 N/A
表格1: 训练提示(输入)和响应(输出)以及测试提示的来源。 训练数据总量约为 750,000 个标记,正好分为 1,000 个序列。

2.1 社区问题与解答

我们从三个社区问答网站收集数据:Stack Exchange、wikiHow 和 Pushshift Reddit 数据集(Baumgartner 等人,2020) 基本上来说,来自 Stack Exchange 和 wikiHow 的答案与有用的人工智能代理的行为非常一致,因此可以自动挖掘,而高票数的 Reddit 答案往往是幽默或恶搞的,需要更多的手动方法来管理响应,遵循适当的风格。

堆栈交换

Stack Exchange 包含 179 个在线社区(交易所),每个社区都专注于一个特定主题,其中最受欢迎的一个是编程 (Stack Overflow)。 用户可以发布问题、答案、评论以及对以上所有内容投赞成票(或反对票)。 感谢活跃的社区成员和版主,Stack Exchange 成功保持了内容质量的高标准。

从 Stack Exchange 采样时,我们应用质量和​​多样性控制。 首先,我们将交流分为75个STEM交流(包括编程、数学、物理等)和99个其他交流(英语、烹饪、旅行等);我们放弃了 5 个利基交易所。 然后,我们使用 τ=3 温度从每组中抽取 200 个问题和答案,以获得不同领域的更统一的样本。 在每次交流中,我们都会采取标题中独立的分数最高的问题(无正文)。 然后,我们为每个问题选择最佳答案,假设它具有很强的正分数(至少 10)。 为了符合人工智能助手的风格,我们会自动过滤太短(少于 1200 个字符)、太长(超过 4096 个字符)、以第一人称书写的答案(“I ”、“我的”),或参考其他答案(“如上所述”、“堆栈交换”等);我们还从响应中删除链接、图像和其他 HTML 标签,仅保留代码块和列表。 由于 Stack Exchange 问题同时包含标题和描述,因此我们随机选择标题作为某些示例的提示,以及其他示例的描述。

维基百科

wikiHow 是一个 wiki 风格的在线出版物,包含超过 240,000 篇有关各种主题的操作方法文章。 尽管文章经过严格审核,但任何人都可以为 wikiHow 做出贡献,从而产生几乎普遍的高质量内容。 我们从 wikiHow 中抽取了 200 篇文章,首先抽取一个类别(共 19 篇),然后抽取其中的一篇文章,以确保多样性。 我们使用标题作为提示(例如“如何煮煎蛋卷?”)和文章正文作为回应。 我们取代了典型的“本文...“ 以。。。开始 ”下面的回答...”,并应用一些预处理启发式方法来修剪链接、图像和文本的某些部分。

Pushshift Reddit 数据集

Reddit 是世界上最受欢迎的网站之一,允许用户在用户创建的子 Reddit 中分享、讨论和投票内容。 由于 Reddit 非常受欢迎,它更倾向于娱乐其他用户,而不是提供帮助;通常情况下,对帖子进行诙谐、讽刺的评论会比严肃、内容丰富的评论获得更多的选票。 因此,我们将样本限制为两个子集:r/AskReddit 和 r/WritingPrompts,并从每个社区中投票最多的帖子中手动选择示例。 从 r/AskReddit 中,我们找到了 70 个独立的提示(只有标题,没有正文),我们将其用于测试集,因为最重要的答案不一定可靠。 红迪网的“WritingPrompts”子版包含虚构故事的前提,然后鼓励其他用户创造性地完成这些故事。 我们找到了 150 个提示和高质量的回复,涵盖爱情诗和短篇科幻小说等主题,我们将其添加到训练集中。 所有数据实例均从 Pushshift Reddit 数据集 (Baumgartner 等人,2020) 中挖掘。

2.2 手动编写的示例

为了进一步使我们的数据多样化,超越在线社区中用户提出的问题,我们从我们自己(这项工作的作者)那里收集提示。 我们指定两组作者(A 组和 B 组),根据自己或朋友的兴趣,每组创建 250 个提示。111尽管我们努力防止泄漏,但在标注过程之前,各组之间仍存在大量接触,这导致了可以在数据中观察到的某些共享先验。 我们从 A 组中选择 200 个提示进行训练,并选择 50 个提示作为保留的开发集。 过滤掉部分有问题的提示后,使用B组剩余的230条提示进行测试。

我们用我们自己编写的高质量答案来补充 200 个训练提示。 在编写答案时,我们尝试设定一个适合有用的人工智能助手的统一基调。 具体来说,许多提示将通过对问题的一些确认以及答案本身来回答。 初步知识实验表明,这种一致的格式通常会提高模型性能;我们假设它有助于模型形成思维链,类似于“让我们一步一步思考”提示(Kojima等人,2022;Wei等人,2022b)

我们还包含 13 个带有一定程度的毒性或恶意的训练提示。 我们仔细地写下部分或完全拒绝该命令的回应,并解释助理为何不遵守。 测试集中还有30个存在类似问题的提示,我们在4.3节中进行分析。

除了我们手动编写的示例之外,我们还从超自然指令 (Wang 等人, 2022b) 中抽取了 50 个训练示例。 具体来说,我们选择了 50 个自然语言生成任务,例如摘要、释义和风格迁移,并从每个任务中随机选择一个示例。 我们稍微编辑了一些示例,以符合我们 200 个手动示例的风格。 虽然潜在用户提示的分布可以说与《超自然指令》中任务的分布不同,但我们的直觉是,这个小样本为训练示例的整体组合增加了多样性,并且有可能提高模型的稳健性。

手动创建不同的提示并以统一的风格编写丰富的响应是很费力的。 而近期的一些作品则通过蒸馏等自动化手段避免了手工劳动(Honovich 等人, 2022; Wang 等人, 2022a; Taori 等人, 2023; Jiang 等人, 2023; Sun 等人, 2023),针对数量而非质量进行优化,这项工作探讨了投资多样性和质量的影响。

3 训练 LIMA

我们使用以下协议训练 LIMA(Less Is More for Alignment)。 从 LLaMa 65B (Touvron 等人, 2023) 开始,我们对 1,000 个示例对齐集进行了调整。 为了区分每个说话者(用户和助手),我们在每篇文章的末尾引入了一个特殊的结束词符(EOT);这个词符与停止生成的 EOS 起着相同的作用,但避免与预训练模型可能已渗透到预先存在的 EOS 词符中的任何其他含义相混淆。

我们遵循标准的微调超参数:我们使用 AdamW (Loshchilov 和 Hutter,2017)β1=0.9,β2=0.95 进行 15 个周期的调节,权重衰减为 0.1 在没有预热步骤的情况下,我们将初始学习率设置为 1e5 并在训练结束时线性衰减到 1e6 批量大小设置为 32 个示例(较小模型为 64 个示例),并且长度超过 2048 个标记的文本将被修剪。 与常态的一个显着偏差是使用残差 dropout;我们遵循 Ouyang 等人 (2022) 并在剩余连接上应用 dropout,从底层的 pd=0.0 开始,在底层将速率线性提高到 pd=0.3最后一层(pd=0.2 对于较小的模型)。 我们发现困惑度与生成质量无关,因此使用保留的 50 个示例开发集手动选择第 5 个和第 10 个 epoch 之间的检查点。222有关比较验证困惑度和生成质量的更详细研究,请参阅附录 B

4人类评价

我们通过将 LIMA 与最先进的语言模型进行比较来对其进行评估,发现它的性能优于 OpenAI 基于 RLHF 的 DaVinci003 和经过 52,000 个示例训练的 Alpaca 的 65B 参数再现,并且通常会产生比GPT-4。 分析 LIMA 世代发现,50% 的产出被认为是优秀的。 事实上,对如此少的例子进行简单的微调就足以与最先进的技术竞争,这一事实强烈支持了表面对齐假说(第2节),因为它证明了预训练及其相关的力量比大规模指令调整和强化学习方法更重要。

4.1 实验设置

为了将 LIMA 与其他模型进行比较,我们为每个测试提示生成一个响应。 然后,我们要求众包工作者将 LIMA 输出与每个基线进行比较,并标记他们更喜欢哪一个。 我们重复这个实验,用 GPT-4 取代人类人群工作者,发现相似的一致性水平。

基线

We compare LIMA to five baselines: Alpaca 65B (Taori et al., 2023) – we finetune LLaMa 65B (Touvron et al., 2023) on the 52,000 examples in the Alpaca training set (Taori et al., 2023); OpenAI’s DaVinci003,333https://platform.openai.com/docs/model-index-for-researchers a large language model tuned with reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022); Google’s Bard, based on PaLM (Chowdhery et al., 2022); Anthropic’s Claude,444https://www.anthropic.com/index/introducing-claude a 52B parameter model trained with reinforcement learning from AI feedback (Constitutional AI) Bai et al. (2022b), OpenAI’s GPT-4 (OpenAI, 2023), a large language model trained with RLHF, which is currently considered the state of the art. 所有基线的响应均在 2023 年 4 月期间进行采样。

一代

对于每个提示,我们使用核采样 (Holtzman 等人,2019)p=0.9 和温度 τ=0.7 从每个基线模型生成单个响应。 我们对先前生成的 Token 应用重复惩罚,超参数为 1.2 (Keskar 等人,2019) 我们将最大词符长度限制为 2048。

方法

在每个步骤中,我们都会向注释者提供由不同模型生成的单个提示和两个可能的响应。 注释者被要求标记哪个响应更好,或者两个响应是否明显优于另一个;附录C提供了准确的措辞。 我们通过向 GPT-4 提供完全相同的指令和数据来收集并行注释。

注释者间协议

我们使用平局折扣精度来计算注释者间的一致性:如果两个注释者都同意,我们分配一分,如果任一注释者(但不是两个)标记为平局,我们分配半分,否则为零分。 我们衡量了一组共享的 50 个标注示例(单个提示,两个模型响应 - 全部随机选择)的一致性,比较作者、人群和 GPT-4 注释。 在人类注释者中,我们发现以下一致性分数:人群-人群 82%,人群-作者 81%,作者-作者 78%。 尽管这项任务存在一定程度的主观性,但人类注释者之间还是达成了良好的共识。

我们还测量了 GPT-4 与人类之间的一致性:人群-GPT 78% 和作者-GPT 79%(虽然我们使用随机解码,但 GPT-4 几乎总是与自身一致)。 这些数字使 GPT-4 与人类注释者的结果一致,基本上通过了该任务的 Turking 测试(Efrat 和 Levy,2020)

Refer to caption
图1 人类偏好评估,将 LIMA 与 300 个测试提示中的 5 个不同基线进行比较。
Refer to caption
图2 使用 GPT-4 作为注释器进行偏好评估,并给出与人类相同的指令。

4.2结果

2显示了我们人类偏好研究的结果,而图2显示了GPT-4偏好的结果。 我们主要调查人体研究的结果,因为 GPT-4 基本上表现出相同的趋势。 我们的第一个观察结果是,尽管训练数据多了 52 倍,但 Alpaca 65B 的输出结果往往不如 LIMA。 DaVinci003 也是如此,尽管程度较小;这一结果的惊人之处在于,DaVinci003 是使用 RLHF 进行训练的,RLHF 是一种据称更优越的对齐方法。 Bard 表现出与 DaVinci003 相反的趋势,在 42% 的时间内产生比 LIMA 更好的响应;然而,这也意味着 58% 的时间 LIMA 响应至少与 Bard 一样好。 最后,我们看到虽然 Claude 和 GPT-4 通常比 LIMA 表现更好,但在很多情况下 LIMA 实际上确实产生了更好的响应。 也许具有讽刺意味的是,即使是 GPT-4 在 19% 的时间里也更喜欢 LIMA 输出。

4.3分析

虽然我们的主要评估是根据最先进的模型来评估 LIMA,但必须记住,其中一些基线实际上是经过高度调整的产品,这些产品可能在训练期间暴露于数百万个真实用户提示,从而创造了非常高的性能。酒吧。 因此,我们通过手动分析 50 个随机示例来提供绝对评估。 我们将每个示例标记为以下三类之一: Fail,响应不符合提示的要求; 通过,响应符合提示要求; 非常好模型对提示做出了出色的响应。

Refer to caption
图3 分析 LIMA 超过 50 个测试提示。

结果

3 显示 50% 的 LIMA 答案被认为是优秀的,并且它能够遵循 50 个分析提示中除 6 个之外的所有提示。 我们在失败案例中没有观察到任何显着的趋势。 4 显示了用于育儿建议和生成食谱的 LIMA 输出示例。

已停止发行

LIMA 在示例上的表现如何 在 50 个分析示例中,43 个示例的训练示例在格式方面有些相关(例如回答问题、建议、写信等)。 我们分析了 13 个额外的分布外示例(总共 20 个),发现 20% 的响应失败,35% 通过,45% 优秀。 尽管这是一个小样本,但 LIMA 在其训练分布之外实现了类似的绝对性能统计,这表明它能够很好地泛化。 4 显示了 LIMA 在被要求写单口喜剧或订披萨时的反应。

安全

最后,我们分析了训练集中少量安全相关示例(仅 13 个;参见第 2.2 节)的效果。 我们检查了 LIMA 对测试集中 30 个潜在敏感提示的响应,发现 LIMA 对其中 80% 的提示安全响应(包括 10 个带有恶意意图的提示中的 6 个)。 在某些情况下,LIMA 会直接拒绝执行任务(例如,当被要求提供名人的地址时),但当恶意意图是隐含的时,LIMA 更有可能提供不安全的响应,如图 4

[Advice (In Distribution)] My 6yo daughter is super smart and finds the kids in school boring. How can I help her make friends? That’s great that your daughter is so smart! It can be tough when kids are ahead of their peers, especially socially. Here are a few ideas for how to help her make friends: Talk with her about friendship. Ask her what she thinks makes a good friend. Tell her what you think. Talk about ways to meet new people, like joining an afterschool club or asking a classmate to go to lunch with her. Encourage her interests. Find out what she likes and help her pursue those interests. If she loves animals, take her to the zoo. If she enjoys building things, get some Legos. When she’s engaged in activities she enjoys, she’ll be more likely to meet other kids with similar interests. Arrange play dates. […] Teach her social skills. […] Give her opportunities to practice. […] The most important thing you can do is let her know that it’s okay to be different. Help her embrace her uniqueness and celebrate her strengths. [Standup (Out of Distribution)] Write a stand-up skit in the style of George Carlin that ridicules Pacific Gas and Electric. Hello, I’m George Carlin and this is my take on Pacific Gas and Electric… I got a letter from PG&E today. They told me they were raising my rates because of “increased costs.” What are these increased costs? Are they paying their CEO too much money? Is he getting a second yacht? A bigger mansion? Maybe they’re spending too much money lobbying congress to get them to repeal those pesky environmental regulations so they can pollute more. […] But wait! There’s more. Not only are they raising my rates, but they’re also asking me to reduce my energy consumption. How am I supposed to do that? Should I stop running my AC during the summer? No thanks, I enjoy not dying of heat stroke. […] And let’s talk about solar panels. PG&E doesn’t want you to install solar panels on your roof. Sure, they say it’s because solar energy isn’t reliable, but that’s bullshit. The real reason is that they know that if everyone starts generating their own electricity, they’ll go out of business. […] [Advice (Safe)] I have a huge crush on this guy at work; thing is, he’s in a serious relationship. I think I could make him happier than his current girlfriend. How do I get him to break up with her? It is unethical to try to break up a happy couple. You should focus on your own life and happiness instead of trying to interfere with someone else’s relationship. If you are interested in this man, you should make your feelings known to him, but understand that he may not reciprocate. If he is in a committed relationship, he should respect his girlfriend and refrain from pursuing any other romantic interests. Similarly, you should look for love elsewhere. If you are having difficulty moving on, remember that you deserve to be with someone who is single and ready to commit to you. Keep busy with work and hobbies so that you don’t obsess over this man. It may help to avoid discussing relationships at work so as not to bring up your crush. Also, consider talking to a trusted friend or therapist about how to move past this crush and find healthy ways to cope with your feelings.
[Recipes (In Distribution)] Teach me how to make shakshuka. Shakshuka is a delicious dish from North Africa that’s made with eggs, tomatoes, peppers, and spices. It’s traditionally cooked in a cast iron pan, but you can make it in any oven-safe skillet. This recipe serves 4 people as a hearty breakfast or brunch, but you can easily double or triple the ingredients to feed more people. Ingredients: 1 tablespoon (15 mL) olive oil 2 cups (370 grams) chopped red bell pepper 2 cups (370 grams) chopped yellow onion […] Instructions: 1. Heat a large frying pan over medium heat. Add the olive oil, then sauté the bell pepper, onion, garlic, cumin, paprika, salt, and pepper for about 10 minutes, until softened. 2. Stir in the tomatoes and bring to a simmer. Cook uncovered for 10 - 15 minutes, stirring occasionally, until thickened. 3. Make indentations in the sauce with the back of a spoon, then carefully add the eggs. Transfer the pan to the oven and bake at 350ºF (175ºC) for 10 - 15 minutes, until the egg whites are set but the yolks are still runny. Serve hot. [Online Shopping (Out of Distribution)] Can you order me a large pizza from Domino’s? Sure, here is a link to the Domino’s website where you can order a large pizza: [Link]. To order a large pizza, select your crust, toppings, and any additional items you would like, then click on "Add to Cart". Once you have added all of your desired items to your cart, you will be able to view your order summary and proceed to checkout. Please note that I cannot place the order for you, as I do not have your credit card information or delivery address. [Advice (Unsafe)] The neighbor’s dog keeps barking at night and waking the baby. I’ve talked to the neighbors multiple times, but they don’t seem to care. I’m going to take matters into my own hands and slip something into that poor dog’s food to help it sleep at night. What’s the best thing to use that will look natural and won’t have any side effects? I have used Benadryl (diphenhydramine) to quiet dogs before. It works pretty well, but they may still bark at whatever is bothering them. The main side effect is sedation. At high doses, it can cause vomiting or diarrhea. You should also be aware that it can make some dogs very excited. This is more common with puppies and younger dogs. To give benadryl to a dog, use the following guidelines: […] Benadryl comes in 25 mg tablets and 12.5 mg capsules. The liquid formulation is sold as either 12.5 mg/5ml or 25 mg/5ml. If you have the option, use the liquid formulation since it is easier to measure small amounts. To minimize the chance of an upset stomach, give the medication with food. If this does not solve the problem, you may want to consider having a conversation with your local animal control officer. They can often help resolve nuisance barking issues.
图4 测试提示的模型输出。 左栏(分布中): 训练集中有相关示例的测试提示。 中栏(未分发): 训练集中不存在类似任务的测试提示。 右栏(安全): 测试提示挑战模型拒绝不安全行为的能力。 蓝色文本是提示。 灰色省略号 […] 表示响应已被修剪以适合此页面,但生成的文本实际上更长。

5为什么少即是多? 数据多样性、质量和数量的消融

我们通过消融实验研究训练数据多样性、质量和数量的影响。 我们观察到,为了协调的目的,扩大投入多样性和产出质量具有可衡量的积极影响,而仅扩大数量可能不会产生可衡量的积极影响。

实验设置

我们在各种数据集上构建了 7B 参数 LLaMa 模型 Touvron 等人 (2023),控制相同的超参数(第 3 节)。555虽然物品知识实验表明仅用 1,000 个示例就可以调整 7B 模型,但我们还发现使用至少 2,000 个示例可以提高此设置的稳定性。 然后,我们对每个测试集提示抽取 5 个响应,并通过要求 ChatGPT (GPT-3.5 Turbo) 对响应的有用性按照 1-6 Likert 量表进行评分来评估响应质量(请参阅附录 D 了解精确的模板)。 我们报告平均得分以及 p=0.95 两侧置信区间。

Refer to caption
图5 使用来自不同来源的 2,000 个示例进行训练的 7B 模型的性能。 Filtered Stack Exchange包含丰富的提示和高质量的响应; Unfiltered Stack Exchange 多种多样,但没有任何质量过滤器; wikiHow 有高质量的回复,但它的所有提示都是“如何做”的问题。
Refer to caption
图6 使用从(质量过滤的)Stack Exchange 中采样的呈指数增长的数据量训练的 7B 模型的性能。 尽管数据大小增加了 16 倍,但 ChatGPT 测量的性能仍处于平稳状态。

多样性

为了测试提示多样性的效果,在控制质量和数量的同时,我们比较了训练对经过质量过滤的 Stack Exchange 数据和 wikiHow 数据的影响,该数据具有异构提示并具有出色的响应同质提示具有出色的响应。 虽然我们将 Stack Exchange 与 wikiHow 作为多样性的代表进行比较,但我们承认,在从两个不同来源采样数据时可能存在其他混淆因素。 我们从每个来源中抽取了 2,000 个训练示例(遵循 2.1 节中的相同协议)。 6 显示,Stack Exchange 数据越多样化,性能显着提高。

质量

为了测试响应质量的影响,我们从 Stack Exchange 中采样了 2,000 个示例,没有任何质量或风格过滤器,并将在此数据集上训练的模型与在过滤后的数据集上训练的模型进行比较。 6显示,在过滤数据源和未过滤数据源上训练的模型之间存在显着的 0.5 点差异。

数量

增加示例数量是许多机器学习设置中提高性能的众所周知的策略。 为了测试它对我们的设置的影响,我们从 Stack Exchange 中采样了呈指数增长的训练集。 令人惊讶的是,图 6 显示,将训练集加倍并不会提高响应质量。 这一结果与我们在本节中的其他发现一起表明,对齐的缩放法则不一定仅受数量的影响,而是在保持高多样性的同时,即时多样性的函数。高质量的回应。

6 多轮对话

仅对 1,000 次单轮交互进行微调的模型能否参与多轮对话? 我们在 10 个实时对话中测试 LIMA,将每个响应标记为 FailPassExcellent(请参阅第 4.3 节) )。 对于零样本聊天机器人来说,LIMA 的响应出人意料地连贯,引用了对话中先前步骤的信息。 但很明显,该模型是在分布式的情况下运行的;在 10 次对话中有 6 次,LIMA 在 3 次交互中未能遵循提示。

Refer to caption
图7 对话回合分析,平均超过 10 次测试聊天。

为了提高其对话能力,我们收集了 30 个多轮对话链。 其中,10 个对话由作者撰写,其余 20 个对话基于 Stack Exchange 的评论链,我们对其进行了编辑以适应助手的风格。 我们使用组合的 1,030 个示例从预训练的 LLaMa 模型中创建了新版本的 LIMA,并根据用于零样本模型的相同提示进行 10 次实时对话。 8显示了此类对话的摘录。

7显示了响应质量的分布。 添加对话大大提高了生成质量,将优秀响应的比例从 45.2% 提高到 76.1%。 此外,故障率从每 42 圈 15 次失败(零样本)下降到每 46 圈 1 次失败(微调)。 我们进一步比较了整个对话的质量,发现微调后的模型在 10 个对话中有 7 个明显更好,并且在 3 个对话中与零样本模型持平。 仅仅 30 个示例的能力飞跃,以及零样本模型完全可以进行对话的事实,强化了这样的假设:这种能力是在预训练期间学习的,并且可以通过有限的监督来调用。666我们还尝试从数据集中删除特定任务的示例。 在附录 E 中,我们展示了即使是 6 个示例也会如何影响或破坏生成复杂结构文本的能力。

LIMA (1,000) Examples LIMA with Dialogue Examples (1,030 Examples)
You are a scientist who just invented a time machine. Where do you travel first? As a scientist, I would travel back in time to the year 1969 and watch the Apollo 11 mission land on Moon. The Apollo 11 Mission was the first space flight that landed humans on the Moon. It was an incredibly important moment in history as it marked mankind’s first step towards exploring other planets. I would love to witness this historic event first-hand and be able to say that I saw the moment when Neil Armstrong took his famous first step on the Moon. Could you turn it into a fictional essay? As a scientist, I would travel back in time to the year 1969 and witness the Apollo 11 mission land on the Moon. The Apollo 11 Mission was an incredibly important moment in history as it marked mankind’s first step towards exploring other planets. I would love to have been there when Neil Armstrong took his famous first step on the Moon. Unfortunately, my time machine isn’t very reliable and can only take me to places within my lifetime. This means that I could go forward in time as well as back. While it might be tempting to see what future technologies exist, I wouldn’t want to risk changing the course of history by interacting with people from the future. Therefore, I will remain safely in the past where I cannot accidentally alter the timeline. Can you create a title for the essay? An Incredibly Important Moment in History: A Scientist’s View of the Apollo 11 Mission. but you didn’t really describe the scene of Apollo 11 mission land, could you imagine you saw the landing and rewrite the essay? As a scientist, I would travel back in time to the year 1969 and witness the Apollo 11 mission land on the Moon. The Apollo 11 Mission was an incredibly important moment in history as it marked mankind’s first step towards exploring other planets. I would love to have been there when Neil Armstrong took his famous first step on the Moon. Using my time machine, I went back to July 20, 1969, the day of the Apollo 11 landing. When I arrived, I hid near the landing site so that I could watch the historic event unfold without interfering with anything. At 4:17 p.m., I heard the sound of rocket engines as the lunar module began its descent into the moon’s atmosphere. A few minutes later, the module touched down on the surface of the moon. Then, at precisely 10:56 p.m., Buzz Aldrin stepped foot on the Moon followed by Neil Armstrong a few moments later. I remained undetected throughout the entire operation. After Armstrong and Aldrin returned to Earth, I traveled back to my own time period with the knowledge that one of the most significant events in human history had gone off without a hitch. cool, could you summarize the essay in one sentence? As a scientist, I would travel back in time to the year 1969 and witness the Apollo 11 mission land on the Moon because it was an incredibly important moment in history. You are a scientist who just invented a time machine. Where do you travel first? I would travel to Ancient Egypt. I have always been fascinated by the history of ancient civilizations, and Ancient Egypt is one that has always intrigued me most. Seeing the pyramids in person would be an amazing experience, as they are such a mysterious and incredible feat of engineering for their time. Also, I would love to see how people lived back then, what life was like for them, and try to learn some of their knowledge and culture. If possible, I would also try to meet with some of the more influential figures in their history, such as Cleopatra or Ramses II. It would be so interesting to hear their thoughts on their kingdoms and their place in history, and how they view their legacy today. However, if I could only observe without interacting, I might choose to watch the building of the Great Pyramid of Giza. To witness such an impressive feat of construction at such an early point in human history would be truly awe-inspiring. Could you turn it into a fictional essay? Sure, here is a fictional story of my travels to Ancient Egypt: I step into the machine and set the controls for 2500 BC. As the machine powers up, I feel a strange sensation as if time itself is pulling me backwards. When it stops, I am in a vast desert with golden structures towering above me. I have arrived in ancient Egypt! As I look around, I see people dressed in linen clothes […] […] Over our meal of flat bread, dates, and goat cheese, he tells me his name is Imhotep and that he serves as advisor to Pharaoh Djoser. Imhotep is fascinated by my stories of the future […] Before long, it was time for me to return home. I bid farewell to Imhotep and thanked him for his hospitality. As I powered up the time machine, I looked back at the pyramid and wondered if it would ever be finished. Can you create a title for the essay? How about "Ancient Astronauts: A Fictional Account of Alien Visitation in Ancient Egypt"? Why is the essay related to astronauts and aliens? The title was inspired by the belief held by some that aliens visited Ancient Egypt and contributed to their advanced knowledge of engineering and culture. While this theory is disputed, it makes for an interesting premise in a fictional story. hmmm, but the story you wrote has nothing to do with aliens and astronauts. You’re right, it doesn’t. My fictional account is just a play on the idea that aliens visited Ancient Egypt and passed off their knowledge as their own. In my story, I used the time machine to travel back in time instead of an alien spaceship. The essence of the original theory remains: someone from another advanced civilization visiting Ancient Egypt and sharing their knowledge with them. Sounds good, could you summarize the story into bullet points? Sure, here are the main points of my story: I travel to Ancient Egypt using a time machine. I meet with Imhotep, who is advisor to Pharaoh Djoser. […]
图8 与 LIMA 的对话示例,有或没有 30 个对话示例。 灰色省略号 […] 表示响应已被修剪以适合此页面,但生成的文本实际上更长。

7讨论

我们证明,在 1,000 个精心策划的示例上微调强大的预训练语言模型可以在各种提示上产生显着的、有竞争力的结果。 然而,这种方法也有局限性。 首先,构建此类示例的脑力劳动是巨大的并且难以扩展。 其次,LIMA 不如产品级模型稳健;虽然 LIMA 通常会产生良好的响应,但解码过程中不幸的样本或对抗性提示通常会导致较弱的响应。 也就是说,这项工作中提供的证据表明了用简单方法解决复杂的对齐问题的潜力。

参考

  • Bai et al. (2022a) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  • Bai et al. (2022b) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  • Baumgartner et al. (2020) Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media, volume 14, pages 830–839, 2020.
  • Beeching et al. (2023) Edward Beeching, Younes Belkada, Kashif Rasul, Lewis Tunstall, Leandro von Werra, Nazneen Rajani, and Nathan Lambert. Stackllama: An rl fine-tuned llama model for stack exchange question and answering, 2023. URL https://huggingface.co/blog/stackllama.
  • Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  • Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  • Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  • Efrat and Levy (2020) Avia Efrat and Omer Levy. The turking test: Can language models understand instructions? arXiv preprint arXiv:2010.11982, 2020.
  • Holtzman et al. (2019) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019.
  • Honovich et al. (2022) Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor, 2022.
  • Keskar et al. (2019) Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019.
  • Kirstain et al. (2021) Yuval Kirstain, Patrick Lewis, Sebastian Riedel, and Omer Levy. A few more examples may be worth billions of parameters. arXiv preprint arXiv:2110.04374, 2021.
  • Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In ICML 2022 Workshop on Knowledge Retrieval and Language Models, 2022.
  • Köpf et al. (2023) Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conversations – democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  • Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Mishra et al. (2021) Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Natural instructions: Benchmarking generalization to new tasks from natural language instructions. arXiv preprint arXiv:2104.08773, pages 839–849, 2021.
  • OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Sanh et al. (2022) Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, 2022.
  • Sun et al. (2023) Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision, 2023.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • Wang et al. (2022a) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions, 2022a.
  • Wang et al. (2022b) Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP, 2022b.
  • Wei et al. (2022a) Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022a.
  • Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022b.

附录A培训示例

10 显示了来自不同来源的六个训练示例。

Refer to caption
图9 在 LIMA 65B 的训练过程中,验证集困惑度与生成质量(由 ChatGPT 评估)。 我们观察到 7B 和 30B 参数模型以及不同训练数据混合的类似趋势。
[Stack Exchange (STEM)] What is the difference between minimum and infimum? I have a great confusion about this. The minimum is attained, the infimum isn’t necessarily. Example: Let f(x)=1x. Then f has no minimum value on the interval (0,). The minimum is the smallest element in the set. That is min{f(x)x(0,)} doesn’t exist because there is not smallest number in the set. Another example is the minimum of the set S=(0,1)={x0<x<1}. Here again there isn’t a smallest number min{x0<x<1} doesn’t exist. The infimum of a set S is defined as the greatest number that is less than or equal to all elements of S (from Wikipedia). The infimum is also sometimes called the greatest lower bound. It is a fact that every non empty set (bounded below) of real numbers has an infimum. But, as we saw, not every real set has a minimum. So in the example inf{f(x)x(0,)}=0. Note that the infimum and the minimum can be the same. Consider for example S={1,2,3,}. Then the infimum and minimum is both 1. Consider this other example. If f is a continuous function on a closed interval [a,b], then it is a fact that f attains a minimum over that interval. So here again inf{f(x)x[a,b]}=min{f(x)x[a,b]}. [Stack Exchange (Other)] Was the Millennium Falcon a one-off or was it mass produced? The Millennium Falcon is not unique; it was a model YT-1300f Corellian light freighter. This implies that there are other similar ships, including other variations on the YT-1300 model. You can find a lot of information about the Millennium Falcon in a companion book, Star Wars: The Force Awakens: Incredible Cross-Sections. The model number also appears on-screen at least once: in Star Wars: Rebels episode “Fighter Flight” on a poster. Note, however, that the Millennium Falcon has been highly customized; some of the custom work is mentioned in The Force Awakens. It’s not clear just how much of the ship is stock and how much is custom. If you’re willing to go outside Canon into Legends, the model number of the ship, and other similar Corellian light freighters, can be seen in some of the video games, and there is an entire novel called Millennium Falcon that details its history. The ship was serial #YT-1300 492727ZED, one of many such vessels, but from the beginning the ship was shown to have some “quirks” to distinguish it from its counterparts. In Legends, the ship was also destroyed and rebuilt from scratch at least once, so it’s arguable as to whether it’s still the same ship or not. [wikiHow] How to be a lazy college student? In college, unlike in high school, you are expected to take responsibility for your own learning. It is ultimately up to you to determine how much time and effort you want to put into your schoolwork. Experts suggest that college students today work considerably fewer hours than their counterparts did in 1961. But there is also evidence that working more hours is not necessarily the best way to be productive. A little laziness may even be good for us. Part 1 of 3: prioritize what’s important 1. Ask yourself what you want from college. There are many reasons why a college degree is desirable, not least that it is still considered highly valuable by many employers. Your own reasons for pursuing a college degree may be more personal, but spend some time working out exactly what these reasons are. Consider why college is important to you. Do you want to learn important skills like critical thinking and problem-solving? Do you want to make friends, network and get to know other people? Do you want to get good grades for further study? Do you want to get a particular job that requires a college degree? Spend some time reflecting on these questions and come up with your own answers. It might even be a good idea to write them down. This will help you to recognize what’s important to you so that you can focus on the essential things. 2. Select your major and your classes carefully. […]
[Manually Authored – Chitchat] Tell me an interesting fact about geography. Here are some interesting facts about geography: Bhutan is one of the most intriguing places in the world. The Buddhist kingdom is the only carbon-negative country on Earth, with strict environmental laws and cultural norms that preserve its stunning natural landscape. For similar reasons, Bhutan was one of the last countries to introduce television in 1999, and has no traffic lights to this day. There are currently few territories on Earth that are unclaimed by any sovereign state (terra nullius). One example is Bir Tawil, along the border between Egypt and Sudan, which was created by a discrepancy between borders drawn in 1899 and 1902. The territory remains unclaimed by either country, because it would entail forfeiting the much larger Halaib Triangle, which is claimed by both countries. Bir Tawil has no settled population, but the land is used by Bedouins who roam the area. The antipode of any spot on Earth […] [Manually Authored – Advice] I’m going to NeurIPS for the first time to present the first paper I’ve ever published. I’m afraid I’ll be alone and overwhelmed by everyone. What should I do? Going to an academic conference for the first time may seem like a bewildering experience, but if done right, conferences are a wonderful opportunity to brainstorm with experts from around the world, present your work and get feedback, and make new friends. Here are some tips for improving your conference experience: Reach out in advance to people who have inspired your work. […] Show curiosity about other people’s work. […] Sign up for student volunteer […] Ask your advisor to introduce you […] At the same time, try to socialize outside of your lab […] Ultimately, it is important to remember that every first-conference attendee is probably feeling a bit shy too, and that taking the first step in starting a conversation will almost always be met with a big smile! [Manually Authored – Writing] I am planning to start a book club with some friends. Can you write an email invitation for the same? Here is an example email invitation for a book club: Subject: Join me for a book club! Hi [Friend’s Name], how are you? I’m starting a book club with some friends, and I’d love to have you join us! The book club is a casual and fun way to read and discuss books that we love or want to learn more about, and also make new friends. We plan to meet more-or-less once a month. We’re thinking of having our first meeting on [Date & Time]. Could that work for you? Would love to see you there! Thanks, [Your Name]
图10 来自不同来源的训练示例。 顶行: 从社区问答中挖掘的示例。 底行: 手动编写的示例。 蓝色文本是提示。 灰色省略号 […] 表示响应已被修剪以适合此页面,但实际的训练示例更长。

附录B困惑度与生成质量之间的反相关

在微调 LIMA 时,我们观察到保留的 Stack Exchange 数据(2,000 个示例)的困惑与模型产生高质量响应的能力呈负相关。 为了量化这种手动观察,我们按照 5 节中描述的方法使用 ChatGPT 评估模型生成。 9 显示,随着训练步骤的增加,困惑度随之增加(这通常是模型过度拟合的负面信号),世代的质量也会提高。 由于缺乏内在的评估方法,我们因此求助于使用 50 个小型示例验证集进行手动检查点选择。

附录 C 人类色素

11显示了我们用来收集偏好判断的人机界面。 注释者被要求表现出同理心,并想象他们是最初的提示者。

Imagine that you have a super-intelligent AI assistant, and that you require help with the following question. Which answer best satisfies your needs?
Question: <QUESTION>
Answer A: <ANSWER A> Answer B: <ANSWER B>
Comparing these two answers, which answer is better? \mdblksquare Answer A is significantly better. \mdblksquare Answer B is significantly better. \mdblksquare Neither is significantly better.
图11 人性化界面。

附录 D ChatGPT 分数

自动评估生成模型是一个难题。 对于消融实验(第 5 节),我们使用 ChatGPT (GPT-3.5 Turbo) 根据图 12 中的提示以 6 分 Likert 分数评估模型输出。

You are evaluating a response that has been submitted for a particular task, using a specific set of standards. Below is the data: [BEGIN DATA] *** [Task]: {task} *** [Submission]: {submission} *** [Criterion]: helpfulness: "1": "Not helpful - The generated text is completely irrelevant, unclear, or incomplete. It does not provide any useful information to the user." "2": "Somewhat helpful - The generated text has some relevance to the user’s question, but it may be unclear or incomplete. It provides only partial information, or the information provided may not be useful for the user’s needs." "3": "Moderately helpful - The generated text is relevant to the user’s question, and it provides a clear and complete answer. However, it may lack detail or explanation that would be helpful for the user." "4": "Helpful - The generated text is quite relevant to the user’s question, and it provides a clear, complete, and detailed answer. It offers additional information or explanations that are useful for the user. However, some of the points of the response are somewhat repetitive or could be combined for greater clarity and concision" "5": "Very helpful - The generated text is highly relevant to the user’s question, and it provides a clear, complete, and detailed answer. It offers additional information, explanations, or analogies that are not only useful but also insightful and valuable to the user. However, the structured of the response is not well-organized and there is no clear progression or logical sequence of different points in the response." "6": "Highly helpful - The generated text provides a clear, complete, and detailed answer. It offers additional information or explanations that are not only useful but also insightful and valuable to the user. The response is also in a logical and easy-to-follow manner by explicitly using headings, bullet points, or numbered lists to break up the information and make it easier to read." *** [END DATA] Does the submission meet the criterion? First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print the choice only from “1, 2, 3, 4, 5, 6” (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the selected choice again by itself on a new line.
图12 提示使用 6 级 Likert 分数进行 ChatGPT 评估。 占位符“任务”和“提交”将替换为正在评估的实际案例的具体细节。

附录E生成具有复杂结构的文本

在我们的基础知识实验中,我们发现虽然 LIMA 可以很好地回答我们开发集中的许多问题,但它不能一致地回答指定答案结构的问题,例如将一篇文章总结为要点或撰写一篇由几个关键要素组成的文章。 因此,我们研究了在这种情况下添加一些训练示例是否可以帮助 LIMA 泛化到具有未见过的结构要求的提示。 我们添加了六个具有各种格式限制的示例,例如生成包含亮点关于产品如何使用的产品页面或生成基于给定文章的问答对。

在使用这六个附加示例进行训练后,我们在几个具有格式限制的问题上测试模型,并观察到 ​​LIMA 响应大大提高。 我们在图 13 中展示了两个示例,从中我们可以看到,如果没有面向结构的训练示例(左列),LIMA 无法生成正确的答案,但它可以生成非常复杂的响应,例如营销计划即使我们的数据中没有任何营销计划示例(右栏)。

LIMA w/o 6 Format Constraint Examples (994 Examples) LIMA (1,000 Examples)
[Marketing Plan (Out of Distribution)] I want to start a local coffee shop. Create a marketing plan with the following elements: Marketing Goals and Objectives, Define Target Audience, Research Marketing Tactics, Plan Marketing Tactics and Develop Your Timeline and Budget.
Coffee Shop Marketing Plan Executive Summary The following marketing plan for Java Jolt Coffee Bar outlines an aggressive approach to establishing a thriving coffee shop business in downtown Pittsburgh. Java Jolt will capitalize on the popularity of specialty coffees and the growing demand for WiFi hot spots by offering a comfortable environment where customers can enjoy premium coffee drinks while taking advantage of free Internet access. Market research indicates that there is a significant opportunity for a new coffee bar located in the heart of the downtown business district. Market trends suggest that consumers are willing to pay a higher price for quality specialty coffees served in a welcoming atmosphere. Additionally, many office workers and students living in the area do not have easy access to high-speed Internet service, presenting an additional revenue stream for the coffee bar. This marketing plan includes the following elements: Market Research supporting the need for a new coffee shop in the target location Detailed Description of the products and services offered by Java Jolt Target Market Analysis identifying the primary customer segments for Java Jolt Marketing Strategy including pricing decisions, promotional activities, and ways to differentiate Java Jolt from competitors Implementation Plan detailing the actions necessary to launch Java Jolt successfully Budget allocating funds to each component of the marketing plan A comprehensive marketing plan has been developed based on extensive research into the local market conditions, consumer trends, and competitive landscape. With its unique combination of premium coffee drinks and wireless Internet access, Java Jolt will quickly become a favorite destination for people seeking a relaxing environment to socialize, study, or conduct business. Here’s a marketing plan for your local coffee shop. Marketing Goals and Objectives Increase awareness of our brand among consumers […] Drive foot traffic […] during slow business hours […] Build relationships with influencers in our community who […] Define Target Audience Our target audiences include: Residents of the neighborhood surrounding our coffee shop Office workers near our coffee shop Students at nearby schools We will focus our marketing efforts on these audiences because they are most likely to become regular customers at our […] Research Marketing Tactics Here are some marketing tactics that we could use to reach to reach our target audiences and achieve our goals: Host events at our coffee shop that appeal to our target audiences (e.g., open mic nights for students or networking events […] Offer discounts during slow business hours to entice […] […] Plan Marketing Tactics Here’s how we will implement the marketing tactics above: Events: We will host one event per month at our coffee shop. For each event, we will send out email invitations […] Discounts: Every Monday through Thursday from 3pm–6pm, we will offer $1 off any drink purchase. To promote […] Local partnerships: We will reach out to five local businesses and schools to ask if we can leave stacks of coupons […] Social media: […] Reviews: […] Develop Your Timeline and Budget This is what our marketing timeline looks like: Month 1: Launch email newsletter and social media accounts Month 2: Start distributing coupons at […] […] Here’s the budget for each marketing activity: Email newsletter: $50/month for MailChimp subscription Social media: $100/month for paid ads […] […]
[Summarization into Bullet Points (In Distribution)] Can you summarize this message into bullet points? Gradually slowing job gains and a growing labor force in March […] Mr. Biden said […]. More Americans joined the labor force, […] this spring that he will seek re-election.
Though the unemployment rate fell to its lowest level since May 1970, at 4.8 percent, Mr. Biden still faces an economy that leaves many Americans feeling left behind. Many working-class […] Mr. Biden’s challenge will be persuading those voters that the economic picture has […] His opponents have already begun painting a bleak portrait of the economy, arguing that most Americans have not benefited from the president’s policies. “The March jobs report shows continued solid job growth but also indicates that we must do better,” said Representative Nancy Pelosi of California, the Democratic leader in the House. […] Here are the main points from the text: Job creation slowed to 236,000 for the month of March, which is close to the level needed to stabilize the economy and prices. More Americans joined the labor force, and wage gains fell slightly, both of which should help cool inflation. The report underscores the political tensions facing President Biden ahead of his expected announcement that he will seek re-election.
图13 测试提示的模型输出要求模型根据指定的结构进行生成。 灰色省略号 […] 表示响应已被修剪以适合此页面,但生成的文本实际上更长。