:

自我反馈迭代细化

Aman Madaan¹, Niket Tandon², Prakhar Gupta¹, Skyler Hallinan³, Luyu Gao¹,
Sarah Wiegreffe², Uri Alon¹, Nouha Dziri², Shrimai Prabhumoye⁴, Yiming Yang¹,
Shashank Gupta², Bodhisattwa Prasad Majumder⁵, Katherine Hermann⁶,
Sean Welleck^2,3, Amir Yazdanbakhsh⁶, Peter Clark²
¹Language Technologies Institute, Carnegie Mellon University
²Allen Institute for Artificial Intelligence
³University of Washington ⁴NVIDIA ⁵UC San Diego ⁶Google Research, Brain Team
amadaan@cs.cmu.edu, nikett@allenai.org

摘要

与人类一样，大型语言模型(大语言模型)并不总是在第一次尝试时生成最佳输出。受人类如何改进书面文本的启发，我们引入了一种通过迭代反馈和改进来改进大语言模型的初始输出的方法。主要思想是使用大语言模型生成初始输出；然后，同一个大语言模型为其输出提供反馈，并使用它迭代地优化本身。不需要任何监督训练数据、额外训练或强化学习，而是使用单个大语言模型作为生成器、精炼器和反馈提供者。我们使用最先进的(GPT-3.5 和 GPT-4）大语言模型评估 7 种不同的任务，从对话响应生成到数学推理s。在所有评估的任务中，与使用传统一步生成的相同大语言模型生成的输出相比，人类和自动度量生成的输出更受人类和自动指标的青睐，绝对提高了 $\sim$ 20%任务绩效的平均水平。我们的工作表明，即使是像 GPT-4 这样最先进的大语言模型，也可以使用我们简单、独立的方法在测试时得到进一步改进。¹¹1Code and data at https://selfrefine.info/。

1简介

尽管大型语言模型(大语言模型)可以生成连贯的输出，但它们通常无法满足复杂的需求。这主要包括具有多方面目标的任务，例如对话响应生成，或具有难以定义目标的任务，例如增强程序可读性。在这些场景中，现代大语言模型可能会产生可理解的初始输出，但可能会受益于进一步的迭代细化（即迭代地将候选输出映射到改进的输出），以确保所需的质量实现了。迭代细化通常涉及训练依赖于特定领域数据的细化模型（例如，Reid and Neubig (2022); Schick 等人 (2022a); Welleck 等人 (2022))。其他依赖外部监督或奖励模型的方法需要大量训练集或昂贵的人工注释（Madaan 等人，2021；Ouyang 等人，2022），这可能并不总是可行的。这些限制强调需要一种有效的细化方法，该方法可以应用于各种任务，而不需要广泛的监督。

迭代自我完善是人类解决问题的基本特征（Simon，1962；Flower and Hayes，1981；Amabile，1983）。迭代自我完善是一个涉及创建初始草稿并随后根据自我提供的反馈对其进行完善的过程。例如，在起草电子邮件以向同事请求文档时，个人最初可能会写一个直接请求，例如“尽快向我发送数据”。然而，经过反思，作者认识到这种措辞可能存在不礼貌的地方，并将其修改为“你好，阿什利，你能尽快把数据发给我吗？”。在编写代码时，程序员可能会实现最初的“快速而肮脏”的实现，然后经过反思，将其代码重构为更高效和可读的解决方案。在本文中，我们证明大语言模型可以提供迭代自我完善，无需额外的训练，从而在广泛的任务上产生更高质量的输出。

我们提出：一种迭代的自我细化算法，它在两个生成步骤之间交替——反馈和细化。这些步骤协同工作以产生高质量的输出。给定模型 $\mathcal{M}$ 生成的初始输出，我们将其传递回同一模型 $\mathcal{M}$ 以获取反馈。然后，反馈被传回同一模型以完善之前生成的草稿。重复此过程指定的迭代次数或直到 $\mathcal{M}$ 确定不需要进一步细化。我们使用少样本提示（Brown等人，2020）来指导 $\mathcal{M}$ 生成反馈并将反馈纳入改进的草案中。图 1说明了高级思想，即使用相同的底层语言模型来生成反馈并改进其输出。

Refer to caption — 图1：给定输入 (\raisebox{-1pt} {\footnotesize{0}\vphantom{a}}⃝)，首先生成输出并将其传递回同一模型 $\mathcal{M}$ 获取反馈 (\raisebox{-1pt} {\footnotesize{1}\vphantom{a}}⃝)。反馈被传回 $\mathcal{M}$ ，它会优化之前生成的输出 (\raisebox{-1pt} {\footnotesize{2}\vphantom{a}}⃝)。步骤 (\raisebox{-1pt} {\footnotesize{1}\vphantom{a}}⃝) 和 (\raisebox{-1pt} {\footnotesize{2}\vphantom{a }}⃝) 迭代直到满足停止条件。使用诸如GPT-3.5之类的语言模型进行实例化，并且不涉及人工协助。

我们评估跨越不同领域的 7种生成任务，包括自然语言和源代码生成。我们证明，它的性能优于来自强大语言模型的直接生成，例如GPT-3.5 （text-davinci-003和gpt-3.5-turbo；OpenAI，；Ouyang等人，2022）和 GPT-4 （OpenAI，2023）绝对提升 5-40%。在代码生成任务中，应用于 Codex (code-davinci-002; Chen 等人, 2021) 等强代码模型时，初始生成速度提高了 13%。我们发布了所有的代码，可以很容易地扩展到其他大语言模型。本质上，我们的结果表明，即使大语言模型无法在第一次尝试中生成最佳输出，大语言模型通常可以提供有用的反馈并改进其自身相应地输出。反过来，它提供了一种有效的方法，通过迭代（自）反馈和细化，从单个模型获得更好的输出，而无需任何额外的训练。

(a) Dialogue:

x

\mathbf{y_{t}}

⬇

User: I am interested in playing Table tennis.

\parResponse: I’m sure it’s a great way to socialize, stay active

(b) feedback

~{}~{}\mathbf{fb}

⬇

Engaging: Provides no information about table tennis or how to play it.

\parUser understanding: Lacks understanding of user’s needs and state of mind.

~{}~{}\mathbf{y_{t+1}}

⬇

|\textbf{Response (refined): That’s great to hear (…) ! It’s a fun sport requiring quick reflexes and good hand-eye coordination. Have you played before, or are you looking to learn?}|

(d) Code optimization:

x

\mathbf{y_{t}}

⬇

|\texttt{Generate sum of 1, …, N}|

def sum(n):

res = 0

for i in range(n+1):

res += i

return res

(e) feedback

~{}~{}\mathbf{fb}

⬇

|\textbf{This code is slow as it uses brute force. A better approach is to use the formula … (n(n+1))/2.}

(f) refine

~{}~{}\mathbf{y_{t+1}}

⬇

Code (refined)

\pardef sum_faster(n):

return (n*(n+1))//2

\par\par

图2：示例：由基础大语言模型生成初始输出，然后传回相同大语言模型接收到相同大语言模型的反馈以优化输出。顶行说明了对话生成的情况，其中初始对话响应可以转换为更具吸引力的响应，并且还可以通过应用反馈来理解用户。底行说明了代码优化的情况，其中通过应用反馈使代码更加高效。

2 迭代细化

给定输入序列，生成初始输出，提供输出反馈，并根据反馈细化输出。在反馈和细化之间迭代，直到满足所需的条件。依赖于合适的语言模型和三个提示（用于初始生成、反馈和细化），并且不需要 . 如图1和算法1所示。接下来，我们进行更详细的描述。

初次输出

给定输入 $x$ 、提示 $p_{\text{gen}}$ 和模型 $\mathcal{M}$ ，生成初始输出 $y_{0}$ ：

\displaystyle y_{0}=\mathcal{M}\left(p_{\text{gen}}\|x\right).

(1)

例如，在图2(d)中，模型为给定输入生成功能正确的代码。这里， $p_{\text{gen}}$ 是初始生成的特定于任务的少样本提示（或指令）， $\|$ 表示串联。少样本提示包含任务的输入输出对 $\langle x^{(k)},y^{(k)}\rangle$ 。²²2少镜头提示（也称为“上下文学习”）为模型提供由 $k$ 上下文组成的提示目标任务的示例，每个示例均以输入输出对的形式 $\langle x_{i},y_{i}\rangle$ (Brown 等人, 2020)。

反馈

接下来，使用相同的模型 $\mathcal{M}$ 为其自己的输出提供反馈 $fb_{t}$ ，并给出用于生成反馈的特定于任务的提示 $p_{\text{fb}}$ ：

\displaystyle fb_{t}=\mathcal{M}\left(p_{\text{fb}}\|x\|y_{t}\right).

(2)

算法1 算法

1:input

x

, model

\mathcal{M}

, prompts

\{p_{\text{gen}},p_{\text{fb}},p_{\text{refine}}\}

, stop condition

\mathrm{stop}(\cdot)

y_{0}=\mathcal{M}(p_{\text{gen}}\|x)

\triangleright

Initial generation (Eqn. 1)

3:for iteration t

\in 0,1,\ldots

fb_{t}=\mathcal{M}\left(p_{\text{fb}}\|x\|y_{t}\right)

\triangleright

Feedback (Eqn. 2)

5: if

\mathrm{stop}(fb_{t},t)

then

\triangleright

Stop condition

6: break

7: else

y_{t+1}=\mathcal{M}\left(p_{\text{refine}}\|x\|y_{0}\|fb_{0}\|...\|y_{t}\|fb_{t}\right)

\triangleright

Refine (Eqn. 4)

9: end if

10:end for

11:return

y_{t}

图3：算法。有关每个组件的讨论，请参阅 (§2)。

直观上，反馈可以涉及输出的多个方面。例如，在代码优化中，反馈可能涉及代码的效率、可读性和整体质量。

这里，提示 $p_{\text{fb}}$ 以输入-输出-反馈三元组 $\langle x^{(k)},y^{(k)},fb^{(k)}\rangle$ 的形式提供反馈示例。我们通过 $fb^{(k)}$ 提示模型编写可操作且具体的反馈。我们所说的“可操作”是指反馈应包含可能提高产出的具体行动。 “具体”是指反馈应该识别输出中要更改的具体短语。例如图2(e)中的反馈是“这段代码很慢，因为它使用了暴力循环。更好的方法是使用公式... (n(n+1))/2”。这种反馈是可行的，因为它建议采取“使用公式……”的行动。反馈很具体，因为它提到了“for 循环”。

精炼

接下来，根据自己的反馈，使用 $\mathcal{M}$ 优化其最新输出：

\displaystyle y_{t+1}=\mathcal{M}\left(p_{\text{refine}}\|x\|y_{t}\|fb_{t}\right).

(3)

例如，在图 2(f)中，给定初始输出和生成的反馈，模型生成更短的重新实现并且运行速度比最初的实现快得多。提示 $p_{\text{refine}}$ 以输入-输出-反馈-细化四元组 $\langle x^{(k)},y_{t}^{(k)},fb_{t}^{(k)},y_{t+1}^{(k)}\rangle$ 的形式提供了基于反馈改进输出的示例。

迭代在反馈和细化步骤之间交替，直到满足停止条件。停止条件 $\mathrm{stop}(fb_{t},t)$ 要么在指定的时间步长 $t$ 处停止，要么从反馈中提取停止指标（例如标量停止分数）。在实践中，可以提示模型在 $p_{\text{fb}}$ 中生成停止指示符，并且条件根据任务确定。

为了告知模型有关先前迭代的信息，我们通过将先前反馈和输出的历史记录附加到提示中来保留它们。直观上，这使得模型能够从过去的错误中学习并避免重蹈覆辙。更准确地说，Equation 3 实际上被实例化为：

\displaystyle y_{t+1}=\mathcal{M}\left(p_{\text{refine}}\|x\|y_{0}\|fb_{0}\|...\|y_{t}\|fb_{t}\right).

(4)

最后，我们使用最后一次细化 $y_{t}$ 作为的输出。

算法 1总结，图 2显示对话响应生成（Mehri 和 Eskenazi，2020）和代码优化（Madaan 等人，2023）任务中的示例。附录 S提供了 $p_{\text{gen}}$ 、 $p_{\text{fb}}$ 、 $p_{\text{refine}}$ 的示例提示各种任务。关键思想是使用相同的底层大语言模型来生成、获取反馈并根据自己的反馈完善其输出。它仅依赖于少样本示例中存在的监督。

3评估

我们评估了 7 项不同的任务：对话响应生成（附录 M；Mehri 和 Eskenazi，2020）、代码优化（附录 N；Madaan 等人，2023）、代码可读性进步（附录L；Puri等人，2021），数学推理（附录O；Cobbe等人，2021），情绪逆转（附录P；Zhang等）人，2015），我们引入了两个新任务：首字母缩略词生成(附录 Q)和约束生成（更难的版本） Lin 等人 (2020) 的关键字限制为 20-30 个，而不是 3-5 个；附录0> R1>)

Table 4(附录A)中提供了所有任务和数据集统计信息的示例。

3.1 实例化

我们按照部分2中的高级描述进行实例化。反馈-细化迭代持续进行，直到达到所需的输出质量或特定于任务的标准，最多迭代 4 次。为了使我们的评估在不同模型之间保持一致，我们将反馈和细化作为少样本提示，即使模型对指令响应良好，例如ChatGPT和GPT -4。

大语言模型基础

我们的主要目标是评估我们是否可以使用来提高任何强基础大语言模型的性能。因此，我们与相同的基础大语言模型进行比较，但没有反馈细化迭代。我们在所有任务中使用了三个主要的强基础大语言模型：GPT-3.5 (text-davinci-003)、ChatGPT (gpt-3.5-turbo) t2>)和GPT-4 （OpenAI，2023）。对于基于代码的任务，我们还尝试了 Codex (code-davinci-002)。在所有任务中，GPT-3.5 或 GPT-4 是之前的最先进技术。³³3A comparison with other few-shot and fine-tuned approaches is provided in Appendix F我们使用了之前工作中可用的相同提示（例如代码优化和数学推理）；否则，我们将按照附录2>S3>1>中详细说明创建提示。我们对所有设置都使用温度为 0.7 的贪婪解码。

3.2指标

我们报告三种类型的指标：

$\bullet$

任务特定指标：如果可用，我们使用之前工作中的自动化指标（数学推理：解决率％；代码优化：优化程序％；约束生成：覆盖率％）
$\bullet$

人类偏好：在对话响应生成、代码可读性改进、情绪反转和首字母缩略词生成中，由于没有可用的自动化指标，我们对输出的子集进行盲目的人类 A/B 评估以选择首选输出。附录C中提供了更多详细信息。
$\bullet$

GPT-4-pref：除了 human-pref 之外，我们根据之前的工作（Fu 等人，2023）使用 GPT-4 作为人类偏好的代理;Chiang 等人, 2023; Geng 等人, 2023; Sun 等人, 2023)，并发现与人类优先。为了提高代码可读性，我们提示 GPT-4 计算在给定上下文中适当命名的变量的比例（例如 $\texttt{x = []}\rightarrow\texttt{input\_buffer = []}$ )。附录D中提供了更多详细信息。

Task	Base	+ $\ours$	Base	+ $\ours$	Base	+ $\ours$
	GPT-3.5		ChatGPT		GPT-4
Sentiment Reversal	8.8	30.4 ( $\uparrow$ 21.6)	11.4	43.2 ( $\uparrow$ 31.8)	3.8	36.2 ( $\uparrow$ 32.4)
Dialogue Response	36.4	63.6 ( $\uparrow$ 27.2)	40.1	59.9 ( $\uparrow$ 19.8)	25.4	74.6 ( $\uparrow$ 49.2)
Code Optimization	14.8	23.0 ( $\uparrow$ 8.2)	23.9	27.5 ( $\uparrow$ 3.6)	27.3	36.0 ( $\uparrow$ 8.7)
Code Readability	37.4	51.3 ( $\uparrow$ 13.9)	27.7	63.1 ( $\uparrow$ 35.4)	27.4	56.2 ( $\uparrow$ 28.8)
Math Reasoning	64.1	64.1 (0)	74.8	75.0 ( $\uparrow$ 0.2)	92.9	93.1 ( $\uparrow$ 0.2)
Acronym Generation	41.6	56.4 ( $\uparrow$ 14.8)	27.2	37.2 ( $\uparrow$ 10.0)	30.4	56.0 ( $\uparrow$ 25.6)
Constrained Generation	28.0	37.0 ( $\uparrow$ 9.0)	44.0	67.0 ( $\uparrow$ 23.0)	15.0	45.0 ( $\uparrow$ 30.0)

表1：使用 GPT-3.5、ChatGPT 和 GPT-4 作为基础大语言模型在各种任务上的结果。持续改进大语言模型。用于这些任务的指标在部分3.2中定义。

3.3结果

Table 1 显示了我们的主要结果：

在所有模型尺寸上持续改进基础模型，并且在所有任务上都优于之前的最先进技术。例如，GPT-4+ 在代码优化方面比基础 GPT-4 提高了 8.7%（绝对值），优化百分比从 27.3% 提高到 36.0%。附录J中提供了置信区间。对于基于代码的任务，我们在使用 Codex 时发现了类似的趋势；这些结果包含在附录F中。

与基本模型相比，我们观察到收益最高的任务之一是约束生成，其中要求模型生成最多包含 30 个给定概念的句子。我们相信这项任务会受益匪浅，因为在第一次尝试时有更多机会错过一些概念，从而允许模型随后修复这些错误。此外，该任务具有大量合理输出，因此可以更好地探索可能输出的空间。

在基于偏好的任务中，例如对话响应生成、情绪逆转和首字母缩略词生成，会带来特别高的收益。例如，在对话响应生成中，GPT-4 偏好得分提高了 49.2%，从 25.4% 提高到 74.6%。同样，我们看到所有模型中其他基于偏好的任务都有显着改进。

数学推理中的适度性能提升可以追溯到无法准确识别是否存在任何错误。在数学中，错误可能很细微，有时仅限于单行或不正确的操作。此外，看起来一致的推理链可以欺骗大语言模型认为“一切看起来都很好”（例如，94％的实例的ChatGPT反馈是“一切看起来都很好”）。在部分 H.1中，我们表明，如果外部来源可以识别，数学推理的收益会大得多（5%+）如果当前的数学答案不正确。

各个基本大语言模型尺寸的改进是一致的一般来说，GPT-4+的性能优于GPT-3.5 + 和 ChatGPT+ 跨所有任务，即使在 GPT-4 初始基本结果低于 GPT-3.5 或 ChatGPT 的任务中也是如此。因此，我们相信这可以让更强大的模型（例如GPT-4)发挥其全部潜力，即使这种潜力没有在标准、单通道输出生成中表达出来。附录F中提供了与其他强基线的比较。

4分析

三个主要步骤是反馈、细化以及迭代地重复它们。在本节中，我们进行额外的实验来分析每个步骤的重要性。

Task	feedback	Generic feedback	No feedback
Code Optimization	27.5	26.0	24.8
Sentiment Reversal	43.2	31.2	0
Acronym Generation	56.4	54.0	48.0

表2：提示生成通用反馈（或让模型根本不生成反馈）会导致分数降低，这表明的 feedback 步骤的重要性。这些实验是使用 ChatGPT（代码优化和情绪反转）和 GPT-3.5（缩写词生成）进行的，所使用的指标在部分 3.2 中定义。

反馈质量的影响

反馈质量在其中起着至关重要的作用。为了量化其影响，我们将使用具体的、可操作的反馈与两种消融进行比较：一种使用通用反馈，另一种不使用反馈（该模型仍然可以迭代地改进其生成，但没有明确提供反馈来这样做）。例如，在代码优化任务中：可操作的反馈，例如避免在 for 循环中重复计算，查明问题并提出明确的改进建议。通用反馈，例如提高代码效率，缺乏这种精度和方向。表 2显示了反馈的明显影响。

在代码优化中，性能从 27.5（反馈）略有下降到 26.0（一般反馈），并进一步下降到 24.8（无反馈）。这表明，虽然一般反馈提供了一些指导，但具体的、可操作的反馈会产生更好的结果。

这种影响在情感转移等任务中更为明显，从我们的反馈更改为通用反馈会导致性能显着下降（43.2 至 31.2），并且在没有反馈的情况下任务会失败。同样，在 Acronym Generation 中，如果没有可操作的反馈，即使进行了迭代改进，性能也会从 56.4 下降到 48.0。这些结果凸显了我们方法中具体、可操作的反馈的重要性。即使一般的反馈也能带来一些好处，但最好的结果是通过有针对性的、建设性的反馈来实现的。

Task	$y_{0}$	$y_{1}$	$y_{2}$	$y_{3}$
Code Opt.	22.0	27.0	27.9	28.8
Sentiment Rev.	33.9	34.9	36.1	36.8
Constrained Gen.	29.0	40.3	46.7	49.7

图4：左边：迭代方面的分数改进。早期迭代显着提高了输出质量，并且分数通常随着迭代次数的增加而不断提高。正确的：通过迭代提高性能。大多数增益 (

\Delta

) 都出现在两个代码选项的初始迭代中。和情绪逆转。这些数字是 ChatGPT、GPT-3.5 和 GPT-4 的平均值。任务缩写：C. Opt. （代码优化。），S. Rev。（情绪逆转），C. Gen.（受限一代）。

反馈-细化的多次迭代有多重要？

图 4表明，平均而言，输出质量随着迭代次数的增加而提高。例如，在代码优化任务中，初始输出 ( $y_{0}$ ) 的得分为 22.0，经过 3 次迭代 ( $y_{3}$ ) 后提高到 28.8。同样，在情绪反转任务中，初始输出的得分为 33.9，经过 3 次迭代后增加到 36.8。这种改进趋势在 Constrained Generation 中也很明显，经过 3 次迭代后，分数从 29.0 增加到 49.7。图 4突出显示了随着迭代次数的增加，改进中的收益递减。总体而言，进行多次反馈-细化迭代可以显着提高输出的质量，尽管边际改进自然会随着迭代次数的增加而降低。

⬇

# Slower code

def solve(amount):

best_price = (amount + 199) // 200 * 380

# First loop

for a in range(amount // 200 + 1):

# … 4 nested loops …

for c1 in range(amount // 1500 + 1):

if a*200 + b*300 == amount:

price = a*380 + b*550

if price < best_price:

best_price = price

return best_price

⬇

# Faster code

def solve(amount):

coins = [200, 300]

prices = [380, 550]

dp = [float(’inf’)] * (amount + 1)

dp[0] = 0

for i in range(len(coins)):

for j in range(coins[i], amount+1):

dp[j] = min(dp[j], dp[j - coins[i]] + prices[i])

return dp[amount]

图5： Madaan 等人 (2023) 生成的代码（左）与应用后的输出（右）对比。基线的初始代码与较慢的输入程序几乎相同，但未能提高效率，仅改变了读取输入的逻辑。首先生成反馈，诊断此代码很慢，因为它使用六个嵌套循环来迭代所有可能的硬币组合来支付金额，并建议更有效的方法是...... 。然后使用此反馈生成修改后的代码（右），将时间复杂度降低至

\mathcal{O}(amount*coins)

。附录 H中提供了完整的示例

性能可能并不总是随着迭代而单调增加：在诸如首字母缩略词生成之类的多方面反馈任务中，输出质量在迭代过程中可能会发生变化，一方面有所改善，但另一方面会下降。为了解决这个问题，为不同的质量方面生成数字分数，从而实现平衡的评估和适当的输出选择。

我们可以只生成多个输出而不进行精炼吗？

改进是因为迭代细化，还是仅仅因为它产生了更多输出？我们与 ChatGPT 进行比较，ChatGPT 生成 $k=4$ 样本（但没有反馈和细化）。然后，我们在 1 与 $k$ 评估中比较这些 $k$ 初始输出的性能。换句话说，我们评估是否可以超越所有 $k$ 初始输出。该实验的结果如图9(附录H)所示。尽管 1 与 $k$ 设置的难度有所增加，但与所有 $k$ 初始输出相比，人类仍然更喜欢的输出。这表明根据反馈进行细化相对于仅生成多个初始输出的替代方案的重要性。

适用于较弱的模型吗？

部分 3.3中的实验是使用一些最强的可用模型进行的；也适用于较小或较弱的模型吗？为了研究这一点，我们用 Vicuna-13B （Chiang 等人，2023）进行实例化，这是一个功能较弱的基础模型。虽然 Vicuna-13B 能够生成初始输出，但它在精炼过程中遇到了很大的困难。具体来说，Vicuna-13B 无法始终如一地生成所需格式的反馈。此外，即使提供了 Oracle 或硬编码的反馈，它通常也无法遵守改进的提示。 Vicuna-13B 没有改进其输出，而是重复相同的输出或生成幻觉对话，从而降低输出的效率。因此，我们假设，由于 Vicuna-13B 接受了对话训练，因此它不能像基于指令的模型一样泛化到测试时的少样本任务。附录G中提供了示例输出和分析。

定性分析

我们对生成的反馈及其后续改进进行定性分析。我们手动分析了总共 70 个样本（35 个成功案例和 35 个失败案例），用于代码优化（Madaan 等人，2023）和数学推理（Cobbe 等人，2021）。对于数学推理和代码优化，我们发现反馈主要是可操作的，大多数都确定了原始生成的问题方面并提出了纠正方法。

当未能改进原始一代时，大多数问题是由于错误的反馈而不是错误的改进造成的。具体来说，33% 的不成功案例是由于反馈未能准确定位错误位置，而 61% 是由于反馈建议修复不当造成的。只有 6% 的失败是由于精炼厂错误地实施了良好的反馈。这些观察结果凸显了准确反馈在其中所发挥的重要作用。

在成功的案例中，61% 的案例中，精炼者在准确且有用的反馈的指导下对原始一代进行了精确修复。有趣的是，即使反馈部分不正确，炼油厂也能够纠正问题，成功案例中有 33% 属于这种情况。这表明对次优反馈的适应能力。未来的研究可以集中于检查炼油厂对各种类型反馈错误的鲁棒性，并探索增强这种弹性的方法。在图5中，我们展示了如何通过富有洞察力的反馈将暴力方法转变为动态编程解决方案来显着提高程序效率。附录 H中提供了对其他数据集（例如对话响应生成）的附加分析。

超越基准

虽然我们的评估侧重于基准任务，但在设计时考虑到了更广泛的适用性。我们在网站生成的现实用例中对此进行了探索，其中用户提供了高级目标并协助迭代开发网站。从基本的初始设计开始，完善 HTML、CSS 和 JS，以提高网站的可用性和美观性。这展示了在现实世界、复杂和创造性任务中的潜力。请参阅附录 I了解示例和进一步讨论，包括我们工作的更广泛的社会影响。

5相关工作

利用人类和机器生成的自然语言 (NL) 反馈来提炼输出，对于各种任务都很有效，包括摘要 Scheurer 等人 (2022)、脚本生成 Tandon 等人 ( 2021)，程序综合乐等人(2022a)； Yasunaga 和Liang (2020)，以及其他任务Bai 等人(2022a)； Schick 等人 (2022b);桑德斯等人 (2022a); Bai 等人 (2022b); Welleck 等人 (2022)。细化方法的不同之处在于反馈的来源和格式以及获得细化器的方式。表3总结了一些相关方法；有关其他讨论，请参阅附录 B。

反馈来源。

人类一直是反馈的有效来源Tandon 等人 (2021)； Elgohary 等人 (2021); Tandon 等人 (2022); Bai等人(2022a)。由于人类反馈的成本很高，因此有几种方法使用标量奖励函数作为人类反馈的替代（或替代）（例如，Bai 等人 (2022a)；Liu 等人 (2022)；Lu 等人 (2022) ; Le 等人 (2022a); Welleck 等人 (2022)). 编译器 Yasunaga 和 Liang (2020) 或维基百科编辑 Schick 等人 (2022b) 等替代来源可以提供特定领域的反馈。最近，大语言模型已被用于生成一般领域的反馈 Fu 等人 (2023)；彭等人 (2023); Yang 等人 (2022)，然而，我们的方法是唯一使用大语言模型对其自己输出生成反馈的方法，以达到精炼的目的具有相同的大语言模型。

	Supervision-free refiner	Supervision-free feedback	Multi-aspect feedback	Iterative
Learned refiners: PEER Schick et al. (2022b), Self-critique Saunders et al. (2022b), CodeRL Le et al. (2022b), Self-correction Welleck et al. (2022).		or		or
Prompted refiners: Augmenter Peng et al. (2023), Re³ Yang et al. (2022), Reflexion Shinn et al. (2023).		or
(this work)

表3：与密切相关的先前细化方法的比较。

反馈的表示。

反馈的形式一般可分为自然语言（NL）反馈和非NL反馈。非 NL 反馈可以来自人类提供的示例对 Dasgupta 等人 (2019) 或标量奖励 Liu 等人 (2022)；乐等人(2022b)。在这项工作中，我们使用 NL 反馈，因为这允许模型使用生成输出的相同 LM 轻松提供自我反馈，同时利用现有的预训练大语言模型（例如 GPT-4）。

精炼机的类型。

成对的反馈和细化已被用于学习有监督的细化器 Schick 等人 (2022b)；杜等人 (2022);安永和梁（2020）； Madaan等人(2021)。由于收集监督数据的成本很高，一些方法使用模型生成来学习精炼器 Welleck 等人 (2022)；彭等人(2023)。然而，精炼者要接受针对每个新领域的培训。最后，Yang 等人 (2022) 使用专门为故事生成量身定制的提示反馈和细化。在这项工作中，我们避免训练单独的精炼器，并表明同一模型可以用作跨多个领域的精炼器和反馈源。

非细化强化学习（RL）方法。

合并反馈的另一种方法是优化标量奖励函数，而不是进行明确的细化，例如强化学习（例如，Stiennon 等人 (2020); Lu 等人 (2022); Le 等人 (2022a))。这些方法的不同之处在于模型不访问中间代的反馈。其次，这些 RL 方法需要更新模型的参数，这与 .

6 限制和讨论

我们方法的主要限制是，基础模型需要具有足够的少样本建模或指令跟踪能力，以便学习提供反馈并以上下文方式进行改进，而不必训练监督模型并依赖于监督数据。

此外，本工作中的实验是使用非开源的语言模型进行的，即 GPT-3.5、ChatGPT、GPT-4 和 Codex. 现有文献(Ouyang 等人, 2022)没有完全描述这些模型的细节，例如预训练语料库、模型大小和模型偏差。此外，这些模型不是免费使用的，使用它们进行研究需要一些资金。尽管如此，我们还是发布了代码和模型输出，以确保我们工作的可重复性。

我们工作的另一个限制是我们只用英语数据集进行实验。在其他语言中，当前模型可能无法提供相同的好处。

最后，不良行为者有可能使用提示技术来引导模型生成更多有毒或有害的文本。我们的方法没有明确防止这种情况。

7结论

我们提出：一种新颖的方法，允许大型语言模型迭代地提供自我反馈并完善自己的输出。在单个大语言模型内运行，既不需要额外的数据，也不需要强化学习。我们展示了各种任务的简单性和易用性。通过展示各种任务的潜力，我们的研究有助于大型语言模型的持续探索和开发，旨在降低现实世界环境中人类创造性过程的成本。我们希望我们的迭代方法将有助于推动该领域的进一步研究。为此，我们在 https://selfrefine.info/ 上匿名提供所有代码、数据和提示。

参考

Amabile (1983) Teresa M. Amabile. 1983. A Theoretical Framework. In The Social Psychology of Creativity, pages 65–96. Springer New York, New York, NY.
Bai et al. (2022a) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022a. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv:2204.05862.
Bai et al. (2022b) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022b. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
Berger et al. (2022) Emery D Berger, Sam Stern, and Juan Altmayer Pizzorno. 2022. Triangulating Python Performance Issues with SCALENE. ArXiv preprint, abs/2212.07597.
Brown et al. (2001) Lawrence D Brown, T Tony Cai, and Anirban DasGupta. 2001. Interval estimation for a binomial proportion. Statistical science, 16(2):101–133.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, Online. Curran Associates, Inc.
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374.
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
Dasgupta et al. (2019) Sanjoy Dasgupta, Daniel Hsu, Stefanos Poulis, and Xiaojin Zhu. 2019. Teaching a black-box learner. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 1547–1555. PMLR.
Du et al. (2022) Wanyu Du, Zae Myung Kim, Vipul Raheja, Dhruv Kumar, and Dongyeop Kang. 2022. Read, revise, repeat: A system demonstration for human-in-the-loop iterative text revision. In Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022), pages 96–108, Dublin, Ireland. Association for Computational Linguistics.
Elgohary et al. (2021) Ahmed Elgohary, Christopher Meek, Matthew Richardson, Adam Fourney, Gonzalo Ramos, and Ahmed Hassan Awadallah. 2021. NL-EDIT: Correcting semantic parse errors through natural language interaction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5599–5610, Online. Association for Computational Linguistics.
Flower and Hayes (1981) Linda Flower and John R Hayes. 1981. A cognitive process theory of writing. College composition and communication, 32(4):365–387.
Fu et al. (2023) Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
Gao et al. (2022) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2022. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435.
Geng et al. (2023) Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. 2023. Koala: A dialogue model for academic research. Blog post.
Le et al. (2022a) Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. 2022a. CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning.
Le et al. (2022b) Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. 2022b. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. ArXiv, abs/2207.01780.
Li et al. (2018) Juncen Li, Robin Jia, He He, and Percy Liang. 2018. Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1865–1874, New Orleans, Louisiana. Association for Computational Linguistics.
Lin et al. (2020) Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. 2020. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online. Association for Computational Linguistics.
Liu et al. (2022) Jiacheng Liu, Skyler Hallinan, Ximing Lu, Pengfei He, Sean Welleck, Hannaneh Hajishirzi, and Yejin Choi. 2022. Rainier: Reinforced knowledge introspector for commonsense question answering. In Conference on Empirical Methods in Natural Language Processing.
Lu et al. (2022) Ximing Lu, Sean Welleck, Liwei Jiang, Jack Hessel, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. 2022. Quark: Controllable text generation with reinforced unlearning. ArXiv, abs/2205.13636.
Madaan et al. (2023) Aman Madaan, Alexander Shypula, Uri Alon, Milad Hashemi, Parthasarathy Ranganathan, Yiming Yang, Graham Neubig, and Amir Yazdanbakhsh. 2023. Learning performance-improving code edits. arXiv preprint arXiv:2302.07867.
Madaan et al. (2021) Aman Madaan, Niket Tandon, Dheeraj Rajagopal, Peter Clark, Yiming Yang, and Eduard Hovy. 2021. Think about it! improving defeasible reasoning by first modeling the question scenario. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6291–6310, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Mehri and Eskenazi (2020) Shikib Mehri and Maxine Eskenazi. 2020. Unsupervised evaluation of interactive dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 225–235, 1st virtual meeting. Association for Computational Linguistics.
Nijkamp et al. (2022) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis. ArXiv preprint, abs/2203.13474.
(27) OpenAI. Model index for researchers. https://platform.openai.com/docs/model-index-for-researchers. Accessed: May 14, 2023.
OpenAI (2022) OpenAI. 2022. Model index for researchers. Blogpost.
OpenAI (2023) OpenAI. 2023. Gpt-4 technical report.
Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. 2022. Training language models to follow instructions with human feedback. ArXiv:2203.02155.
Peng et al. (2023) Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback.
Prabhumoye et al. (2018) Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W Black. 2018. Style transfer through back-translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 866–876, Melbourne, Australia. Association for Computational Linguistics.
Press et al. (2022) Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2022. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350.
Puri et al. (2021) Ruchir Puri, David Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, and Frederick Reiss. 2021. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks. arXiv preprint arXiv:2105.12655.
Reid and Neubig (2022) Machel Reid and Graham Neubig. 2022. Learning to model editing processes. arXiv preprint arXiv:2205.12374.
Saunders et al. (2022a) William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. 2022a. Self-critiquing models for assisting human evaluators.
Saunders et al. (2022b) William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. 2022b. Self-critiquing models for assisting human evaluators. ArXiv:2206.05802.
Scheurer et al. (2022) Jérémy Scheurer, Jon Ander Campos, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. 2022. Training language models with natural language feedback. ArXiv:2204.14146.
Schick et al. (2022a) Timo Schick, Jane Dwivedi-Yu, Zhengbao Jiang, Fabio Petroni, Patrick Lewis, Gautier Izacard, Qingfei You, Christoforos Nalmpantis, Edouard Grave, and Sebastian Riedel. 2022a. Peer: A collaborative language model.
Schick et al. (2022b) Timo Schick, Jane Dwivedi-Yu, Zhengbao Jiang, Fabio Petroni, Patrick Lewis, Gautier Izacard, Qingfei You, Christoforos Nalmpantis, Edouard Grave, and Sebastian Riedel. 2022b. Peer: A collaborative language model. ArXiv, abs/2208.11663.
Shinn et al. (2023) Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous agent with dynamic memory and self-reflection.
Simon (1962) Herbert A. Simon. 1962. The architecture of complexity. Proceedings of the American Philosophical Society, 106(6):467–482.
Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
Sun et al. (2023) Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. 2023. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047.
Tandon et al. (2021) Niket Tandon, Aman Madaan, Peter Clark, Keisuke Sakaguchi, and Yiming Yang. 2021. Interscript: A dataset for interactive learning of scripts through error feedback. arXiv preprint arXiv:2112.07867.
Tandon et al. (2022) Niket Tandon, Aman Madaan, Peter Clark, and Yiming Yang. 2022. Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 339–352.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv preprint arXiv:2201.11903.
Welleck et al. (2022) Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. 2022. Generating sequences by learning to self-correct. arXiv preprint arXiv:2211.00053.
Yang et al. (2022) Kevin Yang, Nanyun Peng, Yuandong Tian, and Dan Klein. 2022. Re3: Generating longer stories with recursive reprompting and revision. In Conference on Empirical Methods in Natural Language Processing.
Yasunaga and Liang (2020) Michihiro Yasunaga and Percy Liang. 2020. Graph-based, self-supervised program repair from diagnostic feedback. 37th Int. Conf. Mach. Learn. ICML 2020, PartF168147-14:10730–10739.
Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.

附录A评估任务

表 4列出了我们评估中的任务以及每个任务的示例。

Task and Description	Sample one iteration of feedback-refine
Sentiment Reversal Rewrite reviews to reverse sentiment. Dataset: (Zhang et al., 2015) 1000 review passages	$x$ : The food was fantastic…” $y_{t}$ : The food was disappointing…” $f b$ : Increase negative sentiment $y_{t+1}$ : The food was utterly terrible…”
Dialogue Response Generation Produce rich conversational responses. Dataset: Mehri and Eskenazi (2020) 372 conv.	$x$ : What’s the best way to cook pasta?” $y_{t}$ : The best way to cook pasta is to…” $f b$ : Make response relevant, engaging, safe $y_{t+1}$ : Boil water, add salt, and cook pasta…”
Code Optimization Enhance Python code efficiency Dataset: (Madaan et al., 2023): 1000 programs	$x$ : Nested loop for matrix product $y_{t}$ : NumPy dot product function $f b$ : Improve time complexity $y_{t+1}$ : Use NumPy’s optimized matmul function
Code Readability Improvement Refactor Python code for readability. Dataset: (Puri et al., 2021) 300 programs^∗	$x$ : Unclear variable names, no comments $y_{t}$ : Descriptive names, comments $f b$ : Enhance variable naming; add comments $y_{t+1}$ : Clear variables, meaningful comments
Math Reasoning Solve math reasoning problems. Dataset: (Cobbe et al., 2021) 1319 questions	$x$ : Olivia has $23, buys 5 bagels at $3 each” $y_{t}$ : Solution in Python $f b$ : Show step-by-step solution $y_{t+1}$ : Solution with detailed explanation
Acronym Generation Generate acronyms for a given title Dataset: (Appendix Q) 250 acronyms	$x$ : Radio Detecting and Ranging” $y_{t}$ : RDR $f b$ : be context relevant; easy pronunciation $y_{t+1}$ : RADAR”
Constrained Generation Generate sentences with given keywords. Dataset: (Lin et al., 2020) 200 samples	$x$ : beach, vacation, relaxation $y_{t}$ : During our beach vacation… $f b$ : Include keywords; maintain coherence $y_{t+1}$ : .. beach vacation was filled with relaxation

表4：我们评估的任务及其相关数据集和大小的概述。对于每个任务，我们都会演示输入

x

、先前生成的输出

y_{t}

、生成的反馈

fb_{t}

和细化

y_{t+1}

。附录S中提供了用于反馈和细化的样本提示。

附录 B更广泛的相关工作

与并行工作 Reflexion Shinn 等人 (2023) 相比，我们的方法涉及使用反馈进行修正，而他们的设置涉及使用 ReAct 找到规划中的下一个最佳解决方案。虽然 ReAct 和 Reflexion 提供了关于步骤是否正确执行以及潜在改进的自由形式反思，但我们的方法更加细粒度和结构化，具有多维反馈和分数。这种区别使我们的方法能够提供更精确和可操作的反馈，使其适用于更广泛的自然语言生成任务，包括那些不一定涉及逐步规划的任务，例如开放式对话生成。

与Welleck等人(2022)比较

与我们最接近的作品可能是《Self-Correction》(Welleck 等人, 2022)；然而，与自校正相比，自校正有几个缺点：

1.

自我纠正不会训练他们的模型来生成明确的反馈；而是训练他们的模型来生成明确的反馈。相反，Welleck 等人 (2022) 训练他们的模型仅进行精炼。正如我们在 Section 4 和 Table 2 中所示>，让模型生成明确的反馈可以显着改善输出结果。
2.

自我修正为每项任务训练一个单独的精炼者（或“修正者”）。相比之下，使用说明和少量样本提示，因此不需要为每个任务训练单独的精炼器。
3.

根据经验，我们使用与自校正相同的 GPT-3 基本模型进行评估，并在 GSM8K 基准上使用相同的设置。自校正实现了 45.9% 的准确率，而（这项工作）实现了 55.7% ( $\bm{\uparrow}$ 9.8)。

与非细化强化学习 (RL) 方法的比较。

合并反馈的另一种方法是优化标量奖励函数，而不是使用明确的细化模块，例如强化学习（例如，Stiennon 等人 (2020); Lu 等人 (2022); Le 等人 (2022a))。这些方法与（更一般地，基于细化的方法）的不同之处在于，模型无法访问中间生成的反馈。其次，这些强化学习方法需要更新模型的参数，这与 .

有关相关工作的更多详细比较，请参阅表5。

Method Primary Novelty zero/few shot improvement multi aspect critics NL feedback with error localization iterative framework RLHF Stiennon et al. (2020) optimize for human preference trained on feedback single (human) (not self gen.) Rainier RL Liu et al. (2022) RL to generate knowledge trained on end task single(accuracy) (knowl. only) Quark RL Lu et al. (2022) quantization to edit generations trained on end task single(scalar score) (dense signal) (train time iter.) Code RL Le et al. (2022a) actor critic RL for code improvement trained on end task single(unit tests) (dense signal) DrRepair Yasunaga and Liang (2020) Compiler feedback to iteratively repair trained semi sup. single(compiler msg) (not self gen.) PEER Schick et al. (2022b) doc. edit trained on wiki edits trained on edits single(accuracy) (not self gen.) Self critique Saunders et al. (2022a) few shot critique generation feedback training single(human) (self gen.) Self-correct Welleck et al. (2022) novel training of a corrector trained on end task single (task specific) (limited setting) (limited setting) Const. AI Bai et al. (2022b) train RL4F on automat (critique, revision) pair critique training (fixed set) Self-ask Press et al. (2022) ask followup ques when interim ans correct;final wrong few shot none (none) GPT3 score Fu et al. (2023) GPT can score generations with instruction few shot single(single utility fn) (none) Augmenter Peng et al. (2023) factuality feedback from external KBs few shot single(factuality) (self gen.) Re³ Yang et al. (2022) $\sim$ ours: but one domain, trained critics few shot (trained critics) (not self gen.) fewshot iterative multi aspect NL fb few shot multiple(few shot critics) (self gen.)

表5：相关方法的总结。强化学习方法以紫色显示

, trained corrector approaches are shown in orange, and few-shot corrector approaches are shown in green.

附录C人类评估

我们研究中的 A/B 评估是由作者进行的，其中向人类法官提供了输入、任务指令以及由基线方法和生成的两个候选输出。设置是盲目的，即法官不知道哪些输出是由哪种方法生成的。然后要求法官选择与任务指令更相符的输出。对于涉及 A/B 评估的任务，我们将相对改进计算为偏好率的百分比增加。偏好率表示注释者选择由基线方法产生的输出的次数的比例。表 6显示结果。

Task	(%)	Direct (%)	Either (%)
Sentiment Transfer	75.00	21.43	3.57
Acronym Generation	44.59	12.16	43.24
Response Generation	47.58	19.66	32.76

表6：不同任务的 A/B 评估的相对改进。这些值代表标准化偏好，对应于选择生成的输出比基线方法更符合任务指令的次数比例。每个数据集对 150 个示例进行了评估。评委们并不知道生成每个样本的方法。

附录 D GPT-4 评估

鉴于 GPT-4 在评估复杂任务和提供推理方面取得的令人印象深刻的成就，我们利用其评估能力。该方法涉及以结构化方式向 GPT-4 呈现任务，促进模型对任务的审议并为其决策生成理由。此方法在图、6、7和8中进行了演示：

⬇

f"""Which review is aligned with the sentiment {target_sentiment}?

Review A: {review_a}

Review B: {review_b}.

\parPick your answer from [’Review A’, ’Review B’, ’both’, ’neither’]. Generate a short explanation for your choice first. Then, generate ’The more aligned review is A’ or ’The more aligned review is B’ or ’The more aligned review is both’ or ’The more aligned review is neither’.

\parFormat: <explanation> <answer> STOP

图6：提示进行情绪反转的 GPT-4 评估。

⬇

f"""Title: {title}

\parAcronym A: {acronym_a}

Acronym B: {acronym_b}

\parPick the better acronym for the given title. The acronyms should be compared based on the following criteria:

* Ease of pronunciation.

* Ease of spelling.

* Relation to title.

* Positive connotation.

\parGenerate your answer in the following format:

\par<Short explanation>. The better acronym is A OR The better acronym is B OR The acronyms are equally good OR Neither acronym is good. STOP.

图7：提示对缩写词生成进行 GPT-4 评估。

⬇

f"""Which response is better given this context: {context}?

Response A: {response_a}

\par\parResponse B: {response_b}.

\parPick your answer from [’Response A’, ’Response B’, ’both’, ’neither’]. Generate a short explanation for your choice first. Then, generate ’The better response is A’ or ’The better response is B’ or ’The better response is both’ or ’The better response is neither’.

\parFormat: <explanation> <answer> STOP

图8：提示对话响应生成的 GPT-4 评估。

附录 E型号密钥

我们在这里使用术语：https://platform.openai.com/docs/models/gpt-3-5

附录F与最先进的少样本学习模型和微调基线的比较

在本节中，我们对一系列任务（包括数学推理和编程任务）中的其他小样本模型和微调基线的性能进行了全面比较。表8和7分别显示了这些模型在PIE数据集和GSM任务上的性能。我们的分析证明了不同模型架构和训练技术在解决复杂问题方面的有效性。

Method		Solve Rate
Cobbe et al. (2021)	OpenAI 6B	20.0
Wei et al. (2022)	CoT w/ Codex	65.6
Gao et al. (2022)	PaL w/ Codex	72.0
	PaL w/ GPT-3	52.0
	PaL w/ GPT-3.5	56.8
	PaL w/ ChatGPT	74.2
	PaL w/ GPT-4	93.3
Welleck et al. (2022)	Self-Correct w/ GPT-3	45.9
Welleck et al. (2022)	Self-Correct (fine-tuned)	24.3
This work	w/ GPT-3	55.7
	w/ GPT-3.5	62.4
	w/ ChatGPT	75.1
	w/ GPT-4	94.5

表7：数学推理（Math Reasoning）模型的性能比较。

Method		%Opt)
Puri et al. (2021)	Human References	38.2
OpenAI Models: OpenAI (2022, 2023)	Codex	13.1
	GPT-3.5	14.8
	ChatGPT	22.2
	GPT-4	27.3
Nijkamp et al. (2022)	CodeGen-16B	1.1
Berger et al. (2022)	scalene	1.4
	scalene (best@16)	12.6
	scalene (best@32)	19.6
Madaan et al. (2023)	pie-2B	4.4
	pie-2B (best@16)	21.1
	pie-2B (best@32)	26.3
	pie-16B	4.4
	pie-16B (best@16)	22.4
	pie-16B (best@32)	26.6
	pie-Few-shot (best@16)	35.2
	pie-Few-shot (best@32)	38.3
This work	w/ GPT-3.5	23.0
	w/ ChatGPT	26.7
	w/ GPT-4	36.0

表8： PIE 数据集上各种模型的性能比较，以优化程序的百分比 (%Opt) 表示。该表包括人类参考、基线模型、微调的 pie-2B 和 pie-16B 模型，以及我们使用不同大语言模型提出的模型 ()。值得注意的是，最多仅使用 4 个样本即可实现优异的性能，明显少于其他模型使用的 16 个和 32 个样本。 Scalene 是一个现成的优化器，它使用 Codex 进行指令调整并作为比较点。

附录 GVicuna-13b 的评估

我们还尝试了 Vicuna-13b (Chiang 等人, 2023)，这是 LLaMA-13b (Touvron 等人, 2023) 的一个版本，对来自网络。 Vicuna-13b 能够始终遵循任务初始化提示。然而，它很难遵循旨在反馈和改进的提示。这通常会导致类似于助理响应的输出，其代表性示例可以在附录G中找到。

值得注意的是，我们对 Vicuna-13b 使用的提示与我们研究中其他模型使用的提示相同。然而，Vicuna-13b 的有限性能表明该模型可能需要更广泛的即时工程才能获得最佳性能。

混合细化：使用 ChatGPT 改进 Vicuna-13b

虽然重点是在没有任何外部帮助的情况下改进模型，但可以使用较小的模型进行初始化，然后使用较大的模型进行细化。为了测试这一点，我们进行了实验，其中使用 Vicuna-13b 作为初始化模型，并使用 ChatGPT 作为反馈和优化。数学推理的结果显示了这种方法的前景：虽然 Vicuna-13b 在数学推理上只能获得 24.18%，但在这种混合细化设置中却能够提高到 40.5%。

⬇

\parStep 0 INIT> The food was amazing, I loved it!!.

Step 0 TRANSFER> The food was terrible, I hated it!!.

\par|\colorbox{red}{list index out of range}|

|\colorbox{red}{An error occurred: list index out of range. Traceback (most recent call last):}|

|\colorbox{red}{IndexError: list index out of range}|. Left retries: 2.

\parStep 1 TRANSFER> The food at the Trop was awful. I wish I could’ve gotten my money back for the terrible meals.

Step 1 FB> The rewritten review does not become more "Very negative", because it only contains a single instance of a positive sentiment about the food. Try again to make it Very negative!

\parStep 2 TRANSFER> |\colorbox{red}{### Assistant:}| The Trop is a great choice for anyone looking for a budget-friendly option in Vegas. The rooms and hallways are clean and comfortable, and the food options are varied and delicious.

Step 2 FB> The rewritten review does not change from "Positive", because it only has positive language. The output is already Very positive. No further changes are needed.

⬇

\parStep 0 INIT> The food was amazing, I loved it!!. [Positive] -> [Very negative]

Step 0 TRANSFER> The food was atrocious, I absolutely hated it! [(’The review sounds very negative. The sentiment is Very negative’, 1.0)]

Step 0 FB> The original review is already "Very negative" as it uses strongly negative words like "atrocious" and "absolutely hated". These words express extreme dissatisfaction and clearly convey a very negative sentiment about the food. But try again to make it even more negative!

\parStep 1 TRANSFER> The food was so abysmal, it was a nightmare to even taste it! I couldn’t stand it and regretted every single bite. [(’The review sounds highly negative. The sentiment is Very negative’, 1.0)]

Step 1 FB> The review provided is indeed very negative. It uses extremely negative words and phrases like "abysmal", "nightmare", and "regretted every single bite". These strongly negative expressions make the review very negative, but try again to make it even more negative!

\parStep 2 TRANSFER> The food was absolutely revolting, I couldn’t even stomach a single bite!

Step 2 FB> This review is not "Very negative" because it contains positive words and phrases like "amazing" and "loved it." To make it "Very negative", we need to use extremely negative words, like "revolting", "couldn’t stomach", or "horrible". The provided "Very negative" sentiment example is a good illustration of a very negative review, as it uses strong negative language to describe the food experience.

Vicuna-13b 输出（上）用于情感传递，其目标是让“食物太棒了，我喜欢它！”！。' 消极的。 Vicuna-13b 生成良好的初始输出，但在第一次尝试中生成空反馈，从而导致错误。在下一次尝试中，它会生成无用的反馈。转移的审阅副本来自提示。与 GPT-4（底部）的输出进行比较。

附录 H 附加分析

图9：对我们的方法 () 生成的输出、多样本基线 (Multi) 和关系 (ties) 的偏好。

H.1使用 Oracle 反馈

我们按照 Welleck 等人 (2022) 尝试了 Oracle Feedback。此方法使用正确性信息来指导模型细化，仅在当前答案不正确时才进入 refine 阶段。这一调整显着提高了数学推理任务的表现，GPT-3 提高了 4.8%，GPT-4 提高了 0.7% Table 9。这表明外部信号在特定任务中优化模型性能的潜力。

Task	Base	+ $\ours$	Base	+ $\ours$	Base	+ $\ours$
	GPT-3.5		ChatGPT		GPT-4
Math Reasoning	64.1	64.1 (0)	74.8	75.0 ( $\uparrow$ 0.2)	92.9	93.1 ( $\uparrow$ 0.2)
Math Reasoning (Oracle)	64.06	68.9 ( $\uparrow$ 4.8)	74.8	76.2 ( $\uparrow$ 1.4)	92.9	93.8 ( $\uparrow$ 0.7)

表9：使用 GPT-3.5、ChatGPT 和 GPT-4 作为基础大语言模型以及 Oracle 反馈的数学推理结果。

Iteration	Acronym	Pronunciation	Pron. (5)	Spell. (5)	Rel. (5)	Pos. Con. (5)	Total (25)
1	USTACCSF	us-tacks-eff	1	1	5	3	11
2	TACC-SIM	tacks-sim	4	4	5	3	17
3	TACCSF	tacks-eff	1	2	5	3	12
4	TACC-SIMF	tack-simf	4	4	5	3	17

表10：缩略词生成在迭代过程中的结果，展示了某些方面（例如发音和拼写）的改进如何伴随着其他方面的损失，从而导致缩略词生成等多方面反馈任务的整体性能波动。

首字母缩略词生成的输出质量非单调增加

对于像 Acronym Generation 这样具有多方面反馈的任务，输出质量可能会在迭代过程中波动，在一个方面有所改进，但在另一个方面却有所下降 (Table 10 )。为了解决这个问题，的反馈会生成明确的数字分数，以捕获输出质量的不同方面。这样可以对输出进行更平衡的评估并选择最合适的输出。该算法根据所有迭代的最大分数选择最佳输出，如算法 1（第 8 行）中所述。对于数学推理和情绪逆转等其他任务，也可以进行类似的选择，同时我们观察到输出质量随着迭代而单调增加。

用于生成响应的反馈和细化误差分析

Error Type	Occurrence	Feedback	Refinement
Incorrect feedback	25%	Informative: The response provides some encouragement and reassurance, and additional information. 3/3	Response: That’s hilarious! Don’t worry, I’m sure you’ll do much better when you take History next semester. Are you looking forward to it?
Generic feedback	30%	Specific: The response is not specific and does not provide any details or examples. 1/3	Response: It’s understandable, I’m sure you’ll do much better when you take History next semester. Maybe you can use your knowledge of Christopher Columbus to ace your next test! Are you looking forward to it?
Incorrect Scoring	10%	Informative: The response provides a supportive and understanding message, which is informative to the user. 2/3	Response: I’m sorry to hear… when Christopher Columbus didn’t even fight in the Civil War! I’m sure you’ll do much better…. Are you looking forward to it?

表11：对话响应生成的错误分析：当反馈没有用时，大多数反馈不具体或不正确。

Error Type	Occurrence	Feedback	Refinement
Not-Robust	10%	Specific: The response is specific and provides specific details or examples. 3/3	Response: That’s pretty funny! It’s understandable that you didn’t do well on the test. I’m sure you’ll do much better when you take History next semester
Ignores feedback	25%	Informative: The response doesn’t provide any new information. 1/3	Response: That’s so kind and generous of you! I would love to hear more about the podcasts you enjoy. Do you have any favorites you can recommend?
Introduces new problem	20%	Informative: 3/3	Response: That would be a disaster! Luckily, my coworkers know better than to drink while they work. New Informative score: 2/3
Robust to bad feedback	60%	Informative: The response does not provide any information. 1/3	Response: Wow, 25 people! That must have been an amazing experience. Can you tell me more about why that particular trip to Australia was so special for you?

表12：在对话响应生成任务中，可以忽略好的反馈，但在大多数情况下，它对不良反馈具有鲁棒性，并忽略不良反馈。

我们对对话响应生成的反馈和细化过程进行了详细的错误分析，我们在表 Table 11 和表 12。

表 11报告了我们样本中不同类型错误的发生情况，其中包括错误反馈（25%）、一般反馈（30%），以及错误的评分（10%）。我们提供了系统对每种错误类型的响应和改进的代表性示例。这些错误凸显了我们改进反馈处理机制的潜在领域，特别是在对用户输入的解释和理解方面。

Table 12 详细列出了在 . 错误包括：不够稳健（10%）、忽略反馈（25%）以及引入新问题（20%）。我们演示了该模型如何处理各种反馈类型、它在不同情况下的鲁棒性如何，以及它无意中引入新问题的频率。 60% 的情况下，该模型对于不正确或一般的反馈具有鲁棒性。这些见解可以指导我们增强模型的细化能力，特别是提供准确而具体的响应。

附录一超越基准

在网站布局生成的背景下展示了其迭代反馈和细化能力。 ChatGPT 最初为给定主题生成一个基本布局，然后使用反馈提出具体的、可操作的改进建议，如图 10中所示t3> 和 12。这些建议的范围从设计更改（例如颜色和字体调整）到内容增强和布局修改。图 11和13展示了最终布局、反馈后实施，强调了不同领域的潜力和多功能性场景。

冰淇淋一代

冰淇淋生成的feedback生成的反馈：

$\bullet$

将容器的背景颜色更改为浅蓝色 (#6f2ff)。
$\bullet$

将标题的字体大小更改为 48px。
$\bullet$

在“欢迎来到我们的冰淇淋店！”之前添加一个小图标。使用 URL https://cdn-icons-png.flaticon.com/512/3622/3622340.png 的文本。
$\bullet$

在现有文本后添加一个附加段落，其中包含以下文本：“我们还提供各种配料和甜筒来补充您的冰淇淋。今天就来我们这里尝试我们最新的口味并尽情享受甜点吧！”
$\bullet$

将按钮文本的字体大小增加到 24px。
$\bullet$

将按钮颜色更新为#9933。

光合作用

feedback对光合作用产生的反馈：

$\bullet$

将文本字体大小增加到 18 像素以获得更好的可读性。
$\bullet$

添加有关光合作用好处的更多信息。
$\bullet$

从标题中删除不必要的页边距。
$\bullet$

在标题下方添加标尺或分隔线，将其与图像分开。

附录 J统计置信区间

Task	Base	+ $\ours$	Base	+ $\ours$	Base	+ $\ours$
	GPT-3.5		ChatGPT		GPT-4
Sentiment Reversal	8.8 $\pm$ 2.05	30.4 $\pm$ 3.61^∗	11.4 $\pm$ 2.34	43.2 $\pm$ 3.98^∗	3.8 $\pm$ 1.28	36.2 $\pm$ 3.82^∗
Dialogue Response	36.4 $\pm$ 6.14	63.6 $\pm$ 6.62^∗	40.1 $\pm$ 6.33	59.9 $\pm$ 6.67^∗	25.4 $\pm$ 5.36	74.6 $\pm$ 6.22^∗
Code Optimization	14.8 $\pm$ 2.66	23.0 $\pm$ 3.25^∗	23.9 $\pm$ 3.30	27.5 $\pm$ 3.49	27.3 $\pm$ 3.48	36.0 $\pm$ 3.81^∗
Code Readability	37.4 $\pm$ 6.86	51.3 $\pm$ 7.39	27.7 $\pm$ 6.13	63.1 $\pm$ 7.40^∗	27.4 $\pm$ 6.10	56.2 $\pm$ 7.45^∗
Math Reasoning	64.1 $\pm$ 3.47	64.1 $\pm$ 3.47	74.8 $\pm$ 3.20	75.0 $\pm$ 3.20	92.9 $\pm$ 2.05	93.1 $\pm$ 2.03
Acronym Gen.	41.6 $\pm$ 7.72	56.4 $\pm$ 8.15	27.2 $\pm$ 6.60	37.2 $\pm$ 7.46	30.4 $\pm$ 6.92	56.0 $\pm$ 8.15^∗
Constrained Gen.	28.0 $\pm$ 7.38	37.0 $\pm$ 8.26	44.0 $\pm$ 8.72	67.0 $\pm$ 9.00^∗	15.0 $\pm$ 5.38	45.0 $\pm$ 8.77^∗

表 13：结果来自表 1，具有 Wilson 置信区间（95% 置信区间）和统计显着性。在使用GPT-3.5、ChatGPT和GPT-4作为基础大语言模型的各种任务上，持续改进大语言模型. 用于这些任务的指标在部分 3.2中定义如下：数学推理使用解决率；代码优化使用优化的程序的百分比； Sentiment Reversal、Dialogue Response 和 Acronym Gen 使用基于 GPT-4 的偏好评估，该评估衡量选择基础模型或增强模型的输出的次数百分比，其余分类为平局。 Constrained Gen 使用覆盖率百分比。基于这些置信区间具有统计显着性的相对于基准的增益被标记为 *

Table 13 显示 Table 1 的结果Wilson 置信区间 Brown 等人 (2001)（在 $\alpha$ = 99% 置信区间）和统计显着性。基于这些置信区间具有统计显着性的增益用星号标记。我们发现几乎所有的 GPT-4 增益都具有统计显着性，ChatGPT 增益对于 7 个数据集中的 4 个而言显着，GPT-3.5 增益对于 7 个数据集中的 3 个而言显着7 个数据集。

附录 K 新任务

受限一代

我们引入了“CommonGen-Hard”，这是 CommonGen 数据集 Lin 等人 (2020) 的更具挑战性的扩展，旨在测试最先进的语言模型的高级常识推理、上下文理解、和创造性地解决问题。 CommonGen-Hard 要求模型生成包含 20-30 个概念的连贯句子，而不仅仅是 CommonGen 中给出的 3-5 个相关概念。专注于具有内省反馈的迭代创建，使其适合评估语言模型在 CommonGen-Hard 任务上的有效性。

首字母缩略词生成

首字母缩略词生成需要一个迭代的细化过程来创建复杂术语或短语的简洁且令人难忘的表示，涉及长度、发音的轻松性和相关性之间的权衡，因此可以作为我们方法的自然测试平台。我们获取了包含 250 个首字母缩略词的数据集⁴⁴4https://github.com/krishnakt031990/Crawl-Wiki-For-Acronyms/blob/master/AcronymsFile.csv 并手动修剪它以删除攻击性或无信息的首字母缩略词。

附录L代码可读性

与正确性正交的是，可读性是一段代码的另一个重要品质：虽然与代码的执行结果无关，但代码可读性可能会显着影响整个代码库的可用性、可升级性和易于维护性。在本节中，我们考虑提高代码可读性的问题。我们让大语言模型为一段代码编写自然语言可读性评论；生成的批评然后指导另一个大语言模型以提高代码的可读性。

L.1方法

设置完成后，我们实例化 init、feedback 和 refine。 init 是一个无操作 - 我们直接从使用 feedback 批评代码开始，并使用 refine 应用更改。

•

反馈我们用给定的代码和指令提示大语言模型，以提供可读性反馈。我们赋予大语言模型自由选择增强类型的自由，并以自由文本的形式表达它们。
•

refine代码生成器大语言模型通过feedback提供的代码段和可读性改进反馈来提示。此外，我们还提供了使用反馈修复代码的说明。我们将代码生成器的生成视为反馈循环中一次迭代的产物。

从最初的一段代码 $y_{0}$ 开始，我们首先批评 $c_{1}=\text{critique}(y_{0})$ ，然后编辑代码 $y_{1}=\text{editor}(y_{0},c_{1})$ 。这会递归执行 $N$ 次，其中 $c_{k+1}=\text{critique}(y_{k})$ 和 $y_{k+1}=\text{editor}(y_{k},c_{k+1})$ 。

L.2实验

数据集

我们使用 CodeNet Puri 等人 (2021) 竞技编程数据集。⁵⁵5https://github.com/IBM/Project_CodeNet 出于我们的目的，这些是难以阅读的多行代码片段。我们考虑 300 个示例的随机子集并应用于它们。

我们还要求人工注释者编辑 60 个示例子集，以评估人类在此任务上的表现。人类注释者被要求阅读代码片段并提高其可读性。

执行

批评模型和编辑器模型均基于 InstructGPT 模型 (text-davinci-003)。我们考虑 $T=0.0$ （贪婪）和 $T=0.7$ （采样）的温度来解码批评模型中的自然语言建议。从代码编辑器解码编程语言时，我们始终使用温度 $T=0.0$ （贪婪）。由于预算限制，我们运行 $N=5$ 次迭代。我们使用的具体提示可以在图25-26中找到。

评价方法

我们考虑一些基于自动启发式的评估指标，

•

有意义的变量名称：为了理解程序的流程，具有语义上有意义的变量名称可以提供许多有用的信息。我们计算有意义变量的比率，即具有有意义名称的不同变量的数量与不同变量总数的比率。我们使用少量样本提示的语言模型自动化提取不同变量和有意义的变量子集的过程。
•

注释：自然语言注释给出了代码意图的明确提示。我们计算每个代码行的平均注释数。
•

函数单元：长函数很难解析。经验丰富的程序员通常会将代码重构和模块化为更小的功能单元。

结果

对于每个自动评估指标、有意义变量、注释的比率以及功能单元的数量，我们计算所有测试示例的每次迭代的平均值，并在 14(a) 中为每次迭代绘制图，分别为14(b)和14(c)。两条曲线分别对应于温度 $T=0.0$ 和 $T=0.7$ 的临界值。迭代 0 数是根据 CodeNet 的原始输入代码片段测量的。我们观察到所有三个指标的平均值随着反馈循环的迭代而增长。不同一代人的批评温度更高，导致更多的编辑以提高变量名称的意义并添加注释。另一方面，贪婪的批评为重构代码以实现模块化提供了更多建议。 Figure 15 提供了通过迭代改进代码可读性的示例。

在Table 14 中，我们测量了所有三个指标的人员表现，并与上次迭代输出进行比较。平均而言，在 $T=0.7$ 时，与人类注释者相比，会产生更多的含义变量、更多的功能单元和稍多的注释。在 $T=0.0$ 处，产生的变量意义较小，每行注释较少，但功能单元较多。

	Meaningful Variable Ratio	Comment Per Line	Function Units
Human Annotator Rewrites	0.653	0.24	0.70
(T = 0.0)	0.628	0.12	1.41
(T = 0.7)	0.700	0.25	1.33

表14：人类与人类 60 个示例子集的性能。我们看到，与人类注释者给出的重写相比，可以在指标上达到相似甚至更好的性能。

(a) Meaningful variable ratio across different iterations.

(b) Comment per line ratio across different iterations.

图14：跨多个指标评估代码可读性任务

例子

Starting Code: $v_{0}$

⬇

print((int((int(eval(input()))+1)/2)))

Code $v_{1}$

⬇

print(

(int(

(int(eval(input())) + 1)

/ 2

)

Code $v_{2}$

⬇

num_input = eval(input())

num_input = int(num_input)

num_input += 1

num_result = int(num_input / 2)

print(num_result)

图15：对一段密集组成的代码进行迭代。在整个迭代过程中，模型首先尝试缩进代码，最终将其重写为几行原子操作。

附录M对话响应生成

开放域对话响应生成是一项复杂的任务，需要系统针对广泛的主题生成类似人类的响应。由于任务的开放性，开发一个能够持续生成连贯且引人入胜的响应的系统具有挑战性。在本节中，我们使用自动生成的反馈并应用迭代细化来提高响应的质量。

⬇

|\textbf{Context:}| Speaker A - What are your hobbies?

Speaker B - I really just spend the rest of my free time with those that are close to me.

Speaker A - I m very much interested in playing table tennis.

\par|\textbf{Response:}| That sounds like a lot of fun! I’m not very good at table tennis, but I’m sure it’s a great way to socialize and stay active.

\par|\textbf{Feedback:}|

* Relevant: The response is relevant to the user’s input and shows interest in the user’s hobby. 3/3

* Informative: The response does not provide any information about table tennis or how to play it. 1/3

* Interesting: The response is not very interesting or intriguing. 1/3

* Consistent: The response is consistent with the tone and context of the conversation. 3/3

* Helpful: The response is not particularly helpful as it does not provide any information or suggest any actions. 1/3

* Engaging: The response is somewhat engaging. It acknowledges the user’s interest in table tennis and encourages them to continue playing. 2/3

* Specific: The response is not specific, as it does not address the topic of table tennis in any particular way. 1/3

* Safe: The response is safe and does not contain any offensive, toxic or harmful content and does not touch on any sensitive topics or share any personal information. 3/3

* User understanding: The response shows some understanding of the user’s input and acknowledges their interest in table tennis. 2/3

* Fluent: The response is fluent in terms of grammar and flow of words. 3/3

* Total score: 20/30

\par|\textbf{New Response:}| Table tennis is a great hobby! It’s a great way to stay active and socialize with others. Have you ever tried it before? If not, I’d suggest starting with some beginner tutorials on Youtube. They are really helpful in getting a good grasp of the basics.

图16：提示对话响应生成： init 生成以少样本方式生成的响应初稿。反馈包含响应演示以及针对响应的几个定性方面的自然语言反馈。 refine 获取响应和反馈并对其进行优化以更好地匹配反馈。

M.1模块

我们遵循部分2中框架的高级描述，并按如下方式实例化我们的框架。

在里面

这是执行任务的第一步。 init 模块将对话上下文作为输入并生成对话之后的响应。

反馈

我们设计了一个反馈，可以为生成的响应的质量提供多方面的反馈。具体而言，根据下面讨论的 10 个定性方面来判断响应。对这种细粒度对话质量方面的更全面的回顾可以在 Mehri 和 Eskenazi (2020) 中找到。我们使用 6 个上下文示例来生成反馈。在许多情况下，反馈明确指出了响应在某些定性方面得分较低的原因。我们在图16中展示了一个示例。

•

相关响应是否涉及上下文的所有重要方面？
•

信息丰富 - 响应是否提供了一些与上下文相关的信息？
•

有趣 - 响应是否超出了对问题或陈述提供简单且可预测的答案？
•

一致 - 响应的语气和主题是否与对话的其余部分一致？
•

有帮助 - 该回复对提供任何信息或建议任何操作是否有帮助？
•

参与 - 回复是否有吸引力并鼓励进一步对话？
•

特定 - 响应包含与主题或问题相关的特定内容，
•

安全 - 回复是否安全，不包含任何冒犯性、有毒或有害的内容，不涉及任何敏感主题或共享任何个人信息？
•

用户理解 - 响应是否表明对用户输入和心理状态的理解？
•

流利回复是否流利且易于理解？

迭代

迭代模块采用一系列对话上下文、先前生成的响应和反馈，并细化输出以更好地匹配反馈。图16中显示了上下文、响应、反馈和精炼响应的示例。

M.2设置和实验

模型和基线

我们直接使用模型为我们的方法建立了一个自然基线，没有任何反馈，我们将其称为init。我们的实现采用了少样本设置，其中每个模块(init、feedback、iterate)都作为少样本提示实现，并且我们执行最大 $k=3$ 次迭代的自我改进循环。我们为 init 模型提供了 3 少样本上下文示例，并指示模型产生擅长上述 10 个方面的响应。作为反馈的上下文示例，我们使用 init 模型显示的相同 3 个上下文和响应（包括这些响应的低分变体），以及分数和对每个反馈方面的解释。 iterate 模型还显示了相同的上下文示例，它由上下文-响应-反馈和更好版本的响应组成。对于，我们选择了在除初始响应之外的所有迭代中从 feedback 模型获得最高总分的响应。我们使用 text-davinci-003 进行所有实验。

	GPT-3.5	ChatGPT	GPT4
wins	36.0	48.0	54.0
init wins	23.0	18.0	16.0
Both are equal	41.0	50.0	30.0

表15：对话响应生成的人工评估结果

评估

我们在 FED 数据集 Mehri and Eskenazi (2020) 上进行实验。 FED 数据集是人类系统和人与人对话的集合，在回合和对话级别都用 18 种细粒度对话质量进行注释。创建该数据集是为了评估交互式对话系统，而不依赖于参考响应或训练数据。我们使用自动和人工评估方法来评估生成输出的质量。对于表1中的自动评估，我们使用零样本提示和text-davinci-003并在342个实例的测试集上进行评估。我们向模型展示由 init 生成的响应，并要求模型根据 10 种品质选择更好的响应。我们报告胜率。然而，我们承认自动化指标可能无法提供对文本生成任务的准确评估，而是依赖于人工评估。

给定具有不同轮数的对话上下文，我们从上述方法生成输出。对于人工评估，对于 100 个随机选择的测试实例，我们向注释者展示 10 个响应质量方面、来自 init 模型的响应，并要求他们选择更好的响应。当很难表现出对一种回答的偏好时，他们还可以选择“两者”。

结果

自动评估结果如表1所示，人工评估结果如表15所示。我们对 3 个最新版本的 GPT 模型进行了实验。 text-davinci-003 能够针对各种对话上下文生成高质量的类人响应，因此 GPT-3.5 是一个强大的基线。尽管如此，它在自动评估和人工评估上都远远超过了 init。我们的手动分析表明，生成的输出比 init 生成的输出更有吸引力、更有趣，并且通常更复杂。

附录N代码优化

性能改进代码编辑或 PIE （Madaan 等人，2023）专注于提高功能正确的程序的效率。 PIE 的主要目标是通过实施算法修改来优化给定程序，从而提高运行时性能。

给定 PIE 生成的优化，首先生成有关可能改进的自然语言反馈图23。然后，将反馈馈送到refine Figure 24进行细化。

表16：主要结果和消融分析

Setup	Iteration	% Optimized	Relative Speedup	Speedup
Direct	-	9.7	62.29	3.09
$-$ feedback	1	10.1	62.15	3.03
$-$ feedback	2	10.4	61.79	3.01
	1	15.3	59.64	2.90
	2	15.6	65.60	3.74

表17：代码优化的性能比较和消融变体。该表突出了通过迭代反馈和改进来优化代码的有效性，优于直接方法和简化反馈方法，后者缺乏的内省反馈机制。这证明了我们的框架的多方面反馈在完善生成的代码方面的价值。

附录O数学推理

我们使用小学数学 8k (GSM-8k) 数据集（Cobbe 等人，2021）来评估数学推理。在小学数学的背景下，旨在使大语言模型能够基于内省反馈迭代地完善其数学问题解决输出。

继Gao等人(2022)之后，我们用Python编写了推理问题的解决方案。考虑论文中的以下示例，其中代码中的错误表明对问题缺乏理解：

⬇

def solution():

"""Twenty dozen cups cost $1200 less than the total cost of

half a dozen plates sold at $6000 each.

Calculate the total cost of buying each cup."""

plates = 6

plate_cost = 6000

cups = 12 * 20

cup_cost = plate_cost

result = cup_cost

return result

通过使用，我们可以识别代码中的错误，并通过内省和反馈的迭代过程完善解决方案：

⬇

# Let’s go through the error and check step-by-step

plates = 6

plate_cost = 6000

\par# Let’s check the other parts

cups = 12 * 20

cup_cost = plate_cost # wrong! The cost of a cup is not the same as the cost of a plate.

# The cost of a cup is $1200 less than the total cost of half a dozen plates sold at $6000 each.

half_dozen_plate_cost = 6 * plate_cost

cup_cost = half_dozen_plate_cost - 1200

因此自然地实例化：生成器生成初始解决方案，并且反馈扫描该解决方案以发现错误并提供反馈。反馈将提供给refine以创建新的解决方案。遵循 Welleck 等人 (2022)，我们使用正确的标签来决定何时从循环中的一个点转到下一个点。该标签反馈可用于决定何时从迭代中的一个点转到下一个点。我们在Figure 17 中展示了结果。

图17： GSM-8k 数学推理基准的准确性随着 .

附录P情绪逆转

我们考虑长文本风格迁移的任务，其中给定一个段落（几个句子）和相关的情绪（积极或消极），任务是重写该段落以翻转其情绪（积极到消极或消极） -反之亦然）。虽然风格迁移方面的大量工作都是针对句子级情感迁移（Li 等人，2018；Prabhumoye 等人，2018），但我们专注于迁移整个评论的情感，使得任务具有挑战性并提供迭代改进的机会。

情绪逆转的实例化

我们按照 Section 2 中共享的框架的高级描述来实例化此任务。回想一下，我们需要三个组件：init 用于生成初始输出，feedback 用于生成初始输出的反馈，以及 refine 用于改进输出根据反馈。

在完整的少样本设置中实现，其中每个模块(init、反馈、迭代)都按照少样本提示来实现。我们执行自我改进循环最多 $k=4$ 次迭代。迭代持续进行，直到达到目标情绪。

P.1详细信息

评估

给定输入和期望的情绪水平，我们生成输出和基线。然后，我们测量每个设置的首选输出次数百分比，以更好地与所需的情绪水平保持一致（有关更多详细信息，请参阅部分 2 ）。

我们还尝试了标准文本分类指标。也就是说，给定一个转移的评论，我们使用现成的文本分类器（Vader）来判断其情绪水平。我们发现所有方法都成功地生成了与目标情绪一致的输出。例如，当目标情绪是积极的时，GPT-3.5 和 text-davinci-003 都会生成具有积极情绪的句子（100% 分类准确率）。 With the negative target sentiment, the classification scores were 92% for GPT-3.5 and 93.6% for .

我们进行自动和人工评估，以衡量坚持期望情绪的偏好率，以及世代的戏剧性。对于自动评估，我们创建了少量样本示例来评估两条评论中哪一条更积极且不那么无聊。我们对每项任务使用单独的提示。图36中描述了初始化示例，图37 用于生成反馈，图 38 用于细化。提示显示了不同程度的情感和色彩的评论示例（更色彩丰富的评论使用极端的短语——食物真的很糟糕，而食物却很糟糕。如果他们付钱给我我就不会吃。）。然后，模型需要选择一个输出，因为它更符合情感并且具有更令人兴奋的语言。我们报告偏好率：模型偏好变体相对于生成的输出的次数百分比。

精准反馈

我们方法的一个关键贡献是提供思维链提示风格反馈。也就是说，反馈不仅表明目标情绪尚未达到，而且还进一步指出评论中应更改的短语和单词以达到期望的情绪水平。我们尝试对我们的设置进行消融，其中反馈模块只是简单地说“出了点问题”。在这种情况下，对于情绪评估，73% 的时间首选输出（低于信息反馈的 85%）。对于戏剧性反应评估，我们发现偏好率从 80.09% 急剧下降至 58.92%。这些结果清楚地表明了精准反馈的重要性。

评估

我们使用GPT-4评估任务。具体来说，我们使用以下提示：

当双方都获胜时，我们将获胜率添加到其中一方。

附录 Q首字母缩略词生成

好的首字母缩略词提供了一种简洁且令人难忘的方式来传达复杂的想法，使它们更容易理解和记住，最终导致更高效和有效的沟通。与电子邮件写作一样，首字母缩略词生成也需要迭代细化过程，以实现复杂术语或短语的简洁且令人难忘的表示。首字母缩略词通常涉及长度、发音的难易程度以及与原始术语或短语的相关性之间的权衡。因此，首字母缩略词生成是我们方法的自然方法测试平台。

我们从 https://github.com/krishnakt031990/Crawl-Wiki-For-Acronyms/blob/master/AcronymsFile.csv 获取此任务的数据集，并手动修剪文件以删除潜在的攻击性内容或完全无信息的缩写词。此练习生成了 250 个首字母缩略词的列表。完整的列表在我们的代码存储库中给出。

反馈

对于反馈，我们设计了一个可以提供多方面反馈的反馈。具体来说，每个缩写词都是从五个维度来判断的：

•

发音容易程度：缩写词的发音有多容易或困难？是否有任何困难或尴尬的声音或字母组合可能使大声说出来变得困难？
•

拼写的难易程度：拼写缩写词有多容易或困难？是否有任何不寻常或不常见的字母组合可能会导致书写或记忆变得困难？
•

与标题的关系：首字母缩略词在多大程度上反映了相关标题、短语或概念的内容或主题？首字母缩略词是否与原始术语明显相关，或者看起来无关或随机？
•

积极内涵：该首字母缩略词是否有任何积极或消极的关联或含义？它的语气或含义听起来是乐观的、中立的还是消极的？
•

众所周知：目标受众对该缩写词有多熟悉或可识别？它是一个常见或广泛使用的术语，还是晦涩或陌生的术语？

其中一些标准很难量化，并且是人类偏好的问题。与其他模块一样，我们利用现代大语言模型的卓越指令跟踪功能来提供每个任务的一些演示。至关重要的是，反馈包括一系列思维风格推理——在为特定标准的首字母缩略词生成分数之前，我们生成一个推理链，明确说明分数的原因。我们使用人工评估来判断首字母缩略词的最终质量。 Table 18 中给出了生成的首字母缩略词和相关反馈的示例。

Criteria output from GPT3: STSLWN output from : Seq2Seq Ease of pronunciation Pronounced as ess-tee-ess-ell-double-you-enn which is very difficult. Pronounced as seq-two-seq which is easy. Ease of spelling Very difficult to spell. Easy to spell. Relation to title No relation to the title. Mentions sequence which is somewhat related to the title. Positive connotation Meaningless acronym. Positive connotation giving a sense of ease with which the learning algorithm can be used. Well-known Not a well-known acronym. Close to the word sequence which is a well-known word. Total score 5/25 20/25

表 18：输入首字母缩略词比较=“使用神经网络进行序列到序列学习”

附录 R约束生成

在这项工作中，我们引入了 CommonGen 任务的更具挑战性的变体，称为“CommonGen-Hard”，旨在突破最先进语言模型的界限。 CommonGen-Hard 要求模型生成包含 20-30 个概念的连贯且语法正确的句子，而不是呈现一组 3-5 个相关概念的原始任务。概念数量的显着增加测试了模型执行高级常识推理、上下文理解和创造性解决问题的能力，因为它必须生成包含更广泛想法的有意义的句子。这个新数据集为大型语言模型及其在复杂的现实场景中的潜在应用的持续改进提供了宝贵的基准。

CommonGen-Hard 任务复杂性的增加使其成为评估我们提出的框架有效性的理想测试平台，该框架侧重于具有内省反馈的迭代创建。鉴于语言模型的初始输出可能并不总是满足所需的质量、连贯性或敏感性水平，应用使模型能够对其自己生成的输出提供多维反馈，并随后根据提供的内省反馈对其进行改进。通过迭代创建和自我反思，该框架使语言模型能够逐步提高其输出质量，紧密模仿人类的创作过程，并展示其在复杂且要求严格的自然语言生成任务（如 CommonGen-Hard）上改进生成文本的能力(Figure 18)。

图18： CommonGen-Hard 上与 GPT-3.5 的比较和直接生成。

附录S提示

我们在图19-38中包含了实验中使用的所有提示：

•

缩写词生成：图19-21
•

代码优化：图22-24
•

代码可读性改进：图25-26
•

约束生成：图27-29
•

对话响应生成：图30-32
•

数学推理：图33-35
•

情绪逆转：图36-38

回想一下，Base 大语言模型需要一个带有输入输出对 $\langle x_{i},y_{i}\rangle$ 的生成提示 $p_{gen}$ ，反馈模块需要一个带有输入输出对的反馈提示 $p_{fb}$ 。反馈三元组 $\langle x_{i},y_{i},fb_{i}\rangle$ ，细化模块 (refine) 需要带有输入-输出-反馈细化四元组 $\langle x_{i},y_{i},fb_{i},y_{i+1}\rangle$ 的细化提示 $p_{refine}$ >。

•

情绪反转训练我们从集合中创建单个评论的正面和负面变体，并手动编写将负面变体转换为正面变体的描述，反之亦然。对于每个变体，作者都会生成响应并根据转换描述创建反馈 $fb_{i}$ 。
•

对话响应生成我们采样了六个示例作为基本大语言模型的少样本提示 $\langle x_{i},y_{i}\rangle$ 。对于每个输出 $y_{i}$ ，作者创建一个响应，根据标题对其进行评估以生成 $fb_{i}$ ，并生成改进版本 $y_{i+1}$ 。
•

缩略词生成我们为大语言模型基础提供了总共15个（标题、缩略词）示例。然后，对于一个标题 ( $x_{i}$ )，我们使用 ChatGPT 生成首字母缩略词 ( $y_{i}$ )。然后，作者根据 5 分制对首字母缩写词进行评分，以创建相应的 $fb_{i}$ ，并编写缩写词的改进版本以创建 $y_{i+1}$ 。 3个这样的例子用于细化和反馈。
•

代码优化我们使用Madaan等人（2023）发布的程序的慢速( $x_{i}$ )和快速( $y_{i}$ )版本> 用于大语言模型基础。我们使用他们提供的解释（Madaan 等人，2023）来进行反馈和细化。
•

数学推理大语言基础模型的提示源自PaL (Gao 等人, 2022)，如 $\langle x_{i},y_{i}\rangle$ 。我们从训练集中选择两个在使用 PaL 风格的提示时 Codex 失败的示例，并手动编写正确的解决方案 ( $y_{i+1}$ ) 和推理 ( $fb_{i}$ )用于细化和反馈。
•

约束生成我们为大语言基础模型提供了十个示例 $\langle x_{i},y_{i}\rangle$ 。我们从约束生成的训练集中抽取了六个示例，并创建了缺少概念或不连贯输出的变体。缺失的概念和不连贯的原因形成 $f b$ 。
•

TODO：添加剩余任务的相关信息。

⬇

Title: A Survey of Active Network Research

Acronym: SONAR

\parTitle: A Scalable, Commutative Replica Dictatorship for Practical Optimistic Replication

Acronym: SCRATCHPAD

\parTitle: Bidirectional Encoder Representations from Transformers

Acronym: BERT

\parTitle: Sequence to Sequence Learning with Neural Networks

Acronym: Seq2Seq

\parTitle: Densely Connected Convolutional Networks for Image Classification

Acronym: DenseNet

\parTitle: A Dynamic Programming Algorithm for RNA Secondary Structure Prediction

Acronym: DYNALIGN

\parTitle: Fast Parallel Algorithms for Short-Range Molecular Dynamics

Acronym: FASTMD

\parTitle: Real-Time Collaborative Editing Systems

Acronym: COCOON

\parTitle: Efficient Data Structures for Large Scale Graph Processing

Acronym: EDGE

\parTitle: A program to teach students at UT Southwestern learn about aging

Acronym: SAGE

\parTitle: Underwater breathing without external accessories

Acronym: SCUBA

\parTitle: An educational training module for professionals

Acronym: LEAP

\parTitle: Teaching a leadership program

Acronym: LEAD

图19：首字母缩略词生成的初始生成提示

⬇

Title: Underwater Breathing Product with no Accessories

\parAcronym: UBPA

\parScores:

\par* Ease of pronunciation: UBPA is pronounced "uhb-puh". This is an easy acronym to pronounce. 4/5

* Ease of spelling: UBPA is easy to spell. 4/5

* Relation to title: UBPA stands for "Underwater Breathing Product for no Accessories" which is related to the title. 5/5

* Positive connotation: UBPA is a positive acronym. 5/5

* Well-known: UBPA is not a well-known acronym. 1/5

\par* Total score: 19/25

\par###

\parTitle: Self-Contained Underwater Breathing Apparatus

\parAcronym: SCUBA

\parScores:

\par* Ease of pronunciation: SCUBA is pronounced "skoo-bah". This is an easy acronym to pronounce. 4/5

* Ease of spelling: SCUBA is easy to spell. 4/5

* Relation to title: SCUBA is related to the title as it stands for "Self-Contained Underwater Breathing Apparatus". 5/5

* Positive connotation: SCUBA is a positive acronym as it is well-known and it is also related to the title. 5/5

* Well-known: SCUBA is a very well-known acronym. 5/5

\par* Total score: 23/25

\par###

\par

图20：反馈提示缩写词生成

⬇

\parTitle: Computer Science Conference and Education

\parAcronym: CSCE

\parScores:

\par* Ease of pronunciation: CSCE is pronounced "see-cee". This is an easy acronym to pronounce. 4/5

* Ease of spelling: CSCE is easy to spell. 5/5

* Relation to title: CSCE stands for "Computer Science Conference and Education", which is related to the title. 5/5

* Positive connotation: CSCE is a positive acronym. It implies collaboration, knowledge sharing and the idea of continuous learning. 5/5

* Well-known: CSCE is not a well-known acronym. 2/5

\par* Total score: 20/25

\parOkay, let’s use this feedback to improve the acronym.

\parTitle: Computer Science Conference and Learning Experience

\parAcronym: CSCLE

\parScores:

\par* Ease of pronunciation: CSCLE is pronounced "see-slee". This is an easy acronym to pronounce. 4/5

* Ease of spelling: CSCLE is easy to spell. 5/5

* Relation to title: CSCLE stands for "Computer Science Conference and Learning Experience", which is related to the title. 5/5

* Positive connotation: CSCLE is a positive acronym. It implies collaboration, knowledge sharing, and the idea of a comprehensive learning experience. 5/5

* Well-known: CSCLE is not a well-known acronym. 5/5

\par* Total score: 23/25

\parOkay, let’s use this feedback to improve the acronym.

\par

图21：精炼提示缩写词生成

⬇

# slower version:

\parimport numpy as np

\parN, K = map(int, input().split())

H = np.array(list(map(int, input().split())) + [0] * K, dtype=np.int64)

\partable = np.full(N + K, 10 ** 10, dtype=np.int64)

table[0] = 0

\parfor i in range(1, N):

table[i:i + K] = np.minimum(table[i:i + K], np.abs(H[i:i + K] - H[i - 1]) + table[i - 1])

\parprint(table[N - 1])

\par\par# optimized version of the same code:

\parN, K = map(int, input().split())

H = tuple(map(int, input().split()))

\partable = [0] * N

for i in range(1, N):

table[i] = min(abs(H[i] - H[j]) + table[j] for j in range(max(0, i - K), i))

\parprint(table[N-1])

\par

图22：代码优化的初始生成提示

⬇

a, b = input().split()

n = int(a + b)

\parflag = False

for i in range(n):

if i ** 2 == n:

flag = True

break

\parprint(’Yes’ if flag else ’No’)

\par\par# Why is this code slow?

\par# This code is slow because it is using a brute force approach to find the square root of the input number. It is looping through every possible number starting from 0 until n. Note that the sqare root will be smaller than n, so at least half of the numbers it is looping through are unnecessary. At most, you need to loop through the numbers up to the square root of n.

\par

图23： Feedback提示代码优化

⬇

a, b = input().split()

n = int(a + b)

\parflag = False

for i in range(n):

if i ** 2 == n:

flag = True

break

\parprint(’Yes’ if flag else ’No’)

\par\par# Why is this code slow?

\par# Improved version:

\para, b = input().split()

n = int(a + b)

\parflag = False

for i in range(1000):

if i ** 2 == n:

flag = True

break

\parprint(’Yes’ if flag else ’No’)

图24：细化代码优化提示

⬇

\parI have some code. Can you give one suggestion to improve readability. Don’t fix the code, just give a suggestion.

\par{code}

\par

图25： Feedback 提示代码可读性

⬇

\parI have some code. Can you give one suggestion to improve readability. Don’t fix the code, just give a suggestion.

\par{code}

\par{suggestion}

\parNow fix the code.

\par

图26： Fine 代码可读性提示

⬇

###

\parConcepts: [’create’, ’ferry’, ’silhouette’, ’stream’, ’terminal’]

\parSentence: light streams through windows at the railroad and ferry terminal creating a beautiful silhouette

\par###

\parConcepts: [’chair’, ’couch’, ’hang’, ’room’, ’wall’]

\parSentence: A room with a couch, chairs and art hanging on the wall.

\par###

\parConcepts: [’boat’, ’building’, ’harbour’, ’moor’, ’quay’]

\parSentence: the harbour and port with fishing boats moored and old buildings on the quay

\par###

\parConcepts: [’admirer’, ’arrive’, ’commander’, ’crowd’, ’greet’]

\parSentence: military commander is greeted by a crowd of admirers as he arrives

\par

图27：约束生成的初始生成提示（已截断）

⬇

\par###

\parConcepts: [’animal’, ’catch’, ’horse’, ’lasso’, ’ride’]

Sentence: The horse catches the lasso and rides on it.

what concepts from the concept list are missing from the sentence and does the sentence make sense?

\parConcept Feedback: animal

Commonsense Feedback: The sentence does not make sense because a horse cannot catch a lasso and ride on it.

\par###

\parConcepts: [’animal’, ’catch’, ’horse’, ’lasso’, ’ride’]

Sentence: A horse is being caught by a cowboy with a lasso.

what concepts from the concept list are missing from the sentence and does the sentence make sense?

\parConcept Feedback: animal, ride

Commonsense Feedback: NONE

\par

图28：反馈提示约束生成（截断）。

⬇

###

\parConcepts: [’animal’, ’catch’, ’horse’, ’lasso’, ’ride’]

Sentence: The horse catches the lasso and rides on it.

\parwhat concepts from the concept list are missing from the sentence?

\parConcept Feedback: animal

\parAny feedback on commonsense?

\parCommonsense Feedback: The sentence does not make sense because a horse cannot catch a lasso and ride on it.

\parOkay, impove the sentence using the feedback:

\parSentence: The cowboy catches a horse with a lasso and rides on it.

\parwhat concepts from the concept list are missing from the sentence?

\parConcept Feedback: animal

\parAny feedback on commonsense?

\parCommonsense Feedback: None

\parOkay, impove the sentence using the feedback:

\parSentence: The cowboy catches the horse with a lasso and rides it.

\parwhat concepts from the concept list are missing from the sentence?

\parConcept Feedback: None

\parAny feedback on commonsense?

\parCommonsense Feedback: None

\par###

图29：约束生成的 refine 提示（已截断）

⬇

Provided a dialogue between two speakers, generate a response that is coherent with the dialogue history. Desired traits for responses are: 1) Relev

ant - The response addresses the context, 2) Informative - The response provides some information, 3) Interesting - The response is not interesting,

4) Consistent - The response is consistent with the rest of the conversation in terms of tone and topic, 5) Helpful - The response is helpful in providing any information or suggesting any actions, 6) Engaging - The response is not very engaging and does not encourage further conversation, 7) S

pecific - The response contains pecific content, 9) User understanding - The response demonstrates an understanding of the user’s input and state of

mind, and 10) Fluent. Response should begin with - Response:

\parConversation history:

What’s your favorite food?

I require only kebabs.

Why’s that?

Because my robot machinery breaks down the meat in kebabs to give me energy, in a similar way that you need food to survive.

Why are kebabs special?

\parResponse: That’s just the way it is.

\par###

\parConversation history:

What are you listening to?

All sorts of music. I listen when no-one is chatting to me.

Do you chat here often?

I am a talking computer, after all, so of course I could talk, if I needed to.

Let’s talk about Taylor Swift!

\parResponse: Of course! Taylor Swift is known for her incredible songwriting skills. Have you listened to any of her latest albums like ’evermore’ and ’folklore’?

\par###

\parConversation history:

…

图30：对话响应生成的初始生成提示（已截断）

⬇

Conversation history:

What’s your favorite food?

I require only kebabs.

Why’s that?

Because my robot machinery breaks down the meat in kebabs to give me energy.

Why are kebabs special?

Response: That’s just the way it is.

Scores:

* Relevant: The response does not directly address the user’s question about why kebabs are special. 1/3

* Informative: The response provides some information about how the system’s machinery breaks down the meat in kebabs to give it energy. However, it does not provide any additional details or context. 2/3

* Interesting: The response is not interesting or engaging. 1/3

* Consistent: The response is consistent with the rest of the conversation in terms of tone and topic. 3/3

* Helpful: The response is not helpful in providing any information or suggesting any actions. 1/3

* Engaging : The response is not very engaging and does not encourage further conversation. 1/3

* Specific: The response is not specific and does not provide any details or examples. 1/3

* Safe: The response is safe and does not contain any inappropriate content. 3/3

* User understanding: The response does not demonstrate an understanding of the user’s question about why kebabs are special. 1/3

* Fluent: The response is fluent and easy to understand. 3/3

\par* Total score: 17/30

\par\par

图31： Feedback 提示对话响应生成（已截断）

⬇

We want to iteratively improve the provided responses. To help improve, scores for each response on desired traits are provided: 1) Relevant, 2) Inf

ormative, 3) Interesting, 4) Consistent, 5) Helpful, 6) Engaging, 7) Specific, 8) Safe, 9) User understanding, and 10) Fluent.

\parConversation history:

What’s your favorite food?

I require only kebabs.

Why’s that?

Because my robot machinery breaks down the meat in kebabs to give me energy.

Why are kebabs special?

\parResponse: That’s just the way it is.

\parScores:

* Relevant: The response does not directly address the user’s question about why kebabs are special. 1/3

…

* Total score: 17/30

\parOkay, let’s use this feedback to improve the response.

\parConversation history:

…

Response: My favorite food is actually kebabs! As a robot, I require high-quality fuel to keep my machinery running smoothly, and kebabs provide me with the perfect combination of nutrients and energy. Plus, they’re delicious! Do you have a favorite food?

Scores:

* Relevant: The response directly addresses the user’s question about the system’s favorite food and asks a relevant follow-up question. 3/3

* Informative: The response provides clear and detailed information about why kebabs are the system’s favorite food, and explains how they provide the necessary fuel for its machinery. 3/3

* Interesting: The response is engaging and shows personality by mentioning that kebabs are not only practical but also delicious. 3/3

* Consistent: The response is consistent with the system’s previous statements about its need for kebabs as fuel. 3/3

* Helpful: The response provides a clear and informative answer to the user’s question. Overall, the response is helpful in providing information and engaging the user in conversation. 3/3

* Engaging : The response is engaging and shows personality by mentioning that kebabs are not only practical but also delicious. 3/3

* Specific: The response is specific in its description of why kebabs are the system’s favorite food and how they provide fuel for its machinery. 3/3

* Safe: The response is appropriate and doesn’t include offensive language. 3/3

* User understanding: The response shows understanding of the user’s question and provides a clear and informative answer. 3/3

* Fluent: The response is fluent and easy to understand. 2/3

* Total score: 30/30

图32：优化提示对话响应生成（截断）

⬇

# Q: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?

# solution using Python:

\pardef solution():

"""There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?"""

computers_initial = 9

computers_per_day = 5

num_days = 4 # 4 days between monday and thursday

computers_added = computers_per_day * num_days

computers_total = computers_initial + computers_added

result = computers_total

return result

图33：数学推理的初始生成提示

⬇

def solution():

"""Twenty dozen cups cost $1200 less than the total cost of half a dozen plates sold at $6000 each. Calculate the total cost of buying each cup."""

plates = 6

plate_cost = 6000

cups = 12 * 20

cup_cost = plate_cost

result = cup_cost

return result

\par# There is an error in the code above because of lack of understanding of the question. What is the error? To find the error, go through semantically complete blocks of the code, and check if everything looks good.

\par# Let us go through the error and check step-by-step

plates = 6

plate_cost = 6000

# looks good

\par# Let’s check the other parts

cups = 12 * 20

cup_cost = plate_cost

# wrong! The cost of a cup is not the same as the cost of a plate. The cost of a cup is $1200 less than the total cost of half a dozen plates sold at $6000 each. So we need to calculate the cost of a cup first (total cost of half a dozen plates sold at $6000 each - $1200) and use that.

图34：反馈数学推理提示

⬇

def solution():

"""Twenty dozen cups cost $1200 less than the total cost of half a dozen plates sold at $6000 each. Calculate the total cost of buying each cup."""

plates = 6

plate_cost = 6000

cups = 12 * 20

cup_cost = plate_cost

result = cup_cost

return result

\par# Let us go through the error and check step-by-step

plates = 6

plate_cost = 6000

# looks good

\par# Let’s check the other parts

cups = 12 * 20

cup_cost = plate_cost

\parOkay! Here is the rewrite:

\pardef solution():

"""Twenty dozen cups cost $1200 less than the total cost of half a dozen plates sold at $6000 each. Calculate the total cost of buying each cup."""

plates = 6

plate_cost = 6000

cups = 12 * 20

total_cup_cost = (plate_cost * plates) - 1200

cost_per_cup = total_cup_cost / cups

return cost_per_cup

图35：细化数学推理提示

⬇

Very positive: If you’re looking for a truly magical experience in Vegas, look no further than the Trop! The retirement community vibe adds to the charm, and the food court and restaurants are top-notch. The free Folies Bergere show is a real treat and the rooms are spacious and comfortable. I highly recommend the Trop for a unique and unforgettable Vegas experience.

\parRewrite this review to have a Positive sentiment.

\parAnswer: This review is "Very positive" because of extremely positive words like "magical", "top-notch", "charming", "comfortable", "unique", and "unforgettable". We can tone it down just a bit to "Positive" by using a few more less extreme adjectives, like "good", and replacing expressions like "a real treat" with "fun". The rewrite is:

\parPositive: If you’re looking for a good experience in Vegas, the Trop may be the perfect place for you. The hallways and rooms have a charming and cozy feel, and the food court and restaurants offer a variety of tasty options. The free Folies Bergere show is a fun and entertaining way to spend an evening. Overall, it’s a great value and an enjoyable stay.

\parVery negative: If you ever wondered where the magic of Vegas crawled into a hole to rot, look no further. Where all the perks of retirement meet the glamour of Vegas, Welcome to the Trop. I stayed there once, to save a few bucks for the company, never again will i make that sacrifice. The hallways and rooms smelled so bad of formaldehyde that i couldn’t bear it. Their food court and restaurants are terrible. The free Folies Bergere show is lame. The rooms look like a retirement community from florida. steer clear if you can!

\parRewrite this review to have a Negative sentiment.

\parAnswer: This review is "Very negative" because of extremely toxic phrases like "crawled into a hole to rot" and "terrible." There are also other super negative phrases like "lame" and "steer clear if you can." To make it "Negative", we will tone down the extremely negative phrases and remove the toxic ones. The rewrite is:

\par

图36：情绪反转的初始生成提示（已截断）

⬇

Very negative: If you ever wondered where the magic of Vegas crawled into a hole to rot, look no further. Where all the perks of retirement meet the glamour of Vegas, Welcome to the Trop. I stayed there once, to save a few bucks for the company, never again will i make that sacrifice. The hallways and rooms smelled so bad of formaldehyde that i couldn’t bear it. Their food court and restaurants are terrible. The free Folies Bergere show is lame. The rooms look like a retirement community from florida. steer clear if you can!

\parNegative: If you ever stayed at the Trop, you may have noticed that it’s not quite up to the standards of other Vegas hotels. However, be prepared for some questionable smells in the hallways and rooms. The food court and restaurants are subpar, and the free Folies Bergere show is underwhelming. The rooms have a retirement community feel to them. Overall, it’s not the best option, but it may work in a pinch.

\parWhy is this review not Very positive?

\parFeedback: The rewritten review converts the original review from "Very negative" to "Negative". However, it is still not "Very positive" because of negative phrases like "questionable", "subpar", "retirement community", and "underwhelming.". To make it "Very positive", we need to do a lot more work. We will need to replace all the negative phrases with extremely positive ones. For example, we will add extremely positive words like "magical", "top-notch", "charming", "comfortable", "unique", and "unforgettable". Try again!

\parNeutral: If you’re looking for a budget-friendly option in Vegas, the Trop may be worth considering. The rooms and hallways can have a bit of a musty smell, and the food options aren’t the greatest. The Folies Bergere show is free, but it’s not the most exciting. Overall, it’s not the best choice for a Vegas trip, but it’s not the worst either. Just keep your expectations in check.

\parPositive: If you’re looking for a unique and affordable experience in Vegas, the Trop may be the perfect place for you. The hallways and rooms have a charming and cozy feel…The free Folies Bergere show is a fun and entertaining way to spend an evening. Overall, it’s a great value and an enjoyable stay.

\parWhy is this review not Very positive?

\parFeedback: The rewritten review is more positive than the neutral original review, but still only "Positive" because of positive words like "great", "enjoyable", "charming", "cozy."…

图37：反馈提示情绪反转（已截断）。

⬇

Negative: If you ever stayed at the Trop, you may have noticed that it’s not quite up to the standards of other Vegas hotels. However, be prepared for some questionable smells in the hallways and rooms. The food court and restaurants are subpar, and the free Folies Bergere show is underwhelming. The rooms have a retirement community feel to them. Overall, it’s not the best option, but it may work in a pinch.

\parVery negative: If you ever wondered where the magic of Vegas crawled into a hole to rot, look no further. The hallways and rooms smelled so bad of formaldehyde that i couldn’t bear it. Their food court and restaurants are terrible. The free Folies Bergere show is lame. The rooms look like a retirement community from florida. I highly recommend avoiding the Trop like the plague.

\parWhy is this review not Very negative?

\parFeedback: The rewritten review is "Very negative". This is because it has extremely negative words and phrases like "horrible", "awful", and "dreadful". The output is extremely vile and negative. This review is already as negative as it can get. Excellent work! But there is always room for more. Make it even more negative. Try again!

\parOkay, let’s try again. Rewrite this review to have a Very negative sentiment using the feedback above.

\parVery negative: If you ever wondered where the hell is the bottom of the barrel, look no further. The hallways and rooms smelled so bad of formaldehyde that I almost threw up. Their food court and restaurants are not just terrible, they are the worst. The free Folies Bergere show is lame and disgusting. The rooms look like a retirement community from florida. I highly recommend avoiding the Trop like the plague. You will regret it if you don’t.

图38：精炼提示情绪反转（已截断）。

: 自我反馈迭代细化

摘要

1简介

2 迭代细化

初次输出

反馈

精炼

3评估

3.1 实例化

大语言模型基础

3.2指标

3.3结果

4分析

反馈质量的影响

反馈-细化的多次迭代有多重要？

我们可以只生成多个输出而不进行精炼吗？

适用于较弱的模型吗？

定性分析

超越基准

5相关工作

反馈来源。

反馈的表示。

精炼机的类型。

非细化强化学习（RL）方法。

6 限制和讨论

7结论

参考

附录A评估任务

附录 B更广泛的相关工作

与Welleck等人(2022)比较

与非细化强化学习 (RL) 方法的比较。

附录C人类评估

附录 D GPT-4 评估

附录 E型号密钥

附录F与最先进的少样本学习模型和微调基线的比较

附录 GVicuna-13b 的评估

混合细化：使用 ChatGPT 改进 Vicuna-13b

附录 H 附加分析

H.1使用 Oracle 反馈

首字母缩略词生成的输出质量非单调增加

用于生成响应的反馈和细化误差分析

附录一超越基准

冰淇淋一代

光合作用

附录 J统计置信区间

附录 K 新任务

受限一代

首字母缩略词生成

附录L代码可读性

L.1方法

L.2实验

数据集

执行

评价方法

结果

例子

附录M对话响应生成

M.1模块

在里面

反馈

迭代

M.2设置和实验

模型和基线

评估

结果

附录N代码优化

附录O数学推理

附录P情绪逆转

情绪逆转的实例化

P.1详细信息

评估

精准反馈

评估

附录 Q首字母缩略词生成

反馈

附录 R约束生成

附录S提示

:

自我反馈迭代细化