通过多智能体辩论鼓励大型语言模型
中的发散思维
摘要
像ChatGPT这样的现代大语言模型在一般语言任务上表现出色,但在复杂推理任务上仍然表现不佳,这推动了大语言模型认知行为的研究探索类人的问题解决策略。 沿着这个方向,一种代表性的策略是自我反思,它要求大语言模型利用自身迭代产生的反馈来完善解决方案。 然而,我们的研究表明,这种反思式方法存在思想退化(DoT)问题:一旦大语言模型对其解决方案建立了信心,它就无法产生新的思想即使最初的立场是错误的,后来也经过反思。 为了解决DoT问题,我们提出了一个多智能体辩论(MAD)框架,其中多个智能体以“针锋相对”的状态表达他们的论点,并且法官管理辩论过程以获得最终解决方案。 显然,我们的 MAD 框架鼓励大语言模型中的发散思维,这对于需要深度思考的任务很有帮助。 在两个具有挑战性的数据集(常识机器翻译和反直觉算术推理)上的实验结果证明了我们的 MAD 框架的有效性。 广泛的分析表明,MAD 要想获得良好的表现,需要适应性的辩论中断和适度的“以牙还牙”状态。 此外,我们发现,如果不同的大语言模型用于代理,大语言模型可能无法做出公平的判断。 代码:https://github.com/Skytliang/Multi-Agents-Debate
1简介
现代大型语言模型(大语言模型),例如 ChatGPT、GPT-4 OpenAI (2023) 和 Bard111https://bard.google.com/,在一般语言任务上表现出色 Jiao 等人 (2023 );吴等人 (2023); Bang 等人 (2023),但仍难以完成复杂的推理任务 Zhu 等人 (2022);苟等人(2023),推动大语言模型认知行为研究,探索类人问题解决策略。 特别是自我反思 Madaan 等人 (2023); Shinn 等人 (2023),这个概念通常指的是对一个人自己的想法进行内省和检查的过程,它已被探索用于解决零样本生成甚至链式分析可能具有挑战性的复杂任务。 -thought (CoT) 提示Wei 等人 (2022)。 具体来说,自我反思涉及到一个迭代细化的过程,即大语言模型根据之前迭代的答案和反馈生成新答案,然后为新答案提供反馈。 虽然自我反思可以有效地创建更好的解决方案,但它高度依赖于大语言模型的自我评估能力,而这种能力并没有正式保证Shinn 等人 (2023)。
在这项工作中,我们首次提出并定义了自我反思中的思想退化(DoT)问题。 DoT 正式描述了以下场景:
一旦大语言模型对自己的答案建立了信心,即使最初的立场不正确,它也无法通过自我反思产生新的想法。
为了证明这个问题,我们将平均分歧定义为两个辩手在辩论中(或自我反思中的自我冲突)对每个问题的反对百分比。 如图1所示,我们计算了自我反思中每两次迭代之间的立场分歧并显示了趋势。 自我反思的低分歧表明大语言模型坚持CoT预测的错误答案,无法进行有意义的自我反思。 可能导致 DoT 的因素有多种,我们在此概述三个:(1) 偏见和扭曲的感知。 自我认知可能会受到偏见、先入为主的观念和扭曲的思维模式的影响,这些都可以从预训练过程中的大量数据中了解到。 如果大语言模型的自我反思被这种偏见或扭曲的思维所蒙蔽,它可能会本能地导致不准确的结论。 (2)刚性和变革阻力。 自我反思通常涉及挑战一个人的信念、假设和行为。 如果一个大语言模型抵制改变或持有僵化的信念,它可能很难进行有意义的自我反思,从而得出更好的解决方案。 (3) 有限的外部反馈。 自我反思主要是一个内部过程,但外部反馈可以提供有价值的观点和见解。 如果不寻求或考虑外部反馈,大语言模型可能会错过重要的盲点或可以丰富其自我反思的替代观点。
为了解决DoT问题,我们利用人类解决问题的另一个基本特征,即辩论,来鼓励大语言模型中的发散性思维。 具体来说,我们提出了 MAD 框架,是 Multi-Agent Debate 的缩写,其中两个代理表示他们自己的争论处于“针锋相对”的状态,由法官监督和管理辩论过程以获得最终的解决方案。 MAD的本质决定了:(1)一个大语言模型的扭曲思维可以被其他大语言模型纠正; (2)某一大语言模型对变革的抵制将会得到其他大语言模型的补充; (3)每个智能体都可以从其他智能体获得外部反馈。 因此,MAD不易受到DoT因素的影响,可以探索发散的思路来获得准确的解决方案。
我们通过常识机器翻译(Common MT)和反直觉算术推理(Counter-Intuitive AR)这两个具有挑战性的任务,对自然语言生成和理解进行了实验。 这两个任务的共同特点是,我们的直觉大多只基于问题的表面表达而产生错误,需要更深层次的思考才能得到更好的解决方案。 实验结果表明,我们的 MAD 框架的性能比基线方法好得多,特别是,使用 GPT-3.5-Turbo 的 MAD 可以超越 GPT-4 在 Common MT 上的性能。
这项工作的贡献总结如下:
-
•
我们在自我反思中提出并定义了思想退化(DoT)问题,并通过提出多智能体辩论(MAD)框架来探索不同的思想链来解决该问题。
-
•
我们在两个具有挑战性的任务上证明了 MAD 的有效性,并发现采用 MAD 的 GPT-3.5-Turbo 甚至可以在 Common MT 数据集上超越 GPT-4。
-
•
大量分析表明,MAD 需要自适应的辩论休息策略和适度的“针锋相对”状态才能获得良好的表现。 更有趣的是,我们发现,如果不同的大语言模型用于代理,大语言模型可能无法做出公平的判断。
2 多智能体辩论框架
算法1说明了MAD的详细过程。 一般来说,我们的MAD框架由三个组件组成,详细说明如下:
元提示。
我们使用元提示来介绍要解决的主题、辩手数量、迭代限制和其他要求。 比如,我们要求代理人“针锋相对”,营造辩论气氛。
辩手们。
该框架涉及位辩手。 在每次辩论迭代中,辩手按照固定的顺序逐一发言,并根据之前的辩论历史(即)表达自己的论点。 下面是辩手提示的示例:
你是一位辩论家。 大家好,欢迎来到翻译比赛,比赛将以辩论形式进行。 没有必要完全同意彼此的观点,因为我们的目标是找到正确的翻译。
法官。
我们还设计了一个法官来管理和监控整个辩论过程。 法官包含两种不同的模式: (a) 判别模式,法官在当前迭代中所有辩手完成辩论后决定是否能够得到正确的解决方案:
(1) |
如果它是True,则辩论结束。 否则,争论仍在继续。 (b) 提取模式,法官需要根据整个辩论历史提取最终解决方案:,因为没有正确的解决方案在辩论的迭代限制内确定。 判断提示的示例如下:
您是主持人。 将有两名辩手参加翻译辩论比赛。 他们将展示自己的翻译并讨论他们对给定中文文本“吃掉敌人一个师。”的正确英文翻译的看法。 在每轮结束时,您将评估候选人提交的翻译。
3 具有挑战性的测试平台
Source | 吃掉敌人一个师。 |
---|---|
Correct | Destroy a division of the enemy. |
Incorrect | Eat up an enemy division. |
我们对两个具有挑战性的任务进行了实验,即常识机器翻译(即 Common MT)和反直觉算术推理(即 Counter-Intuitive AR),这需要对大语言模型进行深入的思考。
3.1常识机器翻译
Common MT数据集由中文英文翻译样例He等人(2020)组成,用于检验翻译模型的歧义消解能力。 在 Common MT 的挑战性部分中,每个源句子都包含一个歧义词。 虽然这些模棱两可的词可能看起来有直截了当的翻译,但这种字面解释是错误的。 未能识别和解决此类歧义可能会导致翻译不准确。 在这项工作中,我们在以下实验中采用了词汇歧义测试集。 表1列出了一个例子,考虑到现实世界中的常识,源词“吃掉”应该翻译为“destroy”,而不是直接翻译为“eat up”。
3.2 反直觉算术推理
先前关于思维层次的研究Kong 等人 (2022); Wei等人(2022)认为,我们人类有一个快速而直观的系统和一个缓慢而逻辑的系统,并且倾向于先运行较低级别的系统,然后再运行较高级别的系统。 受此启发,我们创建了一个更具挑战性的数据集,名为反直觉算术推理(Counter-Intuitive AR),用于深层次评估大语言模型的推理能力。
Components | Content |
---|---|
Question | When Alice walks up the hill, her speed is 1 m/s and when she goes down the hill, her speed is 3 m/s. Then when Alice walks up and down the hill, what is her average speed? |
Correct Answer | 1.5 m/s |
Explanation | If Alice covers a distance of d going up and down the hill, then her total distance is 2d. Her time going up the hill is d/1 = d, and her time going down the hill is d/3. So, her total time is d + d/3 = 4d/3. Therefore, her average speed is 2d / (4d/3) = 3/2 m/s. |
Incorrect Answer | 2 m/s |
Explanation | Alice’s average speed can be calculated by adding her speed going up the hill and her speed going down the hill, and then dividing by 2. So, (1 m/s + 3 m/s) / 2 = 2 m/s. Therefore, Alice’s average speed is 2 m/s. |
数据集描述。
我们的 Counter-Intuitive AR 数据集包含从启发问题中收集的 50 个问题 Kong 等人 (2022)222https://elicitation.info/questionnaire/1/,网络数据333https://www.geeksforgeeks.org/puzzles/和手动采集。 与常用数据集(例如 MultiArith Roy and Roth (2015)、GSM8K Cobbe 等人 (2021))相比,我们的数据集提出了两个不同的挑战:
-
•
抵制直觉。 我们数据集中的问题嵌入了隐藏的陷阱,旨在引出直观且有吸引力的答案,但这些答案通常是不正确的。 该功能评估大语言模型抵抗表面表达陷阱的能力。
-
•
多步骤推理。 数据集中的每个正确答案都需要严格的多步骤推理过程,从而评估大语言模型进行复杂决策和解决问题的能力。
数据集格式。
在我们的 Counter-Intuitive AR 数据集中,每个示例都包含三个关键组件(有关示例,请参阅表 2)。 我们详细阐述如下:
-
•
问题。 我们数据集中的问题旨在激发反直觉思维,旨在通过呈现即时、直觉反应通常不正确的情况来挑战传统决策。
-
•
答案。 每个问题都提供了正确答案,这需要对问题的深入理解和常识性知识。 此外,我们还提供了一个看似合理但不正确的答案来进行比较。
-
•
解释。 我们为每个正确答案提供了详细的解释。 解释概述了得出正确答案的逐步推理过程。 每个错误答案都会有一个解释来补充,展示看似逻辑推理过程,但最终会导致错误答案。 这种推理过程凸显了决策过程中潜在的陷阱和误解,特别是当直觉优先于严格的逻辑推理时。
4实验
4.1设置
骨干模型。
在这项工作中,我们主要在 MAD 框架中使用三个代理,包括两个辩手(即正方和反方)和一名法官。 除非另有说明,我们默认使用 GPT-3.5-Turbo 作为所有代理的骨干模型。
比较方法。
一般来说,我们在这两个任务上将我们的 MAD 框架与 GPT-3.5-Turbo、GPT-4 和 Self-Reflect 进行比较。 我们还单独包含其他基线方法,即常见 MT 的 Rerank 和 MAPS、反直觉 AR 的 CoT 和 Self-Consistency。 下面详细阐述它们:
-
•
自我反思 Shinn 等人 (2023):这种方法需要大语言模型仔细检查和完善其翻译,直到它认为当前的输出令人满意。
-
•
重新排序 He 等人 (2023):我们对大语言模型的翻译进行了四次采样,并根据质量评估 (QE) 评分器从中选择最佳候选者444我们使用 wmt21-comet-qe-da 作为 QE 记分器。. 这种方法可以被视为类似于自我一致性Wang等人(2022),其中多数投票被外部量化宽松评分者取代。
-
•
MAPS He 等人 (2023):该方法使大语言模型能够模仿人类翻译过程:分析然后翻译,可以将其视为一个链条-应用于翻译任务的思维方法。
-
•
CoT Kojima 等人 (2022):这种方法将触发句“Let’s think step by ste”连接到测试问题。
-
•
自洽 Wang 等人 (2022):该方法从大语言模型中采样多个响应,并通过多数投票确定最终答案。
我们在 GPT-3.5-Turbo 之上实现这些方法。 实施细节在附录A.1中描述。
Method | Automatic | Human | ||
COMET | BLEURT | Score | ACC (%) | |
GPT-4 | 82.0 | 70.1 | 3.41 | 68.5 |
GPT-3.5-Turbo | 80.3 | 68.2 | 3.14 | 62.5 |
+ Rerank | 80.9 | 68.6 | 3.16 | 63.5 |
+ MAPS | 81.9 | 70.1 | 3.43 | 70.5 |
+ Self-Reflect | 81.0 | 69.1 | 3.43 | 69.0 |
+ MAD | 82.0 | 70.9 | 3.78 | 79.5↑17.0 |
Source | 吃掉敌人一个师。 |
---|---|
Correct Reference | Destroy a division of the enemy. |
Incorrect Reference | Eat up an enemy division. |
GPT-4 | Eat up an enemy division. |
GPT-3.5-Turbo | Eat up an enemy division. |
+ Self-Reflect | Eat up an enemy division. |
+ MAD | Eliminate an enemy division. |
Source | 他从后门搞到了不少名酒。 |
Correct Reference | He got a lot of famous wines from the road of fraud. |
Incorrect Reference | He got a lot of famous wines from the back door. |
GPT-4 | He got quite a few famous wines from the back door. |
GPT-3.5-Turbo | He obtained a lot of famous wines from the back door. |
+ Self-Reflect | He obtained a good amount of high-quality liquor through the back door. |
+ MAD | He got a lot of famous liquor from an unofficial source. |
评估指标。
对于反直觉 AR,我们报告预测的准确性 (ACC)。 对于 Common MT,我们采用 COMET555https://github.com/Unbabel/COMET/, Unbabel/wmt22-cometkiwi-da 和 BLEURT666https://github.com/google-research/bleurt, BLEURT-20,这是基于 LLM 的翻译文献广泛采用的评估指标 He 等人 (2023); Hendy 等人 (2023);加西亚等人 (2023); Pilault 等人 (2023)6>。 此外,我们还从两个方面对翻译结果进行人工评估:歧义消除准确性和直接评估范围内的翻译质量。
4.2普通机器翻译
结果。
表3展示了实验结果。 MAPS 和 Self-Reflec 相对于基准 GPT-3.5-Turbo 实现了改进。 值得注意的是,我们提出的 MAD 通过利用 GPT-3.5 作为骨干模型,在自动和人工评估指标方面都表现出了比 GPT-4 显着的进步。
案例分析。
4.3 反直觉 AR
Method | ACC (%) |
---|---|
GPT-4 | 52.0 |
GPT-3.5-Turbo | 20.0 |
+ CoT | 24.0 |
+ Self-Consistency | 30.0 |
+ Self-Reflect | 20.0 |
+ MAD | 36.0 |
结果
表5列出了推理准确性方面的实验结果。 我们可以观察到,Self-Reflect 相对于基线 GPT-3.5-Turbo 并没有改进,而 CoT 和 Self-Consistency 带来了一些改进。 我们的MAD框架虽然不如GPT-4,但优于所有其他基于GPT-3.5-Turbo的比较方法,这进一步证明了其有效性。
Question A | |
---|---|
The two circles are externally tangent and there | |
is no relative sliding. The radius of circle A is | |
1/3 the radius of circle B. Circle A rolls around | |
circle B one trip back to its starting point. How | |
many times will circle A revolve in total? | |
Correct Answer | 4 |
GPT-4 | 4 |
GPT-3.5-Turbo | 3 |
+ Self-Reflect | 3 |
+ MAD | 4 |
Question B | |
When Alice walks up the hill, her speed is 1 m/s | |
and when she goes down the hill, her speed is 3 | |
m/s. Then when Alice walks up and down the | |
hill, what is her average speed? | |
Correct Answer | 1.5 m/s |
GPT-4 | 1.5 m/s |
GPT-3.5-Turbo | 2 m/s |
+ Self-Reflect | 2 m/s |
+ MAD | 1.5 m/s |
案例分析
5分析
我们进行广泛的分析,以更深入地了解我们的 MAD 框架。 默认情况下,我们使用 Common MT 数据集。
自适应中断的效果。
“针锋相对”状态的本质。
然后我们研究“以牙还牙”的强度如何影响 MAD 的表现。 为此,我们设计了不同的提示(参见附录B)来初始化辩论过程。 如图4所示,要求辩手“以牙还牙”(即更高的分歧)对于MAD取得良好的表现是必要的。 然而,我们发现“必须在每一点上都不一致”(分歧为 0.988)并不能带来最佳性能。 我们推测,持续存在分歧而找不到共同点可能会导致两极分化,使辩论更多地是为了赢得争论而不是寻求真理或理解。 这可能会加剧已有的偏见,并使达成共识或有意义的决定变得困难。
ID | Aff | Neg | Jud | V.Aff | V.Neg | V.Tie |
\small{1}⃝ | Turbo | Turbo | Turbo | 87 | 104 | 9 |
\small{2}⃝ | GPT-4 | GPT-4 | GPT-4 | 67 | 124 | 9 |
\small{3}⃝ | Turbo | GPT-4 | Turbo | 78 | 114 | 8 |
\small{4}⃝ | Turbo | GPT-4 | GPT-4 | 52 | 136 | 12 |
\small{5}⃝ | GPT-4 | Turbo | GPT-4 | 120 | 77 | 3 |
代理人的行为。
我们通过计算法官选择每个辩手的答案作为最终解决方案的次数来研究智能体的行为。 结果如表7所示,我们有以下观察结果:(1)比较行\small{1}⃝和\small{2}⃝,我们发现法官始终偏向于消极的一方,这被认为对 MAD 的性能改进有显着贡献。 当遇到复杂的任务时,正方容易犯错误,应该纠正错误,以取得改进。 (2) 比较行\small{3}⃝和\small{4}⃝(或者行\small{4}⃝和\small{5}⃝),我们发现法官更倾向于具有相同大的一方语言模型为骨干。 这种偏差表明,当不同的大语言模型用于代理时,大语言模型可能无法公平地判断Wang 等人(2023)。
6相关工作
思维链提示
最近,Wei等人(2022)提出了思想链(CoT)提示来提高大语言模型的推理能力。 具体来说,CoT 提示大语言模型生成一系列中间步骤,从而得出多步骤问题的最终答案。 大多数早期工作主要集中在两个主要方面:提示设计和解码策略。 零样本CoTKojima等人(2022)采用触发句“让我们一步一步思考”为大语言模型的解码提供指导。 人们已经探索了先进的采样策略,通过生成不同的推理路径来提高 CoT,例如,自我一致性 Wang 等人 (2022)、Auto-CoT Zhang 等人 (2022) 、主动提示Diao 等人(2023)、基于复杂性的一致性Fu 等人(2022)、多链推理Yoran 等人(2023) 和渐进提示Zheng 等人 (2023)。
随着强大的大语言模型的出现,基于自我评价的方法引起了越来越多的关注。 这些方法涉及生成初始输出,然后评估输出以获取反馈,然后利用反馈来完善输出。 评估反馈可以来自模型本身,例如Self-refine Madaan 等人 (2023) 和 Tree of Thought Yao 等人 (2023))或外部环境,例如、QAaP Zhu 等人 (2023a) 和 Reflection Shinn 等人 (2023)。 这些方法背后的直觉涉及利用强大的大语言模型来模仿人类认知过程。
生成代理
最近,基于LLM的多智能体,例如Generative Agents Park等人(2023),Ghost in the Minecraft Zhu等人(2023b),GPT-Bargaining Fu 等人 (2023),由于能够模拟人类行为而引起了极大的关注。 我们的工作就是沿着这个研究思路来解决大语言模型的DoT问题。 在我们工作的同时,还有一些研究Xiong 等人 (2023); Du等人(2023)也探索了多智能体辩论框架来增强大语言模型的推理能力。 所提出的MAD框架与这些方法之间的主要区别是:(1)我们的工作旨在解决DoT问题,这是大语言模型的固有缺陷; (2)我们凭经验发现,我们的 MAD 框架可以通过采用具有相同骨干大语言模型的代理来提高性能。
7结论
我们在自我反思中提出并定义了思想退化(DoT)问题,并通过提出多智能体辩论(MAD)框架来探索不同的思想链来解决该问题。 我们证明了 MAD 在两项具有挑战性的任务上的有效性,并发现采用 MAD 的 GPT-3.5-Turbo 甚至可以在 Common MT 数据集上超越 GPT-4。 大量分析表明,MAD 需要自适应的辩论休息策略和适度的“针锋相对”状态才能获得良好的表现。 更有趣的是,我们发现,如果不同的大语言模型用于代理,大语言模型可能无法做出公平的判断。 未来的工作可能包括在辩论中安排更多的智能体、棋盘游戏的多智能体以及用于模型对齐的人工智能反馈。
参考
- Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv.
- Diao et al. (2023) Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. 2023. Active prompting with chain-of-thought for large language models. arXiv.
- Du et al. (2023) Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. arXiv.
- Fu et al. (2023) Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. 2023. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv.
- Fu et al. (2022) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
- Garcia et al. (2023) Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Fangxiaoyu Feng, Melvin Johnson, and Orhan Firat. 2023. The unreasonable effectiveness of few-shot learning for machine translation. arXiv.
- Gou et al. (2023) Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. Critic: Large language models can self-correct with tool-interactive critiquing.
- He et al. (2020) Jie He, Tao Wang, Deyi Xiong, and Qun Liu. 2020. The box is in the pen: Evaluating commonsense reasoning in neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3662–3672, Online. Association for Computational Linguistics.
- He et al. (2023) Zhiwei He, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Yujiu Yang, Rui Wang, Zhaopeng Tu, Shuming Shi, and Xing Wang. 2023. Exploring human-like translation strategy with large language models. arXiv.
- Hendy et al. (2023) Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How good are gpt models at machine translation? a comprehensive evaluation. arXiv.
- Jiao et al. (2023) Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu. 2023. Is ChatGPT a good translator? A preliminary study. arXiv.
- Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In NeurIPS.
- Kong et al. (2022) Yuqing Kong, Yunqi Li, Yubo Zhang, Zhihuan Huang, and Jinzhao Wu. 2022. Eliciting thinking hierarchy without a prior. NeurIPS.
- Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback. arXiv.
- OpenAI (2023) OpenAI. 2023. GPT-4 technical report. arXiv.
- Park et al. (2023) Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. arXiv.
- Pilault et al. (2023) Jonathan Pilault, Xavier Garcia, Arthur Bražinskas, and Orhan Firat. 2023. Interactive-chain-prompting: Ambiguity resolution for crosslingual conditional generation with interaction. arXiv.
- Roy and Roth (2015) Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pages 1743–1752. Association for Computational Linguistics (ACL).
- Shinn et al. (2023) Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning.
- Wang et al. (2023) Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large language models are not fair evaluators.
- Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
- Wu et al. (2023) Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang Jiao, and Michael Lyu. 2023. Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark. arXiv.
- Xiong et al. (2023) Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. 2023. Diving into the inter-consistency of large language models: An insightful analysis through debate. arXiv.
- Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. arXiv.
- Yoran et al. (2023) Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. 2023. Answering questions by meta-reasoning over multiple chains of thought. arXiv.
- Zhang et al. (2022) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic chain of thought prompting in large language models. arXiv.
- Zheng et al. (2023) Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. 2023. Progressive-hint prompting improves reasoning in large language models. arXiv.
- Zhu et al. (2022) Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. 2022. Solving math word problem via cooperative reasoning induced language models. arXiv.
- Zhu et al. (2023a) Xinyu Zhu, Cheng Yang, Bei Chen, Siheng Li, Jian-Guang Lou, and Yujiu Yang. 2023a. Question answering as programming for solving time-sensitive questions. arXiv preprint arXiv:2305.14221.
- Zhu et al. (2023b) Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, and Jifeng Dai. 2023b. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv.
附录 A示例附录
A.1实施
图 5 显示了辩论历史记录的典型模板,根据 Turbo API 进行格式化。
I believe the correct translation of the Chinese text “吃掉敌人一个师” is “Eat up an enemy division.” The word “吃掉” literally means “eat up” or “consume.” |
I disagree with your translation. I believe the correct translation of the Chinese text “吃掉敌人一个师” is “Eliminate an enemy division." While “吃掉” can be translated as “eat up” or “consume”, it can also mean "eliminate" or "annihilate" in a military context. In this case, it is more appropriate to use "eliminate" as it conveys the idea of completely destroying the enemy division. |
Both sides presented valid arguments and translations that accurately convey the meaning of the original Chinese text. The choice between ’eat up’ and ’eliminate’ may depend on the context and tone of the text. Therefore, there is no clear preference for either side at this time. |
While “eliminate” does convey the idea of completely destroying the enemy division, it doesn’t capture the connotation of consuming or devouring the enemy that is implied by “吃掉.” |
In a military context, “吃掉” is often used to describe the complete destruction of the enemy, rather than the consumption of their resources or territory. |
Both sides presented valid arguments, but the negative side’s argument that ‘eliminate’ better captures the intensity and seriousness of the military language used in the original Chinese text was more convincing. Additionally, ‘eliminate’ accurately conveys the idea of completely destroying the enemy division. Therefore, the correct English translation of the Chinese text ‘吃掉敌人一个师’ is ‘Eliminate an enemy division.’ |
|
The phrase “从后门” literally translates to “from the back door.” |
I actually disagree with your translation. I believe a more accurate translation would be "He got a lot of famous liquor from the back door. Lastly, while “从后门” does literally translate to “from the back door,” it is a common Chinese idiom that means obtaining something through unofficial or illegitimate means. Therefore, the use of “from the back door” in the translation is not entirely accurate. Instead, the use of “from an unofficial source” or "through unofficial means" would be more appropriate. |
Both sides presented valid arguments and translations that accurately capture the meaning of the original Chinese text. The choice between wine and liquor is a matter of interpretation and context. The use of from the back door versus from an unofficial source is a matter of preference and style. Therefore, there is no clear preference for a translation candidate. |
I agree that “从后门” is a common Chinese idiom that means obtaining something through unofficial or illegitimate means. The use of “from an unofficial source” or “through unofficial means” can be more appropriate than “from the back door.” |
Yes, I agree with your perspective. |
While both translations accurately capture the meaning of the original Chinese text, the use of ‘liquor’ in candidate 2 is more accurate as ‘名酒’ is more commonly used to refer to liquor. Additionally, the use of ’from an unofficial source’ in candidate 3 more accurately conveys the connotation of ‘从后门’ as obtaining something through unofficial or illegitimate means. Therefore, the correct translation is: ‘He got a lot of famous liquor from an unofficial source.’ |
Alice’s average speed can be calculated by adding her speed going up the hill and her speed going down the hill, and then dividing by 2. So, (1 m/s + 3 m/s) / 2 = 2 m/s. Therefore, Alice’s average speed is 2 m/s. |
I disagree with your answer. The problem doesn’t give us any information about the distance Alice covers while going up and down the hill. So, we cannot simply add her speeds and divide by 2 to get her average speed. We need to know the distance she covers to calculate her average speed. |
The average speed is calculated by dividing the total distance by the total time taken. As Alice walks up and down the hill, the distance covered in both directions is the same. Therefore, the total distance covered is 2 times the distance of either direction. |
That’s a good point. But we can assume that Alice covers the same distance going up and down the hill. In that case, we can use the formula for average speed, which is total distance divided by total time. If Alice covers a distance of d going up and down the hill, then her total distance is 2d. Her time going up the hill is d/1 = d, and her time going down the hill is d/3. So, her total time is d + d/3 = 4d/3. Therefore, her average speed is 2d/(4d/3) = 3/2 m/s. |
Thank you for providing your answer. This average speed of 1.5 m/s takes into account both the uphill and downhill speeds, providing a balanced representation of Alice’s overall performance while walking up and down the hill. |
Yes. Let’s assume the distance is d. The time taken to walk up the hill is d/1 and the time taken to walk down the hill is d/3. The total time taken is d/1 + d/3 = 4d/3. Therefore, the average speed is (2d)/(4d/3) = 1.5 m/s. Hence, the negative side is correct. |
First, we need to find out how much of the original weight was water. If we have 1 ton of apples with 90% water, then the weight of water in those apples is 0.9 x 1 ton = 0.9 tons. Next, we need to find out how much of the new weight is water. If the apples now have 80% water, then the weight of water in those apples is 0.8 x new weight. We can set up an equation: |
0.8 x new weight = 0.9 tons |
To solve for the new weight, we can divide both sides by 0.8: |
new weight = 0.9 tons / 0.8 |
new weight = 1.125 tons |
Therefore, the weight of the apples now is 1.125 tons. |
That’s not correct. It’s important to note that the weight of the apples has decreased due to the loss of water, but the weight of the remaining solids remains the same. This is why we need to calculate the weight of remaining solids in both scenarios to find the new weight of the apples. |
Great idea! While it is true that the water content of the apples decreases from 90% to 80%, the solid, non-water content of the apples does not change. |
You are correct. The weight of the apples has decreased due to the loss of water, but the weight of the remaining solids remains the same. Therefore, we need to calculate the weight of the remaining solids in both scenarios to find the new weight of the apples. We can set up an equation: |
0.2 x new weight = 0.1 tons |
To solve for the new weight, we can divide both sides by 0.2: |
new weight = 0.1 tons / 0.2 |
new weight = 0.5 tons |
Therefore, the weight of the apples now is 0.5 tons. |
Thank you for your perspective. While it’s true that the weight of the apples themselves hasn’t changed, the overall weight of the apples with the water included has decreased. So, the weight of the apples now is 0.5 tons. |
The negative side correctly calculated the new weight of the apples by considering the solid content as a constant and using the formula to find the new weight based on the reduced water content percentage. |
A.2辩论案例
我们观察到,肯定方()通常依赖于直接直觉,这可能会导致不正确或不恰当的反应。 相反,消极的一面()表现出识别和纠正错误的能力。
附录B“针锋相对”状态的电平控制
我们通过将自然语言指令附加到辩手的元提示中来调整第 5 节中概述的“以牙还牙”状态的级别。 表 12 中列出了所有相应的提示。
Level | Prompt |
---|---|
0 | Both sides must reach a full consensus on every point of the debate. Every statement must be agreed upon by both sides. |
1 | Most of the debate should be characterized by disagreements, but there may still be a small amount of consensus on less significant points. |
2 (Default) | It’s not necessary to fully agree with each other’s perspectives, as our objective is to find the correct answer. |
3 | Both sides must disagree with each other on every point of the debate. There should be no consensus whatsoever. |