Encouraging Divergent Thinking in Large Language Models
through Multi-Agent Debate

Tian Liang1  Zhiwei He211footnotemark: 1  Wenxiang Jiao311footnotemark: 1  Xing Wang322footnotemark: 2  Yan Wang
Rui Wang2  Yujiu Yang1  Zhaopeng Tu3  Shuming Shi3
1Tsinghua Shenzhen International Graduate School, Tsinghua University
2Shanghai Jiao Tong University 3Tencent AI Lab

{liangt21@mails,yang.yujiu@sz}.tsinghua.edu.cn {zwhe.cs}@sjtu.edu.cn
{joelwxjiao,brightxwang,shumingshi}@tencent.com
Tian, Zhiwei and Wenxiang contributed equally and are co-first authors. Work was done when Tian and Zhiwei were interning at Tencent AI Lab. Xing Wang and Yujiu Yang are co-corresponding authors.
Abstract

Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of “tit for tat” and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of “tit for tat” state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Codes: https://github.com/Skytliang/Multi-Agents-Debate

1 Introduction

Refer to caption
Figure 1: Degeneration-of-thought with respect to the iteration of self-reflection (or debate), which is measured by the disagreement of stances ([0,1]) between two adjacent iterations. Results are calculated for the examples that are incorrectly predicted by CoT. Clearly, the proposed multi-agent debate method can alleviate the DoT problem by producing more divergent thoughts, while the self-reflection method fails.
Refer to caption
Figure 2: Framework of Multi-Agent Debate. Here we designate the devil ( Refer to caption ) as the affirmative side while the angel ( Refer to caption ) as the negative side. We want the angel to correct the devil’s mistakes.

Modern large language models (LLMs) like ChatGPT, GPT-4 OpenAI (2023) and Bard111https://bard.google.com/, have shown remarkable performance on general language tasks Jiao et al. (2023); Wu et al. (2023); Bang et al. (2023) but still struggle on complex reasoning tasks Zhu et al. (2022); Gou et al. (2023), which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. In particular, self-reflection Madaan et al. (2023); Shinn et al. (2023), a concept that usually refers to the process of introspection and examination of a person’s own thoughts, has been explored to solve intricate tasks that could be challenging for a zero-shot generation or even chain-of-thought (CoT) prompting Wei et al. (2022). Specifically, self-reflection involves an iterative refinement process such that the LLM generates a new answer based on the answers and feedback in previous iterations and then provides feedback for the new answer. While self-reflection can be effective in creating better solutions, it is highly dependent on the self-evaluation capabilities of LLMs, which are not formally guaranteed Shinn et al. (2023).

In this work, we focus on the Degeneration-of-Thought (DoT) problem in self-reflection, which is proposed and defined by us for the first time. Formally, DoT describes the following scenario:

Once the LLM has established confidence in its answers, it is unable to generate novel thoughts later through self-reflection even if the initial stance is incorrect.

To demonstrate this problem, we define the average disagreement as the percentage of opposition between two debaters in debate (or self-confliction in self-reflection) for each question. As Figure 1 seen, we calculate the disagreement of stances between every two iterations in self-reflection and show the trends. The low disagreement of self-reflection suggests that the LLM sticks to the incorrect answers predicted by CoT and is unable to engage in meaningful self-reflection. There are various factors that could result in DoT, and we outline three here: (1) Bias and Distorted Perception. Self-perception can be influenced by biases, preconceived notions, and distorted thinking patterns, which can be learned from the massive amount of data during pretraining. If an LLM’s self-reflection is clouded by such biases or distorted thinking, it can lead to inaccurate conclusions instinctively. (2) Rigidity and Resistance to Change. Self-reflection often involves challenging one’s beliefs, assumptions, and behaviors. If an LLM is resistant to change or holds rigid beliefs, it may struggle to engage in meaningful self-reflection that leads to better solutions. (3) Limited External Feedback. Self-reflection is primarily an internal process, but external feedback can provide valuable perspectives and insights. Without seeking or considering external feedback, an LLM may miss important blind spots or alternative viewpoints that can enrich its self-reflection.

To address the DoT issue, we leverage another fundamental characteristic of human problem-solving, i.e., debate, to encourage divergent thinking in LLMs. Specifically, we propose the MAD framework, short for Multi-Agent Debate, where two agents express their own arguments in the state of “tit for tat” and a judge monitors and manages the debate process to obtain a final solution. The nature of MAD determines that (1) The distorted thinking of one LLM can be corrected by the others; (2) The resistance to change of one LLM will be complemented by the others; and (3) each agent can obtain external feedback from the others. Therefore, MAD is less susceptible to the factors of DoT, and can explore divergent chain-of-thoughts to achieve accurate solutions.

We conducted experiments on both natural language generation and understanding through two challenging tasks, namely, Commonsense Machine Translation (Common MT) and Counter-Intuitive Arithmetic Reasoning (Counter-Intuitive AR). The common characteristic of the two tasks is that our instincts are mostly incorrect based on only the superficial expressions of the questions, and deeper levels of contemplation are required for better solutions. Experimental results demonstrate that our MAD framework performs much better than the baseline methods, especially, MAD with GPT-3.5-Turbo can surpass the performance of GPT-4 on Common MT.

The contributions of this work are summarized as follows:

  • We propose and define the Degeneration-of-Thought (DoT) problem in self-reflection, and address it by proposing the Multi-Agent Debate (MAD) framework to explore divergent chain-of-thoughts.

  • We demonstrate the effectiveness of MAD on two challenging tasks, and find that GPT-3.5-Turbo with MAD can even surpass GPT-4 on the Common MT dataset.

  • Extensive analyses suggest that the adaptive break strategy of debate and the modest level of “tit for tat” state are required for MAD to obtain good performance. More interestingly, we find that LLMs might not be a fair judge if different LLMs are used for agents.

2 Multi-Agent Debate Framework

Algorithm 1 MAD: Multi-Agents Debate
1:Debate topic t, maximum number of rounds M and number of debaters N
2:Final answer a
3:procedure MAD(t, M, N)
4: J Initialize the judge
5: D[D1,,DN] Initialize debaters
6: H[t] Initialize debate history
7: m0 Current round
8: while mM do
9: mm+1
10: for each Di in D do
11: hDi(H) Generate argument
12: HH+[h] Append h to H
13: if Jd(H) then
14: break Debate is over
15: aJe(H) Extract the final answer
16: return a

Algorithm 1 illustrates the detailed process of MAD. Generally, our MAD framework is composed of three components which are elaborated as follows:

Meta Prompts.

We use meta prompts to introduce the topic to be solved, the number of debaters, the iteration limit, and other requirements. For example, we require the agents to “tit for tat” so as to create an atmosphere of debate.

Debaters.

There are N debaters D={Di}i=1N involved in the framework. In each debate iteration, the debaters Di speak one by one in a fixed order and express their arguments based on the previous debate history H, i.e., Di(H)=h. An example of a debater prompt appears below:

You are a debater. Hello and welcome to the translation competition, which will be conducted in a debate format. It’s not necessary to fully agree with each other ’ s perspectives, as our objective is to find the correct translation.

Judge.

We also design a judge J to manage and monitor the whole debate process. The judge contains two different modes: (a) Discrinative Mode, in which the judge J decides whether the correct solution can be obtained after all the debaters finish their arguments in the current iteration:

Jd(H)={True,solution obtainedFalse,otherwise. (1)

If it is True, the debate is over. Otherwise, the debate continues. (b) Extractive Mode, in which the judge J needs to extract the final solution based on the whole debate history: Je(H)=a, since no correct solution is identified within the iteration limit of debate. An example of a judge prompt appears below:

You are a moderator. There will be two debaters involved in a translation debate competition. They will present their translations and discuss their perspectives on the correct English translation of the given Chinese text: "吃掉敌人一个师。". At the end of each round, you will evaluate the candidates’ translation submissions.

3 Challenging Testbeds

Source 吃掉敌人一个师。
Correct Destroy a division of the enemy.
Incorrect Eat up an enemy division.
Table 1: An example from the Common MT dataset. The underlined Chinese words are translated into the corresponding colored words in English.

We conduct experiments on two challenging tasks, namely, commonsense machine translation (i.e., Common MT), and counter-intuitive arithmetic reasoning (i.e., Counter-Intuitive AR), which require deep levels of contemplation for LLMs.

3.1 Commonsense Machine Translation

The Common MT dataset is composed of ChineseEnglish translation examples He et al. (2020), which are used to examine the ambiguity resolution ability of translation models. Within the challenging part of Common MT, each source sentence contains an ambiguous word. While these ambiguous words might appear to have a straightforward translation, such a literal interpretation is erroneous. Failure to identify and address such ambiguities may result in inaccurate translations. In this work, we adopt the lexical ambiguity test set in the following experiment. Table 1 lists an example, where the source word “吃掉” should be translated to “destroy” rather than the straightforward translation “eat up” by considering the common sense in the real world.

3.2 Counter-Intuitive Arithmetic Reasoning

Previous studies on thinking hierarchy Kong et al. (2022); Wei et al. (2022) suggest that we humans have a fast and intuitive system and a slow and logical system, and tend to run the lower level system before the higher level one. Inspired by this, we created a more challenging dataset named Counter-Intuitive Arithmetic Reasoning (Counter-Intuitive AR) to evaluate the reasoning abilities of LLMs at deep levels.

Components Content
Question When Alice walks up the hill, her speed is 1 m/s and when she goes down the hill, her speed is 3 m/s. Then when Alice walks up and down the hill, what is her average speed?
Correct Answer 1.5 m/s
Explanation If Alice covers a distance of d going up and down the hill, then her total distance is 2d. Her time going up the hill is d/1 = d, and her time going down the hill is d/3. So, her total time is d + d/3 = 4d/3. Therefore, her average speed is 2d / (4d/3) = 3/2 m/s.
Incorrect Answer 2 m/s
Explanation Alice’s average speed can be calculated by adding her speed going up the hill and her speed going down the hill, and then dividing by 2. So, (1 m/s + 3 m/s) / 2 = 2 m/s. Therefore, Alice’s average speed is 2 m/s.
Table 2: An example in Counter-Intuitive AR dataset.

Dataset Description.

Our Counter-Intuitive AR dataset contains 50 questions collected from elicitation questions Kong et al. (2022)222https://elicitation.info/questionnaire/1/, web data333https://www.geeksforgeeks.org/puzzles/ and manual collection. Compared to the commonly-used datasets, e.g., MultiArith Roy and Roth (2015), GSM8K Cobbe et al. (2021), our dataset presents two distinct challenges:

  • Resistance to Intuition. The questions in our dataset are embedded in hidden traps designed to elicit intuitive and appealing answers that are often incorrect. This feature evaluates the abilities of LLMs to resist the traps of superficial expressions.

  • Multi-Step Reasoning. Each correct answer within the dataset requires a rigorous multi-step reasoning process, thereby evaluating the capacity of LLMs to engage in complex decision-making and problem-solving.

Dataset Format.

In our Counter-Intuitive AR dataset, each example contains three key components (see Table 2 for an example). We elaborate on the details below:

  • Questions. The questions in our dataset are designed to stimulate counter-intuitive thinking, which aims to challenge conventional decision-making by presenting situations where the immediate, intuitive response is often incorrect.

  • Answers. Each question is provided with a correct answer, which requires deep comprehension of the question and commonsense knowledge. Additionally, we also provide a plausible yet incorrect answer for comparison.

  • Explanations. We provide a detailed explanation for each correct answer. The explanation outlines the step-by-step reasoning process that leads to the correct answer. Each incorrect answer is also complemented by an explanation demonstrating a seemingly logical reasoning process but ultimately leading to the incorrect answer. This reasoning process highlights the potential pitfalls and misconceptions during decision-making, especially when intuition is prioritized over rigorous logical reasoning.

4 Experiment

4.1 Setups

Backbone Models.

In this work, we mainly use three agents in our MAD framework, including two debaters (i.e., affirmative and negative) and a judge. Unless other stated, we use GPT-3.5-Turbo as the backbone model for all agents by default.

Compared Methods.

Generally, we compare our MAD framework with GPT-3.5-Turbo, GPT-4, and Self-Reflect on both tasks. We also include other baseline methods individually, namely, Rerank and MAPS for Common MT, CoT and Self-Consistency for Counter-Intuitive AR. Below elaborates the details of them:

  • Self-Reflect Shinn et al. (2023): This approach requires the LLM to scrutinize and refine its translation until it deems the current output satisfactory.

  • Rerank He et al. (2023): We sample the translations from the LLM for four times, from which we select the best candidate based on a quality estimation (QE) scorer444We use wmt21-comet-qe-da as the QE scorer.. This approach can be seen as analogous to self-consistency Wang et al. (2022), where the majority voting is replaced by an external QE scorer.

  • MAPS He et al. (2023): This method enables the LLM to mimic the human translation process: analyze and then translate, which can be viewed as a chain-of-thought method applied to translation task.

  • CoT Kojima et al. (2022): This approach concatenates a trigger sentence “Let’s think step by ste” to the test question.

  • Self-Consistency Wang et al. (2022): This method samples multiple responses from LLMs and determines the final answer through a majority vote.

We implement the methods on top of GPT-3.5-Turbo. The implementation details are described in Appendix A.1.

Method Automatic Human
COMET BLEURT Score ACC (%)
GPT-4 82.0 70.1 3.41 68.5x.x
GPT-3.5-Turbo 80.3 68.2 3.14 62.5x.x
+ Rerank 80.9 68.6 3.16 63.51.0
+ MAPS 81.9 70.1 3.43 70.58.0
+ Self-Reflect 81.0 69.1 3.43 69.06.5
+ MAD 82.0 70.9 3.78 79.5↑17.0
Table 3: Translation performance on Common MT. Note that Rerank and MAPS use the external quality estimation tool to select the best translation from multiple translation candidates.
Source 吃掉敌人一个师。
Correct Reference Destroy a division of the enemy.
Incorrect Reference Eat up an enemy division.
GPT-4 Eat up an enemy division.
GPT-3.5-Turbo Eat up an enemy division.
+ Self-Reflect Eat up an enemy division.
+ MAD Eliminate an enemy division.
Source 他从后门搞到了不少名酒。
Correct Reference He got a lot of famous wines from the road of fraud.
Incorrect Reference He got a lot of famous wines from the back door.
GPT-4 He got quite a few famous wines from the back door.
GPT-3.5-Turbo He obtained a lot of famous wines from the back door.
+ Self-Reflect He obtained a good amount of high-quality liquor through the back door.
+ MAD He got a lot of famous liquor from an unofficial source.
Table 4: Example translations generated by baseline GPT-3.5-Turbo, Self-Reflect and the proposed MAD. We also provide the translation outputs generated by GPT-4. Best viewed in color.

Evaluation Metrics.

For Counter-Intuitive AR, we report the accuracy (ACC) of predictions. For Common MT, we adopt automatic metrics like COMET555https://github.com/Unbabel/COMET/, Unbabel/wmt22-cometkiwi-da and BLEURT666https://github.com/google-research/bleurt, BLEURT-20, which are widely adopted evaluation metrics for LLM-based translation literature He et al. (2023); Hendy et al. (2023); Garcia et al. (2023); Pilault et al. (2023). Moreover, we also employ human evaluation for the translation results in terms of two aspects: ambiguity resolution accuracy and direct assessment of translation quality in range [1,5].

4.2 Common MT

Results.

Table 3 presents the experimental results. MAPS and Self-Reflec achieve improvements over baseline GPT-3.5-Turbo. Remarkably, our proposed MAD, by utilizing GPT-3.5 as the backbone model, has demonstrated significant advancements over GPT-4 across both automatic and human evaluation metrics.

Case Study.

Table 4 shows example translations generated by baseline GPT-3.5-Turbo and the proposed MAD. We can find that the baseline GPT-3.5-Turbo (even the more powerful GPT-4) incorrectly translates the source words literally. Because of the DoT issue, Self-Reflect cannot rectify the literal translation. The proposed MAD framework, which explores divergent chain-of-thoughts, can generate the free translation of the underlined words within the source sentences. The detailed debate process of translation examples can be found in Appendix A.2.

4.3 Counter-Intuitive AR

Method ACC (%)
GPT-4 52.0
GPT-3.5-Turbo 20.0
+ CoT 24.0
+ Self-Consistency 30.0
+ Self-Reflect 20.0
+ MAD 36.0
Table 5: Reasoning accuracy on Counter-Intuitive AR.

Results

Table 5 lists the experimental results in terms of reasoning accuracy. We can observe that Self-Reflect does not improve over the baseline GPT-3.5-Turbo, while CoT and Self-Consistency bring some improvements. Our MAD framework, though not as good as GPT-4, outperforms all the other compared methods based on GPT-3.5-Turbo, which further demonstrates its effectiveness.

Question A
The two circles are externally tangent and there
is no relative sliding. The radius of circle A is
1/3 the radius of circle B. Circle A rolls around
circle B one trip back to its starting point. How
many times will circle A revolve in total?
Correct Answer 4
GPT-4 4
GPT-3.5-Turbo 3
+ Self-Reflect 3
+ MAD 4
Question B
When Alice walks up the hill, her speed is 1 m/s
and when she goes down the hill, her speed is 3
m/s. Then when Alice walks up and down the
hill, what is her average speed?
Correct Answer 1.5 m/s
GPT-4 1.5 m/s
GPT-3.5-Turbo 2 m/s
+ Self-Reflect 2 m/s
+ MAD 1.5 m/s
Table 6: Example predictions generated by baseline GPT-3.5-Turbo, Self-Reflect and the proposed MAD. We also provide the results by GPT-4.

Case Study

Table 6 presents two example outputs on Counter-Intuitive AR. We find both CoT and Self-Reflect fail to reach the right answer. With divergent thinking, our MAD framework emerges “we need to consider both the rotation around circle B and the rotation of circle A itself” and find the correct answer. The detailed debate process can be found in Appendix A.2.

5 Analysis

We conduct extensive analyses to gain a deeper understanding on our MAD framework. By default, we use the Common MT dataset.

Effect of Adaptive Break.

We first investigate the stopping strategy of debate. For each iteration, we force the judge J to extract the final answer (a=Je(H)) instead of adaptively breaking the debate as in Algorithm 1. Figure 3 shows the results. We can observe that MAD performs better than self-reflection as the iteration increases. However, the highest COMET score appears at the first iteration and is also lower than the result of the adaptive break. It indicates that, for most examples, MAD can generate good translations at the first iteration such that the debate should be stopped. Forcing the debate to continue will harm the translation results, which demonstrates the reasonableness of our adaptive break strategy.

Refer to caption
Figure 3: Translation performance with respect to the iteration of debate or self-reflection.

Essense of “Tit for Tat” State.

We then study how the intensity of “tit for tat” affects the performance of MAD. To achieve so, we design different prompts (see Appendix B) to initialize the debate process. As shown in Figure 4, asking the debaters to “tit for tat” (i.e., higher disagreement) is necessary for MAD to achieve good performance. However, we find that “must disagree with each other on every point ” (with a disagreement of 0.988) does not lead to the best performance. We speculate that continuous disagreement without finding common ground can contribute to polarization, where the debate becomes more about winning the argument than seeking truth or understanding. This can reinforce pre-existing biases and make it difficult to reach a consensus or meaningful decision.

Refer to caption
Figure 4: Translation performance with respect to the level of “tit for tat” state. We also show the corresponding disagreement during the debate.
ID Aff Neg Jud V.Aff V.Neg V.Tie
\small{1}⃝ Turbo Turbo Turbo 87 104 9
\small{2}⃝ GPT-4 GPT-4 GPT-4 67 124 9
\small{3}⃝ Turbo GPT-4 Turbo 78 114 8
\small{4}⃝ Turbo GPT-4 GPT-4 52 136 12
\small{5}⃝ GPT-4 Turbo GPT-4 120 77 3
Table 7: Behavior of agents in MAD. V.Aff (V.Neg) denotes the times affirmative (negative) is chosen for the final solution.

Behavior of Agents.

We study the behavior of agents by calculating how many times the judge chooses the answers of each debater as the final solution. The results are listed in Table 7 and we have the following observations: (1) Comparing row \small{1}⃝ and \small{2}⃝, we find that the judge consistently favors the negative side, which is believed to contribute significantly to the performance improvement in MAD. When encountering complex tasks, the affirmative side tends to make mistakes that should be corrected to achieve improvements. (2) Comparing row \small{3}⃝ and \small{4}⃝ (or row \small{4}⃝ and \small{5}⃝), we find the judge shows a preference to the side with the same LLM as the backbone. This bias indicates that LLMs might not be a fair judge Wang et al. (2023) when different LLMs are used for the agents.

6 Related Work

Chain-of-Thought Prompting

Recently, Wei et al. (2022) has proposed chain-of-thought (CoT) prompting to improve the reasoning ability of LLMs. Specifically, CoT prompts LLMs to generate a series of intermediate steps that lead to the final answer of a multi-step problem. Most earlier work primarily concentrates on two main aspects: prompt design and decoding strategies. Zero-shot CoT Kojima et al. (2022) employs the trigger sentence “Let’s think step by step” to provide guidance for the decoding of LLMs. Advanced sampling strategies have been explored to improve CoT by generating diverse reasoning paths, e.g., Self-Consistency Wang et al. (2022), Auto-CoT Zhang et al. (2022), Active-Prompting Diao et al. (2023), Complexity-based Consistency Fu et al. (2022), Multi-Chain Reasoning Yoran et al. (2023), and Progressive-Hint Prompting Zheng et al. (2023).

With the emergence of powerful LLMs, approaches based on self-evaluation have attracted increasing attention. These approaches involve the generation of initial output, followed by evaluating the output to acquire feedback, which is then utilized to refine the output. Evaluation feedback can come from the model itself, e.g., Self-refine Madaan et al. (2023) and Tree of Thoughts Yao et al. (2023)) or external environments, e.g., QAaP Zhu et al. (2023a) and Reflection Shinn et al. (2023). The intuition behind these approaches involves the utilization of robust LLMs to mimic the human cognition process.

Generative Agents

Recently, LLMs-based multi-agent intelligent, e.g., Generative Agents Park et al. (2023), Ghost in the Minecraft Zhu et al. (2023b), GPT-Bargaining Fu et al. (2023), has drawn significant attention for enabling simulations of human behavior. Our work follows this research line to address the DoT problem of LLMs. Concurrent with our work, a few studies Xiong et al. (2023); Du et al. (2023) also explore the multi-agent debate framework to enhance the reasoning ability of LLMs. The main differences between the proposed MAD framework and these approaches are: (1) our work aims to address the DoT problem, which is an inherent deficiency of LLMs; and (2) we empirically find that our MAD framework can yield enhanced performance by employing agents with the identical backbone LLM.

7 Conclusion

We propose and define the Degeneration-of-Thought (DoT) problem in self-reflection, and address it by proposing the Multi-Agent Debate (MAD) framework to explore divergent chain-of-thoughts. We demonstrate the effectiveness of MAD on two challenging tasks and find that GPT-3.5-Turbo with MAD can even surpass GPT-4 on the Common MT dataset. Extensive analyses suggest that the adaptive break strategy of debate and the modest level of “tit for tat” state are required for MAD to obtain good performance. More interestingly, we find that LLMs might not be a fair judge if different LLMs are used for agents. Future works may include scheduling more agents in the debate, multi-agents for board games, and AI feedback for model alignment.

References

  • Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv.
  • Diao et al. (2023) Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. 2023. Active prompting with chain-of-thought for large language models. arXiv.
  • Du et al. (2023) Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. arXiv.
  • Fu et al. (2023) Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. 2023. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv.
  • Fu et al. (2022) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
  • Garcia et al. (2023) Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Fangxiaoyu Feng, Melvin Johnson, and Orhan Firat. 2023. The unreasonable effectiveness of few-shot learning for machine translation. arXiv.
  • Gou et al. (2023) Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. Critic: Large language models can self-correct with tool-interactive critiquing.
  • He et al. (2020) Jie He, Tao Wang, Deyi Xiong, and Qun Liu. 2020. The box is in the pen: Evaluating commonsense reasoning in neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3662–3672, Online. Association for Computational Linguistics.
  • He et al. (2023) Zhiwei He, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Yujiu Yang, Rui Wang, Zhaopeng Tu, Shuming Shi, and Xing Wang. 2023. Exploring human-like translation strategy with large language models. arXiv.
  • Hendy et al. (2023) Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How good are gpt models at machine translation? a comprehensive evaluation. arXiv.
  • Jiao et al. (2023) Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu. 2023. Is ChatGPT a good translator? A preliminary study. arXiv.
  • Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In NeurIPS.
  • Kong et al. (2022) Yuqing Kong, Yunqi Li, Yubo Zhang, Zhihuan Huang, and Jinzhao Wu. 2022. Eliciting thinking hierarchy without a prior. NeurIPS.
  • Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback. arXiv.
  • OpenAI (2023) OpenAI. 2023. GPT-4 technical report. arXiv.
  • Park et al. (2023) Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. arXiv.
  • Pilault et al. (2023) Jonathan Pilault, Xavier Garcia, Arthur Bražinskas, and Orhan Firat. 2023. Interactive-chain-prompting: Ambiguity resolution for crosslingual conditional generation with interaction. arXiv.
  • Roy and Roth (2015) Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pages 1743–1752. Association for Computational Linguistics (ACL).
  • Shinn et al. (2023) Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning.
  • Wang et al. (2023) Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large language models are not fair evaluators.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
  • Wu et al. (2023) Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang Jiao, and Michael Lyu. 2023. Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark. arXiv.
  • Xiong et al. (2023) Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. 2023. Diving into the inter-consistency of large language models: An insightful analysis through debate. arXiv.
  • Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. arXiv.
  • Yoran et al. (2023) Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. 2023. Answering questions by meta-reasoning over multiple chains of thought. arXiv.
  • Zhang et al. (2022) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic chain of thought prompting in large language models. arXiv.
  • Zheng et al. (2023) Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. 2023. Progressive-hint prompting improves reasoning in large language models. arXiv.
  • Zhu et al. (2022) Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. 2022. Solving math word problem via cooperative reasoning induced language models. arXiv.
  • Zhu et al. (2023a) Xinyu Zhu, Cheng Yang, Bei Chen, Siheng Li, Jian-Guang Lou, and Yujiu Yang. 2023a. Question answering as programming for solving time-sensitive questions. arXiv preprint arXiv:2305.14221.
  • Zhu et al. (2023b) Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, and Jifeng Dai. 2023b. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv.

Appendix A Example Appendix

A.1 Implementation

Figure 5 displays a typical template of debate history, formatted according to the Turbo API.

[
{
"role": "system",
"content": "You are a debater. Hello and welcome to the translation competition, which will be conducted in a debate format. Its not necessary to fully agree with each others perspectives, as our objective is to find the correct translation. The debate topic is stated as follows: What is the correct English translation of the following Chinese text: ______"
},
{
"role": "user",
"content": "Translate the following text from Chinese to English: ______"
},
{
"role": "assistant",
"content": "I think ______ is a correct translation because ______"
},
{
"role": "user",
"content": "I disagree with you. Here is my reason: ______"
},
{
"role": "assistant",
"content": "I can see your point of view, but ______"
}
Figure 5: A typical template of debate history, formatted according to the Turbo API.
[Uncaptioned image] I believe the correct translation of the Chinese text “吃掉敌人一个师” is “Eat up an enemy division.” The word “吃掉” literally means “eat up” or “consume.”
[Uncaptioned image] I disagree with your translation. I believe the correct translation of the Chinese text “吃掉敌人一个师” is “Eliminate an enemy division." While “吃掉” can be translated as “eat up” or “consume”, it can also mean "eliminate" or "annihilate" in a military context. In this case, it is more appropriate to use "eliminate" as it conveys the idea of completely destroying the enemy division.
[Uncaptioned image] Both sides presented valid arguments and translations that accurately convey the meaning of the original Chinese text. The choice between ’eat up’ and ’eliminate’ may depend on the context and tone of the text. Therefore, there is no clear preference for either side at this time.
[Uncaptioned image] While “eliminate” does convey the idea of completely destroying the enemy division, it doesn’t capture the connotation of consuming or devouring the enemy that is implied by “吃掉.”
[Uncaptioned image] In a military context, “吃掉” is often used to describe the complete destruction of the enemy, rather than the consumption of their resources or territory.
[Uncaptioned image] Both sides presented valid arguments, but the negative side’s argument that ‘eliminate’ better captures the intensity and seriousness of the military language used in the original Chinese text was more convincing. Additionally, ‘eliminate’ accurately conveys the idea of completely destroying the enemy division. Therefore, the correct English translation of the Chinese text ‘吃掉敌人一个师’ is ‘Eliminate an enemy division.’

Table 8: The debate process of translation example on Chinese sentence “吃掉敌人一个师。”.
[Uncaptioned image] The phrase “从后门” literally translates to “from the back door.”
[Uncaptioned image] I actually disagree with your translation. I believe a more accurate translation would be "He got a lot of famous liquor from the back door. Lastly, while “从后门” does literally translate to “from the back door,” it is a common Chinese idiom that means obtaining something through unofficial or illegitimate means. Therefore, the use of “from the back door” in the translation is not entirely accurate. Instead, the use of “from an unofficial source” or "through unofficial means" would be more appropriate.
[Uncaptioned image] Both sides presented valid arguments and translations that accurately capture the meaning of the original Chinese text. The choice between wine and liquor is a matter of interpretation and context. The use of from the back door versus from an unofficial source is a matter of preference and style. Therefore, there is no clear preference for a translation candidate.
[Uncaptioned image] I agree that “从后门” is a common Chinese idiom that means obtaining something through unofficial or illegitimate means. The use of “from an unofficial source” or “through unofficial means” can be more appropriate than “from the back door.”
[Uncaptioned image] Yes, I agree with your perspective.
[Uncaptioned image] While both translations accurately capture the meaning of the original Chinese text, the use of ‘liquor’ in candidate 2 is more accurate as ‘名酒’ is more commonly used to refer to liquor. Additionally, the use of ’from an unofficial source’ in candidate 3 more accurately conveys the connotation of ‘从后门’ as obtaining something through unofficial or illegitimate means. Therefore, the correct translation is: ‘He got a lot of famous liquor from an unofficial source.’
Table 9: The debate process of translation example on Chinese sentence “他从后门搞到了不少名酒。”.
[Uncaptioned image] Alice’s average speed can be calculated by adding her speed going up the hill and her speed going down the hill, and then dividing by 2. So, (1 m/s + 3 m/s) / 2 = 2 m/s. Therefore, Alice’s average speed is 2 m/s.
[Uncaptioned image] I disagree with your answer. The problem doesn’t give us any information about the distance Alice covers while going up and down the hill. So, we cannot simply add her speeds and divide by 2 to get her average speed. We need to know the distance she covers to calculate her average speed.
[Uncaptioned image] The average speed is calculated by dividing the total distance by the total time taken. As Alice walks up and down the hill, the distance covered in both directions is the same. Therefore, the total distance covered is 2 times the distance of either direction.
[Uncaptioned image] That’s a good point. But we can assume that Alice covers the same distance going up and down the hill. In that case, we can use the formula for average speed, which is total distance divided by total time. If Alice covers a distance of d going up and down the hill, then her total distance is 2d. Her time going up the hill is d/1 = d, and her time going down the hill is d/3. So, her total time is d + d/3 = 4d/3. Therefore, her average speed is 2d/(4d/3) = 3/2 m/s.
[Uncaptioned image] Thank you for providing your answer. This average speed of 1.5 m/s takes into account both the uphill and downhill speeds, providing a balanced representation of Alice’s overall performance while walking up and down the hill.
[Uncaptioned image] Yes. Let’s assume the distance is d. The time taken to walk up the hill is d/1 and the time taken to walk down the hill is d/3. The total time taken is d/1 + d/3 = 4d/3. Therefore, the average speed is (2d)/(4d/3) = 1.5 m/s. Hence, the negative side is correct.
Table 10: The debate process of arithmetic reasoning example.
[Uncaptioned image] First, we need to find out how much of the original weight was water. If we have 1 ton of apples with 90% water, then the weight of water in those apples is 0.9 x 1 ton = 0.9 tons. Next, we need to find out how much of the new weight is water. If the apples now have 80% water, then the weight of water in those apples is 0.8 x new weight. We can set up an equation:
0.8 x new weight = 0.9 tons
To solve for the new weight, we can divide both sides by 0.8:
new weight = 0.9 tons / 0.8
new weight = 1.125 tons
Therefore, the weight of the apples now is 1.125 tons.
[Uncaptioned image] That’s not correct. It’s important to note that the weight of the apples has decreased due to the loss of water, but the weight of the remaining solids remains the same. This is why we need to calculate the weight of remaining solids in both scenarios to find the new weight of the apples.
[Uncaptioned image] Great idea! While it is true that the water content of the apples decreases from 90% to 80%, the solid, non-water content of the apples does not change.
[Uncaptioned image] You are correct. The weight of the apples has decreased due to the loss of water, but the weight of the remaining solids remains the same. Therefore, we need to calculate the weight of the remaining solids in both scenarios to find the new weight of the apples. We can set up an equation:
0.2 x new weight = 0.1 tons
To solve for the new weight, we can divide both sides by 0.2:
new weight = 0.1 tons / 0.2
new weight = 0.5 tons
Therefore, the weight of the apples now is 0.5 tons.
[Uncaptioned image] Thank you for your perspective. While it’s true that the weight of the apples themselves hasn’t changed, the overall weight of the apples with the water included has decreased. So, the weight of the apples now is 0.5 tons.
[Uncaptioned image] The negative side correctly calculated the new weight of the apples by considering the solid content as a constant and using the formula to find the new weight based on the reduced water content percentage.
Table 11: The debate process of arithmetic reasoning example.

A.2 Debate Case

Table 9 and Table 10 present the debate process of example translations in Section 4.2. Table 10 and Table 11 show the debate process of example answers in Section 4.3.

We observe that the affirmative side ( [Uncaptioned image] ) often relies on direct intuition, which can lead to incorrect or inappropriate responses. Conversely, the negative side ( [Uncaptioned image] ) demonstrates an ability to identify and rectify his mistakes.

Appendix B Level Control of “tit for tat” State

We modulate the level of “tit for tat” state outlined in Section 5 through appending natural language instructions to the debaters’ meta prompt. All the corresponding prompts are itemized in Table 12.

Level Prompt
0 Both sides must reach a full consensus on every point of the debate. Every statement must be agreed upon by both sides.
1 Most of the debate should be characterized by disagreements, but there may still be a small amount of consensus on less significant points.
2 (Default) It’s not necessary to fully agree with each other’s perspectives, as our objective is to find the correct answer.
3 Both sides must disagree with each other on every point of the debate. There should be no consensus whatsoever.
Table 12: Prompts for different levels of “tit for tat” state.