Encouraging Divergent Thinking in Large Language Models
through Multi-Agent Debate

Tian Liang¹ Zhiwei He²¹¹footnotemark: 1 Wenxiang Jiao³¹¹footnotemark: 1 Xing Wang³²²footnotemark: 2 Yan Wang
Rui Wang² Yujiu Yang¹ Zhaopeng Tu³ Shuming Shi³
¹Tsinghua Shenzhen International Graduate School, Tsinghua University
²Shanghai Jiao Tong University ³Tencent AI Lab
{liangt21@mails,yang.yujiu@sz}.tsinghua.edu.cn {zwhe.cs}@sjtu.edu.cn
{joelwxjiao,brightxwang,shumingshi}@tencent.com Tian, Zhiwei and Wenxiang contributed equally and are co-first authors. Work was done when Tian and Zhiwei were interning at Tencent AI Lab. Xing Wang and Yujiu Yang are co-corresponding authors.

Abstract

Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of “tit for tat” and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of “tit for tat” state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Codes: https://github.com/Skytliang/Multi-Agents-Debate

1 Introduction

Refer to caption — Figure 1: Degeneration-of-thought with respect to the iteration of self-reflection (or debate), which is measured by the disagreement of stances ( $\in[0,1]$ ) between two adjacent iterations. Results are calculated for the examples that are incorrectly predicted by CoT. Clearly, the proposed multi-agent debate method can alleviate the DoT problem by producing more divergent thoughts, while the self-reflection method fails.

Modern large language models (LLMs) like ChatGPT, GPT-4 OpenAI (2023) and Bard¹¹1https://bard.google.com/, have shown remarkable performance on general language tasks Jiao et al. (2023); Wu et al. (2023); Bang et al. (2023) but still struggle on complex reasoning tasks Zhu et al. (2022); Gou et al. (2023), which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. In particular, self-reflection Madaan et al. (2023); Shinn et al. (2023), a concept that usually refers to the process of introspection and examination of a person’s own thoughts, has been explored to solve intricate tasks that could be challenging for a zero-shot generation or even chain-of-thought (CoT) prompting Wei et al. (2022). Specifically, self-reflection involves an iterative refinement process such that the LLM generates a new answer based on the answers and feedback in previous iterations and then provides feedback for the new answer. While self-reflection can be effective in creating better solutions, it is highly dependent on the self-evaluation capabilities of LLMs, which are not formally guaranteed Shinn et al. (2023).

In this work, we focus on the Degeneration-of-Thought (DoT) problem in self-reflection, which is proposed and defined by us for the first time. Formally, DoT describes the following scenario:

Once the LLM has established confidence in its answers, it is unable to generate novel thoughts later through self-reflection even if the initial stance is incorrect.

To demonstrate this problem, we define the average disagreement as the percentage of opposition between two debaters in debate (or self-confliction in self-reflection) for each question. As Figure 1 seen, we calculate the disagreement of stances between every two iterations in self-reflection and show the trends. The low disagreement of self-reflection suggests that the LLM sticks to the incorrect answers predicted by CoT and is unable to engage in meaningful self-reflection. There are various factors that could result in DoT, and we outline three here: (1) Bias and Distorted Perception. Self-perception can be influenced by biases, preconceived notions, and distorted thinking patterns, which can be learned from the massive amount of data during pretraining. If an LLM’s self-reflection is clouded by such biases or distorted thinking, it can lead to inaccurate conclusions instinctively. (2) Rigidity and Resistance to Change. Self-reflection often involves challenging one’s beliefs, assumptions, and behaviors. If an LLM is resistant to change or holds rigid beliefs, it may struggle to engage in meaningful self-reflection that leads to better solutions. (3) Limited External Feedback. Self-reflection is primarily an internal process, but external feedback can provide valuable perspectives and insights. Without seeking or considering external feedback, an LLM may miss important blind spots or alternative viewpoints that can enrich its self-reflection.

To address the DoT issue, we leverage another fundamental characteristic of human problem-solving, i.e., debate, to encourage divergent thinking in LLMs. Specifically, we propose the MAD framework, short for Multi-Agent Debate, where two agents express their own arguments in the state of “tit for tat” and a judge monitors and manages the debate process to obtain a final solution. The nature of MAD determines that (1) The distorted thinking of one LLM can be corrected by the others; (2) The resistance to change of one LLM will be complemented by the others; and (3) each agent can obtain external feedback from the others. Therefore, MAD is less susceptible to the factors of DoT, and can explore divergent chain-of-thoughts to achieve accurate solutions.

We conducted experiments on both natural language generation and understanding through two challenging tasks, namely, Commonsense Machine Translation (Common MT) and Counter-Intuitive Arithmetic Reasoning (Counter-Intuitive AR). The common characteristic of the two tasks is that our instincts are mostly incorrect based on only the superficial expressions of the questions, and deeper levels of contemplation are required for better solutions. Experimental results demonstrate that our MAD framework performs much better than the baseline methods, especially, MAD with GPT-3.5-Turbo can surpass the performance of GPT-4 on Common MT.

The contributions of this work are summarized as follows:

•

We propose and define the Degeneration-of-Thought (DoT) problem in self-reflection, and address it by proposing the Multi-Agent Debate (MAD) framework to explore divergent chain-of-thoughts.
•

We demonstrate the effectiveness of MAD on two challenging tasks, and find that GPT-3.5-Turbo with MAD can even surpass GPT-4 on the Common MT dataset.
•

Extensive analyses suggest that the adaptive break strategy of debate and the modest level of “tit for tat” state are required for MAD to obtain good performance. More interestingly, we find that LLMs might not be a fair judge if different LLMs are used for agents.

2 Multi-Agent Debate Framework

Algorithm 1 MAD: Multi-Agents Debate

1:Debate topic

t

, maximum number of rounds

M

and number of debaters

N

2:Final answer

a

3:procedure MAD(

t

M

N

)

J

\triangleright

Initialize the judge

D\leftarrow[D_{1},\cdots,D_{N}]

\triangleright

Initialize debaters

H\leftarrow[t]

\triangleright

Initialize debate history

m\leftarrow 0

\triangleright

Current round

8: while

m\leq M

m\leftarrow m+1

10: for each

D_{i}

D

11:

h\leftarrow D_{i}(H)

\triangleright

Generate argument

12:

H\leftarrow H+[h]

\triangleright

Append

h

H

13: if

J_{d}(H)

then

14: break

\triangleright

Debate is over

15:

a\leftarrow J_{e}(H)

\triangleright

Extract the final answer

16: return

a

Algorithm 1 illustrates the detailed process of MAD. Generally, our MAD framework is composed of three components which are elaborated as follows:

Meta Prompts.

We use meta prompts to introduce the topic to be solved, the number of debaters, the iteration limit, and other requirements. For example, we require the agents to “tit for tat” so as to create an atmosphere of debate.

Debaters.

There are $N$ debaters $D=\{D_{i}\}_{i=1}^{N}$ involved in the framework. In each debate iteration, the debaters $D_{i}$ speak one by one in a fixed order and express their arguments based on the previous debate history $H$ , i.e., $D_{i}(H)=h$ . An example of a debater prompt appears below:

You are a debater. Hello and welcome to the translation competition, which will be conducted in a debate format. It’s not necessary to fully agree with each other ’ s perspectives, as our objective is to find the correct translation.

Judge.

We also design a judge $J$ to manage and monitor the whole debate process. The judge contains two different modes: (a) Discrinative Mode, in which the judge $J$ decides whether the correct solution can be obtained after all the debaters finish their arguments in the current iteration:

J_{d}(H)=\begin{cases}\texttt{True},~{}~{}&\textrm{solution obtained}\\ \texttt{False},~{}~{}&\textrm{otherwise}\end{cases}.

(1)

If it is True, the debate is over. Otherwise, the debate continues. (b) Extractive Mode, in which the judge $J$ needs to extract the final solution based on the whole debate history: $J_{e}(H)=a$ , since no correct solution is identified within the iteration limit of debate. An example of a judge prompt appears below:

You are a moderator. There will be two debaters involved in a translation debate competition. They will present their translations and discuss their perspectives on the correct English translation of the given Chinese text: "吃掉敌人一个师。". At the end of each round, you will evaluate the candidates’ translation submissions.

3 Challenging Testbeds

Source	吃掉敌人一个师。
Correct	Destroy a division of the enemy.
Incorrect	Eat up an enemy division.

Table 1: An example from the Common MT dataset. The underlined Chinese words are translated into the corresponding colored words in English.

We conduct experiments on two challenging tasks, namely, commonsense machine translation (i.e., Common MT), and counter-intuitive arithmetic reasoning (i.e., Counter-Intuitive AR), which require deep levels of contemplation for LLMs.

3.1 Commonsense Machine Translation

The Common MT dataset is composed of Chinese $\Rightarrow$ English translation examples He et al. (2020), which are used to examine the ambiguity resolution ability of translation models. Within the challenging part of Common MT, each source sentence contains an ambiguous word. While these ambiguous words might appear to have a straightforward translation, such a literal interpretation is erroneous. Failure to identify and address such ambiguities may result in inaccurate translations. In this work, we adopt the lexical ambiguity test set in the following experiment. Table 1 lists an example, where the source word “吃掉” should be translated to “destroy” rather than the straightforward translation “eat up” by considering the common sense in the real world.

3.2 Counter-Intuitive Arithmetic Reasoning

Previous studies on thinking hierarchy Kong et al. (2022); Wei et al. (2022) suggest that we humans have a fast and intuitive system and a slow and logical system, and tend to run the lower level system before the higher level one. Inspired by this, we created a more challenging dataset named Counter-Intuitive Arithmetic Reasoning (Counter-Intuitive AR) to evaluate the reasoning abilities of LLMs at deep levels.

Components	Content
Question	When Alice walks up the hill, her speed is 1 m/s and when she goes down the hill, her speed is 3 m/s. Then when Alice walks up and down the hill, what is her average speed?
Correct Answer	1.5 m/s
Explanation	If Alice covers a distance of d going up and down the hill, then her total distance is 2d. Her time going up the hill is d/1 = d, and her time going down the hill is d/3. So, her total time is d + d/3 = 4d/3. Therefore, her average speed is 2d / (4d/3) = 3/2 m/s.
Incorrect Answer	2 m/s
Explanation	Alice’s average speed can be calculated by adding her speed going up the hill and her speed going down the hill, and then dividing by 2. So, (1 m/s + 3 m/s) / 2 = 2 m/s. Therefore, Alice’s average speed is 2 m/s.

Table 2: An example in Counter-Intuitive AR dataset.

Dataset Description.

Our Counter-Intuitive AR dataset contains 50 questions collected from elicitation questions Kong et al. (2022)²²2https://elicitation.info/questionnaire/1/, web data³³3https://www.geeksforgeeks.org/puzzles/ and manual collection. Compared to the commonly-used datasets, e.g., MultiArith Roy and Roth (2015), GSM8K Cobbe et al. (2021), our dataset presents two distinct challenges:

•

Resistance to Intuition. The questions in our dataset are embedded in hidden traps designed to elicit intuitive and appealing answers that are often incorrect. This feature evaluates the abilities of LLMs to resist the traps of superficial expressions.
•

Multi-Step Reasoning. Each correct answer within the dataset requires a rigorous multi-step reasoning process, thereby evaluating the capacity of LLMs to engage in complex decision-making and problem-solving.

Dataset Format.

In our Counter-Intuitive AR dataset, each example contains three key components (see Table 2 for an example). We elaborate on the details below:

•

Questions. The questions in our dataset are designed to stimulate counter-intuitive thinking, which aims to challenge conventional decision-making by presenting situations where the immediate, intuitive response is often incorrect.
•

Answers. Each question is provided with a correct answer, which requires deep comprehension of the question and commonsense knowledge. Additionally, we also provide a plausible yet incorrect answer for comparison.
•

Explanations. We provide a detailed explanation for each correct answer. The explanation outlines the step-by-step reasoning process that leads to the correct answer. Each incorrect answer is also complemented by an explanation demonstrating a seemingly logical reasoning process but ultimately leading to the incorrect answer. This reasoning process highlights the potential pitfalls and misconceptions during decision-making, especially when intuition is prioritized over rigorous logical reasoning.

4 Experiment

4.1 Setups

Backbone Models.

In this work, we mainly use three agents in our MAD framework, including two debaters (i.e., affirmative and negative) and a judge. Unless other stated, we use GPT-3.5-Turbo as the backbone model for all agents by default.

Compared Methods.

Generally, we compare our MAD framework with GPT-3.5-Turbo, GPT-4, and Self-Reflect on both tasks. We also include other baseline methods individually, namely, Rerank and MAPS for Common MT, CoT and Self-Consistency for Counter-Intuitive AR. Below elaborates the details of them:

•

Self-Reflect Shinn et al. (2023): This approach requires the LLM to scrutinize and refine its translation until it deems the current output satisfactory.
•

Rerank He et al. (2023): We sample the translations from the LLM for four times, from which we select the best candidate based on a quality estimation (QE) scorer⁴⁴4We use wmt21-comet-qe-da as the QE scorer.. This approach can be seen as analogous to self-consistency Wang et al. (2022), where the majority voting is replaced by an external QE scorer.
•

MAPS He et al. (2023): This method enables the LLM to mimic the human translation process: analyze and then translate, which can be viewed as a chain-of-thought method applied to translation task.
•

CoT Kojima et al. (2022): This approach concatenates a trigger sentence “Let’s think step by ste” to the test question.
•

Self-Consistency Wang et al. (2022): This method samples multiple responses from LLMs and determines the final answer through a majority vote.

We implement the methods on top of GPT-3.5-Turbo. The implementation details are described in Appendix A.1.

Method	Automatic		Human
Method	COMET	BLEURT	Score	ACC (%)
GPT-4	82.0	70.1	3.41	68.5 ${\color[rgb]{1,1,1}{}_{\uparrow~{}x.x}}$
GPT-3.5-Turbo	80.3	68.2	3.14	62.5 ${\color[rgb]{1,1,1}{}_{\uparrow~{}x.x}}$
+ Rerank	80.9	68.6	3.16	63.5 ${\color[rgb]{0,.5,.5}{}_{\uparrow~{}1.0}}$
+ MAPS	81.9	70.1	3.43	70.5 ${\color[rgb]{0,.5,.5}{}_{\uparrow~{}8.0}}$
+ Self-Reflect	81.0	69.1	3.43	69.0 ${\color[rgb]{0,.5,.5}{}_{\uparrow~{}6.5}}$
+ MAD	82.0	70.9	3.78	79.5_↑17.0

Table 3: Translation performance on Common MT. Note that Rerank and MAPS use the external quality estimation tool to select the best translation from multiple translation candidates.

Source	吃掉敌人一个师。
Correct Reference	Destroy a division of the enemy.
Incorrect Reference	Eat up an enemy division.
GPT-4	Eat up an enemy division.
GPT-3.5-Turbo	Eat up an enemy division.
+ Self-Reflect	Eat up an enemy division.
+ MAD	Eliminate an enemy division.
Source	他从后门搞到了不少名酒。
Correct Reference	He got a lot of famous wines from the road of fraud.
Incorrect Reference	He got a lot of famous wines from the back door.
GPT-4	He got quite a few famous wines from the back door.
GPT-3.5-Turbo	He obtained a lot of famous wines from the back door.
+ Self-Reflect	He obtained a good amount of high-quality liquor through the back door.
+ MAD	He got a lot of famous liquor from an unofficial source.

Table 4: Example translations generated by baseline GPT-3.5-Turbo, Self-Reflect and the proposed MAD. We also provide the translation outputs generated by GPT-4. Best viewed in color.

Evaluation Metrics.

For Counter-Intuitive AR, we report the accuracy (ACC) of predictions. For Common MT, we adopt automatic metrics like COMET⁵⁵5https://github.com/Unbabel/COMET/, Unbabel/wmt22-cometkiwi-da and BLEURT⁶⁶6https://github.com/google-research/bleurt, BLEURT-20, which are widely adopted evaluation metrics for LLM-based translation literature He et al. (2023); Hendy et al. (2023); Garcia et al. (2023); Pilault et al. (2023). Moreover, we also employ human evaluation for the translation results in terms of two aspects: ambiguity resolution accuracy and direct assessment of translation quality in range $[1,5]$ .

4.2 Common MT

Results.

Table 3 presents the experimental results. MAPS and Self-Reflec achieve improvements over baseline GPT-3.5-Turbo. Remarkably, our proposed MAD, by utilizing GPT-3.5 as the backbone model, has demonstrated significant advancements over GPT-4 across both automatic and human evaluation metrics.

Case Study.

Table 4 shows example translations generated by baseline GPT-3.5-Turbo and the proposed MAD. We can find that the baseline GPT-3.5-Turbo (even the more powerful GPT-4) incorrectly translates the source words literally. Because of the DoT issue, Self-Reflect cannot rectify the literal translation. The proposed MAD framework, which explores divergent chain-of-thoughts, can generate the free translation of the underlined words within the source sentences. The detailed debate process of translation examples can be found in Appendix A.2.

4.3 Counter-Intuitive AR

Method	ACC (%)
GPT-4	52.0
GPT-3.5-Turbo	20.0
+ CoT	24.0
+ Self-Consistency	30.0
+ Self-Reflect	20.0
+ MAD	36.0

Table 5: Reasoning accuracy on Counter-Intuitive AR.

Results

Table 5 lists the experimental results in terms of reasoning accuracy. We can observe that Self-Reflect does not improve over the baseline GPT-3.5-Turbo, while CoT and Self-Consistency bring some improvements. Our MAD framework, though not as good as GPT-4, outperforms all the other compared methods based on GPT-3.5-Turbo, which further demonstrates its effectiveness.

Question A
The two circles are externally tangent and there
is no relative sliding. The radius of circle A is
1/3 the radius of circle B. Circle A rolls around
circle B one trip back to its starting point. How
many times will circle A revolve in total?
Correct Answer	4
GPT-4	4
GPT-3.5-Turbo	3
+ Self-Reflect	3
+ MAD	4
Question B
When Alice walks up the hill, her speed is 1 m/s
and when she goes down the hill, her speed is 3
m/s. Then when Alice walks up and down the
hill, what is her average speed?
Correct Answer	1.5 m/s
GPT-4	1.5 m/s
GPT-3.5-Turbo	2 m/s
+ Self-Reflect	2 m/s
+ MAD	1.5 m/s

Table 6: Example predictions generated by baseline GPT-3.5-Turbo, Self-Reflect and the proposed MAD. We also provide the results by GPT-4.

Case Study

Table 6 presents two example outputs on Counter-Intuitive AR. We find both CoT and Self-Reflect fail to reach the right answer. With divergent thinking, our MAD framework emerges “we need to consider both the rotation around circle B and the rotation of circle A itself” and find the correct answer. The detailed debate process can be found in Appendix A.2.

5 Analysis

We conduct extensive analyses to gain a deeper understanding on our MAD framework. By default, we use the Common MT dataset.

Effect of Adaptive Break.

We first investigate the stopping strategy of debate. For each iteration, we force the judge $J$ to extract the final answer ( $a=J_{e}(H)$ ) instead of adaptively breaking the debate as in Algorithm 1. Figure 3 shows the results. We can observe that MAD performs better than self-reflection as the iteration increases. However, the highest COMET score appears at the first iteration and is also lower than the result of the adaptive break. It indicates that, for most examples, MAD can generate good translations at the first iteration such that the debate should be stopped. Forcing the debate to continue will harm the translation results, which demonstrates the reasonableness of our adaptive break strategy.

Essense of “Tit for Tat” State.

We then study how the intensity of “tit for tat” affects the performance of MAD. To achieve so, we design different prompts (see Appendix B) to initialize the debate process. As shown in Figure 4, asking the debaters to “tit for tat” (i.e., higher disagreement) is necessary for MAD to achieve good performance. However, we find that “must disagree with each other on every point ” (with a disagreement of 0.988) does not lead to the best performance. We speculate that continuous disagreement without finding common ground can contribute to polarization, where the debate becomes more about winning the argument than seeking truth or understanding. This can reinforce pre-existing biases and make it difficult to reach a consensus or meaningful decision.

ID	Aff	Neg	Jud	V.Aff	V.Neg	V.Tie
\small{1}⃝	Turbo	Turbo	Turbo	87	104	9
\small{2}⃝	GPT-4	GPT-4	GPT-4	67	124	9
\small{3}⃝	Turbo	GPT-4	Turbo	78	114	8
\small{4}⃝	Turbo	GPT-4	GPT-4	52	136	12
\small{5}⃝	GPT-4	Turbo	GPT-4	120	77	3

Table 7: Behavior of agents in MAD. V.Aff (V.Neg) denotes the times affirmative (negative) is chosen for the final solution.

Behavior of Agents.

We study the behavior of agents by calculating how many times the judge chooses the answers of each debater as the final solution. The results are listed in Table 7 and we have the following observations: (1) Comparing row \small{1}⃝ and \small{2}⃝, we find that the judge consistently favors the negative side, which is believed to contribute significantly to the performance improvement in MAD. When encountering complex tasks, the affirmative side tends to make mistakes that should be corrected to achieve improvements. (2) Comparing row \small{3}⃝ and \small{4}⃝ (or row \small{4}⃝ and \small{5}⃝), we find the judge shows a preference to the side with the same LLM as the backbone. This bias indicates that LLMs might not be a fair judge Wang et al. (2023) when different LLMs are used for the agents.

6 Related Work

Chain-of-Thought Prompting

Recently, Wei et al. (2022) has proposed chain-of-thought (CoT) prompting to improve the reasoning ability of LLMs. Specifically, CoT prompts LLMs to generate a series of intermediate steps that lead to the final answer of a multi-step problem. Most earlier work primarily concentrates on two main aspects: prompt design and decoding strategies. Zero-shot CoT Kojima et al. (2022) employs the trigger sentence “Let’s think step by step” to provide guidance for the decoding of LLMs. Advanced sampling strategies have been explored to improve CoT by generating diverse reasoning paths, e.g., Self-Consistency Wang et al. (2022), Auto-CoT Zhang et al. (2022), Active-Prompting Diao et al. (2023), Complexity-based Consistency Fu et al. (2022), Multi-Chain Reasoning Yoran et al. (2023), and Progressive-Hint Prompting Zheng et al. (2023).

With the emergence of powerful LLMs, approaches based on self-evaluation have attracted increasing attention. These approaches involve the generation of initial output, followed by evaluating the output to acquire feedback, which is then utilized to refine the output. Evaluation feedback can come from the model itself, e.g., Self-refine Madaan et al. (2023) and Tree of Thoughts Yao et al. (2023)) or external environments, e.g., QAaP Zhu et al. (2023a) and Reflection Shinn et al. (2023). The intuition behind these approaches involves the utilization of robust LLMs to mimic the human cognition process.

Generative Agents

Recently, LLMs-based multi-agent intelligent, e.g., Generative Agents Park et al. (2023), Ghost in the Minecraft Zhu et al. (2023b), GPT-Bargaining Fu et al. (2023), has drawn significant attention for enabling simulations of human behavior. Our work follows this research line to address the DoT problem of LLMs. Concurrent with our work, a few studies Xiong et al. (2023); Du et al. (2023) also explore the multi-agent debate framework to enhance the reasoning ability of LLMs. The main differences between the proposed MAD framework and these approaches are: (1) our work aims to address the DoT problem, which is an inherent deficiency of LLMs; and (2) we empirically find that our MAD framework can yield enhanced performance by employing agents with the identical backbone LLM.

7 Conclusion

We propose and define the Degeneration-of-Thought (DoT) problem in self-reflection, and address it by proposing the Multi-Agent Debate (MAD) framework to explore divergent chain-of-thoughts. We demonstrate the effectiveness of MAD on two challenging tasks and find that GPT-3.5-Turbo with MAD can even surpass GPT-4 on the Common MT dataset. Extensive analyses suggest that the adaptive break strategy of debate and the modest level of “tit for tat” state are required for MAD to obtain good performance. More interestingly, we find that LLMs might not be a fair judge if different LLMs are used for agents. Future works may include scheduling more agents in the debate, multi-agents for board games, and AI feedback for model alignment.

References

Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv.
Diao et al. (2023) Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. 2023. Active prompting with chain-of-thought for large language models. arXiv.
Du et al. (2023) Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. arXiv.
Fu et al. (2023) Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. 2023. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv.
Fu et al. (2022) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
Garcia et al. (2023) Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Fangxiaoyu Feng, Melvin Johnson, and Orhan Firat. 2023. The unreasonable effectiveness of few-shot learning for machine translation. arXiv.
Gou et al. (2023) Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. Critic: Large language models can self-correct with tool-interactive critiquing.
He et al. (2020) Jie He, Tao Wang, Deyi Xiong, and Qun Liu. 2020. The box is in the pen: Evaluating commonsense reasoning in neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3662–3672, Online. Association for Computational Linguistics.
He et al. (2023) Zhiwei He, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Yujiu Yang, Rui Wang, Zhaopeng Tu, Shuming Shi, and Xing Wang. 2023. Exploring human-like translation strategy with large language models. arXiv.
Hendy et al. (2023) Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How good are gpt models at machine translation? a comprehensive evaluation. arXiv.
Jiao et al. (2023) Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu. 2023. Is ChatGPT a good translator? A preliminary study. arXiv.
Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In NeurIPS.
Kong et al. (2022) Yuqing Kong, Yunqi Li, Yubo Zhang, Zhihuan Huang, and Jinzhao Wu. 2022. Eliciting thinking hierarchy without a prior. NeurIPS.
Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback. arXiv.
OpenAI (2023) OpenAI. 2023. GPT-4 technical report. arXiv.
Park et al. (2023) Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. arXiv.
Pilault et al. (2023) Jonathan Pilault, Xavier Garcia, Arthur Bražinskas, and Orhan Firat. 2023. Interactive-chain-prompting: Ambiguity resolution for crosslingual conditional generation with interaction. arXiv.
Roy and Roth (2015) Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pages 1743–1752. Association for Computational Linguistics (ACL).
Shinn et al. (2023) Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning.
Wang et al. (2023) Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large language models are not fair evaluators.
Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
Wu et al. (2023) Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang Jiao, and Michael Lyu. 2023. Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark. arXiv.
Xiong et al. (2023) Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. 2023. Diving into the inter-consistency of large language models: An insightful analysis through debate. arXiv.
Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. arXiv.
Yoran et al. (2023) Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. 2023. Answering questions by meta-reasoning over multiple chains of thought. arXiv.
Zhang et al. (2022) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic chain of thought prompting in large language models. arXiv.
Zheng et al. (2023) Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. 2023. Progressive-hint prompting improves reasoning in large language models. arXiv.
Zhu et al. (2022) Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. 2022. Solving math word problem via cooperative reasoning induced language models. arXiv.
Zhu et al. (2023a) Xinyu Zhu, Cheng Yang, Bei Chen, Siheng Li, Jian-Guang Lou, and Yujiu Yang. 2023a. Question answering as programming for solving time-sensitive questions. arXiv preprint arXiv:2305.14221.
Zhu et al. (2023b) Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, and Jifeng Dai. 2023b. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv.

Appendix A Example Appendix

A.1 Implementation

Figure 5 displays a typical template of debate history, formatted according to the Turbo API.

⬇

[

{

"role": "system",

"content": "You are a debater. Hello and welcome to the translation competition, which will be conducted in a debate format. It’s not necessary to fully agree with each other’s perspectives, as our objective is to find the correct translation. The debate topic is stated as follows: What is the correct English translation of the following Chinese text: ______"

{

"role": "user",

"content": "Translate the following text from Chinese to English: ______"

{

"role": "assistant",

"content": "I think ______ is a correct translation because ______"

{

"role": "user",

"content": "I disagree with you. Here is my reason: ______"

{

"role": "assistant",

"content": "I can see your point of view, but ______"

}

Figure 5: A typical template of debate history, formatted according to the Turbo API.

[Uncaptioned image] — Table 8: The debate process of translation example on Chinese sentence “吃掉敌人一个师。”.

A.2 Debate Case

Table 9 and Table 10 present the debate process of example translations in Section 4.2. Table 10 and Table 11 show the debate process of example answers in Section 4.3.

We observe that the affirmative side ( [Uncaptioned image] ) often relies on direct intuition, which can lead to incorrect or inappropriate responses. Conversely, the negative side ( ) demonstrates an ability to identify and rectify his mistakes.

Appendix B Level Control of “tit for tat” State

We modulate the level of “tit for tat” state outlined in Section 5 through appending natural language instructions to the debaters’ meta prompt. All the corresponding prompts are itemized in Table 12.

Level	Prompt
0	Both sides must reach a full consensus on every point of the debate. Every statement must be agreed upon by both sides.
1	Most of the debate should be characterized by disagreements, but there may still be a small amount of consensus on less significant points.
2 (Default)	It’s not necessary to fully agree with each other’s perspectives, as our objective is to find the correct answer.
3	Both sides must disagree with each other on every point of the debate. There should be no consensus whatsoever.

Table 12: Prompts for different levels of “tit for tat” state.

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Abstract

1 Introduction

2 Multi-Agent Debate Framework

Meta Prompts.

Debaters.

Judge.

3 Challenging Testbeds

3.1 Commonsense Machine Translation

3.2 Counter-Intuitive Arithmetic Reasoning

Dataset Description.

Dataset Format.

4 Experiment

4.1 Setups

Backbone Models.

Compared Methods.

Evaluation Metrics.

4.2 Common MT

Results.

Case Study.

4.3 Counter-Intuitive AR

Results

Case Study

5 Analysis

Effect of Adaptive Break.

Essense of “Tit for Tat” State.

Behavior of Agents.

6 Related Work

Chain-of-Thought Prompting

Generative Agents

7 Conclusion

References

Appendix A Example Appendix

A.1 Implementation

A.2 Debate Case

Appendix B Level Control of “tit for tat” State

Encouraging Divergent Thinking in Large Language Models
through Multi-Agent Debate