Encouraging Divergent Thinking in Large Language Models
through Multi-Agent Debate
Abstract
Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of “tit for tat” and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of “tit for tat” state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Codes: https://github.com/Skytliang/Multi-Agents-Debate
1 Introduction
Modern large language models (LLMs) like ChatGPT, GPT-4 OpenAI (2023) and Bard111https://bard.google.com/, have shown remarkable performance on general language tasks Jiao et al. (2023); Wu et al. (2023); Bang et al. (2023) but still struggle on complex reasoning tasks Zhu et al. (2022); Gou et al. (2023), which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. In particular, self-reflection Madaan et al. (2023); Shinn et al. (2023), a concept that usually refers to the process of introspection and examination of a person’s own thoughts, has been explored to solve intricate tasks that could be challenging for a zero-shot generation or even chain-of-thought (CoT) prompting Wei et al. (2022). Specifically, self-reflection involves an iterative refinement process such that the LLM generates a new answer based on the answers and feedback in previous iterations and then provides feedback for the new answer. While self-reflection can be effective in creating better solutions, it is highly dependent on the self-evaluation capabilities of LLMs, which are not formally guaranteed Shinn et al. (2023).
In this work, we focus on the Degeneration-of-Thought (DoT) problem in self-reflection, which is proposed and defined by us for the first time. Formally, DoT describes the following scenario:
Once the LLM has established confidence in its answers, it is unable to generate novel thoughts later through self-reflection even if the initial stance is incorrect.
To demonstrate this problem, we define the average disagreement as the percentage of opposition between two debaters in debate (or self-confliction in self-reflection) for each question. As Figure 1 seen, we calculate the disagreement of stances between every two iterations in self-reflection and show the trends. The low disagreement of self-reflection suggests that the LLM sticks to the incorrect answers predicted by CoT and is unable to engage in meaningful self-reflection. There are various factors that could result in DoT, and we outline three here: (1) Bias and Distorted Perception. Self-perception can be influenced by biases, preconceived notions, and distorted thinking patterns, which can be learned from the massive amount of data during pretraining. If an LLM’s self-reflection is clouded by such biases or distorted thinking, it can lead to inaccurate conclusions instinctively. (2) Rigidity and Resistance to Change. Self-reflection often involves challenging one’s beliefs, assumptions, and behaviors. If an LLM is resistant to change or holds rigid beliefs, it may struggle to engage in meaningful self-reflection that leads to better solutions. (3) Limited External Feedback. Self-reflection is primarily an internal process, but external feedback can provide valuable perspectives and insights. Without seeking or considering external feedback, an LLM may miss important blind spots or alternative viewpoints that can enrich its self-reflection.
To address the DoT issue, we leverage another fundamental characteristic of human problem-solving, i.e., debate, to encourage divergent thinking in LLMs. Specifically, we propose the MAD framework, short for Multi-Agent Debate, where two agents express their own arguments in the state of “tit for tat” and a judge monitors and manages the debate process to obtain a final solution. The nature of MAD determines that (1) The distorted thinking of one LLM can be corrected by the others; (2) The resistance to change of one LLM will be complemented by the others; and (3) each agent can obtain external feedback from the others. Therefore, MAD is less susceptible to the factors of DoT, and can explore divergent chain-of-thoughts to achieve accurate solutions.
We conducted experiments on both natural language generation and understanding through two challenging tasks, namely, Commonsense Machine Translation (Common MT) and Counter-Intuitive Arithmetic Reasoning (Counter-Intuitive AR). The common characteristic of the two tasks is that our instincts are mostly incorrect based on only the superficial expressions of the questions, and deeper levels of contemplation are required for better solutions. Experimental results demonstrate that our MAD framework performs much better than the baseline methods, especially, MAD with GPT-3.5-Turbo can surpass the performance of GPT-4 on Common MT.
The contributions of this work are summarized as follows:
-
•
We propose and define the Degeneration-of-Thought (DoT) problem in self-reflection, and address it by proposing the Multi-Agent Debate (MAD) framework to explore divergent chain-of-thoughts.
-
•
We demonstrate the effectiveness of MAD on two challenging tasks, and find that GPT-3.5-Turbo with MAD can even surpass GPT-4 on the Common MT dataset.
-
•
Extensive analyses suggest that the adaptive break strategy of debate and the modest level of “tit for tat” state are required for MAD to obtain good performance. More interestingly, we find that LLMs might not be a fair judge if different LLMs are used for agents.
2 Multi-Agent Debate Framework
Algorithm 1 illustrates the detailed process of MAD. Generally, our MAD framework is composed of three components which are elaborated as follows:
Meta Prompts.
We use meta prompts to introduce the topic to be solved, the number of debaters, the iteration limit, and other requirements. For example, we require the agents to “tit for tat” so as to create an atmosphere of debate.
Debaters.
There are debaters involved in the framework. In each debate iteration, the debaters speak one by one in a fixed order and express their arguments based on the previous debate history , i.e., . An example of a debater prompt appears below:
You are a debater. Hello and welcome to the translation competition, which will be conducted in a debate format. It’s not necessary to fully agree with each other ’ s perspectives, as our objective is to find the correct translation.
Judge.
We also design a judge to manage and monitor the whole debate process. The judge contains two different modes: (a) Discrinative Mode, in which the judge decides whether the correct solution can be obtained after all the debaters finish their arguments in the current iteration:
(1) |
If it is True, the debate is over. Otherwise, the debate continues. (b) Extractive Mode, in which the judge needs to extract the final solution based on the whole debate history: , since no correct solution is identified within the iteration limit of debate. An example of a judge prompt appears below:
You are a moderator. There will be two debaters involved in a translation debate competition. They will present their translations and discuss their perspectives on the correct English translation of the given Chinese text: "吃掉敌人一个师。". At the end of each round, you will evaluate the candidates’ translation submissions.
3 Challenging Testbeds
Source | 吃掉敌人一个师。 |
---|---|
Correct | Destroy a division of the enemy. |
Incorrect | Eat up an enemy division. |
We conduct experiments on two challenging tasks, namely, commonsense machine translation (i.e., Common MT), and counter-intuitive arithmetic reasoning (i.e., Counter-Intuitive AR), which require deep levels of contemplation for LLMs.
3.1 Commonsense Machine Translation
The Common MT dataset is composed of ChineseEnglish translation examples He et al. (2020), which are used to examine the ambiguity resolution ability of translation models. Within the challenging part of Common MT, each source sentence contains an ambiguous word. While these ambiguous words might appear to have a straightforward translation, such a literal interpretation is erroneous. Failure to identify and address such ambiguities may result in inaccurate translations. In this work, we adopt the lexical ambiguity test set in the following experiment. Table 1 lists an example, where the source word “吃掉” should be translated to “destroy” rather than the straightforward translation “eat up” by considering the common sense in the real world.
3.2 Counter-Intuitive Arithmetic Reasoning
Previous studies on thinking hierarchy Kong et al. (2022); Wei et al. (2022) suggest that we humans have a fast and intuitive system and a slow and logical system, and tend to run the lower level system before the higher level one. Inspired by this, we created a more challenging dataset named Counter-Intuitive Arithmetic Reasoning (Counter-Intuitive AR) to evaluate the reasoning abilities of LLMs at deep levels.
Components | Content |
---|---|
Question | When Alice walks up the hill, her speed is 1 m/s and when she goes down the hill, her speed is 3 m/s. Then when Alice walks up and down the hill, what is her average speed? |
Correct Answer | 1.5 m/s |
Explanation | If Alice covers a distance of d going up and down the hill, then her total distance is 2d. Her time going up the hill is d/1 = d, and her time going down the hill is d/3. So, her total time is d + d/3 = 4d/3. Therefore, her average speed is 2d / (4d/3) = 3/2 m/s. |
Incorrect Answer | 2 m/s |
Explanation | Alice’s average speed can be calculated by adding her speed going up the hill and her speed going down the hill, and then dividing by 2. So, (1 m/s + 3 m/s) / 2 = 2 m/s. Therefore, Alice’s average speed is 2 m/s. |
Dataset Description.
Our Counter-Intuitive AR dataset contains 50 questions collected from elicitation questions Kong et al. (2022)222https://elicitation.info/questionnaire/1/, web data333https://www.geeksforgeeks.org/puzzles/ and manual collection. Compared to the commonly-used datasets, e.g., MultiArith Roy and Roth (2015), GSM8K Cobbe et al. (2021), our dataset presents two distinct challenges:
-
•
Resistance to Intuition. The questions in our dataset are embedded in hidden traps designed to elicit intuitive and appealing answers that are often incorrect. This feature evaluates the abilities of LLMs to resist the traps of superficial expressions.
-
•
Multi-Step Reasoning. Each correct answer within the dataset requires a rigorous multi-step reasoning process, thereby evaluating the capacity of LLMs to engage in complex decision-making and problem-solving.
Dataset Format.
In our Counter-Intuitive AR dataset, each example contains three key components (see Table 2 for an example). We elaborate on the details below:
-
•
Questions. The questions in our dataset are designed to stimulate counter-intuitive thinking, which aims to challenge conventional decision-making by presenting situations where the immediate, intuitive response is often incorrect.
-
•
Answers. Each question is provided with a correct answer, which requires deep comprehension of the question and commonsense knowledge. Additionally, we also provide a plausible yet incorrect answer for comparison.
-
•
Explanations. We provide a detailed explanation for each correct answer. The explanation outlines the step-by-step reasoning process that leads to the correct answer. Each incorrect answer is also complemented by an explanation demonstrating a seemingly logical reasoning process but ultimately leading to the incorrect answer. This reasoning process highlights the potential pitfalls and misconceptions during decision-making, especially when intuition is prioritized over rigorous logical reasoning.
4 Experiment
4.1 Setups
Backbone Models.
In this work, we mainly use three agents in our MAD framework, including two debaters (i.e., affirmative and negative) and a judge. Unless other stated, we use GPT-3.5-Turbo as the backbone model for all agents by default.
Compared Methods.
Generally, we compare our MAD framework with GPT-3.5-Turbo, GPT-4, and Self-Reflect on both tasks. We also include other baseline methods individually, namely, Rerank and MAPS for Common MT, CoT and Self-Consistency for Counter-Intuitive AR. Below elaborates the details of them:
-
•
Self-Reflect Shinn et al. (2023): This approach requires the LLM to scrutinize and refine its translation until it deems the current output satisfactory.
-
•
Rerank He et al. (2023): We sample the translations from the LLM for four times, from which we select the best candidate based on a quality estimation (QE) scorer444We use wmt21-comet-qe-da as the QE scorer.. This approach can be seen as analogous to self-consistency Wang et al. (2022), where the majority voting is replaced by an external QE scorer.
-
•
MAPS He et al. (2023): This method enables the LLM to mimic the human translation process: analyze and then translate, which can be viewed as a chain-of-thought method applied to translation task.
-
•
CoT Kojima et al. (2022): This approach concatenates a trigger sentence “Let’s think step by ste” to the test question.
-
•
Self-Consistency Wang et al. (2022): This method samples multiple responses from LLMs and determines the final answer through a majority vote.
We implement the methods on top of GPT-3.5-Turbo. The implementation details are described in Appendix A.1.
Method | Automatic | Human | ||
COMET | BLEURT | Score | ACC (%) | |
GPT-4 | 82.0 | 70.1 | 3.41 | 68.5 |
GPT-3.5-Turbo | 80.3 | 68.2 | 3.14 | 62.5 |
+ Rerank | 80.9 | 68.6 | 3.16 | 63.5 |
+ MAPS | 81.9 | 70.1 | 3.43 | 70.5 |
+ Self-Reflect | 81.0 | 69.1 | 3.43 | 69.0 |
+ MAD | 82.0 | 70.9 | 3.78 | 79.5↑17.0 |
Source | 吃掉敌人一个师。 |
---|---|
Correct Reference | Destroy a division of the enemy. |
Incorrect Reference | Eat up an enemy division. |
GPT-4 | Eat up an enemy division. |
GPT-3.5-Turbo | Eat up an enemy division. |
+ Self-Reflect | Eat up an enemy division. |
+ MAD | Eliminate an enemy division. |
Source | 他从后门搞到了不少名酒。 |
Correct Reference | He got a lot of famous wines from the road of fraud. |
Incorrect Reference | He got a lot of famous wines from the back door. |
GPT-4 | He got quite a few famous wines from the back door. |
GPT-3.5-Turbo | He obtained a lot of famous wines from the back door. |
+ Self-Reflect | He obtained a good amount of high-quality liquor through the back door. |
+ MAD | He got a lot of famous liquor from an unofficial source. |
Evaluation Metrics.
For Counter-Intuitive AR, we report the accuracy (ACC) of predictions. For Common MT, we adopt automatic metrics like COMET555https://github.com/Unbabel/COMET/, Unbabel/wmt22-cometkiwi-da and BLEURT666https://github.com/google-research/bleurt, BLEURT-20, which are widely adopted evaluation metrics for LLM-based translation literature He et al. (2023); Hendy et al. (2023); Garcia et al. (2023); Pilault et al. (2023). Moreover, we also employ human evaluation for the translation results in terms of two aspects: ambiguity resolution accuracy and direct assessment of translation quality in range .
4.2 Common MT
Results.
Table 3 presents the experimental results. MAPS and Self-Reflec achieve improvements over baseline GPT-3.5-Turbo. Remarkably, our proposed MAD, by utilizing GPT-3.5 as the backbone model, has demonstrated significant advancements over GPT-4 across both automatic and human evaluation metrics.
Case Study.
Table 4 shows example translations generated by baseline GPT-3.5-Turbo and the proposed MAD. We can find that the baseline GPT-3.5-Turbo (even the more powerful GPT-4) incorrectly translates the source words literally. Because of the DoT issue, Self-Reflect cannot rectify the literal translation. The proposed MAD framework, which explores divergent chain-of-thoughts, can generate the free translation of the underlined words within the source sentences. The detailed debate process of translation examples can be found in Appendix A.2.
4.3 Counter-Intuitive AR
Method | ACC (%) |
---|---|
GPT-4 | 52.0 |
GPT-3.5-Turbo | 20.0 |
+ CoT | 24.0 |
+ Self-Consistency | 30.0 |
+ Self-Reflect | 20.0 |
+ MAD | 36.0 |
Results
Table 5 lists the experimental results in terms of reasoning accuracy. We can observe that Self-Reflect does not improve over the baseline GPT-3.5-Turbo, while CoT and Self-Consistency bring some improvements. Our MAD framework, though not as good as GPT-4, outperforms all the other compared methods based on GPT-3.5-Turbo, which further demonstrates its effectiveness.
Question A | |
---|---|
The two circles are externally tangent and there | |
is no relative sliding. The radius of circle A is | |
1/3 the radius of circle B. Circle A rolls around | |
circle B one trip back to its starting point. How | |
many times will circle A revolve in total? | |
Correct Answer | 4 |
GPT-4 | 4 |
GPT-3.5-Turbo | 3 |
+ Self-Reflect | 3 |
+ MAD | 4 |
Question B | |
When Alice walks up the hill, her speed is 1 m/s | |
and when she goes down the hill, her speed is 3 | |
m/s. Then when Alice walks up and down the | |
hill, what is her average speed? | |
Correct Answer | 1.5 m/s |
GPT-4 | 1.5 m/s |
GPT-3.5-Turbo | 2 m/s |
+ Self-Reflect | 2 m/s |
+ MAD | 1.5 m/s |
Case Study
Table 6 presents two example outputs on Counter-Intuitive AR. We find both CoT and Self-Reflect fail to reach the right answer. With divergent thinking, our MAD framework emerges “we need to consider both the rotation around circle B and the rotation of circle A itself” and find the correct answer. The detailed debate process can be found in Appendix A.2.
5 Analysis
We conduct extensive analyses to gain a deeper understanding on our MAD framework. By default, we use the Common MT dataset.
Effect of Adaptive Break.
We first investigate the stopping strategy of debate. For each iteration, we force the judge to extract the final answer () instead of adaptively breaking the debate as in Algorithm 1. Figure 3 shows the results. We can observe that MAD performs better than self-reflection as the iteration increases. However, the highest COMET score appears at the first iteration and is also lower than the result of the adaptive break. It indicates that, for most examples, MAD can generate good translations at the first iteration such that the debate should be stopped. Forcing the debate to continue will harm the translation results, which demonstrates the reasonableness of our adaptive break strategy.
Essense of “Tit for Tat” State.
We then study how the intensity of “tit for tat” affects the performance of MAD. To achieve so, we design different prompts (see Appendix B) to initialize the debate process. As shown in Figure 4, asking the debaters to “tit for tat” (i.e., higher disagreement) is necessary for MAD to achieve good performance. However, we find that “must disagree with each other on every point ” (with a disagreement of 0.988) does not lead to the best performance. We speculate that continuous disagreement without finding common ground can contribute to polarization, where the debate becomes more about winning the argument than seeking truth or understanding. This can reinforce pre-existing biases and make it difficult to reach a consensus or meaningful decision.
ID | Aff | Neg | Jud | V.Aff | V.Neg | V.Tie |
\small{1}⃝ | Turbo | Turbo | Turbo | 87 | 104 | 9 |
\small{2}⃝ | GPT-4 | GPT-4 | GPT-4 | 67 | 124 | 9 |
\small{3}⃝ | Turbo | GPT-4 | Turbo | 78 | 114 | 8 |
\small{4}⃝ | Turbo | GPT-4 | GPT-4 | 52 | 136 | 12 |
\small{5}⃝ | GPT-4 | Turbo | GPT-4 | 120 | 77 | 3 |
Behavior of Agents.
We study the behavior of agents by calculating how many times the judge chooses the answers of each debater as the final solution. The results are listed in Table 7 and we have the following observations: (1) Comparing row \small{1}⃝ and \small{2}⃝, we find that the judge consistently favors the negative side, which is believed to contribute significantly to the performance improvement in MAD. When encountering complex tasks, the affirmative side tends to make mistakes that should be corrected to achieve improvements. (2) Comparing row \small{3}⃝ and \small{4}⃝ (or row \small{4}⃝ and \small{5}⃝), we find the judge shows a preference to the side with the same LLM as the backbone. This bias indicates that LLMs might not be a fair judge Wang et al. (2023) when different LLMs are used for the agents.
6 Related Work
Chain-of-Thought Prompting
Recently, Wei et al. (2022) has proposed chain-of-thought (CoT) prompting to improve the reasoning ability of LLMs. Specifically, CoT prompts LLMs to generate a series of intermediate steps that lead to the final answer of a multi-step problem. Most earlier work primarily concentrates on two main aspects: prompt design and decoding strategies. Zero-shot CoT Kojima et al. (2022) employs the trigger sentence “Let’s think step by step” to provide guidance for the decoding of LLMs. Advanced sampling strategies have been explored to improve CoT by generating diverse reasoning paths, e.g., Self-Consistency Wang et al. (2022), Auto-CoT Zhang et al. (2022), Active-Prompting Diao et al. (2023), Complexity-based Consistency Fu et al. (2022), Multi-Chain Reasoning Yoran et al. (2023), and Progressive-Hint Prompting Zheng et al. (2023).
With the emergence of powerful LLMs, approaches based on self-evaluation have attracted increasing attention. These approaches involve the generation of initial output, followed by evaluating the output to acquire feedback, which is then utilized to refine the output. Evaluation feedback can come from the model itself, e.g., Self-refine Madaan et al. (2023) and Tree of Thoughts Yao et al. (2023)) or external environments, e.g., QAaP Zhu et al. (2023a) and Reflection Shinn et al. (2023). The intuition behind these approaches involves the utilization of robust LLMs to mimic the human cognition process.
Generative Agents
Recently, LLMs-based multi-agent intelligent, e.g., Generative Agents Park et al. (2023), Ghost in the Minecraft Zhu et al. (2023b), GPT-Bargaining Fu et al. (2023), has drawn significant attention for enabling simulations of human behavior. Our work follows this research line to address the DoT problem of LLMs. Concurrent with our work, a few studies Xiong et al. (2023); Du et al. (2023) also explore the multi-agent debate framework to enhance the reasoning ability of LLMs. The main differences between the proposed MAD framework and these approaches are: (1) our work aims to address the DoT problem, which is an inherent deficiency of LLMs; and (2) we empirically find that our MAD framework can yield enhanced performance by employing agents with the identical backbone LLM.
7 Conclusion
We propose and define the Degeneration-of-Thought (DoT) problem in self-reflection, and address it by proposing the Multi-Agent Debate (MAD) framework to explore divergent chain-of-thoughts. We demonstrate the effectiveness of MAD on two challenging tasks and find that GPT-3.5-Turbo with MAD can even surpass GPT-4 on the Common MT dataset. Extensive analyses suggest that the adaptive break strategy of debate and the modest level of “tit for tat” state are required for MAD to obtain good performance. More interestingly, we find that LLMs might not be a fair judge if different LLMs are used for agents. Future works may include scheduling more agents in the debate, multi-agents for board games, and AI feedback for model alignment.
References
- Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv.
- Diao et al. (2023) Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. 2023. Active prompting with chain-of-thought for large language models. arXiv.
- Du et al. (2023) Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. arXiv.
- Fu et al. (2023) Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. 2023. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv.
- Fu et al. (2022) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
- Garcia et al. (2023) Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Fangxiaoyu Feng, Melvin Johnson, and Orhan Firat. 2023. The unreasonable effectiveness of few-shot learning for machine translation. arXiv.
- Gou et al. (2023) Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. Critic: Large language models can self-correct with tool-interactive critiquing.
- He et al. (2020) Jie He, Tao Wang, Deyi Xiong, and Qun Liu. 2020. The box is in the pen: Evaluating commonsense reasoning in neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3662–3672, Online. Association for Computational Linguistics.
- He et al. (2023) Zhiwei He, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Yujiu Yang, Rui Wang, Zhaopeng Tu, Shuming Shi, and Xing Wang. 2023. Exploring human-like translation strategy with large language models. arXiv.
- Hendy et al. (2023) Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How good are gpt models at machine translation? a comprehensive evaluation. arXiv.
- Jiao et al. (2023) Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu. 2023. Is ChatGPT a good translator? A preliminary study. arXiv.
- Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In NeurIPS.
- Kong et al. (2022) Yuqing Kong, Yunqi Li, Yubo Zhang, Zhihuan Huang, and Jinzhao Wu. 2022. Eliciting thinking hierarchy without a prior. NeurIPS.
- Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback. arXiv.
- OpenAI (2023) OpenAI. 2023. GPT-4 technical report. arXiv.
- Park et al. (2023) Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. arXiv.
- Pilault et al. (2023) Jonathan Pilault, Xavier Garcia, Arthur Bražinskas, and Orhan Firat. 2023. Interactive-chain-prompting: Ambiguity resolution for crosslingual conditional generation with interaction. arXiv.
- Roy and Roth (2015) Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pages 1743–1752. Association for Computational Linguistics (ACL).
- Shinn et al. (2023) Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning.
- Wang et al. (2023) Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large language models are not fair evaluators.
- Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
- Wu et al. (2023) Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang Jiao, and Michael Lyu. 2023. Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark. arXiv.
- Xiong et al. (2023) Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. 2023. Diving into the inter-consistency of large language models: An insightful analysis through debate. arXiv.
- Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. arXiv.
- Yoran et al. (2023) Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. 2023. Answering questions by meta-reasoning over multiple chains of thought. arXiv.
- Zhang et al. (2022) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic chain of thought prompting in large language models. arXiv.
- Zheng et al. (2023) Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. 2023. Progressive-hint prompting improves reasoning in large language models. arXiv.
- Zhu et al. (2022) Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. 2022. Solving math word problem via cooperative reasoning induced language models. arXiv.
- Zhu et al. (2023a) Xinyu Zhu, Cheng Yang, Bei Chen, Siheng Li, Jian-Guang Lou, and Yujiu Yang. 2023a. Question answering as programming for solving time-sensitive questions. arXiv preprint arXiv:2305.14221.
- Zhu et al. (2023b) Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, and Jifeng Dai. 2023b. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv.
Appendix A Example Appendix
A.1 Implementation
Figure 5 displays a typical template of debate history, formatted according to the Turbo API.
I believe the correct translation of the Chinese text “吃掉敌人一个师” is “Eat up an enemy division.” The word “吃掉” literally means “eat up” or “consume.” |
I disagree with your translation. I believe the correct translation of the Chinese text “吃掉敌人一个师” is “Eliminate an enemy division." While “吃掉” can be translated as “eat up” or “consume”, it can also mean "eliminate" or "annihilate" in a military context. In this case, it is more appropriate to use "eliminate" as it conveys the idea of completely destroying the enemy division. |
Both sides presented valid arguments and translations that accurately convey the meaning of the original Chinese text. The choice between ’eat up’ and ’eliminate’ may depend on the context and tone of the text. Therefore, there is no clear preference for either side at this time. |
While “eliminate” does convey the idea of completely destroying the enemy division, it doesn’t capture the connotation of consuming or devouring the enemy that is implied by “吃掉.” |
In a military context, “吃掉” is often used to describe the complete destruction of the enemy, rather than the consumption of their resources or territory. |
Both sides presented valid arguments, but the negative side’s argument that ‘eliminate’ better captures the intensity and seriousness of the military language used in the original Chinese text was more convincing. Additionally, ‘eliminate’ accurately conveys the idea of completely destroying the enemy division. Therefore, the correct English translation of the Chinese text ‘吃掉敌人一个师’ is ‘Eliminate an enemy division.’ |
|
The phrase “从后门” literally translates to “from the back door.” |
I actually disagree with your translation. I believe a more accurate translation would be "He got a lot of famous liquor from the back door. Lastly, while “从后门” does literally translate to “from the back door,” it is a common Chinese idiom that means obtaining something through unofficial or illegitimate means. Therefore, the use of “from the back door” in the translation is not entirely accurate. Instead, the use of “from an unofficial source” or "through unofficial means" would be more appropriate. |
Both sides presented valid arguments and translations that accurately capture the meaning of the original Chinese text. The choice between wine and liquor is a matter of interpretation and context. The use of from the back door versus from an unofficial source is a matter of preference and style. Therefore, there is no clear preference for a translation candidate. |
I agree that “从后门” is a common Chinese idiom that means obtaining something through unofficial or illegitimate means. The use of “from an unofficial source” or “through unofficial means” can be more appropriate than “from the back door.” |
Yes, I agree with your perspective. |
While both translations accurately capture the meaning of the original Chinese text, the use of ‘liquor’ in candidate 2 is more accurate as ‘名酒’ is more commonly used to refer to liquor. Additionally, the use of ’from an unofficial source’ in candidate 3 more accurately conveys the connotation of ‘从后门’ as obtaining something through unofficial or illegitimate means. Therefore, the correct translation is: ‘He got a lot of famous liquor from an unofficial source.’ |
Alice’s average speed can be calculated by adding her speed going up the hill and her speed going down the hill, and then dividing by 2. So, (1 m/s + 3 m/s) / 2 = 2 m/s. Therefore, Alice’s average speed is 2 m/s. |
I disagree with your answer. The problem doesn’t give us any information about the distance Alice covers while going up and down the hill. So, we cannot simply add her speeds and divide by 2 to get her average speed. We need to know the distance she covers to calculate her average speed. |
The average speed is calculated by dividing the total distance by the total time taken. As Alice walks up and down the hill, the distance covered in both directions is the same. Therefore, the total distance covered is 2 times the distance of either direction. |
That’s a good point. But we can assume that Alice covers the same distance going up and down the hill. In that case, we can use the formula for average speed, which is total distance divided by total time. If Alice covers a distance of d going up and down the hill, then her total distance is 2d. Her time going up the hill is d/1 = d, and her time going down the hill is d/3. So, her total time is d + d/3 = 4d/3. Therefore, her average speed is 2d/(4d/3) = 3/2 m/s. |
Thank you for providing your answer. This average speed of 1.5 m/s takes into account both the uphill and downhill speeds, providing a balanced representation of Alice’s overall performance while walking up and down the hill. |
Yes. Let’s assume the distance is d. The time taken to walk up the hill is d/1 and the time taken to walk down the hill is d/3. The total time taken is d/1 + d/3 = 4d/3. Therefore, the average speed is (2d)/(4d/3) = 1.5 m/s. Hence, the negative side is correct. |
First, we need to find out how much of the original weight was water. If we have 1 ton of apples with 90% water, then the weight of water in those apples is 0.9 x 1 ton = 0.9 tons. Next, we need to find out how much of the new weight is water. If the apples now have 80% water, then the weight of water in those apples is 0.8 x new weight. We can set up an equation: |
0.8 x new weight = 0.9 tons |
To solve for the new weight, we can divide both sides by 0.8: |
new weight = 0.9 tons / 0.8 |
new weight = 1.125 tons |
Therefore, the weight of the apples now is 1.125 tons. |
That’s not correct. It’s important to note that the weight of the apples has decreased due to the loss of water, but the weight of the remaining solids remains the same. This is why we need to calculate the weight of remaining solids in both scenarios to find the new weight of the apples. |
Great idea! While it is true that the water content of the apples decreases from 90% to 80%, the solid, non-water content of the apples does not change. |
You are correct. The weight of the apples has decreased due to the loss of water, but the weight of the remaining solids remains the same. Therefore, we need to calculate the weight of the remaining solids in both scenarios to find the new weight of the apples. We can set up an equation: |
0.2 x new weight = 0.1 tons |
To solve for the new weight, we can divide both sides by 0.2: |
new weight = 0.1 tons / 0.2 |
new weight = 0.5 tons |
Therefore, the weight of the apples now is 0.5 tons. |
Thank you for your perspective. While it’s true that the weight of the apples themselves hasn’t changed, the overall weight of the apples with the water included has decreased. So, the weight of the apples now is 0.5 tons. |
The negative side correctly calculated the new weight of the apples by considering the solid content as a constant and using the formula to find the new weight based on the reduced water content percentage. |
A.2 Debate Case
Table 9 and Table 10 present the debate process of example translations in Section 4.2. Table 10 and Table 11 show the debate process of example answers in Section 4.3.
We observe that the affirmative side ( ) often relies on direct intuition, which can lead to incorrect or inappropriate responses. Conversely, the negative side ( ) demonstrates an ability to identify and rectify his mistakes.
Appendix B Level Control of “tit for tat” State
We modulate the level of “tit for tat” state outlined in Section 5 through appending natural language instructions to the debaters’ meta prompt. All the corresponding prompts are itemized in Table 12.
Level | Prompt |
---|---|
0 | Both sides must reach a full consensus on every point of the debate. Every statement must be agreed upon by both sides. |
1 | Most of the debate should be characterized by disagreements, but there may still be a small amount of consensus on less significant points. |
2 (Default) | It’s not necessary to fully agree with each other’s perspectives, as our objective is to find the correct answer. |
3 | Both sides must disagree with each other on every point of the debate. There should be no consensus whatsoever. |