Self-Harmonized Chain of Thought

Ziqi Jin    Wei Lu
StatNLP Research Group
Singapore University of Technology and Design,


Chain-of-thought (CoT) prompting reveals that large language models are capable of performing complex reasoning via intermediate steps. CoT prompting is mainly categorized into two paradigms. The first approach utilizes straightforward prompts like “Let’s think step by step” to generate a sequential thought process before yielding an answer. The second approach makes use of human-crafted, step-by-step demonstrations to guide the model’s reasoning process. The latter method replaces manually written demonstrations with ones generated by large language models using “Let’s think step by step”. This approach sometimes leads to reasoning errors, highlighting the need to diversify demonstrations to mitigate its misleading effects. However, diverse demonstrations pose challenges for effective representations. In this work, we propose ECHO, a self-harmonized chain-of-thought prompting method. Consolidating diverse solution paths into a uniform and effective solution pattern, ECHO demonstrates the best overall performance in three reasoning domains111Our code is available:

1 Introduction

A recent technique known as chain-of-thought (CoT) prompting Wei et al. (2022) has enhanced the ability of large language models (LLMs) to tackle complex reasoning tasks. By decomposing complex problems into a series of intermediate steps, CoT prompting effectively guides LLMs in their reasoning process.

Figure 1: A comparison between ECHO and other CoT baselines. “Zero-CoT” is short for Zero-shot-CoT and “Few-CoT” is short for Few-shot-CoT. The demonstrations generated by Auto-CoT and ECHO will be applied as few-shot examples during inference.

CoT methods in large language models typically use two prompting paradigms: Zero-shot-CoT Kojima et al. (2022) and Few-shot-CoT Wei et al. (2022). Zero-shot-CoT employs a single, universal prompt like “Let’s think step by step” to guide models in forming reasoning chains without needing specific examples, enhancing their zero-shot reasoning capabilities. The second paradigm, Few-shot-CoT, involves few-shot prompting with human-crafted demonstrations, pairing each question with a detailed reasoning chain comprising multiple intermediate steps that lead to the final answer.

Few-Shot-CoT necessitates the creation of human-crafted examples for each specific domain, which can be time-consuming and expensive. To address this, a method called Auto-CoT Zhang et al. (2023) has been developed. Auto-CoT automates the process of creating demonstrations by initially employing Zero-shot-CoT. Specifically, it clusters questions based on their similarity and selects a representative question from each cluster. Then, using the prompt “Let’s think step by step,” Auto-CoT generates the rationale for each chosen question. This automated process efficiently forms demonstrations and mitigates the manual effort required in Few-shot-CoT. Although it offers a solution to the need for human-crafted examples of Few-Shot-CoT, Auto-CoT faces its own challenge known as misleading by similarity Zhang et al. (2023)as the method depends on demonstrations initialized by Zero-shot-CoT. Some of these demonstrations may contain incorrect reasoning processes and answers. If one of the wrong demonstrations happens to be similar to the problem to be solved, it can be misleading. To make the method more robust, Zhang et al. (2023) select one from each question cluster to maximize diversity. Although this diversity mitigates the misleading effect of similar demonstrations, it introduces two new problems. First, some demonstrations may be too dissimilar or irrelevant to the actual question, making them ineffective. Second, diverse demonstrations might encompass varied solution patterns, including nuanced ones, making them less representative to learn from.

To address the two problems brought about by diverse demonstrations, we propose a novel method that aims to unify diverse rationale patterns into one general pattern, which is the balance point between all patterns. We name our method ECHO (Self-Harmonized Chain of Thought). It consists of three main steps: First, split a given dataset into several clusters and select one representative question from each cluster. Second, use Zero-shot-CoT for each question to generate its rationale. Third, employ a dynamic prompting mechanism where demonstrations can be improved by each other. In this step, one demonstration is randomly chosen to regenerate its rationale at each iteration, while the remaining demonstrations serve as in-context examples. This process ensures continuous refinement and adaptation of reasoning patterns between demonstrations.

We conducted experiments across three different reasoning domains. ECHO demonstrates better overall performance (+2.8%) than other baselines. We also conduct comprehensive ablation studies to understand why unifying diversity improves performance.

We summarize our main contributions in this work as follows:

  • We propose ECHO, a novel approach that automatically improves the quality of demonstrations in the CoT process by unifying the diversity.

  • We design an iterative unifying prompting framework that is generally effective across various tasks via reducing the demonstrations’ variety.

  • Extensive experiments have shown that our proposed approach offers competitive results on arithmetic, commonsense and symbolic reasoning domains by reducing diversity.

Figure 2: Overview of our ECHO method. In the demonstration unification process, ECHO iteratively re-generates the rationale of one demonstration with other demonstrations as in-context examples.

2 Related Work

CoT prompting, a gradient-free method, effectively encourages LLMs to generate intermediate reasoning steps before yielding the final answer. Wei et al. (2022) proposes Few-shot-CoT, which utilizes effective manual demonstrations to facilitate CoT reasoning. Later, Kojima et al. (2022) improved this method for zero-shot reasoning.

Recent studies have focused on creating more complex demonstrations or employing ensemble-like strategies. For example, some prompting methods Zhou et al. (2022); Wang et al. (2023); Yao et al. (2023) adopt a problem decomposition approach, breaking down complex problems into simpler subproblems. Self-consistency and some other CoT methods Wang et al. (2022); Asai et al. (2023) involve reasoning from multiple paths. Moreover, some prompting methods can generate executable programs to aid in the computation process within the arithmetic domain Gao et al. (2022); Chen et al. (2022); Pi et al. (2022).

Although designing increasingly complex prompts improves CoT effectiveness, it also increases the human efforts involved. Surprisingly, little attention has been paid to automating prompt generation. Among the few efforts in this area, Auto-CoT Zhang et al. (2023) stands out. Auto-CoT automates the process by clustering a dataset and selecting a representative question from each cluster, followed by using Zero-shot-CoT to generate rationales for these questions.

A key consideration in Auto-CoT Zhang et al. (2023) is the avoidance of using questions in the same cluster as demonstrations. The rationale behind this choice is rooted in the observation that if demonstrations are too similar to the target question, there is a higher risk that the model might replicate its mistakes, since Zero-shot-CoT Kojima et al. (2022) does not guarantee correctness in its responses (i.e. the demonstrations might be wrong). However, the use of diverse demonstrations might include irrelevant demonstrations or non-representative solution patterns. Our approach is designed to address these potential limitations.

3 Self-Harmonized Chain of Thought

Our method involves three main steps as shown in Figure 2: (1) question clustering: partition questions of a given dataset into a few clusters based on their similarity; (2) demonstration sampling: select a representative question from each cluster and generate its reasoning chain using Zero-shot-CoT. These first two steps are similar to Auto-CoT with minor differences Zhang et al. (2023). (3) demonstration unification: one demonstration is randomly selected for rationale update in each iteration, while the remaining demonstrations serve as in-context examples. This unification process forces each rationale to learn from the remaining ones to build a coherent pattern. The process iteratively cycles through each demonstration once per iteration and continues over multiple iterations to achieve consistency across all demonstrations.

Question Clustering

In our method, we first apply clustering to a set of questions Q. Each question is transformed into a fixed-size vector representation using Sentence-BERT Reimers and Gurevych (2019). Following this, we employ a k-means clustering model to categorize these vector representations into k distinct clusters. Within each cluster i, the questions are ordered in a list 𝐪(i)=[q1(i),q2(i),], based on their distance to the centroid of the cluster. This methodology is comprehensively detailed in lines 1 through 5 of the algorithm. In the Auto-CoT framework, Zhang et al. (2023), the number of clusters is equal to the number of output demonstrations, and one representative of each cluster is chosen as the demonstration. However, our approach adopts a different strategy by allowing for a greater number of clusters. This expansion enables the inclusion of a wider range of demonstrations in the unification process. This strategy has two benefits: It not only facilitates effective learning from more diverse patterns with the additional demonstrations, but also ensures that the final pattern can be applied to a wider range of demonstrations.

Demonstration Sampling

In this phase, we focus on sampling one demonstration d(i) for each cluster i. Within each cluster, we evaluate the questions in the list 𝐪(i)=[q1(i),q2(i),] against our predefined selection criteria. When assessing each question qj(i), we generate its rationale using the prompt ‘Let’s think step by stepKojima et al. (2022), akin to the methodology used in Auto-CoT Zhang et al. (2023). Our criteria for selection include two key constraints: the question length should not exceed 60 tokens, and its corresponding rationale r(i) should be limited to no more than 5 steps. As per the Zero-shot-CoT approach, steps

Algorithm 1 Self-Harmonized CoT
1:A set of questions 𝒬, the number of demonstrations for unification k, number of iteration T and number of demonstration for the output m
2:Demonstration list 𝒟=[d(1),,d(m)]
3:Encode each q in 𝒬 by Sentence-BERT Question Clustering
4:Cluster all the encoded question representations into k clusters
5:for each cluster i=1,,k do
6:     Sort questions 𝐪(i)=[q1(i),q2(i),] in the ascending order of the distance to the cluster center
7:end for
8:for each cluster i=1,,k do Demonstration Sampling
9:     for each question qj(i) in 𝐪(i) do
10:         Generate rationale rj(i) for qj(i) using Zero-Shot-CoT
11:         if qj(i),rj(i) satisfy selection criteria then
12:              Add d(k)=q(k)r0(k) to 𝒟
13:              break
14:         end if
15:     end for
16:end for
17:for each iteration t=1,,T do Demonstration Unification
18:     for each demonstration d(i) in 𝒟 do
19:         Create prompt P with demonstrations 𝒟d(i) in random order
20:         Regenerate rationale rnew(i) for question q(i) using Few-Shot-CoT
21:         Update d(i)=q(i)rnew(i) to 𝒟
22:     end for
23:end for
24:Keep the first m elements and drop the remaining elements from 𝒟.
25:return 𝒟

are demarcated with ‘\n’, and we count the steps by tallying the number of these separators. The specifics of this procedure are detailed from line 6 to line 14 in the algorithm.

Demonstration Unification

In this step, we need to regenerate the reasoning chain for the sampled questions to form a convergent pattern.

In each iteration, each demonstration d(k)=q(k)r0(k) in the set 𝒟 is updated once. The rationale r(i) of a selected demonstration d(i) is regenerated using the remaining shuffled demonstrations 𝒟d(i) as in-context demonstrations. The newly generated rationale then replaces the previous one. Since the number of demonstrations at this stage might exceed what is used during inference, we will drop some of them to align with the required number for inference. The detailed implementation is shown from line 15 to line 23 in the algorithm.

As each regenerated rationale is conditioned on other demonstrations (as in-context examples), it gradually aligns with the patterns observed in those example rationales. Through successive iterations, this process leads to convergence, resulting in a uniform pattern across all rationales.

In our method, the number of clusters k typically exceeds the number of output demonstrations m, allowing a wider range of demonstrations to be included in the unification process. This ensures that the final unified pattern is more robust and generalizable and suitable for a wider array of samples. This process is similar to information compression, where the insights from a larger set of demonstrations are distilled into the final set of m demonstrations. This oversampling approach enhances both the representativeness and the adaptability of the final output demonstrations. Note that although the demonstration unification process introduces more demonstrations, we applied the same amount of demonstrations during the inference stage (as used in prior approaches) in order to make a fair comparison.

Method Arithmetic Commonsense Symbolic Overall
MultiArith GSM8K SingleEq AddSub AQuA SVAMP avg. CSQA Strategy avg. Letter Coin avg.
Zero-Shot 74.0 20.8 87.2 86.8 26.8 72.3 61.3 71.4 55.6 63.5 1.2 48.4 24.8 54.5
Zero-Shot-CoT 84.2 74.5 88.0 84.3 54.3 78.5 77.3 69.6 53.1 61.4 69.6 81.6 63.1 71.3
Few-Shot 80.0 20.8 87.0 85.6 30.7 76.0 63.4 78.4 49.5 64.0 6.2 57.2 31.7 57.1
Few-Shot-CoT 98.3 77.9 92.5 85.6 56.7 81.5 82.1 76.1 63.2 69.7 81.6 95.4 88.5 80.9
Auto-CoT 96.0 76.2 92.1 85.8 52.4 82.6 80.8 74.9 56.4 65.7 76.2 99.4 87.8 79.2
ECHO (k=m,T=1) 98.0 78.4 91.3 87.3 52.8 81.1 81.5 77.2 59.9 68.6 83.6 99.4 91.5 80.9
ECHO (k=max,T=1) 97.7 78.5 89.8 87.1 55.5 84.2 82.1 73.8 58.0 65.9 81.3 99.8 90.6 80.8
ECHO (k=max,T=4) 97.2 76.9 93.1 86.8 59.1 85.4 83.1 77.5 63.4 70.5 81.0 99.6 90.3 82.0
Table 1: Accuracy on ten datasets from three categories of reasoning tasks.

4 Why does it Work?

In this section, we offer some insight into why the proposed approach can be effective.

Consider a set of questions 𝒬={q(1),q(2),,q(m)}. Assume that we can use Zero-Shot-CoT to arrive at the rationales (with answers): ={r(1),r(2),,r(m)}. Auto-CoT performs the following: First, it constructs demonstrations 𝒟={d(1),d(2),,d(m)}, where d(k)=q(k)r(k). During the inference stage, such demonstrations are then used as few-shot examples for generating “refined” rationales for the i-th instance, based on 𝒟\d(i), which we denote as r0(i). Completing this refinement process leads to 0={r0(1),r0(2),,r0(m)}.

Empirically, Auto-CoT was shown to yield better results than Zero-Shot-CoT. Mathematically, this leads to the following hypothesis:

p(𝒬,0)p(𝒬,) (1)

where p(𝒬,) returns the probability for the set of rationales to be assessed as correct, for the set of questions 𝒬.

Eq. 1 shows that the set of refined rationales is likely to be more accurate than the original set of rationales, which were generated without any demonstrations.

In practice, we can keep refining the set of rationales by reconstructing the rationales. Specifically, we can first construct 𝒟0={d0(1),d0(2),,d0(m)} , where d0(k)=q(k)r0(k). Repeating the above process, we arrive at 1={r1(1),r1(2),,r1(m)}.

If the above hypothesis for Auto-CoT is true, it is not unreasonable to state the following hypothesis as its extension:

p(𝒬,1)p(𝒬,0) (2)

We can repeat the above process T times to arrive at a chain of inequalities. Basically, it argues why our proposed approach works empirically. Note that, however, the above process updates the rationales in a batch mode, while in our work, we adopted an on-line approach, where we utilize the recently updated rationales for the next instance within the same iteration.

5 Experimental Setup

Tasks and Datasets

Following prior work on CoT, we evaluate our method on 10 reasoning datasets, including 6 arithmetic datasets: SingleEq Koncel-Kedziorski et al. (2015), AddSub Hosseini et al. (2014), MultiArith Roy and Roth (2015), GSM8K Cobbe et al. (2021), AQUA-RAT Ling et al. (2017), and SVAMP Patel et al. (2021), 2 commonsense reasoning datasets: CommonsenseQA Talmor et al. (2019) and StrategyQA Geva et al. (2021), and 2 symbolic reasoning datasets: Last Letter and Coin Flip Wei et al. (2022). Table 6 contains the statistics for all benchmarks.


Following Wei et al. (2022), we used the OpenAI API for our experiments. We chose GPT-3.5-Turbo-0301 in our main experiments because it is easy to access and more affordable. We also tested on Mixtral-8x7B in the ablation study to validate the generalizability of our method. To ensure the reproducibility of our experiments, we fixed the temperature parameter at 0.


For our primary experiment, we simply set the iteration count T=1. We also set the number of demonstrations for the unification process to be equal to the number of output demonstrations k=m.222We followed Wei et al. (2022) for the number of demonstrations: 4 for AQUA, Coin Flip, and Last Letters; 6 for StrategyQA; 7 for CSQA; 8 for all other datasets

6 Results

6.1 Performance of ECHO

The data presented in Table 1 offers a detailed comparison of various methods across three categories: Arithmetic, Commonsense, and Symbolic. In these evaluations, ECHO consistently outperforms Auto-CoT Zhang et al. (2023) in each domain, matching Few-Shot-CoT’s overall performance.

Notably, Auto-CoT falls short in performance compared to Few-Shot-CoT across all domains, particularly in Commonsense reasoning. This suggests that Auto-CoT may not replace human effort in Few-Shot-CoT. ECHO, on the other hand, aligns closely with Few-Shot-CoT’s overall performance, indicating its potential as a significant advancement towards a powerful reasoning framework without manual efforts.

However, ECHO outperformed Few-Shot-CoT’s average score in symbolic tasks but lags in arithmetic and common sense tasks. We observe that symbolic reasoning tasks tend to have a uniform pattern, making a single iteration sufficient for effective reasoning, whereas arithmetic and common sense reasoning tasks are more diverse in question contexts and answer patterns. We hypothesize that this lag in performance is due to the insufficient number of rationales and iterations for the self-harmonization process.

Manual Rationale (0 iterations)
There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6.
There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5.
Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8.
Manual Rationale (3 iterations)
There are 15 trees in the grove. After the grove workers plant trees today, there will be 21 trees. So the grove workers planted 21 - 15 = 6 trees today.
There are 3 cars in the parking lot. 2 more cars arrive, making a total of 3 + 2 = 5 cars in the parking lot.
Jason started with 20 lollipops. After giving some to Denny, he had 12 lollipops left. So he gave away 20 - 12 = 8 lollipops to Denny.
Generated Rationale (0 iterations)
First, we need to find out how many total nickels Sam had before his dad gave him more. Sam had 24 nickels to start with, and his dad gave him 39 more nickels. So, Sam now has a total of 24 + 39 = 63 nickels.
First, we need to add the weight of peanuts and raisins to find the total weight of snacks. 0.1 pounds (peanuts) + 0.4 pounds (raisins) = 0.5 pounds Therefore, Kelly bought 0.5 pounds of snacks in all.
First, we need to add the number of pencils that were already in the drawer (33) to the number of pencils Joan placed in the drawer (27). 33 + 27 = 60 So there are now 60 pencils in the drawer.
Generated Rationale (3 iterations)
First, we need to add the number of nickels Sam had originally to the number of nickels his dad gave him. 24 (original nickels) + 39 (new nickels) = 63 So Sam now has 63 nickels.
First, we need to add the weight of peanuts to the weight of raisins. 0.1 (peanuts) + 0.4 (raisins) = 0.5 So Kelly bought a total of 0.5 pounds of snacks.
First, we need to add the number of pencils Joan placed in the drawer to the original number of pencils in the drawer. 33 (original pencils) + 27 (new pencils) = 60 So there are now 60 pencils in the drawer in total.
Table 2: A comparison of 3 demonstrations between manual prompts and generated prompts.
Type Divergence
RoBERTa-Large T5-XL T5-XXL
Manual 0.758 0.245 0.276
Generated 0.660 0.201 0.244
Δ 0.098 0.044 0.032
Δ % 12.9% 18.0% 11.6%
Table 3: The averaged divergence for manual and generated rationales.
Figure 3: Performance for ECHO initialized by manual prompts and Zero-Shot-CoT generated prompts with 0, 1 and 3 iterations.

6.2 A comparison between ECHO-generated and Manual Prompts

To understand the difference between ECHO and manually written prompts, we tried to apply manually written prompts on ECHO. Our experiments employ the average score from six arithmetic reasoning benchmarks.

The rationales of ECHO are initialized with the Zero-Shot-CoT Kojima et al. (2022), while manual prompts are crafted by humans. We initialized our method with both automatically generated and human-written prompts to check if ECHO can be directly applied to existing manual prompts. The outcomes, depicted in Figure 3, clearly indicate that manually created prompts surpassed those generated via Zero-Shot-CoT before ECHO is applied. However, the difference in performance diminishes after one iteration of ECHO. Following three iterations, the automatically generated prompts begin to exceed the manual ones, achieving peak performance in the experiment.

We selected three examples from both manually written prompts and ECHO-generated prompts for analysis in Table 2. The manually written prompts are coherent before the first iteration; therefore, additional iterations do not enhance their uniformity. In contrast, the generated rationales start off highly varied, but upon the application of ECHO, they tend to unify towards a singular pattern. It follows from this that employing generated rationales provides a more diverse set of patterns to choose from, thereby aiding in identifying the most effective pattern. Additionally, we quantified the divergence among the highest-performing demonstrations from manual (0 iterations) and generated (3 iterations) processes.

To better understand to what extent the divergence is reduced, we compare the average divergence between the manual rationales and the generated rationales. To calculate the divergence, we employ the “roberta-large-nli-stsb-mean-tokens”, “sentence-t5-large” and the “sentence-t5-xxl” model as our encoder, which are some of the most powerful open-source encoders available. We compute the pairwise cosine similarity for each sentence pair and average them to obtain the average similarity. The average divergence is defined as 1 - average similarity. Table 3 presents these findings, revealing that the average divergence is significantly reduced in the generated demonstrations.

6.3 Effect of Hyperparameters

Building on these insights, we understand how the initial diversity matters to our method. Therefore, we tested our method in a different configuration, where the number of demonstrations considered, k, exceeds the original count, m, in the self-harmorized process. It is important to note that this adjustment does not alter the number of output demonstrations used during inference.

The reason behind this enhancement is to increase diversity by giving a more varied set of patterns to start with. This increased diversity is expected to increase the adaptability of ECHO, further increasing its performance across different reasoning tasks. To maximize the diversity, we choose the largest k within the token limit of the model.

The results for this adjustment are shown in Table 1. We observed that although it shows some improvements in the arithmetic domain, there is a noticeable decline in its overall performance. Based on the observation from Figure 3, we hypothesize that a single iteration may be insufficient when dealing with a larger number of demonstrations. This finding points towards the need for more iterations.

We further allow our ECHO method to have more iterations. To understand how many iterations are needed, we test ECHO across a spectrum of iteration counts in Figure 4. In all cases, our method’s average overall performance remained superior to that of Auto-CoT Zhang et al. (2023).

Figure 4: Performance of ECHO in different domains. We label the performance of “Auto-CoT” in orange.

Our experiments reveal a notable trend for overfitting in our method when subjected to an excessive number of iterations. This is evident from Figure 4, where the model’s performance peaks at an optimal iteration count and then gradually declines. The decline in effectiveness is primarily due to the model’s rationales becoming excessively fit to the output demonstrations.

The case study in Appendix A provides further insight into this situation. Even after a single iteration, the model begins to adopt a consistent rationale structure. Phrases like “Sure, let’s break this down” or “First, we need to find” become standard openings, regardless of the problem’s complexity. This early emergence of a uniform approach is observed in both single-step and multi-step solutions.

After 32 iterations, the model increasingly attempts to condense multi-step reasoning into a single step. This simplification results in complex equations, where every numerical value is meticulously followed by its unit, adding unnecessary complexity to the reasoning process. This tendency suggests an overemphasis on uniformity at the cost of conciseness and clarity.

Upon analyzing the data presented in Figure 4, we identified that an iteration count of T=4 offers an optimal balance for our model. The overall average performance of the model peaks with this iteration count. Furthermore, this iteration count ensures the performance in various domains, indicating that T=4 is generalizable.

6.4 Does ECHO Suffer from “Misleading by Similarity”?

The Auto-CoT approach Zhang et al. (2023) acknowledges that their method is affected by “misleading by similarity.” Auto-CoT relies on various demonstrations provided by Zero-shot-CoT Kojima et al. (2022), which might include incorrect reasoning processes or answers. Such misleading examples become particularly problematic when they closely resemble the target problem, causing the LLM to repeat their mistakes.

To understand this limitation, we performed a series of experiments on mathematical reasoning tasks under three conditions: (1) Diverse. Here, we chose demonstrations that sat at the centroid of each cluster, ensuring a diverse representation. (2) Random. We randomly selected demonstrations from the dataset, introducing variability. (3) Uniform. We strictly used demonstrations from the same cluster as the target question, promoting similarity. For efficiency, we set the iteration count T=1,2 and the number of demonstrations m=8 for all our tests.

The results are illustrated in Figure 5. Unlike Auto-CoT, our method improved as the demonstrations became more uniform. It appears that a closer relationship between the demonstrations and the question enhances utility, outweighing the risk of being misled by incorrect demonstrations.

Figure 5: Performance under different settings.
Method Arithmetic Commonsense Symbolic Overall
MultiArith GSM8K SingleEq AddSub AQuA SVAMP avg. CSQA Strategy avg. Letter Coin avg.
Few-Shot-CoT 94.0 65.3 87.4 83.5 47.2 78.2 75.9 74.1 61.5 67.8 59.4 97.0 78.2 74.8
Auto-CoT 92.1 71.6 88.8 84.1 45.7 82.2 77.4 71.8 55.9 63.9 53.2 71.4 62.3 71.7
ECHO (k=max,T=4) 96.8 72.4 88.0 85.3 55.5 82.8 80.2 74.2 54.6 64.4 56.8 73.8 65.3 74.0
Table 4: Accuracy on ten datasets for Mixtral-8x7B.
Method Arithmetic Commonsense Symbolic Overall
MultiArith GSM8K SingleEq AddSub AQuA SVAMP avg. CSQA Strategy avg. Letter Coin avg.
Few-Shot-CoT 98.3 77.9 92.5 85.6 56.7 81.5 82.1 76.1 63.2 69.7 81.6 95.4 88.5 80.9
-half 97.5 75.1 91.5 85.6 57.9 78.3 81.0 70.8 58.8 64.8 81.8 98.6 90.2 79.6
Auto-CoT 96.0 76.2 92.1 85.8 52.4 82.6 80.8 74.9 56.4 65.7 76.2 99.4 87.8 79.2
-half 97.0 77.2 92.3 85.6 54.7 80.0 81.1 74.2 54.8 64.5 77.8 99.6 88.7 79.3
ECHO (k=max,T=4) 97.2 76.9 93.1 86.8 59.1 85.4 83.1 77.5 63.4 70.5 81.0 99.6 90.3 82.0
-half 98.3 78.1 92.3 87.3 58.7 84.9 83.3 72.5 59.8 66.15 82.4 97.6 90.0 81.2
Table 5: Accuracy with half number of demonstrations.

Surprisingly, ECHO, with one iteration, achieved its best performance in a random setting, indicating its ability to effectively learn from all available demonstrations, regardless of their direct relevance to the question at hand. The results for ECHO with two iterations illustrate that when demonstrations are overly diverse, a single iteration of the unification process may not be enough to establish a uniform solution pattern. Conversely, if the demonstrations are too similar, there is a risk of overfitting after just one iteration. This finding highlights that balancing the diversity can ensure adaptability and accuracy in the solution pattern.

From this we infer that diverse demonstrations, even if not directly related, can help build a robust problem solving framework that the model can generalize to new tasks. This leads to a conclusion: A strategic mix of demonstrations, combining both relatedness and diversity, could potentially offer the generalization without the loss of effectiveness. Future work could explore a hybrid selection strategy that dynamically adjusts the demonstration set based on the property of the dataset.

6.5 Results with Mixtral-8x7B

In addition to GPT-3.5-Turbo-0301, we tested our method in 10 benchmarks that aligned with the main experiments in Mixtral-8x7B without fine-tuning. The results are reported in Table 4. With Mixtral-8x7B, our approach outperforms Auto-CoT by an average margin of +2.3%, which is consistent with our existing findings, demonstrating its generalizability to other models.

However, we note that the overall numbers are now all lower than GPT-3.5-Turbo-0301, indicating the impact of the underlying model used. Our method failed to outperform the Few-shot-CoT on average, and we believed there are two main reasons:

First, the quality of the generated rationales may differ depending on the model’s capability, while the quality of manual prompts will not change. In this case, the rationale generated from Mixtral 8x7B may not be as good as those from GPT-3.5-Turbo-0301.

Second, a significant drop we noticed is the “Coin Flip” dataset, with examples such as “Q: A coin is heads up. Irving flips the coin. Hans flips the coin. Moses does not flip the coin. Nicole does not flip the coin. Is the coin still heads up?” We found that the rationales generated by Auto-CoT or ECHO track the coin’s state when compared with GPT-3.5-Turbo-0301. At the same time, the human-written prompts applied by Few-shot-CoT employ a shortcut solution by counting overall flips to determine the outcome. As the coin always heads up at the beginning, if the number of total flips is even, the coin will remain head up. Conversely, if the number of flips is odd, the coin will tail up. We observed that the Mixtral-8x7B is less good at tracking coin’s states.

6.6 Results with 50% Demonstrations

One hypothesis of our work is that following diverse demonstrations is hard. Given that the rationales generated with ECHO are unified, we evaluated the impact on performance by reducing the number of demonstrations by half.

When we decrease the number of demonstrations by half, the overall performance of ECHO decreased by 0.8%, while Few-Shot-CoT declined by 1.3%. We infer that the consistency of ECHO’s rationales allows the remaining demonstrations to retain information from those that are dropped, thereby enhancing robustness against this reduction. Our demonstrations are consistent through the unification process, meaning that each of them learns from each other’s patterns, resulting in minimal impact from reducing the number of demonstrations.

Interestingly, the performance of Auto-CoT improved by 0.1% with a reduced number of demonstrations. We hypothesize that this improvement is due to the reduced diversity of demonstrations, suggesting that diversity can harm performance. Therefore, increasing the number of shots does not necessarily lead to better performance.

7 Conclusion

In this work, we propose a novel method called ECHO for improving chain-of-thought prompting in large language models. We have shown that our method results in a significant improvement in three domains, confirming the feasibility and significance of adopting the self-harmonization mechanism in the CoT prompting process. We also carried out extensive experiments and case studies to understand the behavior of the proposed approach. We hope that these insights can serve as one step towards improving the automatic reasoning framework.


Our study has identified several limitations within the ECHO method: (1) The method incurs a higher inference cost due to the necessity of an additional unification process for demonstrations. This process requires extra computational resources as it involves generating multiple inferences to consolidate the demonstrations into a coherent pattern. For a benchmark of n samples, while the other method requires inference for n times, our method requires n+Tk times (T is the number of iterations and k is the number of samples used). For example, for GSM8K, the ECHO with 4 iterations requires 5.8% more number of inferences. (2) The method is prone to overfitting, which can lead to a decrease in generalizability. Although we have introduced an equation to estimate the optimal number of iterations, this equation may not hold universally across different domains or datasets. (3) The method assumes a certain level of similarity within the data from which it learns, which might not always be the case. In scenarios where the data are highly unrelated or where the relationships between data points are complex, the unification process may struggle to recognize a representative pattern. Future work may explore adaptive mechanisms that can recognize and adapt to the diversity of the data, ensuring that the unification process remains effective in various types of problem. (4) Similar to Zhang et al. (2023), our method can be applied to a dataset containing multiple questions rather than a single question.


Appendix A Examples for Case Studies

We demonstrate how the rationales evolve with 0, 1, 5 and 32 iterations of our method in Table 9.

Appendix B Data Statistics

Reasoning Type Dataset Size Answer Type
Arithmetic SingleEq 0,508 Numeral
AddSub 0,395 Numeral
MultiArith 0,600 Numeral
GSM8K 1,319 Numeral
AQUA 0,254 Multiple Choice
SVAMP 1,000 Numeral
Symbolic Coin Flip 1,000 Yes or No
Last Letter 0,254 String
Common Sense StrategyQA 2,290 Yes or No
CommonsenseQA 1,221 Multiple Choice
Table 6: Data Statistics

Appendix C Effect of Incorrect Demonstrations on Performance

Our research indicates that demonstrations with incorrect answers do not necessarily impair overall performance. This observation is particularly evident in the context of the AQUA dataset, where a notable instance occurred: among the four demonstrations generated by ECHO, two contained wrong answers, while in the manually written prompts, all demonstrations were correct. Despite this, ECHO’s performance significantly surpassed that of the human-written prompts.

This outcome suggests that in the ECHO method, the collective contribution of the demonstrations to the reasoning pattern is more critical than the individual precision of each demonstration. It appears that the model can effectively extract and leverage useful patterns from the demonstrations, even if some contain errors. This ability to distill valuable reasoning patterns from imperfect data underscores the robustness of the ECHO approach. It indicates that the model’s effectiveness relies more on the breadth and diversity of demonstrations rather than their individual correctness. This finding opens up possibilities for utilizing a wider range of demonstrations, including those with inaccuracies, without necessarily compromising the model’s overall performance.

Appendix D Implement Details

The inertia term originates from the inherent attribute of the KMeans model in sklearn. It is calculated by summing up the squared distances between each sample in a cluster and its corresponding centroid.

Appendix E ECHO constructed demonstrations

We append a full list of ECHO-constructed demonstrations.

Appendix F More Demonstrations Helps Better Commonsense Reasoning

The overall performance of ECHO (k=max,T=1) is not as strong as that of ECHO (k=m,T=1). We hypothesize that the reason for this is a single iteration is not enough when handling a larger number of demonstrations. To further investigate the impact of increasing the number of demonstrations in the unification process, we conducted experiments with both 8 and the maximum possible number of demonstrations in 4 iterations. 333’maximum’ refers to the highest number of demonstrations that can be accommodated within the token limit, approximately 20. The results are illustrated in Table 7.

Our results reveal that in the arithmetic domain, the performance of ECHO under both conditions is comparable, with each achieving an average score of 83.1%. The notable difference emerges in the realm of commonsense reasoning. In this area, ECHO (k=max,T=1) with the maximum number of demonstrations surpasses its counterpart by a margin of 3.0% across both benchmarks. This outcome suggests that the sensitivity to the number of demonstrations varies across different domains, with common sense reasoning being particularly influenced by the count of demonstrations.

Type Accuracy
CommonsenseQA StrategyQA Average
max 77.5 63.4 70.5
8 74.9 59.9 67.4
Table 7: The performances in commonsense reasoning.
Method Accuracy
GSM8K StrategyQA Average
ECHO (original) 78.1 59.7 68.9
ECHO (combined) 74.2 58.0 66.1
Table 8: A Comparison of the performances between ECHO using demonstrations from original and combined datasets.

Appendix G A Study on Diverse Dataset

Similar to Zhang et al. (2023), our work also requires to get access to the whole dataset to construct demonstrations. We assume that each dataset has a certain level of similarity. For example, a math dataset should only contain math questions, and a yes or no dataset should only have yes or no questions. However, there are some diverse datasets. In this section, we want to test our method on a more complicated case: we randomly select 500 samples from GSM8K and StrategyQA; one is arithmetic tasks, and another is yes-or-no commonsense tasks. We apply our method to this dataset. The results are shown in Table 8.

Out of 8 demonstrations, 3 of them are from GSM8K and 5 of them are from StrategyQA. We observed a performance drop on both datasets for our method. Our method aims to find a uniform solution pattern, while there is clearly no single pattern to be applied to two totally different datasets.

Context Q Question/Solution Remark
Questions 1: Q: Mary is baking a cake. The recipe calls for 9 cups of flour and 5 cups of sugar. She already put in 3 cups of flour and 62 cups of sugar. How many more cups of flour does she need to add? GT: 6
2: Q: Danny collects bottle caps and wrappers. He found 22 bottle caps and 30 wrappers at the park. Now he has 17 bottle caps and 57 wrappers in his collection. How many wrappers did danny have at first? GT: 27
3: Q: Jerry had 8 action figures and 9 books on a shelf in his room. Later he added 10 more books to the shelf. How many books were on his shelf in all? GT: 19
4: Q: Every day Ryan spends 6 hours on learning english and 7 hours on learning chinese. How many more hours does he spend on learning chinese than he does on learning english? GT: 1
0 iteration 1: Sure, let’s break it down. Mary needs a total of 9 cups of flour for the recipe, and she has already added 3 cups. So, she still needs to add 9 - 3 = 6 cups of flour. Therefore, Mary needs to add 6 more cups of flour to complete the recipe. correct
2: First, we need to find out how many bottle caps and wrappers Danny had at first. We can do this by adding the number of bottle caps and wrappers he found at the park to the number he has in his collection now: Initial number of bottle caps = 22 + 17 = 39 Initial number of wrappers = 30 + 57 = 87 Therefore, Danny had 87 wrappers at first. wrong
3: Sure, let’s solve this problem step by step. First, we need to find the total number of books Jerry had on the shelf before adding 10 more books. We know that he had 9 books on the shelf initially, so we add that to the number of books he added later: 9 + 10 = 19 So, Jerry had 19 books on the shelf after adding 10 more books. correct
4: Sure, let’s break it down. First, we need to find the difference between the number of hours Ryan spends on learning Chinese and English. To do this, we can subtract the number of hours he spends on learning English from the number of hours he spends on learning Chinese: 7 hours (Chinese) - 6 hours (English) = 1 hour Therefore, Ryan spends 1 more hour on learning Chinese than he does on learning English. correct
1 iteration 1: Sure, let’s break it down. First, we need to find out how many more cups of flour Mary needs to add. We can do this by subtracting the amount of flour she already put in from the total amount of flour the recipe calls for: 9 cups (recipe) - 3 cups (already added) = 6 cups. Therefore, Mary needs to add 6 more cups of flour. correct
2: Sure, let’s break it down. First, we need to find out how many wrappers Danny had before he found the ones at the park. We know that he found 30 wrappers at the park and now has 57 wrappers in his collection, so we can subtract the number of wrappers he found from the number he has now: 57 - 30 = 27 wrappers. Next, we need to find out how many bottle caps Danny had before he found the ones at the park. We know that he found 22 bottle caps at the park and now has 17 bottle caps in his collection, so we can subtract the number of bottle caps he found from the number he has now: 17 - 22 = -5 bottle caps. This doesn’t make sense, so we can assume that Danny had 0 bottle caps before he found the ones at the park. Therefore, Danny had 27 wrappers at first. correct
3: Sure, let’s break it down. First, we need to find out how many books Jerry had on the shelf before he added more. We know he had 9 books, so the total number of books on the shelf before he added more was 9. Next, we need to add the 10 books he added to the shelf: 9 + 10 = 19. Therefore, there were 19 books on Jerry’s shelf in all. correct
4: Sure, let’s break it down. First, we need to find the difference between the number of hours Ryan spends on learning Chinese and the number of hours he spends on learning English. We can do this by subtracting the number of hours he spends on learning English from the number of hours he spends on learning Chinese: 7 hours (Chinese) - 6 hours (English) = 1 hour. Therefore, Ryan spends 1 more hour on learning Chinese than he does on learning English. correct
5 iterations 1: Sure, let’s break it down. First, we need to find out how many more cups of flour Mary needs to add. We can do this by subtracting the amount of flour she already put in from the total amount of flour the recipe calls for: 9 cups (recipe) - 3 cups (already added) = 6 cups of flour needed. Therefore, Mary needs to add 6 more cups of flour. correct
2: Sure, let’s break it down. First, we need to find out how many wrappers Danny had before he found the 30 wrappers at the park. We can do this by subtracting the 57 wrappers he has now from the total number of wrappers he had after he found the 30 wrappers at the park: 57 wrappers (now) - 30 wrappers (found at park) = 27 wrappers (before). Next, we need to find out how many bottle caps Danny had before he found the 22 bottle caps at the park. We can do this by subtracting the 17 bottle caps he has now from the total number of bottle caps he had after he found the 22 bottle caps at the park: 22 bottle caps (found at park) + 17 bottle caps (now) = 39 bottle caps (before). Therefore, Danny had 27 wrappers and 39 bottle caps at first. correct
3: Sure, let’s break it down. First, we need to find out how many books Jerry had on the shelf before he added 10 more. We can do this by adding the number of books he had initially to the number of action figures he had: 9 books + 8 action figures = 17 items on the shelf. Next, we need to find out how many books Jerry had on the shelf after he added 10 more. We can do this by adding the 10 books he added to the number of books he had initially: 9 books (initially) + 10 books (added) = 19 books on the shelf in all. Therefore, there were 19 books on Jerry’s shelf in all. correct
4: Sure, let’s break it down. First, we need to find out the difference between the number of hours Ryan spends on learning Chinese and the number of hours he spends on learning English. We can do this by subtracting the number of hours he spends on learning English from the number of hours he spends on learning Chinese: 7 hours (Chinese) - 6 hours (English) = 1 hour. Therefore, Ryan spends 1 more hour on learning Chinese than he does on learning English. correct
32 iterations 1: Sure, let’s break it down. First, we need to find out how many more cups of flour Mary needs to add. We can subtract the amount of flour she already put in from the amount of flour the recipe calls for: 9 cups - 3 cups = 6 cups (adding the units after each number in the equation) . Therefore, Mary needs to add 6 more cups of flour. correct
2: Sure, let’s break it down. First, we need to find out how many wrappers Danny had at first. We can subtract the number of wrappers he has now from the number of wrappers he found at the park and add it to the number of wrappers he had initially: Danny’s wrappers at first = Danny’s wrappers now - Danny’s wrappers found + Wrappers at first 57 wrappers = 30 wrappers - 17 wrappers + Wrappers at first 57 wrappers = 13 wrappers + Wrappers at first Wrappers at first = 57 wrappers - 13 wrappers Wrappers at first = 44 wrappers Therefore, Danny had 44 wrappers at first. wrong
3: Sure, let’s break it down. First, we need to find out how many books Jerry had on the shelf initially. We can add the number of books he had initially to the number of books he added later: 9 books + 10 books = 19 books. Therefore, Jerry had 19 books on his shelf in all. correct
4: Sure, let’s break it down. First, we need to find out how many more hours Ryan spends on learning Chinese than he does on learning English. We can subtract the number of hours he spends on learning English from the number of hours he spends on learning Chinese: 7 hours - 6 hours = 1 hour. Therefore, Ryan spends 1 more hour on learning Chinese than he does on learning English. correct
Table 9: Case studies of 4 demonstrations from SVAMP dataset.
