Discovering Preference Optimization Algorithms
with and for Large Language Models
Abstract
Offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. Typically, preference optimization is approached as an offline supervised learning task using manually-crafted convex loss functions. While these methods are based on theoretical insights, they are inherently constrained by human creativity, so the large search space of possible loss functions remains under explored. We address this by performing LLM-driven objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention. Specifically, we iteratively prompt an LLM to propose and implement new preference optimization loss functions based on previously-evaluated performance metrics. This process leads to the discovery of previously-unknown and performant preference optimization algorithms. The best performing of these we call Discovered Preference Optimization (DiscoPOP)^{1}^{1}1Code: https://github.com/luchris429/DiscoPOP., a novel algorithm that adaptively blends logistic and exponential losses. Experiments demonstrate the state-of-the-art performance of DiscoPOP and its successful transfer to held-out tasks.
1 Introduction
Training Large Language Models (LLMs) usually involves starting with a model pre-trained on large text corpora and then fine-tuning it to match human preferences. Pre-trained, and even instruction fine-tuned LLMs, can generate harmful, dangerous, and unethical completions (Carlini et al., 2021; Gehman et al., 2020). To mitigate this and align an LLM with human values, we use human preference alignment through preference-ranked completion data. This approach has become an industry standard, popularized by reinforcement learning with human feedback (RLHF) (Christiano et al., 2017, RLHF), and more recently, by offline preference optimization algorithms like direct preference optimization (Rafailov et al., 2023, DPO) and sequence likelihood calibration (Zhao et al., 2023, SLiC), which cast the problem as a supervised learning objective. Many algorithms have been proposed in the literature for offline preference optimization, and it remains an open question which one performs best across tasks. While a strictly dominant algorithm may not exist, some algorithms likely exhibit generally improved performance. To date, all existing state-of-the-art preference optimization algorithms (Rafailov et al., 2023; Azar et al., 2023; Zhao et al., 2023) have been developed by human experts. Despite their advancements, these solutions are inherently constrained by human limitations, including creativity, ingenuity, and expert knowledge.
In this work, we aim to address these limitations by performing LLM-driven discovery in order to automatically generate new state-of-the-art preference optimization algorithms without continual expert human intervention in the development process. While previous works (Ma et al., 2023; Yu et al., 2023) have used LLMs to design environment-specific RL reward functions, we discover general-purpose objective functions which can be used across various preference optimization tasks. More specifically, we iteratively prompt an LLM to propose new preference optimization loss functions and evaluate them, with the previously proposed loss functions and their task performance metric (in our case, MT-Bench scores (Zheng et al., 2024)) as in-context examples. After performing this automatic discovery process, we catalog high-performing loss functions and introduce a particularly strong one we call Discovered Preference Optimization (DiscoPOP), a new algorithm. To ensure robustness beyond MT-Bench, we validate DiscoPOP using AlapacaEval 2.0 (Dubois et al., 2024), showing an improvement in win rates against GPT-4 from DPO $(11.23\%\to 13.21\%)$. Additionally, in separate, held-out, tasks such as summarization and controlled generation, models trained with the DiscoPOP loss outperform or perform competitively with existing preference optimization algorithms.
Contributions: \raisebox{-0.9pt}{1}⃝ We propose an LLM-driven objective discovery pipeline to discover novel offline preference optimization algorithms (Section 3). \raisebox{-0.9pt}{2}⃝ We discover multiple high-performing preference optimization losses. One such loss, which we call Discovered Preference Optimization (DiscoPOP), achieves strong performance across multiple held-out evaluation tasks of multi-turn dialogue (AlpacaEval 2.0), controlled sentiment generation (IMDb) and summarization (TL;DR) tasks. \raisebox{-0.9pt}{3}⃝ We provide an initial analysis of DiscoPOP, which is a weighted sum of logistic and exponential losses, and discover surprising features. For example, DiscoPOP is non-convex.
2 Background
Preference Optimization. Consider a pre-trained language model policy ${\pi}_{\theta}$ and a dataset $\mathcal{D}={\{({x}^{i},{y}_{w}^{i},{y}_{l}^{i})\}}_{i=1}^{N}$ consisting of prompts $x$ and preference-ranked completions ${y}_{w}$ and ${y}_{l}$. In this dataset, a human rater prefers ${y}_{w}$ over ${y}_{l}$, denoted as ${y}_{w}\succ {y}_{l}$. The task is to align ${\pi}_{\theta}$ with the human values implicit in these preferences. Canonically, this has been achieved through reinforcement learning from human feedback (Christiano et al., 2017, RLHF), an approach that proceeds in two phases: First, a reward modeling stage that learns a parameterized reward model ${r}_{\varphi}$. By assuming a Bradley-Terry model (Bradley and Terry, 1952) of preferences, the probability of the data can be expressed as $P({y}_{w}\succ {y}_{l})=\mathrm{exp}{r}_{\varphi}({y}_{w},x)/(\mathrm{exp}{r}_{\varphi}({y}_{w},x)+\mathrm{exp}{r}_{\varphi}({y}_{l},x))$, and subsequently simply optimized over $\varphi $ through the maximum likelihood principle. The second stage of policy optimization employs a reinforcement learning algorithm to train the language model against the learned reward. Usually, a KL penalty is introduced between the model and the pre-RL reference policy ${\pi}_{ref}$ (Jaques et al., 2019; Stiennon et al., 2020) to prevent over-optimization and straying too far from the original policy, resulting in the final objective:
$\underset{{\pi}_{\theta}}{\mathrm{max}}\underset{\text{reward maximization}}{\underset{\u23df}{{\mathbb{E}}_{y\sim {\pi}_{\theta},x\sim \mathcal{P}}\left[{r}_{\varphi}(y,x)\right]}}-\beta \underset{\text{regularization}}{\underset{\u23df}{\mathbb{K}\mathbb{L}({\pi}_{\theta},{\pi}_{\text{ref}})}}.$ | (1) |
Despite success in frontier models (Anthropic, 2023; Gemini-Team, 2023), deep RL has many implementation (Engstrom et al., 2019) and training challenges (Sutton, 1984; Razin et al., 2023) that hinder its adoption. In order to simplify the whole process, direct preference optimization (Rafailov et al., 2023, DPO) aims to forego both the reward modeling and online RL procedure. Rewriting (1) with a decomposition of the KL term into:
$\underset{{\pi}_{\theta}}{\mathrm{max}}{\mathbb{E}}_{y\sim {\pi}_{\theta},x\sim \mathcal{P}}\left[\underset{\text{reward}}{\underset{\u23df}{{r}_{\varphi}(y,x)}}+\underset{{\pi}_{ref}\text{regularization}}{\underset{\u23df}{\beta \mathrm{log}{\pi}_{ref}(y|x)}}\right]+\underset{\text{policy entropy}}{\underset{\u23df}{\beta \mathscr{H}({\pi}_{\theta})}},$ | (2) |
expresses the problem as an entropy-regularised RL bandit task (Ziebart et al., 2008), for which a known analytical solution exists: ${\pi}^{\ast}(y|x)=Z{(x)}^{-1}{\pi}_{ref}(y|x)\mathrm{exp}\left({\beta}^{-1}{r}_{\varphi}(y,x)\right)$. By rearranging the reward, we can express the task as a binary classification problem based on the reward difference:
$\underset{{\pi}_{\theta}}{\mathrm{min}}{\mathbb{E}}_{({y}_{w},{y}_{l},x)\sim \mathcal{D}}\left[f\left(\underset{{r}_{\varphi}({y}_{w},x)-{r}_{\varphi}({y}_{l},x)}{\underset{\u23df}{\beta \cdot \left(\mathrm{log}{\displaystyle \frac{{\pi}_{\theta}({y}_{w}|x)}{{\pi}_{\text{ref}}({y}_{w}|x)}}-\mathrm{log}{\displaystyle \frac{{\pi}_{\theta}({y}_{l}|x)}{{\pi}_{\text{ref}}({y}_{l}|x)}}\right)}}\right)\right].$ | (3) |
Here, we define the log ratio difference as $\rho =\mathrm{log}\frac{{\pi}_{\theta}({y}_{w}|x)}{{\pi}_{\text{ref}}({y}_{w}|x)}-\mathrm{log}\frac{{\pi}_{\theta}({y}_{l}|x)}{{\pi}_{\text{ref}}({y}_{l}|x)}$. In DPO, the function $f=-\mathrm{log}\sigma $ is derived as the negative log of the sigmoid function given the BT model assumptions. However, Tang et al. (2024) highlighted that more generally we can obtain a recipe for offline preference optimization algorithms by letting $f:\mathbb{R}\to \mathbb{R}$ be any scalar loss function. For example, setting $f(x)={\left(x-1\right)}^{2}$, the squared loss function (Rosasco et al., 2004) yields IPO (Azar et al., 2023), while employing the max-margin inspired hinge loss (Boser et al., 1992; Cortes and Vapnik, 1995) $f(x)=\mathrm{max}(0,1-x)$ produces SLiC (Zhao et al., 2023).
Meta-Optimization for Algorithm Discovery. The goal of meta-optimization (optimizing the optimization process) is to uncover novel learning algorithms using a data-driven process. Suppose that an algorithm uses an objective function ${f}^{\gamma}$ to train a model for $K$ iterations, where $\gamma $ denotes a set of meta-parameters . Meta-optimization searches for an objective that maximizes the expected downstream performance ${\mathrm{max}}_{{}_{\gamma}}\mathbb{E}[\eta ({\pi}_{K})|\text{train}({f}^{\gamma})]$ where $\eta $ is a downstream performance metric. Unlike previous methods that rely on a predefined parameterization of $\gamma $ (e.g., a neural network (Hospedales et al., 2021) or domain-specific language (Alet et al., 2020)), we leverage LLMs to directly propose code-level objective functions in Python. This approach eliminates the need for a carefully-designed search space and utilizes the extensive knowledge embedded in the LLM for flexible selection and mutation.
3 LLM-Driven Objective Discovery
Choosing an appropriate objective function is crucial for instilling capabilities into networks. Here, we detail our discovery process facilitated by LLM code-level objective function proposals:
Initial Context Construction. In the initial system prompt, we ‘burn-in’ the LLM using several established objective functions given in code and their corresponding performance. Furthermore, we provide problem details and an example of the output response format as a JSON dictionary.
LLM Querying, Parsing & Output Validation. We query the LLM, parse the response JSON, and run a set of unit tests (e.g. for valid output shapes) before starting a training run. If the parsing or unit tests fail, we resample a new solution after providing the error message as feedback to the LLM.
Performance Evaluation. The proposed objective function is then evaluated based on its ability to optimize a model for a predefined downstream validation task. We refer to the resulting performance metric as $\eta $.
Iterative Refinement. By using the performance provided as feedback, the LLM iteratively refines its proposals. In each iteration, the model synthesizes a new candidate loss function, exploring both variations of previously successful formulas and entirely new formulations that might improve upon the existing benchmarks. This iterative process is repeated for a specified number of generations or until convergence when a set of optimal loss functions is observed.
Small case study: Discovering supervised classification loss functions. Consider the case of supervised classification on the CIFAR-10 dataset as a simple starting example. We train a simple ResNet-18 for 5 epochs using the objectives proposed by GPT-4 (OpenAI, 2023). After each training run we provide the LLM with the corresponding validation accuracy and query it for the next PyTorch-based (Paszke et al., 2017) candidate objective function.
Figure 2 depicts the performance of the proposed objective functions across the discovery process. The different discovered objectives all outperform the standard cross-entropy loss. Interestingly, we observe that the LLM-driven discovery alternates between several different exploration, fine-tuning, and knowledge composition steps: Initially, the LLM proposes a label-smoothed cross-entropy objective. After tuning the smoothing temperature, it explores a squared error loss variant, which improved the observed validation performance. Next, the two conceptually different objectives are combined, leading to another significant performance improvement. Hence, the LLM discovery process does not perform a random search over objectives previously outlined in the literature but instead composes various concepts in a complementary fashion. Furthermore, the discovered objectives also generalize to different architectures and longer training runs. In Section D.3 we show that this process of discovery is robust to the choice of sampling temperature and prompt/context construction.
4 Discovering Offline Preference Optimization Objectives
In this section, we run our LLM-driven discovery to automatically generate new state-of-the-art preference optimization algorithms.
4.1 Discovery Task - Multi-turn Dialogue on MT-Bench
In this section we use our LLM-driven discovery method to discover new objective functions $f$ for offline preference optimization, as defined in Section 2 and Equation 3. Specifically, at each generation $i$, GPT-4 generates PyTorch (Paszke et al., 2017) code of candidate objective function ${f}_{i}$. Each objective function takes as input the variables of $\{\mathrm{log}{\pi}_{\theta}({y}_{w}|x),\mathrm{log}{\pi}_{\text{ref}}({y}_{w}|x),\mathrm{log}{\pi}_{\theta}({y}_{l}|x),\mathrm{log}{\pi}_{\text{ref}}({y}_{l}|x)\}$, and returns a scalar. For each proposed objective ${f}_{i}$, we check if ${f}_{i}$ is valid with a unit test.
For each valid generated objective function ${f}_{i}$, we finetune an LLM and then collect a performance evaluation score. Specifically, we build on top of the ‘alignment-handbook’ (Tunstall et al., 2023a) repository to finetune our models. Notably, this repository, when using DPO, reproduces ‘Zephyr 7B Gemma’^{2}^{2}2https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-v0.1 Tunstall and Schmid (2024); Tunstall et al. (2023b), which at the time of release, achieved state-of-the-art scores on MT-Bench for 7B models. ‘Zephyr 7B Gemma’ first takes gemma-7b (Gemma-Team et al., 2024) and finetunes it on the ‘deita-10k-v0-sft’ dataset (Liu et al., 2023) to produce ‘zephyr-7b-gemma-sft’^{3}^{3}3https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-sft-v0.1. It is then trained on the pairwise preference dataset of ‘Argilla DPO Mix 7K’^{4}^{4}4https://huggingface.co/datasets/argilla/dpo-mix-7k. When evaluating a new objective function, we replace DPO in this last step with the generated objective function, keeping the same hyperparameters. We show example runs in Figure 3 and provide further experimental details in Appendix B.
Once we have a trained LLM for the proposed objective function ${f}_{i}$, we evaluate that LLM on the popular multi-turn dialogue evaluation benchmark of MT-Bench (Zheng et al., 2024). This is a multi-turn open-ended question set, which uses GPT-4 to assess the quality of the trained model’s responses, obtaining a high correlation with the popular Chatbot Arena (Zheng et al., 2024). We provide further evaluation details in Appendix C.
4.2 Discovery Results
After evaluating approximately $100$ objective functions, we cataloged the best-performing ones in Table 1. We tabulate the high-level objective forms here and provide the full objective loss functions and their associated code in Appendix E. Moreover, we also plot the best performing sub-task evaluations in Figure 4.
Name | Full Name | Objective $f$ Function | Score (/ 10) $\uparrow $ |
DPO | Direct Preference Optimization | $\text{log}(1+\text{exp}(-\beta \rho ))$ | 7.888 |
DPO* | Official HuggingFace ‘zephyr-7b-gemma’ DPO model | $\text{log}(1+\text{exp}(-\beta \rho ))$ | 7.810 |
SLiC | Sequence Likelihood Calibration | $\text{ReLU}(1-\beta \rho )$ | 7.881 |
KTO | Pairwise Kahneman-Tversky Optimization | see (Ethayarajh et al., 2024) | 7.603 |
DBAQL | Dynamic Blended Adaptive Quantile Loss | $\sigma (\mathbb{V}\text{ar}[\rho /\tau ])\cdot {f}_{dpo}(\beta \rho /0.9)+(1-\sigma (\mathbb{V}\text{ar}[\rho /\tau ]))\cdot {f}_{exp}(\beta \rho \cdot 0.9)$ | 7.978 |
AQL | Adaptive Quantile Loss | $q\cdot {f}_{dpo}(\beta \rho )+(1-q)\cdot {f}_{slic}(\beta \rho )$ | 7.953 |
PADLL | Performance Adaptive Decay Logistic Loss | $$ | 7.941 |
AQFL | Adaptive Quantile Feedback Loss | $r\cdot {f}_{dpo}(\beta \rho )+(1-r)\cdot {f}_{slic}(\beta \rho )$ | 7.931 |
CELL | Combined Exponential + Logistic Loss | $0.5\cdot {f}_{dpo}(\beta \rho )+0.5\cdot {f}_{exp}(\beta \rho )$ | 7.925 |
LRML (DiscoPOP) | Log Ratio Modulated Loss | $(1-\sigma (\rho /\tau ))\cdot {f}_{dpo}(\beta \rho )+\sigma (\rho /\tau )\cdot {f}_{exp}(\beta \rho )$ | 7.916 |
PFL | Policy Focused Loss | $1/2\cdot {f}_{dpo}(\beta \rho )\cdot \mathrm{\U0001d7d9}[{\pi}_{w}>{\pi}_{r}]+2\cdot {f}_{slic}(\beta \rho )\cdot \mathrm{\U0001d7d9}[{\pi}_{w}\le {\pi}_{r}]$ | 7.900 |
5 Held-Out Evaluations
We next validate each of our discovered objective functions (shown in Table 1) on held-out tasks. We find that the Performance Adaptive Decay Loss (PADLL) and the Log Ratio Modulated Loss (LRML) consistently perform well. Because of its unconventional properties and performance, we refer to LRML as our discovered preference optimization, or DiscoPOP, algorithm.
We consider three different standard (Rafailov et al., 2023) open-ended text generation tasks each designed to evaluate different properties of the fine-tuned LLM policy ${\pi}_{\theta}$ where each LLM policy is trained with one of our discovered objective functions $f$ on a preference dataset $\mathcal{D}={\{({x}^{i},{y}_{w}^{i},{y}_{l}^{i})\}}_{i=1}^{N}$.
5.1 Single-turn Dialogue - Alpaca Eval 2.0
We evaluate the trained models on Alpaca Eval 2.0, (Li et al., 2023; Dubois et al., 2023, 2024). This is a single-turn dialogue LLM-based automatic evaluation using GPT-4 to assess the win rate of the trained LLM policy’s completion compared to the of the underlying SFT base model. Alpaca Eval 2.0^{5}^{5}5https://github.com/tatsu-lab/alpaca_eval, has been validated against 20K human annotations, and aims to reduce the length bias of Alpaca Eval 1.0; where using length controlled (LC) Alpaca Eval shows a correlation with Chatbot Area of 0.98, making it a popular benchmark with the highest correlation to Chatbot Arena (Dubois et al., 2024). We also detail task training details in Section B.1.
Function | Win Rate (%) $\uparrow $ | Win Rate - LC (%) $\uparrow $ | Win Rate (%) $\uparrow $ | Win Rate - LC (%) $\uparrow $ |
vs. GPT-4 | vs. SFT Checkpoint | |||
DPO | $11.23\pm 0.97$ | $12.81\pm 0.66$ | $78.72\pm 1.26$ | $63.34\pm 0.30$ |
DPO^{∗} | $11.99\pm 1.00$ | $\underset{\xaf}{14.73\pm 0.71}$ | $75.75\pm 1.31$ | $59.88\pm 0.41$ |
SLiC | $10.67\pm 0.94$ | $13;16\pm 0.69$ | $75.05\pm 1.34$ | $59.67\pm 0.42$ |
KTO | $\underset{\xaf}{12.57\pm 1.00}$ | $13.58\pm 0.67$ | $\underset{\xaf}{78.81\pm 1.25}$ | $62.76\pm 0.31$ |
DBAQL | $10.68\pm 0.92$ | $11.41\pm 0.57$ | $72.06\pm 1.42$ | $54.40\pm 0.38$ |
AQL | $11.11\pm 0.96$ | $13.63\pm 0.68$ | $76.34\pm 1.30$ | $60.94\pm 0.36$ |
PADLL | $\mathbf{14.07}\pm \mathbf{1.04}$ | $\mathbf{14.89}\pm \mathbf{0.66}$ | $\mathbf{81.10}\pm \mathbf{1.21}$ | $\underset{\xaf}{64.14\pm 0.28}$ |
AQFL | $\mathbf{13.63}\pm \mathbf{1.05}$ | $\mathbf{15.55}\pm \mathbf{0.71}$ | $\mathbf{79.32}\pm \mathbf{1.23}$ | $64.41\pm 0.34$ |
CELL | $10.27\pm 0.93$ | $12.26\pm 0.61$ | $71.75\pm 1.39$ | $57.48\pm 0.34$ |
LRML | $\mathbf{13.21}\pm \mathbf{1.02}$ | $\mathbf{14.78}\pm \mathbf{0.67}$ | $\mathbf{79.27}\pm \mathbf{1.24}$ | $\mathbf{65.18}\pm \mathbf{0.32}$ |
PFL | $8.15\pm 0.83$ | $10.67\pm 0.57$ | $68.27\pm 1.44$ | $56.14\pm 0.43$ |
We provide the Alpaca Eval 2.0 results in Table 2. As reference policies, we used GPT-4 for absolute comparison and the SFT-trained model for relative comparison. We observe that the discovered LRML (DiscoPOP), PADLL, and AQFL functions outperform the baselines and other discovered losses on the normal and length-controlled win rates. The differences in scores among these top performing losses are not significant, except for the LC win rate against the SFT reference model, where DiscoPOP performs best.
5.2 Summarization (TL;DR)
We train an LLM policy to, given a forum post on Reddit $x$, generate a summarization $y$ of the main points. We finetune ‘zephyr-7b-gemma-sft‘ using 10% of the Reddit TL;DR summarization preference dataset (Völske et al., 2017) on each of the baseline and discovered objective functions. As a reference model, we again use ‘zephyr-7b-gemma-sft’. Further details on the training pipeline are outlined in Section B.2. To evaluate the quality of the summaries, we make use of the Alpaca Eval 2.0 library with a custom evaluation dataset existing of 694 test samples from the TL;DR dataset and a custom GPT-4 annotator template as described in Rafailov et al. (2023). For additional details regarding the summarization evaluation see Section C.3.
In Table 3 the PADLL loss and DPO loss perform best, with little difference from each other, on the summarization task in three out of four metrics. Additionally, the LRML - DiscoPOP function achieves scores slightly below the top performers, especially in the length-controlled win rates. In contrast to the single-turn dialogue task, the AQFL loss does not achieve high scores in the held-out evaluation.
Function | Win Rate (%) $\uparrow $ | Win Rate - LC (%) $\uparrow $ | Win Rate (%) $\uparrow $ | Win Rate - LC (%) $\uparrow $ |
vs. Human Preference | vs. SFT Checkpoint | |||
DPO | $\mathbf{88.27}\pm \mathbf{1.07}$ | $\mathbf{82.82}\pm \mathbf{0.00}$ | $\mathbf{54.38}\pm \mathbf{1.52}$ | $54.64\pm 0.00$ |
SLiC | $83.02\pm 1.29$ | $63.41\pm 0.00$ | $53.03\pm 1.52$ | $54.11\pm 0.00$ |
KTO | $85.34\pm 1.18$ | $80.26\pm 0.00$ | $51.15\pm 1.54$ | $50.0\pm 0.00$ |
DBAQL | $84.71\pm 1.21$ | $78.68\pm 0.00$ | $52.55\pm 1.52$ | $\underset{\xaf}{55.14\pm 0.00}$ |
AQL | $81.87\pm 1.32$ | $68.89\pm 0.00$ | $46.00\pm 1.54$ | $50.0\pm 0.00$ |
PADLL | $\mathbf{88.54}\pm \mathbf{1.05}$ | $76.13\pm 0.00$ | $\mathbf{55.34}\pm \mathbf{1.52}$ | $\mathbf{55.64}\pm \mathbf{0.00}$ |
AQFL | $85.03\pm 1.22$ | $76.23\pm 0.00$ | $49.56\pm 1.53$ | $50.38\pm 0.00$ |
CELL | $86.33\pm 1.14$ | $73.72\pm 0.00$ | $50.35\pm 1.52$ | $51.90\pm 0.00$ |
LRML | $\mathbf{87.63}\pm \mathbf{1.10}$ | $\underset{\xaf}{81.88\pm 0.00}$ | $\underset{\xaf}{53.46\pm 1.52}$ | $55.10\pm 0.00$ |
PFL | $79.84\pm 1.35$ | $69.23\pm 0.00$ | $44.12\pm 1.52$ | $44.57\pm 0.00$ |
5.3 Positive sentiment generation (IMDb)
In this task, we train an LLM policy to generate movie review completions $y$ with positive sentiment, where $x$ is a prompt at the start of a movie review from the IMDb dataset (Maas et al., 2011). We start with a GPT-2 (Radford et al., 2019) model, which had supervised fine-tuning on the IMDb dataset, and we perform preference optimization using the baseline and discovered objective loss functions. Details of the training implementations can be found in Section B.3. Inspired by Rafailov et al. (2023)’s experiments, we calculate the model rewards through a pre-trained sentiment classifier, which we use as a proxy for ground truth, as well as the KL-Divergence of the trained model and the reference model. Section C.4 provides further details into the evaluation for this task.
We provide results of models with converging $\beta $ values in Figure 5 for LRML compared against DPO and SLiC, displaying the model rewards against the KL-Divergence to the reference model. In Figure 5(a), the LRML-trained text generator outperforms the DPO model in terms of rewards and KL-divergence with low $\beta $ values (0.025, 0.05, 0.1). At higher $\beta $ values (0.5 and 1.0) both methods show trends of increased KL-Divergence and lower rewards, but generally LRML maintains a higher reward than DPO. In Figure 5(b), we note that LRML slightly outperforms DPO, SLiC, AQFL, and PADLL at $\beta \in \{0.05,0.1\}$ in terms of reward. For larger $\beta $ values (0.5 and 1.0), LRML shows similar trends of increased KL-Divergence and rewards like the other objective functions. A more detailed comparison between the individual discovered losses and the baselines can be found in Appendix Figure 8.
6 Analysis of DiscoPOP
We list all our discovered objectives in Table 1, as well as the code and mathematical representations in Appendix E. In this section, we now analyze the Log Ratio Modulated Loss, which we define as the DiscoPOP loss function, as it performs consistently high across the held-out evaluation tasks, and we provide some intuitive understanding of how it outperforms the existing state-of-the-art objectives.
6.1 Log Ratio Modulated Loss (DiscoPOP)
The Log Ratio Modulated Loss is a dynamically weighted sum of the logistic loss (as used in DPO) and the exponential loss. The weight of each is determined through a sigmoid calculation of the difference of log-ratios ($\rho $). Mathematically, the LRML function can be described with a temperature parameter $\tau =0.05$ as follows:
${f}_{lrml}(\beta \rho )$ | $=(1-\sigma (\rho /\tau ))\cdot {f}_{dpo}(\beta \rho )+\sigma (\rho /\tau )\cdot {f}_{exp}(\beta \rho )$ | (4) | ||
$=(1-\sigma (\rho /\tau ))\cdot \text{log}(1+\text{exp}(-\beta \rho ))+\sigma (\rho /\tau )\cdot \text{exp}(-\beta \rho )$ | (5) |
If the difference of log ratios is zero ($\rho =0)$, which is at the start of the training when the model policy ${\pi}_{\theta}$ is equal to the reference policy ${\pi}_{\text{ref}}$, then the loss is equally balanced between the logistic and exponential loss. If $\rho \to \mathrm{\infty}$, the model policy diverges from the reference policy and chosen outputs are preferred, then the exponential term dominates. This emphasizes larger differences more strongly. On the other hand, if $\rho \to -\mathrm{\infty}$, the model policy diverges from the reference policy and rejected outputs are preferred. In this case, the logistic loss can handle moderate differences well. The baseline objective losses and the LRML, the PADLL, and the AQFL functions are displayed in Figure 6, including their gradients. Surprisingly, we see that the DiscoPOP function has a non-convex segment and negative gradients at the starting point $\rho =0$. This is potentially helpful for introducing a curriculum or for stochasticity.
6.2 Limitations of DiscoPOP
While performing very well on single-turn text generation and text summarization, we observed during the IMDb experiment that LRML struggles to converge when $\beta $ is too low ($\beta \le 0.01$) or too high ($\beta \ge 2.5$), likely because $\beta \ne 0.05$ was never seen or used during the discovery process.
In Figure 9 and Figure 10 of the Appendix, we plot the LRML objective function for $\beta \in \{0.01,0.025,0.05,0.1,0.25,0.5,1,2.5,5\}$ against DPO. Notably, when $\beta $ is high, the DiscoPOP objective function takes the form of the DPO log sigmoid loss. During training on $\beta =0.01$, we observed that DiscoPOP gets stuck in generating negative reviews. We hypothesize that it is because the loss is stuck in the local-minima to the left with negative difference of log-ratios. While training with $\beta \in 2.5,5.0$ we observed that the model collapsed after a sharp spike in the loss and subsequently having loss value 0 and NaN outputs. This is potentially due to large gradient in the non-convex part, which could potentially be amended with gradient clipping.
7 Related Work
Evolution and Search with Large Language Models. LLMs provide a fast and automated way to create multiple candidate solutions for a problem stated in natural language (Song et al., 2024). This makes them powerful tools for driving population-based search procedures such as evolutionary meta-discovery. Various recent works have applied this approach to coding problems (Romera-Paredes et al., 2024), neural architecture search (Chen et al., 2024a), virtual robotic design settings (Lehman et al., 2023), and reward functions (Ma et al., 2023; Yu et al., 2023). Finally, recently LLMs have shown to be capable of acting as recombination operators for black-box optimization with Evolution Strategies (Lange et al., 2024) and for Quality-Diversity approaches (Lim et al., 2024).
Automated Discovery for Machine Learning. There are many other approaches to automating the discovery of generalizable machine learning algorithms. Some prior works explore the space of ML functions using genetic algorithms and a hand-crafted domain-specific language for reinforcement learning algorithms (Co-Reyes et al., 2021), curiosity algorithms (Alet et al., 2020), and optimizers (Chen et al., 2024b). Other works instead parameterize a transferrable objective function using neural networks and optimize them with evolution strategies. For example, Lu et al. (2022); Jackson et al. (2024); Houthooft et al. (2018); Alfano et al. (2024) evolve policy optimization objectives, Metz et al. (2022) evolves neural network optimizers, and Lange et al. (2023b, a) evolve blackbox optimizers.
Preference Optimization Algorithms. While the reduction to supervised learning makes DPO and alternatives easier to use, other approaches have sought to simplify the RL step, including using variants of REINFORCE (Ahmadian et al., 2024; Gemma-Team et al., 2024) as well as more fine-grained feedback (Wu et al., 2024) through preferences over individual steps in the reasoning process (Uesato et al., 2022; Lightman et al., 2023) or reward redistribution (Chan et al., 2024). Others use iterative offline training interleaved with sampling from the policy model and obtaining a preference ranking from themselves (Xu et al., 2023), another judge LLM (Guo et al., 2024), or an oracle (Swamy et al., 2024).
8 Conclusion
Summary. In this paper, we proposed and used LLM-driven objective discovery to generate novel offline preference optimization algorithms. Specifically, we were able to discover high-performing preference optimization losses that achieve strog performance across held-out evaluation tasks, with the highest performing providing new insights into what an optimal objective may need to possess, such as being a blend of logistic and exponential losses, and possibly be non-convex.
Limitations & Future work. There are multiple limitations of our current approach. First, we have only scratched the surface of how to generate LLM objective proposals most effectively. Initial exploratory experiments using techniques such as temperature sampling or worst-to-best performance sorting in the context did not yield significant improvements. But one could imagine leveraging more information about the training runs and automatically tuning instruction prompt templates. E.g. by providing entire learning curve plots to a Visual Language Model (see Figure 12) or by meta-meta-optimizing (Lu et al., 2023) the LLM prompt. Second, the highest-performing loss re-purposed $\beta $ in the traditional sense, making it affect the functional behavior as well as the KL penalty of the model with respect to the base model. This motivates future work to study different forms, with perhaps multiple floating point parameters in the form, that each could be tuned separately. Although we provided an initial analysis sweep over this one single parameter and observed some instances of the functional behavior leading to instability of training the model, a further multi-parameter analysis, reformulating the objective, would be beneficial for future work. Finally, our work uses closed-source models (GPT-4) to generate code, which limits reproducibility and is costly to run. Future work could use the produced models themselves to generate code, resulting in code-level self-improvement.
Broader Impact and Ethical Considerations. This paper presents an LLM-driven discovery in-context learning pipeline that is used to generate better-performing novel offline preference optimization algorithms. However, misuse of the pipeline as a tool or training an LLM to produce undesirable, unethical, or harmful outputs could be possible by a user. Furthermore, due to the use of LLMs and training of LLMs, the outputs are susceptible to hallucinations, motivating all outputs of the LLMs to always have a content filter applied to the outputs. Finally, this work takes a small step towards code-level self-improvement in language models, which could potentially result in unintended behaviors.
Acknowledgments and Disclosure of Funding
This work was supported by Azure sponsorship credits granted by Microsoft’s AI for Good Research Lab and by the Microsoft’s Accelerate Foundation Models Academic Research initiative. The hardware used for training was sponsored by GoodAI. SH is funded by AstraZeneca. CF is funded by Canon Medical. AJC is funded by a Microsoft Research and EPSRC ICASE scholarship award. The code can also be accessed at https://github.com/samholt/DiscoPOP.
References
- Ahmadian et al. [2024] Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024.
- Alet et al. [2020] Ferran Alet, Martin F Schneider, Tomas Lozano-Perez, and Leslie Pack Kaelbling. Meta-learning curiosity algorithms. arXiv preprint arXiv:2003.05325, 2020.
- Alfano et al. [2024] Carlo Alfano, Sebastian Towers, Silvia Sapora, Chris Lu, and Patrick Rebeschini. Meta-learning the mirror map in policy mirror descent. arXiv preprint arXiv:2402.05187, 2024.
- Anthropic [2023] Anthropic. Model card and evaluations for claude models, 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf.
- Azar et al. [2023] Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
- Boser et al. [1992] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, 1992.
- Bradley and Terry [1952] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Carlini et al. [2021] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
- Chan et al. [2024] Alex J Chan, Hao Sun, Samuel Holt, and Mihaela van der Schaar. Dense reward for free in reinforcement learning from human feedback. arXiv preprint arXiv:2402.00782, 2024.
- Chen et al. [2024a] Angelica Chen, David Dohan, and David So. Evoprompting: Language models for code-level neural architecture search. Advances in Neural Information Processing Systems, 36, 2024a.
- Chen et al. [2024b] Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. Advances in Neural Information Processing Systems, 36, 2024b.
- Christiano et al. [2017] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Co-Reyes et al. [2021] John D Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Sergey Levine, Quoc V Le, Honglak Lee, and Aleksandra Faust. Evolving reinforcement learning algorithms. arXiv preprint arXiv:2101.03958, 2021.
- Cortes and Vapnik [1995] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20:273–297, 1995.
- Cui et al. [2023] Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
- Dubois et al. [2023] Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
- Dubois et al. [2024] Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
- Engstrom et al. [2019] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep rl: A case study on ppo and trpo. In International conference on learning representations, 2019.
- Ethayarajh et al. [2024] Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization, 2024.
- Gehman et al. [2020] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- Gemini-Team [2023] Google DeepMind Gemini-Team. Gemini: A family of highly capable multimodal models, 2023.
- Gemma-Team et al. [2024] Gemma-Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Guo et al. [2024] Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024.
- Hospedales et al. [2021] Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9):5149–5169, 2021.
- Houthooft et al. [2018] Rein Houthooft, Yuhua Chen, Phillip Isola, Bradly Stadie, Filip Wolski, OpenAI Jonathan Ho, and Pieter Abbeel. Evolved policy gradients. Advances in Neural Information Processing Systems, 31, 2018.
- Jackson et al. [2024] Matthew Thomas Jackson, Chris Lu, Louis Kirsch, Robert Tjarko Lange, Shimon Whiteson, and Jakob Nicolaus Foerster. Discovering temporally-aware reinforcement learning algorithms. arXiv preprint arXiv:2402.05828, 2024.
- Jaques et al. [2019] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
- Lange et al. [2023a] Robert Lange, Tom Schaul, Yutian Chen, Chris Lu, Tom Zahavy, Valentin Dalibard, and Sebastian Flennerhag. Discovering attention-based genetic algorithms via meta-black-box optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 929–937, 2023a.
- Lange et al. [2023b] Robert Lange, Tom Schaul, Yutian Chen, Tom Zahavy, Valentin Dalibard, Chris Lu, Satinder Singh, and Sebastian Flennerhag. Discovering evolution strategies via meta-black-box optimization. In Proceedings of the Companion Conference on Genetic and Evolutionary Computation, pages 29–30, 2023b.
- Lange et al. [2024] Robert Tjarko Lange, Yingtao Tian, and Yujin Tang. Large language models as evolution strategies. arXiv preprint arXiv:2402.18381, 2024.
- Lehman et al. [2023] Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. Evolution through large models. In Handbook of Evolutionary Machine Learning, pages 331–366. Springer, 2023.
- Li et al. [2023] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
- Lightman et al. [2023] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023.
- Lim et al. [2024] Bryan Lim, Manon Flageat, and Antoine Cully. Large language models as in-context ai generators for quality-diversity. arXiv preprint arXiv:2404.15794, 2024.
- Liu et al. [2023] Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. arXiv preprint arXiv:2312.15685, 2023.
- Longpre et al. [2023] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pages 22631–22648. PMLR, 2023.
- Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017. URL https://api.semanticscholar.org/CorpusID:53592270.
- Lu et al. [2022] Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, and Jakob Foerster. Discovered policy optimisation. Advances in Neural Information Processing Systems, 35:16455–16468, 2022.
- Lu et al. [2023] Chris Lu, Sebastian Towers, and Jakob Foerster. Arbitrary order meta-learning with simple population-based evolution. In ALIFE 2023: Ghost in the Machine: Proceedings of the 2023 Artificial Life Conference. MIT Press, 2023.
- Ma et al. [2023] Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
- Maas et al. [2011] Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011.
- Metz et al. [2022] Luke Metz, James Harrison, C Daniel Freeman, Amil Merchant, Lucas Beyer, James Bradbury, Naman Agrawal, Ben Poole, Igor Mordatch, Adam Roberts, et al. Velo: Training versatile learned optimizers by scaling up. arXiv preprint arXiv:2211.09760, 2022.
- OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
- Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
- Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Rafailov et al. [2023] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Razin et al. [2023] Noam Razin, Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, and Etai Littwin. Vanishing gradients in reinforcement finetuning of language models. arXiv preprint arXiv:2310.20703, 2023.
- Romera-Paredes et al. [2024] Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024.
- Rosasco et al. [2004] Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana, and Alessandro Verri. Are loss functions all the same? Neural computation, 16(5):1063–1076, 2004.
- Song et al. [2024] Xingyou Song, Yingtao Tian, Robert Tjarko Lange, Chansoo Lee, Yujin Tang, and Yutian Chen. Position paper: Leveraging foundational models for black-box optimization: Benefits, challenges, and future directions. arXiv preprint arXiv:2405.03547, 2024.
- Stiennon et al. [2020] Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Sutton [1984] Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. University of Massachusetts Amherst, 1984.
- Swamy et al. [2024] Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, and Alekh Agarwal. A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056, 2024.
- Tang et al. [2024] Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires, and Bilal Piot. Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749, 2024.
- Tunstall and Schmid [2024] Lewis Tunstall and Philipp Schmid. Zephyr 7b gemma. https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-v0.1, 2024.
- Tunstall et al. [2023a] Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Shengyi Huang, Kashif Rasul, Alexander M. Rush, and Thomas Wolf. The alignment handbook. https://github.com/huggingface/alignment-handbook, 2023a.
- Tunstall et al. [2023b] Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment, 2023b.
- Uesato et al. [2022] Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
- Völske et al. [2017] Michael Völske, Martin Potthast, Shahbaz Syed, and Benno Stein. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017.
- Wu et al. [2024] Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36, 2024.
- Xu et al. [2023] Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682, 2023.
- Yu et al. [2023] Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, et al. Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647, 2023.
- Zhao et al. [2023] Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
- Zheng et al. [2024] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
- Ziebart et al. [2008] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
Appendix
Appendix A LLM-Driven Objective Discovery Implementation Details
A.1 Prompts
We use the following system prompt to generate the model responses:
We then provide the first user prompt as such:
Upon testing the generated code, if an error is encountered, we provide the following prompt, where ‘error’ is the text containing the system error:
Upon a successful completion, we return the following user prompt, where ‘val’ is the MT-Bench score:
Appendix B Training Details
B.1 Discovery Task - Single-turn Dialogue
For each valid generated objective function ${f}_{i}$, we use it to train an LLM and then collect a performance evaluation score. Specifically, we follow the same process when training and evaluating all objective functions, starting with a pre-trained supervised fine-tuned (SFT) 7 billion gemma model of ‘zephyr-7b-gemma-sft’ This is a 7 billion base version gemma [Gemma-Team et al., 2024] model supervised-fine-tuned on the ‘deita-10k-v0-sft’ dataset [Liu et al., 2023]. Starting with this model, we train it on the pairwise preference dataset of ‘Argilla DPO Mix 7K’; which attempts to create a high-quality preference dataset by filtering only highly rated chosen responses from the datasets of a multi-turn dataset, instruction following dataset [Longpre et al., 2023] and a diverse preference dataset that covers truthfulness, honesty and helpfulness [Cui et al., 2023]. For each training run, we trained all the parameters of the starting model, using a fixed $\beta =0.05$. We used the same fixed hyper-parameters for all training runs unless explicitly noted. Specifically, we used a learning rate of 5e-7, bfloat16 floating-point format, two epochs, a batch size per device of two, a gradient accumulation step of 8, a cosine learning rate scheduler, and AdamW optimization algorithm [Loshchilov and Hutter, 2017]. We use the popular TRL transformers library [vonwerra2022trl], adapting the offline preference optimization objective function to train all models. The models were trained on 8 nvidia A100 GPUs. An individual training run takes approximately 30 minutes. We provide training and evaluation statistics for discovered objective functions in Figure 7.
B.2 TL;DR Summarization
To determine if the discovered objective functions generalize well also to other tasks, we use them to preference optimize an LLM for text summarization. Specifically, we start again with a pre-trained supervised fine-tuned (SFT) 7 billion gemma model of ‘zephyr-7b-gemma-sft’, and we optimized it with the objective function ${f}_{i}$ on a subsample of the the Reddit TL;DR summarization preference dataset [Völske et al., 2017]^{6}^{6}6https://huggingface.co/datasets/CarperAI/openai_summarize_comparisons. More precisely we use the first 10% of the dataset for preference optimization, which ammounts to around 8’000 training samples. During training the hyperparameters are kept same as in the single turn dialogue task, explained in subsection B.1, except that LLMs where trained 4 nvidia A100 GPUS using a gradient accumulation step of 16. An individual training run takes approximately 1.5h.
B.3 IMDb Positive Text Generation
Another popular generalization task for preference optimization [Rafailov et al., 2023] is to fine tune a small LLM to generate positive text for movie review, based on the IMDb sentiment dataset [Maas et al., 2011]^{7}^{7}7https://huggingface.co/datasets/ZHZisZZ/imdb_preference. As starting model, we use a GPT2 model [Radford et al., 2019], that was supervised fine-tuned on the IMDb dataset^{8}^{8}8https://huggingface.co/lvwerra/gpt2-imdb. Subsequently, we apply the baseline and discovered objective function ${f}_{i}$ for preference optimization. The goal of the LLM is given a short prompt of 2-8 tokens, which indicate the start of a movie review, to generate a positive review. As we are interested in the effect of $\beta $ on the rewards and KL-Divergence, we train the objective functions over a sweep of $\beta \in \{0.01,0.025,0.05,0.1,0.25,0.5,1,2.5,5\}$. Every the LLM is trained for three epochs, using AdamW optimizer, with an initial learning rate of 5.0e-5, a warm up scheduler of 0.1, a cosine learning rate scheduler. The models are trained on 4 nvidia A100 GPUs, using a gradient accumulation step of 8, and a batch size per device of 2. The training takes around 30 minutes.
Appendix C Evaluation Metrics
C.1 MT-Bench
To assess the fitness of the discovered preference optimization loss function during the discovery phase, we evaluate the trained LLMs on the MT-Bench [Zheng et al., 2024] benchmark. The evaluation benchmark consists of high quality 80 multi-turn questions, from various disciplines. The goal is to assess LLMs ability to follow instructions and keep the flow of a conversation. A larger LLM, in our case GPT-4, is then used as judge to score the quality of the answers with a number from 0 (lowest) to 10 (highest). Scores are given based on the quality of the LLMs first turn answer (single-turn), as well as on first and second answers (multi-turn). Finally, the MT-Bench score is the average of single-turn and multi-turn scores. For answer generation and evaluation, we used the FastChat library^{9}^{9}9https://github.com/lm-sys/FastChat and its standard sampling and temperature parameters, provided by Zheng et al. [2024].
C.2 Alpaca Eval
Currently, Alpaca Eval 2.0 [Li et al., 2023, Dubois et al., 2023, 2024] is also a popular benchmark to evaluate LLMs. This is a single-turn dialogue LLM-based automatic evaluation using a stronger LLM, here GPT-4 Turbo, to assess the win rate of the trained LLM policy’s completion compared either GPT-4 or to the of the underlying SFT base model. Specifically, Alpaca Eval 2.0, has been validated against 20K human annotations, and aims to reduce the length bias of Alpaca Eval; where using length controlled (LC) Alpaca Eval shows a correlation with Chatbot Arena of 0.98, making it a popular benchmark with the highest correlation to Chatbot Arena [Dubois et al., 2024]. The Alpaca evaluation dataset consists of 841 high quality instruction, originating from different data sets. The library^{10}^{10}10https://github.com/tatsu-lab/alpaca_eval provided by Dubois et al. [2024] calculates the win-rate (percentage were the trained policy is prefered over the reference policy, first introduced in Alpaca Eval 1.0), and a length-controlled win-rate, where a linear model is fitted in order to de-bias for length of the prompt and instruction difficulty. To generate the answers we use a temperature of 0.7, sampling, and a maximum number of new tokens of 1024. Furthermore, the library provides the standard error of the mean, which indicates the confidence of the win-rate and LC win-rate.
C.3 TL;DR Summarization Win-Rate
To evaluate how well the discovered objective functions generalize to the task of summarization, we make use of the Alpaca Eval 2.0 library, similar to subsection C.2. Instead of using the Alpaca evaluation dataset, we create a custom dataset consisting of 694 samples from the IMDb preference test dataset. Additionally, we change the prompt of the annotator LLM, to fit the "Summarization GPT-4 win rate prompt (C)" as described in Rafailov et al. [2023]. The (LC) win-rate is calculated against either the existing human chosen test sample, or against the summary generated by the SFT reference model. For summary generation we apply a temperature parameter of 0.7, sampling, and a maximum of 256 new tokens. Moreover, we stop the summarization after the "\n" token, to avoid nonsensical generations. Furthermore, as the we do not have access to calculate an instruction difficulty for the length-controlled win-rate, we omit this term from the linear model (This has only a small impact on the metric). In addition to the winrates we also provide the standard error as a measure of confidence.
C.4 IMDb Rewards vs KL-Divergence
For the positive text generation, we do not require an LLM judge compared to MT-Bench, Alpaca Eval 2.0, and TL;DR evaluation, as we take a pre-trained sentiment classifier^{11}^{11}11https://huggingface.co/siebert/sentiment-roberta-large-english as ground truth reward scorer. For the positive text generation, the LLMs apply sampling, and a maximum of 60 new tokens. The rewards and KL-divergence are averaged over 10 different generations from the trained LLMs.
Appendix D Additional Results
D.1 Frontiers of Expected Reward vs KL Divergence
D.2 Loss Sweeps for Different $\beta $ Parameters
D.3 Discovery Robustness with respect to LLM Hyperparameters
D.4 Visual Language Models for Objective Discovery
Appendix E Discovered Objective $f$ Functions
To describe the discovered losses mathematically, we define three existing preference optimization losses here:
$${f}_{dpo}(\beta \rho )=-\text{log}(\sigma (\beta \rho ))=-\text{log}(\frac{1}{1+\text{exp}(-\beta \rho )})=\text{log}(1+\text{exp}(-\beta \rho ))$$ | (6) |
$${f}_{slic}(\beta \rho )=\text{ReLU}(1-\beta \rho )$$ | (7) |
$${f}_{exp}(\beta \rho )=\text{exp}(-\beta \rho )$$ | (8) |
Moreover, we display the code of the discovered losses as it is output by the LLM. In addition, we provide a mathematical representation of each, which we have adapted to be consistent with $\beta $ being the KL-Divergence regularization parameter. This is due to the fact, that the generated code for LRML, DBAQL, AQL, AQFL, and PFL did not uphold the $\beta $ ought to be multiplied with the difference of log-ratios, before any further calculations. If this was not uphold, it could to the loss function changing shapes based on the KL-regularization term, and therefore models could not converge, or potentially collapse. In future work, we should constrain the exploring LLM to uphold the $\beta $ multiplication with the input, before any other calculations are done with the difference of log-ratios $\rho $. As the meta exploration was done with a set $\beta =0.05$, and we wish to keep consistent with this scale of regularization, we have adapted the losses by dividing $\rho $ values used in intermediate calculations with a scalar $\tau =0.05$.
In the IMDb experiment in Section 5, we have thus used the corrected version of codes for the discovered losses, based on the provided mathematical representation, as we were most interested in the effect of the KL-divergence compared to the model rewards.
E.1 DBAQL: Dynamic Blended Adaptive Quantile Loss
MT-Bench Score: 7.978
${f}_{dbaql}(\beta \rho )$ | $=\sigma (\mathbb{V}\text{ar}[\rho /\tau ])\cdot {f}_{dpo}(\beta \rho /0.9)+(1-\sigma (\mathbb{V}\text{ar}[\rho /\tau ]))\cdot {f}_{exp}(\beta \rho \cdot 0.9)$ | (9) | ||
$\tau $ | $=0.05$ | (10) |
E.2 AQL: Adaptive Quantile Loss
MT-Bench Score: 7.953
${f}_{aql}(\beta \rho )$ | $=q\cdot {f}_{dpo}(\beta \rho )+(1-q)\cdot {f}_{slic}(\beta \rho )$ | (11) | ||
$q$ | $=\sigma (-\beta \cdot (\rho /\tau -{m}_{2}))$ | (12) | ||
${m}_{2}$ | $=0.5+0.01\cdot \left(\mathbb{E}[\sigma (\rho /\tau )]-0.5\right)$ | (13) | ||
$\tau $ | $=0.05$ | (14) |
E.3 PADLL: Performance Adaptive Decay Logistic Loss
MT-Bench Score: 7.941
${f}_{padll}(\beta \rho )$ | $={\delta}_{\text{adpt}}\cdot {f}_{dpo}(\beta \rho )$ | (15) | ||
$$ | (16) | |||
$$ | (17) | |||
$$ | (18) |
This loss can also be rewritten as:
$$ | (19) |
E.4 AQFL: Adaptive Quantile Feedback Loss
MT-Bench Score: 7.931
${f}_{aqfl}(\beta \rho )$ | $=r\cdot {f}_{dpo}(\beta \rho )+(1-r)\cdot {f}_{slic}(\beta \rho )$ | (20) | ||
$r$ | $=\sigma (0.1\ast d)$ | (21) | ||
$d$ | $=|\rho \tau -{m}_{2}|$ | (22) | ||
${m}_{2}$ | $={m}_{1}+0.05\cdot \left(\sigma (\mathbb{E}[\rho /\tau ]-{m}_{1})\right)$ | (23) | ||
${m}_{1}$ | $=\mathbb{E}[\sigma (-\rho /\tau )]\cdot \sqrt{\mathbb{V}\text{ar}[\rho /\tau ]}$ | (24) | ||
$\tau $ | $=0.05$ | (25) |
E.5 CELL: Combined Exponential + Logistic Loss
MT-Bench Score: 7.925
${f}_{cell}(\beta \rho )$ | $=0.5\cdot {f}_{dpo}(\beta \rho )+0.5\cdot {f}_{exp}(\beta \rho )$ | (26) |
E.6 LRML: Log Ratio Modulated Loss
MT-Bench Score: 7.916
${f}_{lrml}(\beta \rho )$ | $=(1-\sigma (\rho /\tau ))\cdot {f}_{dpo}(\beta \rho )+\sigma (\rho /\tau )\cdot {f}_{exp}(-\beta \rho )$ | (27) | ||
$\tau $ | $=0.05$ | (28) |
E.7 PFL: Policy Focused Loss
MT-Bench Score: 7.900
Interestingly, the PFL generated function code did not include any beta values in the loss function. We have added it to the corrected code for the IMDb experiment, as well as to the mathematical expression below
${f}_{pfl}(\beta \rho )$ | $=1/2\cdot {f}_{dpo}(\beta \rho )\cdot {\mathrm{\U0001d7d9}}_{[{\pi}_{w}>{\pi}_{r}]}+2\cdot {f}_{slic}(\beta \rho )\cdot {\mathrm{\U0001d7d9}}_{[{\pi}_{w}\le {\pi}_{r}]}$ | (29) |
Appendix F Full Run Log
We provide a full run below, formatted for readability.