BAdam: A Memory Efficient Full Parameter
Training Method for Large Language Models

Qijun Luo
School of Science and Engineering
Shenzhen Research Institute of Big Data
The Chinese University of Hong Kong, Shenzhen
qijunluo@link.cuhk.edu.cn &Hengxu Yu
School of Data Science
The Chinese University of Hong Kong, Shenzhen
hengxuyu@link.cuhk.edu.cn &Xiao Li
School of Data Science
The Chinese University of Hong Kong, Shenzhen
lixiao@cuhk.edu.cn

Abstract

This work presents $\mathsf{BAdam}$ , an optimizer that leverages the block coordinate optimization framework with Adam as the inner solver. $\mathsf{BAdam}$ offers a memory efficient approach to the full parameter finetuning of large language models and reduces running time of the backward process thanks to the chain rule property. Experimentally, we apply $\mathsf{BAdam}$ to instruction-tune the Llama 2-7B model on the Alpaca-GPT4 dataset using a single RTX3090-24GB GPU. The results indicate that $\mathsf{BAdam}$ exhibits superior convergence behavior in comparison to LoRA and LOMO. Furthermore, our downstream performance evaluation of the instruction-tuned models using the MT-bench shows that $\mathsf{BAdam}$ modestly surpasses LoRA and more substantially outperforms LOMO. Finally, we compare $\mathsf{BAdam}$ with Adam on a medium-sized task, i.e., finetuning RoBERTa-large on the SuperGLUE benchmark. The results demonstrate that $\mathsf{BAdam}$ is capable of narrowing the performance gap with Adam. Our code is available at https://github.com/Ledzy/BAdam.

1 Introduction

Large language models (LLMs) such as GPT-4 [1] and Llama 2 [22] have shown its strong ability in language understanding, generation, reasoning, translation, etc. Due to its strong applicability, LLMs have been regarded as a feasible approach towards artificial general intelligence [2]. Finetuning or adaptation has become an important step in applying pretrained LLMs to follow human instructions or perform specific downstream tasks.

Backgrounds. When GPU memory (RAM) is not a major limitation, full parameter tuning methods—such as applying Adam to the entire set of parameters of LLMs on the finetuning dataset—often offer greater flexibility for parameter search and optimization. This optimization scheme unlocks the full potential for the model to learn and adapt to downstream tasks by leveraging the parameters in the most efficient manner. However, executing such a full parameter training method typically requires a significant amount of GPU memory. For instance, to finetune an LLM with $M$ billion parameters, Adam [9] necessitates at least $18M$ GB of GPU memory for successful training, and this estimate does not even account for the storage of activations used in the backpropagation (BP) process; see Section 2.2.1 for a detailed analysis. This requirement poses challenges for computational resources as models scale up, given that GPU memory is often limited in practical settings.

Parameter efficient finetuning (PEFT) methods such as low-rank adaptation (LoRA) [8], Adapter [7], prompt- and Prefix-tuning [13, 12], among others, play a critical role in finetuning large language models under memory resource constraints. The principal idea of PEFT is to represent the parameter updates in a much lower-dimensional subspace. For instance, LoRA parameterizes the trainable incremental weights as the product of two lower-dimensional low-rank matrices, which significantly reduces the number of trainable parameters and, consequently, the GPU memory consumption. Despite the success of LoRA and related PEFT methods, finetuning within a substantially lower-dimensional subspace may potentially limit practical downstream performance; see, e.g., [17, 27].

The observations outlined above motivate us to explore a memory efficient full parameter training method, which can enable us to leverage the advantages of full parameter finetuning.

Main results. In this work, we have the following main contributions:

(C.1)

We propose a block coordinate-type optimization method with Adam as the incorporated inner solver, termed $\mathsf{BAdam}$ ; see Section 2.1 for the detailed design. This method partitions the entire set of model parameters into $D$ blocks, updating one block at a time using Adam’s efficient update steps. Such a block coordinate optimization scheme of $\mathsf{BAdam}$ offers a memory efficient solution to full parameter finetuning of LLMs. For example, by partitioning an LLM with $M$ billion parameters into $D$ nearly equal-sized blocks, $\mathsf{BAdam}$ requires only about $2M+\frac{16M}{D}$ GB of GPU memory for successful training with mixed-precision training; see Section 2.2.1 for further analysis. This represents a significant reduction in memory demands compared to traditional full parameter finetuning using Adam. It is important to note that $\mathsf{BAdam}$ is not simply blocklizing the Adam optimizer; rather, it is fundamentally building on the block coordinate optimization framework that utilizes Adam’s update steps as the inner solver. These two schemes are fundamentally different in terms of managing the Adam optimizer states. We refer to the last paragraph of Section 2.1 for a deeper analysis.
(C.2)

We apply $\mathsf{BAdam}$ to finetune the Llama 2-7B model on the Alpaca-GPT4 dataset using a single RTX3090-24GB GPU and compare its performance with existing methods such as LoRA and LOMO. The experiment results demonstrate that $\mathsf{BAdam}$ converges faster and achieves a lower training loss. Additionally, we present the wall-clock running times of these methods, highlighting $\mathsf{BAdam}$ ’s substantial improvement in running time compared to LoRA and LOMO. This improvement is attributed to the chain rule property of the backpropagation (BP) process, which allows $\mathsf{BAdam}$ to save a considerable amount of backward computation time; for a more involved analysis, see Section 2.2.2. We further evaluate the downstream performance of the instruction-tuned models using the MT-bench. $\mathsf{BAdam}$ attained an MT-bench score of 5.06, surpassing the scores of 4.91 and 4.21 achieved by LoRA and LOMO after the same number of data pass, respectively. Therefore, $\mathsf{BAdam}$ demonstrates a modest improvement over LoRA in this context.
(C.3)

We finally compare $\mathsf{BAdam}$ with Adam on the medium-sized language model RoBERTa-large and the SuperGLUE benchmark. We observe that $\mathsf{BAdam}$ is capable of closing the performance gap with Adam in this setting. Consequently, we extrapolate that $\mathsf{BAdam}$ has the potential to perform nearly as well as Adam, even when tuning larger models.

We compare our $\mathsf{BAdam}$ to several representative methods in Table 1. In summary, we believe that our $\mathsf{BAdam}$ can serve as a competitive optimizer for finetuning large language models in scenarios with limited memory.

Algorithm	Memory	Full parameter training	Momentum and second moment	Update precision	Gradient accumulation
Adam [9]	$18M$	✓	✓	Float32	✓
LoRA [8]	$2M+\frac{36rM}{m}$	✗	✓	Float32	✓
LOMO [17]	$2M+\frac{2M}{D}$	✓	✗	Float16	✗
BAdam	$2M+\frac{16M}{D}$	✓	✓	Float32	✓

Table 1: Algorithm feature summary. Here,

M

represents that the model to be trained has

M

billion number of parameters,

r

is the LoRA rank,

m

is the weight matrix dimension (here, we consider the case where the model’s weights are all square matrices to ease the analysis),

D

is the number of transformer layers or the number of partitioned blocks in

\mathsf{BAdam}

\mathsf{BAdam}

performs full parameter mixed-precision training using Adam’s update rule, while only requires memory that is comparable to LoRA and LOMO. Note that we exclude the memory cost of the activation here.

2 The BAdam Method

Block coordinate optimization has a long history in optimization society; see, e.g., [23, 19, 25] and the references therein. At each iteration, such an optimization strategy maintains the majority of the optimization parameters at their up-to-date iteration values, while only approximately optimizes the objective function over the remaining parameters. This procedure ensures that each iteration is a much lower dimensional optimization problem compared to the original problem, and hence is simpler to be approximately optimized. This main feature makes block coordinate-type methods especially suitable for huge-scale problems that has tremendous optimization parameters.

We reveal an interesting link between the block coordinate optimization strategy and the finetuning of LLMs. Namely, the finetuning process boils down to an optimization problem that needs to handle a huge number of trainable model parameters. This setting matches exactly the advantage of block coordinate optimization that decompose a huge-scale optimization problem into many smaller ones over smaller blocks, providing the possibility to release the requirement on large GPU memory. Based on these observations, we achieve our ultimate goal of finetuning LLMs with low memory consumption through designing a block coordinate optimization method.

2.1 Algorithm Design

Figure 1: Illustration of the proposed

\mathsf{BAdam}

, which is based on the block coordinate optimization framework. Colors represent the states of the partitioned blocks in one block-epoch, including the active block, inactive blocks, and updated blocks.

In this subsection, we propose $\mathsf{BAdam}$ , a block coordinate optimization method embedded with Adam updates as the inner solver. The method is displayed in Algorithm 1 and illustrated in Figure 1. Formally, Let us consider an abstract form of the training problem formulation of LLMs $\min_{\theta}\ \mathcal{L}(\theta)=\frac{1}{n}\sum_{j=1}^{n}\ell_{j}(\theta)$ . Here, $\theta\in\mathbb{R}^{d}$ represents the concatenation of vectorized parameters of the model, $n$ is the number of training data points, and $\ell_{j}$ is the loss function for the $j$ -th training data point, which can be the negative log-likelihood loss when performing language modeling or supervised finetuning.

Block partition and block-coordinate optimization framework. At the $t$ -th block-epoch, $\mathsf{BAdam}$ first generates an ordered block partition $\pi=\{\pi_{1},\ldots,\pi_{i},\ldots,\pi_{D}\}$ , which splits the whole model parameters $\theta\in\mathbb{R}^{d}$ into $D$ blocks, i.e., $\theta=\{\theta_{\pi_{1}},\ldots,\theta_{\pi_{i}},\ldots,\theta_{\pi_{D}}\}$ with $\theta_{\pi_{i}}\in\mathbb{R}^{d_{i}}$ and $\sum_{j=1}^{N}d_{j}=d$ . Note that $\pi$ may be either a deterministic or a random partition once the aggregation of all the blocks $\{\theta_{\pi_{i}}\}$ forms the whole set of parameters $\theta$ .

The partition $\pi$ can be very flexible and is a unified representation. Given a large language model such as Llama 2-7B, a natural partition is the layer of the model, including the transformer layers, embedding layer, and the LM head layer. For such a layer-based partition, we may list the partition $\pi$ in a forward order (from the input to the output layer), backward order (from the output to the input layer), or reshuffling order (random). Apart from this most natural partition, one can also choose a small part of parameters from each layer and regard these parameters as one block $\theta_{\pi_{i}}$ . However, different partitions have different effects on the BP process, which is directly related to the running time as discussed in Section 2.2.2. In the rest of this paper, we will use the natural layer-based partition, unless explicitly specified.

We are now ready to present the optimization framework of our $\mathsf{BAdam}$ . Our core idea is to adopt the spirit of the block coordinate optimization. Namely, we optimize over only one active block $\theta_{\pi_{i}}$ at one time, given that the other inactive blocks are fixed at their up-to-date values. Mathematically, at the $t$ -th block-epoch, suppose that the current active block is $\theta_{\pi_{i}}$ , then updating block $\theta_{\pi_{i}}$ amounts to solving the following problem:

\theta_{\pi_{i}}^{t+1}=\mathop{{\operatorname*{arg\,min}}}_{\theta_{\pi_{i}}% \in\mathbb{R}^{d_{i}}}\ \frac{1}{n}\sum_{j=1}^{n}\ell_{j}(\theta^{t+1}_{\pi_{1% }},\dots,\theta^{t+1}_{\pi_{i-1}},\theta_{\pi_{i}},\theta^{t}_{\pi_{i+1}},% \dots,\theta^{t}_{\pi_{D}}).

(1)

One can see that the optimizer (1) fixes the inactive blocks at their most recent values, and hence it is a much lower dimensional optimization problem compared to $\min_{\theta}\ \frac{1}{n}\sum_{j=1}^{n}\ell_{j}(\theta)$ , providing the possibility to implement the algorithm in the situation with limited GPU memory resources; see Section 2.2.1 for the analysis of memory consumption. Solving problem (1) sequentially for $i=1,\ldots,D$ moves the block-epoch from $t$ to $t+1$ .

1 input:

\beta_{1}

\beta_{2}

\varepsilon

, and learning rate

\alpha

. initialization: block-epoch index

t\leftarrow 0

and model parameters

\theta^{0}

from the pretrained model. while stopping criterion not meet do

2 generates a block partition

\pi=\{\pi_{1},\cdots,\pi_{D}\}

;

3 repeat for one block-epoch

i\leftarrow 1,\ldots,D

// reset iteration index, block momentum, and block model parameter

k\leftarrow 0

;

m_{\pi_{i}}^{t,0}\leftarrow 0

;

v_{\pi_{i}}^{t,0}\leftarrow 0

;

\theta_{\pi_{i}}^{t,0}\leftarrow\theta_{\pi_{i}}^{t}

;

// Adam steps for updating the active block

\theta_{\pi_{i}}

5 repeat for

K

Adam steps (sample data points)

k\leftarrow k+1

;

g^{t,k}_{\pi_{i}}\leftarrow

stochastic approximation of

\frac{\partial}{\partial\theta_{\pi_{i}}}\mathcal{L}(\theta^{t+1}_{\pi_{1}},% \dots,\theta^{t+1}_{\pi_{i-1}},\theta^{t,k-1}_{\pi_{i}},\theta^{t}_{\pi_{i+1}}% ,\dots,\theta^{t}_{\pi_{D}})

;

m_{\pi_{i}}^{t,k}\leftarrow\beta_{1}m_{\pi_{i}}^{t,k-1}+(1-\beta_{1})g_{\pi_{i% }}^{t,k}

;

v_{\pi_{i}}^{t,k}\leftarrow\beta_{2}v^{t,k-1}_{\pi_{i}}+(1-\beta_{2})(g^{t,k}_% {\pi_{i}})^{2}

;

\hat{m}_{\pi_{i}}^{t,k}\leftarrow m_{\pi_{i}}^{t,k}/(1-\beta_{1}^{t})

;

\hat{v}_{\pi_{i}}^{t,k}\leftarrow v_{\pi_{i}}^{t,k}/(1-\beta_{2}^{t})

;

\theta_{\pi_{i}}^{t,k}\leftarrow\theta_{\pi_{i}}^{t,k-1}-\alpha\hat{m}_{\pi_{i% }}^{t,k}/\left(\sqrt{\hat{v}_{\pi_{i}}^{t,k}}+\varepsilon\right)

;

13 end

\theta_{\pi_{i}}^{t+1}\leftarrow\theta_{\pi_{i}}^{t,K}

;

g_{\pi_{i}},m_{\pi_{i}},v_{\pi_{i}}\leftarrow

None ;

// clear out the memory for gradient and optimizer states

17 end

t\leftarrow t+1

;

21 end while

return learned model parameters

\theta^{t}

Algorithm 1

\mathsf{BAdam}

: A block coordinate optimization method embedded with

\mathsf{Adam}

update. In this algorithm, vector square, vector square root, and vector division are all element-wise operations.

Update using Adam steps. Due to the sophisticated structure of LLMs, it is almost impossible to compute an accurate solution to subproblem (1). We instead propose to approximately solve (1) using several gradient-based steps starting at $\theta^{t}_{\pi_{i}}$ . Abstractly, $\mathsf{BAdam}$ executes the following update:

\left[\begin{aligned} &\text{Fix blocks }\theta_{\pi_{i^{\prime}}},\forall i^{% \prime}\neq i\text{ at their most recent values},\\ &\theta_{\pi_{i}}^{t+1}\leftarrow\mathcal{A}(\theta^{t+1}_{\pi_{1}},\dots,% \theta^{t+1}_{\pi_{i-1}},\theta^{t}_{\pi_{i}},\theta^{t}_{\pi_{i+1}},\dots,% \theta^{t}_{\pi_{D}}).\end{aligned}\right.

(2)

Here, $\mathcal{A}$ is certain algorithmic procedure. In this work, we choose $\mathcal{A}$ to be $K$ Adam steps [9] starting at $\theta_{\pi}^{t}$ , in order to efficiently approximate the solution of (1). To specify the concrete Adam steps, we first note that the gradient of the training objective function can be correspondingly decomposed as

\nabla\mathcal{L}(\theta)=\begin{bmatrix}\frac{\partial\mathcal{L}}{\partial% \theta_{\pi_{1}}}\\ \vdots\\ \frac{\partial\mathcal{L}}{\partial\theta_{\pi_{D}}}\end{bmatrix}=\begin{% bmatrix}\frac{\partial}{\partial\theta_{\pi_{1}}}\frac{1}{n}\sum_{i=1}^{n}\ell% _{i}(\theta)\\ \vdots\\ \frac{\partial}{\partial\theta_{\pi_{D}}}\frac{1}{n}\sum_{i=1}^{n}\ell_{i}(% \theta)\end{bmatrix}.

(3)

We call $\frac{\partial\mathcal{L}}{\partial\theta_{\pi_{i}}}$ the block gradient of the objective function $\mathcal{L}$ over block $\theta_{\pi_{i}}$ . Importantly, the BP process naturally defines the block gradient. However, it is impractical or even prohibitive to compute the block gradient using all the $n$ training data points. Instead, according to the main spirit of stochastic optimization methods, one can select a batch of data points to compute a block stochastic gradient $g_{\pi_{i}}$ for approximating the block gradient, as outlined in Algorithm 1 of Algorithm 1. With this block stochastic gradient $g_{\pi_{i}}$ , we are able to construct the Adam optimizer states for the active block $\theta_{\pi_{i}}$ as shown in Algorithm 1 – Algorithm 1. Finally, we implement $K$ $\mathsf{Adam}$ steps in Algorithm 1 – Algorithm 1 for approximately solving (1). We note that one can also apply decoupled weight decay regularization [16] to the active block $\theta_{\pi_{i}}$ .

BAdam is not blocklizing Adam. We close this section by remarking that the block coordinate optimization framework foundation of $\mathsf{BAdam}$ is the essential ingredient to achieving low memory consumption, as the $\mathsf{Adam}$ optimizer states of the active block $\theta_{\pi_{i}}$ can be progressively updated in Algorithm 1 – Algorithm 1 of Algorithm 1 using only a block storage memory. We refer to Section 2.2.1 for a detailed analysis of memory consumption. If we simply blocklize Adam, i.e., we sequentially update all the blocks in an Adam iteration, then it is unclear how to progressively update the Adam optimizer states. Therefore, it is worth emphasizing that our $\mathsf{BAdam}$ is essentially a block coordinate optimizer embedded with Adam updates as the inner solver, which is fundamentally different from simply blocklizing the Adam optimizer. Indeed, apart form the chosen Adam update rule, it is possible to propose other efficient optimization procedures for approximately solving (1).

2.2 Analysis of Memory Consumption and Time Saving of BP

2.2.1 Memory consumption analysis

We analyze the memory consumption of $\mathsf{BAdam}$ , caused by storing the model parameters and optimizer states. Let us consider an LLM that is of parameters $d=M$ billion. To store such many parameters, we need roughly $4M$ GB GPU memory or $2M$ GB GPU memory in float point 32 (FP32) precision or FP16 precision, respectively. In our following discussion on memory consumption, we will use GB as the unit of memory.

Let us first analyze the memory use of $\mathsf{Adam}$ . It is typical to employ the mixed-precision training approach for accelerating the BP process. One needs to store the FP16 model parameters for the BP process, which costs $2M$ memory. For a more precise update, the optimizer also maintains a master copy of a FP32 model, which costs $4M$ memory. Then, it comes to store the Adam optimizer states including the stochastic gradient, momentum, and second moment in FP32 precision, costing $4M+4M+4M=12M$ memory. In total, $\mathsf{Adam}$ needs roughly $\mathbf{18}\boldsymbol{M}$ memory.

In terms of our $\mathsf{BAdam}$ , it needs to store the up-to-date model parameters (see Figure 1) in FP16 precision, which costs $2M$ memory. Importantly, since $\mathsf{BAdam}$ only updates the active block at one time, we can store the model parameters, stochastic gradient, momentum, and second moment only for the active block $\theta_{\pi_{i}}$ in FP32 precision. Note that the FP32 model parameters of the active block can be obtained by transforming their stored FP16 version to the FP32 version. Let us consider the simple case where each block of the partitioned $D$ blocks has the same size. Then, $\mathsf{BAdam}$ only needs in total

\mathbf{2}\boldsymbol{M}+\frac{\mathbf{16}\boldsymbol{M}}{\boldsymbol{D}}\ % \text{memory}.

(4)

Thus, we regard $\mathsf{BAdam}$ as a memory efficient optimization method for LLMs. Note that we do not account for the memory required to store activations, as this is associated with the backpropagation (BP) process rather than the optimization method itself. Furthermore, gradient checkpointing can be employed to reduce the memory requirement needed for storing activations.

To provide a more comprehensive comparison, we compare the theoretical memory consumption of $\mathsf{BAdam}$ to those of $\mathsf{Adam}$ , LoRA, and LOMO in Table 1. We also provide an actual memory consumption for training Llama 2-7B in Section 3.2.

2.2.2 Time saving analysis of the BP Process

When the partitioned $D$ blocks $\{\theta_{\pi_{i}}\}$ are the natural $D$ transformer layers of LLMs, thanks to the chain rule property of the BP process, our $\mathsf{BAdam}$ can significantly reduce the computation time of BP compared to Adam, LoRA, and LOMO, after utilizing the same amount of data.

Let us consider one block-epoch of $\mathsf{BAdam}$ , meaning that it has utilized $K\cdot D$ data batches, where $K$ is defined in Algorithm 1. We consider the simple case where each data point has the same sequence length and each transformer layer has the same number of parameters, in order to ease the analysis. Recall that a BP consists of a forward pass and a backward pass. For the forward pass, $\mathsf{BAdam}$ has almost the same computational load as that of Adam and LOMO, while it has less forward computation than that of LoRA due to LoRA’s extra inference time spent on the low-rank adapter. Hence, it remains to consider the number of unit-backward-pass after utilizing $K D$ data batches, where the unit-backward-pass is defined as a backward pass through a single transformer layer. It is important to note that $\mathsf{BAdam}$ only updates the active block, and hence the number of unit-backward-pass largely depends on the depth of the active block. For instance, if the input layer or output layer is the current active block, we need $D$ unit-backward-pass or only $1$ unit-backward-pass, respectively. Thus, after one block-epoch (i.e., utilizing $K D$ data batches), $\mathsf{BAdam}$ requires

K(1+\cdots+D)=\frac{KD(D+1)}{\mathbf{2}}\quad\text{unit-backward-pass}.

(5)

However, Adam, LoRA, and LOMO need to backward for all the $D$ transformer layers, thus requiring $KD^{2}$ unit-backward-pass after utilizing $K D$ data batches. We conclude that $\mathsf{BAdam}$ roughly saves half of the unit-backward-pass number compared to Adam, LoRA, and LOMO, after using the same amount of data.

Apart from saving the number of unit-backward-pass, some of the unit-backward-pass of $\mathsf{BAdam}$ may even take less computational time compared to those of Adam, LoRA, and LOMO. Let us take the backward pass of the input layer as an example. $\mathsf{BAdam}$ does not require explicit stochastic gradient computation of the model parameters of the intermediate layers $\partial z_{l}/\partial\theta_{l}$ , where $\{z_{l}\}$ are the activations of the intermediate layers and $\{\theta_{l}\}$ are the trainable model parameters of the these layers. However, the other three methods do need to compute these quantity explicitly. We refer to Table 3 for an actual experiment illustration for this analysis.

In summary, $\mathsf{BAdam}$ saves computational load of the BP process compared to Adam, LoRA, and LOMO, after training on the same amount of data. We will demonstrate it through experiments in Section 3.2.

3 Experiment Results

In this section, we evaluate our proposed $\mathsf{BAdam}$ in terms of several aspects, namely, the convergence of loss with respect to data pass, the wall-clock running time, the memory profile, and the downstream task performance.

3.1 Experiment Setup

We consider both natural language generation (NLG) and natural language understanding (NLU) tasks. For NLG, we adopt the Alpaca-GPT4 dataset [20], which consists of 52k instruction-following data generated by GPT-4, using prompts from the Alpaca dataset [21]. Our implementation is based on [30]. We perform supervised finetuning (SFT) on the Alpaca-GPT4 dataset for the Llama 2-7B model [22], which contains approximately 6.7 billion parameters. The resulted model is then evaluated on MT-bench [29] to test its downstream performance. As for NLU, we finetune the RoBERTa-large model [15] with 355 million parameters on the SuperGLUE benchmark [24] with a particular focus on 6 tasks, i.e., BoolQ, COPA, MultiRC, RTE, WiC, and WSC, as they are selected in [17, 18]. We evaluate the NLU downstream performance over the test dataset of the 6 tasks.

For each of the task, we compare $\mathsf{BAdam}$ with existing approaches including 1) LoRA [8], which adds trainable low-rank adapter to the original pretrained base model, 2) LOMO [17], which execute the stochastic gradient descent (SGD) update on the fly when performing the BP process, so that one does not need to physically store the stochastic gradient of the full trainable model parameters, and 3) Adam [9], which is the standard optimizer for full parameter training. For training Llama 2-7B on the Alpaca-GPT4 dataset, we set the learning rate to 1e-5 for all the methods. The batch size is set to 8, in the meanwhile, we apply 15 steps of gradient accumulation for all the methods, resulting in an effective batch size of 120. Note that LOMO does not support gradient accumulation as it has to perform the update during the backward process, and hence its effective batch size is 8. For a fair comparison, we will count 15 actual iterations of LOMO as one iteration in the sequel. For tasks in SuperGLUE, the learning rate is set to 1e-5 and batch size is set to 16 for the tested methods. For all the experiments, we choose the rank of LoRA to be $100$ and use low-rank adaptions for all the trainable matrices rather than only the query and key matrices. In this manner, the number of trainable parameters for LoRA is nearly the same as that for $\mathsf{BAdam}$ at each iteration, ensuring a fairer comparison.

Due to the limitation of GPU memory, the performance of Adam is only reported for RoBERTA-large model. Through all the experiments for training Llama 2-7B, we enable gradient checkpointing [4] to reduce the memory cost caused by storing activations for all the tested optimization methods, so that larger batch size can be applied.

3.2 Experiments on Llama 2-7B using a Single RTX3090-24GB GPU

In this subsection, we conduct instruction-tuning for the Llama 2-7B model on the Alpaca-GPT4 dataset. We illustrate the convergence behaviors of different methods. Additionally, we evaluate the downstream performance of the instruction-tuned models on MT-bench. Note that all the experiments in this subsection are conducted using a single RTX3090-24GB GPU.

Convergence performance. We first display the convergence properties of the training loss on the Alpaca-GPT4 dataset for different methods; see Figure 2. One the one hand, it can be observed that $\mathsf{BAdam}$ converges faster in terms of iterations and achieves a lower online training loss compared to LoRA.

Refer to caption — Figure 2: Online training loss versus data pass.

Such a faster convergence is indeed expected, as LoRA confines its parameter search in a lower dimensional subspace. On the other hand, it is clear that LOMO has a worse convergence behavior compared to $\mathsf{BAdam}$ and LoRA. This is reasonable due to the fact that LOMO only implements the suboptimal SGD update, while the other two methods employ Adam updates.

In terms of online training loss versus running time, our $\mathsf{BAdam}$ should have an even clearer advantage over the other two methods. Such a time saving of $\mathsf{BAdam}$ mainly comes from the backward stage in computing stochastic gradients, which is rigorously analyzed in Section 2.2.2. In the following, we will demonstrate the exact wall-clock running time comparison through experiments to clarify our claim.

Wall-clock running time comparison. Time consumption of each method mainly consists of three parts, i.e., forward, backward, and update. Among them, update time is negligible, while the backward time is the dominant part since the computation of the gradient of the trainable model parameters involves intense Jacobian-vector product calculations. Note that different data points have different sequence length, which directly affects the running time. To mitigate such an issue, we let each method go through 2 data epochs (around $850$ iterations) and display the total running time to enable a fair and comprehensive comparison. The results are shown in Table 3. The forward time of $\mathsf{BAdam}$ and LOMO are comparable, while LoRA takes roughly twice more time in this stage. This is predictable since LoRA needs to pass through extra low-rank adapters during inference. For backward time, our $\mathsf{BAdam}$ only takes nearly half of time compared to those of LoRA and LOMO. Such a time saving of the backward stage is rigorously analyzed in Section 2.2.2. It is also worth emphasizing that the backward time includes the re-forward time for all methods due to gradient checkpointing, which actually weakens the running time advantage of our $\mathsf{BAdam}$ .

In Table 3, we conduct tailored experiments to further support our analysis in Section 2.2.2. It can be observed that: 1) backward for "Output layer only" is almost time-free, as it requires only 1 unit-backward-pass as discussed in Section 2.2.2, 2) backward for "All layers" takes significantly more time, as it has to implement $D$ unit-backward-pass and 3) backward for "Input layer only" in our BAdam actually takes less time than $D$ unit-backward-pass or backward for "All layers", as the former scheme does not need to compute the stochastic gradients of the model parameters of the intermediate layers, corroborating the analysis in Section 2.2.2.

Table 2: Time spent on forward, backward, and update for 2 data epochs for finetuning Llama 2-7B. Note that LOMO updates on the fly and hence it does not have update time. The proposed

\mathsf{BAdam}

significantly reduces the backward time. Importantly, we note that the backward time also contains the re-forward time for all the methods due to gradient checkpointing, which actually diminishes the running time advantage of our

\mathsf{BAdam}

Method	Forward	Backward	Update
LoRA	2.48 hours	9.45 hours	56 seconds
LOMO	1.35 hours	9.71 hours	—
BAdam	1.16 hours	5.54 hours	39 seconds

Backward scheme	Backward time
All layers	5.180 seconds
Input layer only	3.903 seconds
Output layer only	0.053 seconds

Table 2: Time spent on forward, backward, and update for 2 data epochs for finetuning Llama 2-7B. Note that LOMO updates on the fly and hence it does not have update time. The proposed

\mathsf{BAdam}

\mathsf{BAdam}

Table 3: Time spent on different backward scheme in one backward pass with batch size 8. The result is averaged over 100 backward passes for finetuning Llama 2-7B. Note that the "Input layer only" backward scheme represents not computing the stochastic gradients of the trainable parameters of other intermediate transformer layers. Again, the backward time contains the re-forward time due to gradient checkpointing.

Memory consumption. We now turn to report the actual memory consumption of $\mathsf{BAdam}$ for instruction-tuning the Llama 2-7B model. We also list the memory costs of LoRA, LOMO, and Adam. The results can be found in Table 4. Due to limited memory resources, the memory consumption for Adam is estimated rather than tested. The batch size is set to 8, and the maximum sequence length of the tested input is 728. In order to finetune Llama 2-7B in a single RTX3090, we apply gradient checkpointing [4] to avoid caching the full activation for all the tested methods. In particular, the checkpointing technique is applied for every layer so that we only store each layer’s input and re-forward through this layer starting at the stored input when performing backward for this layer’s parameters.

The actual peak memory consumption during training are displayed in the "Memory consumption" column of Table 4. We also report the memory costs for storing the FP16 full model, FP32 gradient, float32 optimizer states, and FP16 activation. It is easy to observe that all of $\mathsf{BAdam}$ , LoRA, and LOMO are able to finetune Llama 2-7B using a single RTX3090-24GB GPU.

One can observe that the actual total memory consumption (last column of Table 4) is a bit higher than the summation of the listed quantities in Table 4, i.e., "Model", "Gradient", etc. The additional memory cost is due to pre-allocated memory cache by PyTorch and other additional buffers for reference of intermediate computing results, which can be further reduced through implementation-level improvement.

Method	Model	Gradient	Optimizer states	Activation	Memory consumption
Adam	13.4GB	13.4GB	80.4GB	2.2GB+	109.4GB+
LoRA	14.0GB	1.0GB	2.0GB	2.2GB+	22.1GB
LOMO	13.4GB	0.8GB	–	1.6GB+	18.8GB
BAdam	13.4GB	0.8GB	1.6GB	2.2GB+	21.8GB

Table 4: Peak memory costs of using mixed precision approach to finetune Llama 2-7B with batch size 8 using gradient checkpointing. The input sequence length is around 500. Here, the "Memory consumption" item in the last column represents the actual total memory consumption during training, including pre-allocated memory cache by PyTroch and additional buffers that are not displayed in the table. Note that the memory costs of Adam are estimated rather than tested due to the limitation of memory resources.

Method	MT-bench score
Vanilla Llama 2-7B	3.93
LOMO	4.21
LoRA	4.91
BAdam	5.06

Table 5: MT bench scores for instruction-tuning Llama 2-7B on Alpaca-GPT4 by different optimization methods.

Downstream performance evaluation using MT-bench. For illustrating the tuned models’ downstream performance, we then evaluate the MT-bench scores of the instruction-tuned models obtained by different optimization methods. The models we tested are the output after running each method for around 3k iterations (around $7$ data epochs). The results are shown in Table 5. We can observe that all the optimization methods lead to an improved MT-bench score compared to the pretrained base model. In addition, LoRA and $\mathsf{BAdam}$ largely outperforms LOMO, which again is due to the fact that the latter only employs SGD optimizer. Moreover, $\mathsf{BAdam}$ has a slightly better score than that achieved by LoRA, illustrating the promising performance of our proposed method. We note that our implementation of $\mathsf{BAdam}$ is not yet optimized. We believe that through code-level optimization and more careful choices of hyperparameters, we can further improve the performance of $\mathsf{BAdam}$ .

3.3 BAdam versus Adam on Medium-sized Language Model

Due to limited memory resources, we compare the performance of our $\mathsf{BAdam}$ with that of Adam on the medium-sized language model RoBERTa-large for the SuperGLUE benchmark. All the experiments are conducted using a single RTX3090-24GB GPU.

We display the test results on 6 tasks from the SuperGLUE benchmark, i.e., BoolQ, COPA, WSC, RTE, MultiRC, and WiC. We choose these tasks to conduct experiments since they are selected in [17, 18]. The results can be found in Table 6. It can be observed that our $\mathsf{BAdam}$ outperforms LoRA in 5 out of the 6 tasks. Furthermore, $\mathsf{BAdam}$ demonstrates performance that is comparable to, or tied with, Adam. Based on these results, we can conclude that $\mathsf{BAdam}$ is capable of closing the performance gap with Adam. Consequently, we extrapolate that $\mathsf{BAdam}$ has the potential to perform nearly as well as Adam, even when tuning larger models.

Method	BoolQ	COPA	WSC	RTE	MultiRC	WiC
Adam	0.86	0.59	0.68	0.87	0.76	0.70
LoRA	0.81	0.56	0.62	0.79	0.69	0.59
BAdam	0.85	0.69	0.65	0.76	0.77	0.64

Table 6: SuperGLUE benchmark scores for finetuning RoBERTa-large using different optimization methods.

4 Related Works

Finetuning of LLMs in the limited resource scenarios has become an important research topic. We present a review of the relevant literature below. Given the extensive and rapidly growing body of work in this field, it is important to note that the references we include here are not exhaustive.

Block coordinate optimization. Block coordinate optimization is a well-established algorithmic approach in the field of optimization [23, 19, 25], with a history that can be traced back to the very origins of the discipline. We refer to [3, 28] and the references therein for some recent developments from a theoretical point of view. Such a scheme is particularly suitable for tackling large-scale problems where the large-scale feature is characterized by a huge number of trainable parameters. In the context of finetuning LLMs, we encounter precisely this type of challenge, as the GPU memory requirements are substantial due to the huge number of parameters that need to be trained. Our $\mathsf{BAdam}$ leverages this critical observation by decomposing a huge-scale problem into a series of much lower dimensional ones. Consequently, we highlight that the underlying structure of $\mathsf{BAdam}$ is rooted in the block coordinate optimization framework, with the Adam optimizer being utilized to effectively solve the emerging lower dimensional subproblems.

Parameter efficient finetuning (PEFT). An effective strategy for finetuning LLMs is to train a small number of trainable parameters that are added to the original base model, while keeping the majority of the pretrained parameters frozen. Numerous approaches have been proposed and studied along this line of research. For instance, adapter tuning only finetunes the inserted small modules between layers called adapters [7]. Prompt-tuning / Prefix-tuning [12, 13] attaches additional trainable prefix tokens to the input and/or hidden layers, while remaining the base model unchanged. Interested readers are referred to [6] for a unified framework and a comprehensive comparison of these methods. Another prevalent method for PEFT is to model the incremental update of the weight matrices with low dimensional and parameter efficient structures. One such notable example is the low-rank adaptation (LoRA) [8], which models the increment to the base model as a product of two significantly lower dimensional trainable low-rank matrices. Subsequent research on LoRA has aimed at extending its rank constraints [14, 26], further reducing the number of trainable parameters [10, 11], decreasing memory usage through quantization [5], etc. Presently, LoRA-based methods are commonly employed for finetuning LLMs with limited memory resources.

Memory efficient full parameter finetuning. Though PEFT methods effectively reduces the memory consumption by decreasing the number of trainable parameters, they may yield suboptimal performance for downstream tasks compared to full parameter finetuning [27], as PEFT constrains parameter search to a much lower dimensional subspace. To conduct full parameter finetuning of LLMs with limited memory, the work [17] proposes to efficiently leverage the BP process to update parameters on the fly in the process of computing stochastic gradients. Consequently, LOMO helps to execute SGD for full parameter finetuning without physically storing the stochastic gradients, significantly reducing memory consumption. However, it is worth emphasizing that SGD typically converges more slowly than Adam, and is often deemed suboptimal compared to Adam for training neural networks. Unfortunately, it is unclear how to extend the idea of LOMO to the Adam optimizer. On another front, MeZO [18] proposes to approximate SGD by using only the forward pass. The idea of MeZO derives from zeroth-order optimization, which utilizes function value difference to approximate the stochastic gradients of the trainable model parameters. Hence, MeZo eliminates the need to perform backward pass.

Compared to existing methods, our $\mathsf{BAdam}$ facilitates full parameter finetuning with limited memory resources. Notably, it can modestly outperform LoRA in terms of downstream performance with less running time, and it shows potential in bridging the performance gap when compared to full parameter finetuning using Adam. Consequently, we believe that $\mathsf{BAdam}$ holds promise for efficient finetuning of LLMs under limited memory constraints and may serve as a viable alternative to LoRA.

5 Conclusion

In this work, we have proposed the $\mathsf{BAdam}$ optimizer, which is built upon the block coordinate optimization framework with the integration of Adam steps as the inner solver. $\mathsf{BAdam}$ offers a memory efficient approach for finetuning large language models. We have conducted instruction-tuning for the Llama 2-7B model on the Alpaca-GPT4 dataset using a single RTX3090-24GB GPU. The results indicated that $\mathsf{BAdam}$ improves both convergence speed and running time compared to LoRA and LOMO. Further downstream performance assessments using the MT-bench have demonstrated $\mathsf{BAdam}$ ’s superior performance, especially in comparison to LOMO. When compared with the Adam optimizer for fine-tuning RoBERTa-large on the SuperGLUE benchmark, $\mathsf{BAdam}$ has shown its ability to close the performance gap with Adam.

References

[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[2] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
[3] Xufeng Cai, Chaobing Song, Stephen Wright, and Jelena Diakonikolas. Cyclic block coordinate descent with variance reduction for composite nonconvex optimization. In International Conference on Machine Learning, pages 3469–3494. PMLR, 2023.
[4] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
[5] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. Advances in Neural Information Processing Systems, 36, 2024.
[6] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. International Conference on Learning Representations, 2021.
[7] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In International conference on machine learning, pages 2790–2799. PMLR, 2019.
[8] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
[9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[10] Soroush Abbasi Koohpayegani, KL Navaneet, Parsa Nooralinejad, Soheil Kolouri, and Hamed Pirsiavash. Nola: Networks as linear combination of low rank random basis. In The Twelfth International Conference on Learning Representations, 2024.
[11] Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. VeRA: Vector-based random matrix adaptation. In The Twelfth International Conference on Learning Representations, 2024.
[12] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
[13] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 4582–4597, 2021.
[14] Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. ReLoRA: High-rank training through low-rank updates. In The Twelfth International Conference on Learning Representations, 2024.
[15] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[16] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[17] Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning for large language models with limited resources. arXiv preprint arXiv:2306.09782, 2023.
[18] Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36, 2023.
[19] Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
[20] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023.
[21] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model, 2023.
[22] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[23] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109:475–494, 2001.
[24] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
[25] Stephen J Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
[26] Wenhan Xia, Chengwei Qin, and Elad Hazan. Chain of LoRA: Efficient fine-tuning of language models via residual learning. arXiv preprint arXiv:2401.04151, 2024.
[27] Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When scaling meets llm finetuning: The effect of data, model and finetuning method. The Twelfth International Conference on Learning Representations, 2024.
[28] Lei Zhao, Ding Chen, Daoli Zhu, and Xiao Li. Randomized coordinate subgradient method for nonsmooth optimization. arXiv preprint arXiv:2206.14981, 2022.
[29] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2023.
[30] Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, and Yongqiang Ma. LlamaFactory: Unified efficient fine-tuning of 100+ language models. arXiv preprint arXiv:2403.13372, 2024.

BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models