Federated Learning Meets Multiobjective Optimization
Abstract
Federated learning has emerged as a promising, massively distributed way to train a joint deep model over large amounts of edge devices while keeping private user data strictly on device. In this work, motivated from ensuring fairness among users and robustness against malicious adversaries, we formulate federated learning as multiobjective optimization and propose a new algorithm FedMGDA+ that is guaranteed to converge to Pareto stationary solutions. FedMGDA+ is simple to implement, has fewer hyperparameters to tune, and refrains from sacrificing the performance of any participating user. We establish the convergence properties of FedMGDA+ and point out its connections to existing approaches. Extensive experiments on a variety of datasets confirm that FedMGDA+ compares favorably against stateoftheart.
Keywords Pareto optimization, Distributed algorithms, Federated learning, Edge computing, Machine learning, Neural networks.
1 Introduction
Deep learning has achieved impressive successes on a number of domain applications, thanks largely to innovations on algorithmic and architectural design, and equally importantly to the tremendous amount of computational power one can harness through GPUs, computer clusters and dedicated software and hardware. Edge devices, such as smart phones, tablets, routers, car devices, home sensors, etc., due to their ubiquity and moderate computational power, impose new opportunities and challenges for deep learning. On the one hand, edge devices have direct access to privacy sensitive data that users may be reluctant to share (with say data centers), and they are much more powerful than their predecessors, capable of conducting a significant amount of ondevice computations. On the other hand, edge devices are largely heterogeneous in terms of capacity, power, data, availability, communication, memory, etc., posing new challenges beyond conventional inhouse training of machine learning models. Thus, a new paradigm, known as federated learning (FL) (McMahan et al.,, 2017) that aims at harvesting the prospects of edge devices, has recently emerged. Developing new FL algorithms and systems on edge devices has since become a hot research topic in machine learning.
From the beginning of its birth, FL has close ties to conventional distributed optimization. However, FL emerged from the pressing need to address news challenges in the mobile era that existing distributed optimization algorithms were not designed for per se. We mention the following characteristics of FL that are most relevant to our work, and refer to the excellent surveys (Li et al.,, 2019; Yang et al.,, 2019; Kairouz et al.,, 2019) and the references therein for more challenges and applications in FL.

•
NonIID: Each user’s data can be distinctively different from every other user’s, violating the standard iid assumption in statistical learning and posing significant difficulty in formulating the goal in precise mathematical terms (Mohri et al.,, 2019). The distribution of user data is often severely unbalanced.

•
Limited communication: Communication between each user and a central server is constrained by network bandwidth, device status, user participation incentive, etc., demanding a thoughtful balance between computation (on each user device) and communication.

•
Privacy: Protecting user (data) privacy is of uttermost importance in FL. It is thus not possible to share user data (even to a cloud arbitrator), which adds another layer of difficulty in addressing the previous two challenges.

•
Fairness: As argued forcibly in recent work (e.g., Mohri et al., 2019; Li et al., 2020b ), ensuring fairness among users has become another serious goal in FL, as it largely determines users’ willingness to participate and ensures some degree of robustness against malicious user manipulations.

•
Robustness: FL algorithms are eventually deployed in the wild hence subject to malicious attacks. Indeed, adversarial attacks (e.g., Bagdasaryan et al., 2020; Sun et al., 2019; Bhagoji et al., 2019) have been constructed recently to reveal vulnerabilities of FL systems against malicious manipulations at the user side.
In this work, motivated from the last two challenges above, i.e.fairness and robustness, we propose a new algorithm FedMGDA+ that complements and improves existing FL systems. FedMGDA+ is based on multiobjective optimization and is guaranteed to converge to Pareto stationary solutions. FedMGDA+ is simple to implement, has fewer hyperparameters to tune, and most importantly refrains from sacrificing the performance of any participating user. We demonstrate the superior performance of FedMGDA+ under a variety of metrics including accuracy, fairness, and robustness.
We summarize our contributions as follows:

•
In §3, based on the proximal average we provide a novel, unifying and revealing interpretation of existing FL practices.

•
In §4, we summarize some background on multiobjective optimization and point out its connections to existing FL algorithms. We believe this new perspective will yield more fruitful exchanges between the two fields in the future.

•
In §5, we propose FedMGDA+ that complements existing FL systems while taking robustness and fairness explicitly into its algorithmic design. We prove that FedMGDA+ converges to a Pareto stationary solution under mild assumptions.

•
In §6, we perform extensive experiments to validate the competitiveness of FedMGDA+ under a variety of desirable metrics, and to illustrate the respective pros and cons of our and alternative algorithms.
We discuss more related work in §2 and we conclude in §7 with some future directions.
To facilitate reproducibility, we have released our code at: https://github.com/watml/FedMGDA.
2 Related Work
In this section we give a brief review of some recent work that is directly related to ours and put our contributions in context. To start with, McMahan et al., (2017) proposed the first FL algorithm known as “Federated Averaging” (a.k.a., FedAvg), which is a synchronous update scheme that proceeds in several rounds. At each round, the central server sends the current global model to a subset of users, each of which then uses its respective local data to update the received model. Upon receiving the updated local models from users, the server performs aggregation, such as simple averaging, to update the global model. For more discussion on different averaging schemes, see Li et al., (2020). Li et al., 2020a extended FedAvg to better deal with noni.i.d. distribution of data, by adding a “proximal regularizer” to the local loss functions and minimizing the Moreau envelope function for each user. The resulting algorithm FedProx, as pointed out in §3, is a randomized version of the proximal average algorithm in Yu, (2013) and reduces to FedAvg when regularization diminishes.
Analysing FedAvg has been a challenging task due to its flexible updating scheme, partial user participation, and noniid distribution of client data Li et al., 2020a . The first theoretical analysis of FedAvg for strongly convex and smooth problems with iid and noniid data appeared in Stich, (2019) and Li et al., (2020), respectively, where the effect of different sampling and averaging schemes on the convergence rate of FedAvg was also investigated, leading to the conclusion that such effect becomes particularly important when the dataset is unbalanced and noniid distributed. In Huo et al., (2020), FedAvg was analyzed for nonconvex problems, where FedAvg was formulated as a stochastic gradientbased algorithm with biased gradients, and the convergence of FedAvg with decaying step sizes to stationary points was proved. Moreover, Huo et al., (2020) proposed FedMom, a serverside acceleration based on Nesterov’s momentum, and proved again its convergence to stationary points. Lately, Reddi et al., (2020) proposed and analyzed federated versions of several popular adaptive optimizers (e.g. ADAM). They generalize the framework of FedAvg by decoupling the FL update scheme into server optimizer and client optimizer. Interestingly, same as us, Reddi et al., (2020) also observed the importance of learning rate decays on both clients and server.
Recently, an interesting work by Pathak and Wainwright, (2020) demonstrated theoretically that fixed points reached by FedAvg and FedProx (if exist) need not be stationary points of the original optimization problem, even in convex settings and with deterministic updates. To address this issue, they proposed FedSplit to restore the correct fixed points. It still remains open, though, if FedSplit can still converge to the correct fixed points under asynchronous and stochastic user updates, both of which are widely adopted in practice and studied here.
Ensuring fairness among users has become a serious goal in FL since it largely determines users’ willingness to participate in the training process. Mohri et al., (2019) argued that existing FL algorithms can lead to federated models that are biased toward different users. To solve this issue, Mohri et al., (2019) proposed agnostic federated learning (AFL) to improve fairness among users. AFL considers the target distribution as a weighted combination of the user distributions and optimizes the centralized model for the worsecase realization, leading to a saddlepoint optimization problem which was solved by a fast stochastic optimization algorithm. On the other hand, based on fair resource allocation in wireless networks, Li et al., 2020b proposed qfair federated learning (qFFL) to achieve more uniform test accuracy across users. Li et al., 2020b further proposed qFedAvg as a communication efficient algorithm to solve qFFL. However, both AFL and qFedAvg do not explicitly encourage user participation and they suffer from adversarial attacks while our algorithm FedMGDA+ is designed to be fair among participants and robust against both additive and multiplicative attacks.
FedAvg relies on a coordinatewise averaging of local models to update the global model. According to Wang et al., 2020b , in neural network (NN) based models, such coordinatewise averaging might lead to suboptimal results due to the permutation invariance of NN parameters. To address this issue, Yurochkin et al., (2019) proposed probabilistic federated neural matching (PFNM), which is only applicable to fully connected feedforward networks. The recent work (Wang et al., 2020b, ) proposed federated matched averaging (FedMA) as a layerwise extension of PFNM to accommodate CNNs and LSTMs. However, the Bayesian nonparametric mechanism in PFNM and FedMA may be vulnerable to model poisoning attack (Bagdasaryan et al.,, 2020; Bhagoji et al.,, 2019; Wang et al., 2020a, ), while some simple defences, such as norm thresholding and differential privacy, were discussed in Sun et al., (2019). We note that these ideas are complementary to FedMGDA+ and we plan to investigate possible integrations of them in future work.
Lastly, we note that there is significant interest in standardizing the benchmarks, protocols and evaluations in FL, see for instance (Caldas et al.,, 2018; He et al.,, 2020). We have spent significant efforts in adhering to the suggested rules there, by reporting on common datasets, open sourcing our code and including all experimental details.
3 Problem Setup
We recall the federated learning (FL) framework of McMahan et al., (2017) and point out a simple interpretation that seemingly unifies different implementations. We consider FL with $m$ users (edge devices), where the $i$th user is interested in minimizing a function ${f}_{i}:{\mathbb{R}}^{d}\to \mathbb{R},i=1,\mathrm{\dots},m$, defined on a shared model parameter $\mathbf{w}\in {\mathbb{R}}^{d}$. Typically, each user function ${f}_{i}$ also depends on the respective user’s local (private) data ${\mathcal{D}}_{i}$. The main goal in FL is to collectively and efficiently optimize individual objectives $\{{f}_{i}\}$ while meeting challenges such as those mentioned in the Introduction (§1): noniid distribution of user data, limited communication, user privacy, fairness, robustness, etc..
McMahan et al., (2017) proposed FedAvg to optimize the arithmetic average of individual user functions:
$\underset{\mathbf{w}\in {\mathbb{R}}^{d}}{\mathrm{min}}{\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{0}(\mathbf{w}),\text{where}{\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{0}(\mathbf{w}):={\displaystyle \sum _{i=1}^{m}}{\lambda}_{i}{f}_{i}(\mathbf{w}).$  (1) 
The weights ${\lambda}_{i}$ need to be specified beforehand. Typical choices include the dataset size at each user, the “importance” of each user, or simply uniform, i.e.${\lambda}_{i}\equiv 1/m$. FedAvg works as follows: At each round, a (random) subset of users is selected, each of which performs $k$ epochs of local (full or minibatch) gradient descent:
$\text{for all}i\text{in parallel},{\mathbf{w}}^{i}\leftarrow {\mathbf{w}}^{i}\eta \nabla {f}_{i}({\mathbf{w}}^{i}),$  (2) 
and then the weights are averaged at the server side:
$\mathbf{w}\leftarrow {\displaystyle \sum _{i}}{\lambda}_{i}{\mathbf{w}}^{i},$  (3) 
which is finally broadcast to the users in the next round. The number of local epochs $k$ turns out to be a key factor. Setting $k=1$ amounts to solving (1) by the usual gradient descent algorithm, while setting $k=\mathrm{\infty}$ (and assuming convergence for each local function ${f}_{i}$) amounts to (repeatedly) averaging the respective minimizers of ${f}_{i}$’s. We now give a new interpretation of FedAvg that yields insights on what it optimizes with an intermediate $k$.
Our interpretation is based on the proximal average (Bauschke et al.,, 2008). Recall that the Moreau envelope and proximal map of a convex^{1}^{1}1For nonconvex functions, similar results hold once we address multivaluedness of the proximal map, see Yu et al., (2015). function $f$ is defined respectively as:
${\U0001d5ac}_{f}^{\eta}(\mathbf{w})$  $=\underset{\mathbf{x}}{\mathrm{min}}\frac{1}{2\eta}{\Vert \mathbf{x}\mathbf{w}\Vert}_{2}^{2}+f(\mathbf{x}),$  (4)  
${\U0001d5af}_{f}^{\eta}(\mathbf{w})$  $=\underset{\mathbf{x}}{\text{argmin}}\frac{1}{2\eta}{\Vert \mathbf{x}\mathbf{w}\Vert}_{2}^{2}+f(\mathbf{x}).$  (5) 
Given a set of convex functions $\mathbf{f}=({f}_{1},\mathrm{\dots},{f}_{m})$ and positive weights $\bm{\lambda}=({\lambda}_{1},\mathrm{\dots},{\lambda}_{m})$ that sum to 1, we define the proximal average as the unique function ${\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{\eta}$ such that ${\U0001d5af}_{{\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{\eta}}^{\eta}={\sum}_{i}{\lambda}_{i}{\U0001d5af}_{{f}_{i}}^{\eta}.$ In other words, the proximal map of the proximal average is the average of proximal maps. More concretely, Bauschke et al., (2008) gave the following explicit, albeit complicated, formula for the proximal average:
${\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{\eta}(\mathbf{w})$  $=\underset{{\mathbf{w}}_{1},\mathrm{\dots},{\mathbf{w}}_{m}}{\mathrm{min}}{\displaystyle \sum _{i=1}^{m}}{\lambda}_{i}\left[{f}_{i}({\mathbf{w}}_{i})+\frac{1}{2\eta}{\Vert {\mathbf{w}}_{i}\Vert}_{2}^{2}\right]\frac{1}{2\eta}{\Vert \mathbf{w}\Vert}_{2}^{2},$  (6)  
$\text{s.t.}{\displaystyle \sum _{i=1}^{m}}{\lambda}_{i}{\mathbf{w}}_{i}=\mathbf{w}.$  (7) 
From the above formula we can easily derive that
${\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{0}(\mathbf{w})$  $:=\underset{\eta \to 0+}{lim}{\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{\eta}(\mathbf{w})={\displaystyle {\sum}_{i}}{\lambda}_{i}{f}_{i}(\mathbf{w}),$  
${\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{\mathrm{\infty}}(\mathbf{w})$  $:=\underset{\eta \to \mathrm{\infty}}{lim}{\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{\eta}(\mathbf{w})=\underset{{\sum}_{i}{\lambda}_{i}{\mathbf{w}}_{i}=\mathbf{w}}{\mathrm{min}}{\displaystyle {\sum}_{i}}{\lambda}_{i}{f}_{i}({\mathbf{w}}_{i}).$ 
Interestingly, we can now interpret FedAvg in two extreme settings as minimizing the proximal average:

•
FedAvg with $k=1$ local step is exactly the same as minimizing the proximal average ${\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{0}(\mathbf{w})$ with $\eta =0$. This is clear from the objective (1) of FedAvg (as our notation already suggests).

•
FedAvg with $k=\mathrm{\infty}$ local steps is exactly the same as minimizing the proximal average ${\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{\mathrm{\infty}}(\mathbf{w})$ with $\eta =\mathrm{\infty}$. Indeed,
$\left\{\underset{\mathbf{w}}{\mathrm{min}}{\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{\mathrm{\infty}}(\mathbf{w})\right\}=\underset{{\mathbf{w}}_{1},\mathrm{\dots},{\mathbf{w}}_{m}}{\mathrm{min}}{\displaystyle \sum _{i}}{\lambda}_{i}{f}_{i}({\mathbf{w}}_{i}),$ (8) where the righthand side decouples and hence ${\mathbf{w}}_{i}$ at optimality is a minimizer of ${f}_{i}$ (recall that $\bm{\lambda}\ge 0$).
Therefore, we may interpret FedAvg with an intermediate $k$ as minimizing ${\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{\eta}$ with an intermediate $\eta $. More interestingly, if we apply the PAPG algorithm in Yu, (2013, Algo. 222) to minimize ${\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{\eta}$, we obtain the simple update rule
$\mathbf{w}\leftarrow {\displaystyle {\sum}_{i}}{\lambda}_{i}{\U0001d5af}_{{f}_{i}}^{\eta}(\mathbf{w}),$  (9) 
where the proximal maps are computed in parallel at the user’s side. We note that the recent FedProx algorithm (Li et al., 2020a, ) is essentially a randomized version of (9). Crucially, we do not need to evaluate the complicated formula (6) as the update (9) only requires its proximal map, which by definition is the average of the individual proximal maps (computed by each user separately). Moreover, the difference between the proximal average ${\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{\eta}$ and the arithmetic average ${\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{0}$ can be uniformly bounded using the Lipschitz constant of each function ${f}_{i}$ (Yu,, 2013). Thus, for small step size $\eta $, FedAvg (with any finite $k$) and FedProx all minimize some approximate form of the arithmetic average in (1).
How to set the weights $\bm{\lambda}$ in FedAvg has been a major challenge. In FL, data is distributed in a highly noniid and unbalanced fashion, so it is not clear if some chosen arithmetic average in (1) would really satisfy one’s actual intention. A second issue with the arithmetic average in (1) is its wellknown nonrobustness against malicious manipulations, which has been exploited in recent adversarial attacks (Bhagoji et al.,, 2019). Instead, Agnostic FL (AFL (Mohri et al.,, 2019)) aims to optimize the worstcase loss:
$\underset{\mathbf{w}}{\mathrm{min}}\underset{\bm{\lambda}\in \mathrm{\Lambda}}{\mathrm{max}}{\U0001d5a0}_{\mathbf{f},\bm{\lambda}}^{0}(\mathbf{w}),$  (10) 
where the set $\mathrm{\Lambda}$ might cover reality better than any specific $\bm{\lambda}$ and provide some minimum guarantee for all users (hence achieving mild fairness). On the other hand, the worstcase loss in (10) is perhaps even more nonrobust against adversarial attacks. For instance, adding a positive constant to some loss ${f}_{i}$ can make it dominate the entire optimization process. The recent work qFedAvg (Li et al., 2020b, ) proposes an ${\mathrm{\ell}}_{q}$ norm interpolation between FedAvg (essentially ${\mathrm{\ell}}_{1}$ norm) and AFL (essentially ${\mathrm{\ell}}_{\mathrm{\infty}}$ norm). By tuning $q$, qFedAvg can achieve better compromise than FedAvg or AFL.
4 Multiobjective Minimization (MoM)
Multiobjective minimization (MoM) refers to the setting where multiple scalar objective functions, possibly incompatible with each other, need to be minimized simultaneously. It is also called vector optimization (Jahn,, 2009) because the objective functions can be combined into a single vectorvalued function. In mathematical terms, MoM can be written as
$\underset{\mathbf{w}\in {\mathbb{R}}^{d}}{\mathrm{min}}\mathbf{f}(\mathbf{w}):=({f}_{1}(\mathbf{w}),{f}_{2}(\mathbf{w}),\mathrm{\dots},{f}_{m}(\mathbf{w})),$  (11) 
where the minimum is defined wrt the partial ordering:
$\mathbf{f}(\mathbf{w})\le \mathbf{f}(\mathbf{z})\iff \forall i=1,\mathrm{\dots},m,{f}_{i}(\mathbf{w})\le {f}_{i}(\mathbf{z}).$  (12) 
(We remind that algebraic operations such as $\le $ and $+$, when applied to a vector with another vector or scalar, are always performed componentwise.) Unlike single objective optimization, with multiple objectives it is possible that
$\mathbf{f}(\mathbf{w})\nleqq \mathbf{f}(\mathbf{z})\text{and}\mathbf{f}(\mathbf{z})\nleqq \mathbf{f}(\mathbf{w}),$  (13) 
in which case we say $\mathbf{w}$ and $\mathbf{z}$ are not comparable.
We call ${\mathbf{w}}^{\ast}$ a Pareto optimal solution of (11) if its objective value $\mathbf{f}({\mathbf{w}}^{\ast})$ is a minimum element (wrt the partial ordering in (12)), or equivalently for any $\mathbf{w}$, $\mathbf{f}(\mathbf{w})\le \mathbf{f}({\mathbf{w}}^{\ast})$ implies $\mathbf{f}(\mathbf{w})=\mathbf{f}({\mathbf{w}}^{\ast})$. In other words, it is not possible to improve any component objective in $\mathbf{f}({\mathbf{w}}^{\ast})$ without compromising some other objective. Similarly, we call ${\mathbf{w}}^{\ast}$ a weakly Pareto optimal solution if there does not exist any $\mathbf{w}$ such that $$, i.e., it is not possible to improve all component objectives in $\mathbf{f}({\mathbf{w}}^{\ast})$. Clearly, any Pareto optimal solution is also weakly Pareto optimal but the converse may not hold.
We point out that the optimal solutions in MoM are usually a set (in general of infinite cardinality) (Mukai,, 1980), and without additional subjective preference information, all Pareto optimal solutions are considered equally good (as they are not comparable against each other). This is fundamentally different from the single objective case.
From now on, for simplicity we assume all objective functions are continuously differentiable but not necessarily convex (to accommodate deep models). Finding a (weakly) Pareto optimal solution in this setting is quite challenging (already so in the single objective case). Instead, we will contend with Pareto stationary solutions, namely those that satisfy an intuitive first order necessary condition:
Definition 1 (Paretostationarity, Mukai, 1980).
We call ${\mathbf{w}}^{\ast}$ Paretostationary iff some convex combination of the gradients $\{\nabla {f}_{i}({\mathbf{w}}^{\ast})\}$ vanishes, i.e.there exists some $\mathbf{\lambda}\ge 0$ such that ${\sum}_{i}{\lambda}_{i}=1$ and ${\sum}_{i}{\lambda}_{i}\nabla {f}_{i}({\mathbf{w}}^{\ast})=\mathrm{\U0001d7ce}$.
Lemma 1 (Mukai, 1980).
Any Pareto optimal solution is Pareto stationary. Conversely, if all functions are convex, then any Pareto stationary solution is weakly Pareto optimal.
Needless to say, the above results reduce to the familiar ones for the single objective case ($m=1$).
There exist many algorithms for finding Pareto stationary solutions. We briefly review three popular ones that are relevant for us, and refer the reader to the excellent monograph (Miettinen,, 1998) for more details.
Weighted approach. Let $\bm{\lambda}\in \mathrm{\Delta}$ (the simplex) and consider the following single, weighted objective:
$\underset{\mathbf{w}}{\mathrm{min}}{\displaystyle \sum _{i=1}^{m}}{\lambda}_{i}{f}_{i}(\mathbf{w}).$  (14) 
This is essentially the approach taken by FedAvg, with any (global) minimizer of (14) being weakly Pareto optimal (in fact, Pareto optimal if all weights ${\lambda}_{i}$ are positive). From Definition 1 it is clear that any stationary solution of the weighted scalar problem (14) is a Pareto stationary solution of the original MoM (11). Note that the scalarization weights $\bm{\lambda}$, once chosen, are fixed throughout. Different $\bm{\lambda}$ leads to different Pareto stationary solutions.
$\u03f5$constraint. Let $\mathit{\u03f5}\in {\mathbb{R}}^{m1}$, $\iota \in \{1,\mathrm{\dots},m\}$ and consider the following constrained scalar problem:
$\underset{\mathbf{w}}{\mathrm{min}}$  ${f}_{\iota}(\mathbf{w})$  (15)  
$\mathrm{s}.\mathrm{t}.$  ${f}_{i}(\mathbf{w})\le {\u03f5}_{i},\forall i\ne \iota .$  (16) 
Assuming the constraints are satisfiable, then any (global) minimizer of (15) is again weakly Pareto optimal. The $\u03f5$constraint approach is closely related to the weighted approach above, through the usual Lagrangian reformulation. Both require fixing an $m1$ dimensional parameter in advance ($\bm{\lambda}$ vs. $\mathit{\u03f5}$), though.
Chebyshev approach. Let $\mathbf{s}\in {\mathbb{R}}^{m}$ and consider the minimax problem (where recall that $\mathrm{\Delta}$ is the simplex constraint):
$\underset{\mathbf{w}}{\mathrm{min}}\underset{\bm{\lambda}\in \mathrm{\Delta}}{\mathrm{max}}{\bm{\lambda}}^{\top}(\mathbf{f}(\mathbf{w})\mathbf{s}).$  (17) 
Again, any (global) minimizer is weakly Pareto optimal. Here $\mathbf{s}$ is a fixed vector that ideally lower bounds $\mathbf{f}$. This is essentially the approach taken by AFL (Mohri et al.,, 2019) with $\mathbf{s}=\mathrm{\U0001d7ce}$.
5 FL as Multiobjective Minimization
Having introduced both FL and MoM, and observed some connections between the two, it is very natural to treat each user function ${f}_{i}$ in FL as a separate objective in MoM and aim to optimize them simultaneously as in (11). This will be the main approach we follow below, which, to the best of our knowledge, has not been formally explored before (despite of the apparent connections that we saw in the previous section, perhaps retrospectively). In particular, we will extend the multiple gradient descent algorithm (Mukai,, 1980) in MoM to FL, draw connections to existing FL algorithms, and prove convergence properties of our extended algorithm FedMGDA+. Very importantly, the notion of Pareto optimality and stationarity immediately enforces fairness among users, as we are discouraged from improving certain users by sacrificing others.
To further motivate our development, let us compare to the objective in AFL (Mohri et al.,, 2019):
$\underset{\mathbf{w}}{\mathrm{min}}\underset{\bm{\lambda}\in \mathrm{\Delta}}{\mathrm{max}}{\bm{\lambda}}^{\top}\mathbf{f}(\mathbf{w})\phantom{\rule{1em}{0ex}}\equiv \phantom{\rule{1.167em}{0ex}}\underset{\mathbf{w}}{\mathrm{min}}\underset{i=1,\mathrm{\dots},m}{\mathrm{max}}{f}_{i}(\mathbf{w}),$  (18) 
where $\mathrm{\Delta}$ denotes the simplex^{2}^{2}2To be precise, AFL restricted $\bm{\lambda}$ to a subset $\mathrm{\Lambda}\subseteq \mathrm{\Delta}$. We simply set $\mathrm{\Lambda}=\mathrm{\Delta}$ to ease the discussion.. By optimizing the worst loss than the average loss in FedAvg, AFL provides some guarantee to all users hence achieving some form of fairness. However, note that AFL’s objective (18) is not robust against adversarial attacks. In fact, if a malicious user artificially “inflates” its loss ${f}_{i}$ (e.g., even by adding/multiplying a constant), it can completely dominate and mislead AFL to solely focus on optimizing its performance. The same issue applies to qFedAvg (Li et al., 2020b, ), albeit with a less dramatic effect if $q$ is small.
AFL’s objective (18) is very similar to the Chebyshev approach in MoM (see Section 4), which inspires us to propose the following iterative algorithm for solving (11):
${\stackrel{~}{\mathbf{w}}}_{t+1}=\underset{\mathbf{w}}{\text{argmin}}\underset{\bm{\lambda}\in \mathrm{\Delta}}{\mathrm{max}}{\bm{\lambda}}^{\top}(\mathbf{f}(\mathbf{w})\mathbf{f}({\stackrel{~}{\mathbf{w}}}_{t})),$  (19) 
where we adaptively “center” the user functions using function values from the previous iteration. When the functions ${f}_{i}$ are smooth, we apply the quadratic bound to obtain:
${\mathbf{w}}_{t+1}=\underset{\mathbf{w}}{\text{argmin}}\underset{\bm{\lambda}\in \mathrm{\Delta}}{\mathrm{max}}{\bm{\lambda}}^{\top}{J}_{\mathbf{f}}^{\top}({\mathbf{w}}_{t})(\mathbf{w}{\mathbf{w}}_{t})+\frac{1}{2\mathrm{\eta}}{\Vert \mathbf{w}{\mathbf{w}}_{t}\Vert}^{2},$  (20) 
where ${J}_{\mathbf{f}}=[\nabla {f}_{1},\mathrm{\dots},\nabla {f}_{m}]\in {\mathbb{R}}^{d\times m}$ is the Jacobian and $\mathrm{\eta}>0$ is the step size. Crucially, note that $\mathbf{f}({\mathbf{w}}_{t})$ does not appear in the above bound (20) since we subtracted it off in (19). Since (20) is convex in $\mathbf{w}$ and concave in $\bm{\lambda}$ we can swap min with max and obtain the dual:
$\underset{\bm{\lambda}\in \mathrm{\Delta}}{\mathrm{max}}$  $\underset{\mathbf{w}}{\mathrm{min}}{\bm{\lambda}}^{\top}{J}_{\mathbf{f}}^{\top}({\mathbf{w}}_{t})(\mathbf{w}{\mathbf{w}}_{t})+\frac{1}{2\mathrm{\eta}}{\Vert \mathbf{w}{\mathbf{w}}_{t}\Vert}^{2}.$  (21) 
Solving $\mathbf{w}$ by setting its derivative to $\mathrm{\U0001d7ce}$ we arrive at:
${\mathbf{w}}_{t+1}={\mathbf{w}}_{t}\mathrm{\eta}{\mathbf{d}}_{t},{\mathbf{d}}_{t}={J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}^{\ast},$  (22)  
$\text{where}{\bm{\lambda}}_{t}^{\ast}=\underset{\bm{\lambda}\in \mathrm{\Delta}}{\text{argmin}}{\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t})\bm{\lambda}\Vert}^{2}.$  (23) 
Note that ${\mathbf{d}}_{t}$ is precisely the minimumnorm element in the convex hull of the columns (i.e., gradients) in the Jacobian ${J}_{\mathbf{f}}$, and finding ${\bm{\lambda}}_{t}^{\ast}$ amounts to solving a simple quadratic program. The resulting iterative algorithm in (22) is known as multiple gradient descent algorithm (MGDA), which has been (re)discovered in Mukai, (1980); Fliege and Svaiter, (2000); Désidéri, (2012) and recently applied to multitask learning in Sener and Koltun, (2018); Lin et al., (2019) and to training GANs in Albuquerque et al., (2019). Our concise derivation here reveals some new insights about MGDA, in particular its connection to AFL.
To adapt MGDA to the federated learning setting, we propose the following extensions.
Balancing user average performance and fairness. We observe that the MGDA update in (22) resembles FedAvg, with the crucial difference that MGDA automatically tunes the dual weighting variable $\bm{\lambda}$ in each step while FedAvg presets $\bm{\lambda}$ based on a priori information about the user functions (or simply uniform in lack of such information). Importantly, the direction ${\mathbf{d}}_{t}$ found in MGDA is a common descent direction for all participating objectives:
$\mathbf{f}({\mathbf{w}}_{t+1})$  $\le \mathbf{f}({\mathbf{w}}_{t})+{J}_{\mathbf{f}}^{\top}({\mathbf{w}}_{t})({\mathbf{w}}_{t+1}{\mathbf{w}}_{t})+\frac{1}{2\mathrm{\eta}}{\Vert {\mathbf{w}}_{t+1}{\mathbf{w}}_{t}\Vert}^{2}$  
$\le \mathbf{f}({\mathbf{w}}_{t}),$  (24) 
where the first inequality follows from familiar smoothness assumption on $\mathbf{f}$ while the second inequality follows simply from plugging $\mathbf{w}={\mathbf{w}}_{t}$ in (20) and noting that ${\mathbf{w}}_{t+1}$ by definition can only decrease (20) even more. It is clear that equality is attained iff ${\mathbf{d}}_{t}={J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}^{\ast}=\mathrm{\U0001d7ce}$, i.e., ${\mathbf{w}}_{t}$ is Paretostationary (see Section 4). In other words, MGDA never sacrifices any participating objective to trade for more sizable improvements over some other objective, something FedAvg with a fixed weighting $\bm{\lambda}$ might attempt to do. On the other hand, FedAvg with a fixed weighting $\bm{\lambda}$ may achieve higher average performance under the weighting $\bm{\lambda}$. It is natural to introduce the following tradeoff between average performance and fairness:
$\text{update (}\text{22}\text{) with}{\bm{\lambda}}_{t}^{\ast}=\underset{\bm{\lambda}\in \mathrm{\Delta},{\Vert \bm{\lambda}{\bm{\lambda}}_{0}\Vert}_{\mathrm{\infty}}\le \u03f5}{\text{argmin}}{\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t})\bm{\lambda}\Vert}^{2}.$  (25) 
Clearly, setting $\u03f5=0$ recovers FedAvg with a priori weighting ${\bm{\lambda}}_{0}$ while setting $\u03f5=1$ recovers MGDA where the weighting variable $\bm{\lambda}$ is tuned without any restriction to achieve maximal fairness. In practice, with an intermediate $\u03f5\in (0,1)$ we may strike a desirable balance between the two (sometimes) conflicting goals. Moreover, even with the uninformative weighting ${\bm{\lambda}}_{0}=\mathrm{\U0001d7cf}/m$, using an intermediate $\u03f5$ allows us to upper bound the contribution of each user function to the common direction ${\mathbf{d}}_{t}$ hence achieve some form of robustness against malicious manipulations.
Robustness against malicious users through normalization. Existing work (e.g., Bhagoji et al.,, 2019; Xie et al.,, 2019) has demonstrated that the average gradient in FedAvg can be easily manipulated by even a single malicious user. While more robust aggregation strategies are studied recently (see e.g., Blanchard et al.,, 2017; Yin et al.,, 2018; Diakonikolas et al.,, 2019), they do not necessarily maintain the convergence properties of FedMGDA+ (e.g.finding a common descent direction and converging to a Pareto stationary solution). Instead, we propose to simply normalize the gradients from each user to unit length, based on the following considerations: (a) Normalizing the (sub)gradient is common for specialists in nonsmooth and stochastic optimization (Anstreicher and Wolsey,, 2009) and sometimes eases step size tuning. (b) Solving the weights ${\bm{\lambda}}_{t}^{\ast}$ in (22) with normalized gradients still guarantees fairness, i.e., the resulting direction ${\mathbf{d}}_{t}$ is descending for all participating objectives (by a completely similar reasoning as the remark after (5)). (c) Normalization restores robustness against multiplicative “inflation” from any malicious user, which, combined with MGDA’s builtin robustness against additive “inflation” (see Equation 19), offers reasonable robustness guarantees against adversarial attacks.
Balancing communication and ondevice computation. Communication between user devices and the central server is heavily constrained in FL, due to a variety of reasons mentioned in §3. On the other hand, modern edge devices are capable of performing reasonable amount of ondevice computations. Thus, we allow each user device to perform multiple local updates before communicating its update $\mathbf{g}={\mathbf{w}}^{0}{\mathbf{w}}^{\mathrm{\u25a0}}$, namely the difference between the initial ${\mathbf{w}}^{0}$ and the final ${\mathbf{w}}^{\mathrm{\u25a0}}$, to the central server. The server then calls the (extended) MGDA to perform a global update, which will be broadcast to the next round of user devices. We note that similar strategy was already adopted in many existing FL systems (e.g., McMahan et al.,, 2017; Li et al., 2020b, ; Li et al., 2020a, ).
Subsampling to alleviate noniid and enhance throughput. Due to the massive number of edge devices in FL, it is not realistic to expect most devices to participate at each or even most rounds. Consequently, the current practice in FL is to select a (different) subset of user devices to participate in each round (McMahan et al.,, 2017). Moreover, randomly subsampling user devices can also help combat the noniid distribution of userspecific data (e.g., McMahan et al.,, 2017; Li et al.,, 2020). Here we point out an important advantage of our MGDAbased algorithm: its update is along a common descending direction (see (5)), meaning that the objective of any participating user can only decrease. We believe this unique property of MGDA provides strong incentive for users to participate in FL. To our best knowledge, existing FL algorithms do not provide similar algorithmic incentives. Last but not the least, subsampling also solves a degeneracy issue in MGDA: when the number of participating users exceeds the dimension $d$, the Jacobian ${J}_{\mathbf{f}}$ has full rowrank hence (22) achieves Paretostationarity in a single iteration and stops making progress. Subsampling removes this undesirable effect and allows different subsets of users to be continuously optimized.
With the above extensions, we summarize our extended algorithm FedMGDA+ in Algorithm 1, and we prove the following convergence guarantees (precise statements and proofs can be found in Appendix A):
Theorem 1a.
Let each user function ${f}_{i}$ be $L$Lipschitz smooth and $M$Lipschitz continuous, and choose step size ${\mathrm{\eta}}_{t}$ so that ${\sum}_{t}{\mathrm{\eta}}_{t}=\mathrm{\infty}$ and $$, where ${\sigma}_{t}^{2}:=\mathbf{E}{\Vert {\mathbf{d}}_{t}{\widehat{\mathbf{d}}}_{t}\Vert}^{2}$ with
${\mathbf{d}}_{t}$  $:={J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t},{\bm{\lambda}}_{t}=\underset{\bm{\lambda}\in \mathrm{\Delta}}{\text{argmin}}\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t})\bm{\lambda}\Vert ,$  (26)  
${\widehat{\mathbf{d}}}_{t}$  $:={\widehat{J}}_{\mathbf{f}}({\mathbf{w}}_{t}){\widehat{\bm{\lambda}}}_{t},{\widehat{\bm{\lambda}}}_{t}=\underset{\bm{\lambda}\in \mathrm{\Delta}}{\text{argmin}}\Vert {\widehat{J}}_{\mathbf{f}}({\mathbf{w}}_{t})\bm{\lambda}\Vert .$  (27) 
Then, with $k=r=1$ we have:
$\underset{t=0,\mathrm{\dots},T}{\mathrm{min}}\mathbf{E}{\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}\Vert}^{2}\to 0.$  (28) 
Here $k$ is the number of local updates and $r$ is the number of minibatches in each local update. The convergence rate depends on how quickly the “variance” term ${\sigma}_{t}$ of the stochastic common descent direction ${\widehat{d}}_{t}$ diminishes (if at all), which in turn depends on how aggressively we subsample users or how heterogeneous the users are.
For deterministic gradient updates, we can prove convergence even with more local updates (i.e.$k>1$):
Theorem 1b.
Let each user function ${f}_{i}$ be $L$Lipschitz smooth and $M$Lipschitz continuous. For any number of local updates $k$, if the global step size ${\mathrm{\eta}}_{t}\to 0$ with ${\sum}_{t}{\mathrm{\eta}}_{t}=\mathrm{\infty}$, local learning rate ${\eta}_{t}^{l}\to 0$ and ${\epsilon}_{t}:=\Vert {\mathbf{\lambda}}_{t}{\widehat{\mathbf{\lambda}}}_{t}\Vert \to 0$, then we have:
$\underset{t=0,\mathrm{\dots},T}{\mathrm{min}}{\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}\Vert}^{2}\to 0.$  (29) 
Please refer to Appendix A for the precise statement of the theorem and its proof. We note that one natural approach to bound the deviation ${\epsilon}_{t}$ is by applying the $\u03f5$constrained version of FedMGDA. For example, if ${\Vert \bm{\lambda}{\bm{\lambda}}_{0}\Vert}_{\mathrm{\infty}}\le {\u03f5}_{t}$, and ${\u03f5}_{t}$ is bounded, then ${\epsilon}_{t}\le 2\sqrt{m}{\u03f5}_{t}$ is also bounded. Thus, ${\epsilon}_{t}\to 0$ when ${\u03f5}_{t}\to 0$. Moreover, when $k=1$, we do not need the local learning rate ${\eta}_{t}^{l}$ to decay for convergence; in addition, if ${\epsilon}_{t}\equiv 0$ (e.g. in FedAvg), then our convergence guarantee reduces to the usual one for gradient descent, which is expected since we know FedAvg with $k=1,r=1$ is the same as centralized gradient descent. Lastly, we note that when $k>1$, local learning rate ${\eta}_{t}^{l}$ must vanish in order to obtain convergence. This importance of local learning rate decay is also pointed out in Reddi et al., (2020).
When the functions ${f}_{i}$ are convex, we can derive a finer result:
Theorem 2.
Suppose each user function ${f}_{i}$ is convex and $M$Lipschitz continuous. Suppose at each round FedMGDA+ includes a strongly convex user function whose weight is bounded away from 0. Then, with the choice ${\mathrm{\eta}}_{t}=\frac{2}{c(t+2)}$ and $k=r=1$, we have
$\mathbf{E}{\Vert {\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast}\Vert}^{2}\le \frac{4{M}^{2}}{{c}^{2}(t+3)},$  (30) 
and ${\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast}\to 0$ almost surely, where ${\mathbf{w}}_{t}^{\ast}$ is the nearest Pareto stationary solution to ${\mathbf{w}}_{t}$ and $c$ is some constant.
A slightly stronger result where we also allow some user functions to be nonconvex can be found in Appendix A. The same results hold if the gradient normalization is bounded away from 0 (otherwise we are already close to Pareto stationarity). For $r,k>1$, using a similar argument as in §3, we expect FedMGDA+ to optimize some proxy problem (such as the proximal average), and we leave the thorough theoretical analysis for future work.
We remark that convergence rate for MGDA, even when restricted to the deterministic case, was only derived recently in Fliege et al., (2019). The stochastic case (that we consider here) is much more challenging and our theorems provide one of the first convergence guarantees for FedMGDA+. We wish to emphasize that FedMGDA+ is not just an alternative algorithm for FL practitioners; it can be used as a postprocessing step to enhance existing FL systems or combined with existing FL algorithms (such as FedProx or qFedAvg). This is particularly appealing with nonconvex user functions as MGDA is capable of converging to all Pareto stationary points while approaches such as FedAvg do not necessarily enjoy this property even when we enumerate the weighting ${\bm{\lambda}}_{0}$ (Miettinen,, 1998). Furthermore, it is possible to find multiple or even enumerate all Pareto optimal solutions (i.e.the Pareto front). For instance, we may run FedMGDA+ multiple times with different random seeds or initializations. As shown by Lin et al., (2019), we could also incorporate additional linear constraints in (22) to encode one’s preference and encourage more diverse solutions. However, these techniques become less effective in higher dimensions (i.e.when the number of users is large) and in communication limited settings. Practically, the server may dynamically adjust the linear constraints in (22) to steer the algorithm to a more desirable Pareto stationary solution.
Lastly, we mention that finding the common descent direction (i.e.Line 6 of Algorithm 1) is a standard quadratic programming (QP) problem that is solved only at the server side. For moderate number of (sampled) users, it suffices to employ a generic QP solver while for large number of users we could also solve $\lambda $ efficiently using for instance the conditional gradient algorithm (Sener and Koltun,, 2018), with perstep complexity proportional to the model dimension and the number of participating users. For our experiments below, we used a generic QP sovler and we observed that this overhead is negligible, resulting almost the same overall running time for FedAvg and FedMGDA.
6 Experiments
6.1 Experimental setups
Dataset  Train Clients  Train samples  Test clients  Test samples  Batch size 

CIFAR10 (Krizhevsky,, 2009)  $100$  $50000$  $100$  $10000$  $\{10,\mathrm{\infty}\}$ 
FMNIST (Xiao et al.,, 2017)  $100$  $60000$  $100$  $10000$  $\{10,\mathrm{\infty}\}$ 
FEMNIST (Caldas et al.,, 2018)  $3406$  $709385$  $3406$  $80011$  $\{20,\mathrm{\infty}\}$ 
Shakespeare (Li et al., 2020a, )  $31$  $92959$  $31$  $23255$  $\{10\}$ 
Adult (Dua and Graff,, 2017)  $2$  $32561$  $2$  $16281$  $\{10\}$ 
Layer  Output Shape  $\mathrm{\#}$ of Trainable Parameters  Activation  Hyperparameters 

Input  $(3,32,32)$  $0$  
Conv2d  $(64,28,28)$  $4864$  ReLU  kernel size =$5$; strides=$(1,1)$ 
MaxPool2d  $(64,14,14)$  $0$  pool size=$(2,2)$  
LocalResponseNorm  $(64,14,14)$  $0$  size=$2$  
Conv2d  $(64,10,10)$  $102464$  ReLU  kernel size =$5$; strides=$(1,1)$ 
LocalResponseNorm  $(64,10,10)$  $0$  size=$2$  
MaxPool2d  $(64,5,5)$  $0$  pool size=$(2,2)$  
Flatten  $1600$  $0$  
Dense  $384$  $614784$  ReLU  
Dense  $192$  $73920$  ReLU  
Dense  $10$  $1930$  softmax  
Total  $797962$ 
Layer  Output Shape  $\mathrm{\#}$ of Trainable Parameters  Activation  Hyperparameters 

Input  $(1,28,28)$  $0$  
Conv2d  $(10,24,24)$  $260$  ReLU  kernel size =$5$; strides=$(1,1)$ 
MaxPool2d  $(10,12,12)$  $0$  pool size=$(2,2)$  
Conv2d  $(20,8,8)$  $5020$  ReLU  kernel size =$5$; strides=$(1,1)$ 
MaxPool2d  $(20,4,4)$  $0$  pool size=$(2,2)$  
Dropout2d  $(20,4,4)$  $0$  $p=0.5$  
Flatten  $320$  $0$  
Dense  $50$  $16050$  ReLU  
Dropout  $50$  $0$  $p=0.5$  
Dense  $10$  $510$  softmax  
Total  $21840$ 
Layer  Output Shape  $\mathrm{\#}$ of Trainable Parameters  Activation  Hyperparameters 

Input  $(1,28,28)$  $0$  
Conv2d  $(32,26,26)$  $320$  kernel size =$3$; strides=$(1,1)$  
Conv2d  $(64,24,24)$  $18496$  ReLU  kernel size =$3$; strides=$(1,1)$ 
MaxPool2d  $(64,12,12)$  $0$  pool size=$(2,2)$  
Dropout  $(64,12,12)$  $0$  $p=0.25$  
Flatten  $9216$  $0$  
Dense  $128$  $1179776$  
Dropout  $128$  $0$  $p=0.5$  
Dense  $62$  $7998$  softmax  
Total  $1206590$ 
Name  Parameters 

AFL  ${\gamma}_{\lambda}\in \{0.01,0.1,0.2,0.5\},{\gamma}_{w}\in \{0.01,0.1\}$ 
qFedAvg  $q\in \{0.001,0.01,0.1,0.5,1,2,5,10\}$, $L\in \{0.1,1,10\}$ 
FedMGDA+  $\mathrm{\eta}\in \{0.5,1,1.5,2\}$, and $\text{Decay}\in \{0,\frac{1}{40},\frac{1}{30},\frac{1}{20},\frac{1}{10},\frac{1}{3},\frac{1}{2}\}$ 
FedAvgn  $\mathrm{\eta}\in \{0.5,1,1.5,2\}$, and $\text{Decay}\in \{0,\frac{1}{40},\frac{1}{30},\frac{1}{20},\frac{1}{10},\frac{1}{3},\frac{1}{2}\}$ 
FedProx  $\mu \in \{0.001,0.01,0.1,0.5,1,10\}$ 
MGDAProx  $\mu =0.1$, $\mathrm{\eta}\in \{0.5,1,1.5,2\}$, and $\text{Decay}\in \{0,\frac{1}{40},\frac{1}{30},\frac{1}{20},\frac{1}{10},\frac{1}{5},\frac{1}{3},\frac{1}{2}\}$ 
In this subsection we provide experimental details including dataset descriptions, sampling schemes, model configurations and hyperparameter settings. A quick summary of the datasets that we use can be found in Table 1. We have two parameters in FedMGDA+ to control the total number of local updates in each communication round: $k$, the number of local epochs, and $r=n/b$, the number of updates in each local epoch. Here $n$ is the number of samples at each user (assumed the same for simplicity) while $b$ is the minibatch size for each local update. As observed by, e.g., McMahan et al., (2017) (Table 2), having a larger $k$ is similar as having a smaller $b$ (or equivalently a larger $r$), in terms of total number of local updates. Moreover, $k=1$ with a suitable $b$ usually leads to satisfying performance while very large $k$ can result in plateau or divergence. Thus, in our experiments we fix $k=1$ while vary $b$ to reduce the total number of hyperparameters. This corresponds to a single pass of the local data at each user in every communication round.
6.1.1 CIFAR10 (Krizhevsky,, 2009) and Fashion MNIST (Xiao et al.,, 2017) datasets
In order to create a noni.i.d. dataset, we follow a similar sampling procedure as in McMahan et al., (2017): first we sort all data points according to their classes. Then, they are split into $500$ shards, and each user is randomly assigned $5$ shards of data. By considering $100$ users, this procedure guarantees that no user receives data from more than $5$ classes and the data distribution of each user is different from each other. The local datasets are balanced–all users have the same amount of training samples. The local data is split into train, validation, and test sets with percentage of $80$%, $10$%, and $10$%, respectively. In this way, each user has $400$ data points for training, $50$ for test, and $50$ for validation. We use a CNN model which resembles the one in McMahan et al., (2017), with two convolutional layers followed by three fully connected layers. The details are included in Table 2 for CIFAR10 and in Table 3 for Fashin MNIST. To update the local models at each user using its local data, we apply stochastic gradient descent (SGD) with local batch size $b=10$, local epoch $k=1$, and local learning rate $\eta =0.01$, or $b=400$, $k=1$, and $\eta =0.1$. To model the fact that not all users may participate in each communication round, we employ a parameter $p$ to control the fraction of participating users: $p=0.1$ is the default setting which means that only $10$% of users participate in each communication round.
6.1.2 Federated EMNIST dataset (Caldas et al.,, 2018)
For this experimental setup, we use the same dataset, model, and hyperparameters as Reddi et al., (2020). We use the federated EMNIST dataset of Caldas et al., (2018). The dataset consists of images of digits, and English characters—both lower and upper cases, with 62 classes in total. The images are partitioned by their authors in a way that naturally makes the dataset heterogeneous and unbalanced. We use the model described in Table 4 and the following hyperparameters: local learning rate $\eta =0.1$ and selecting $10$ clients per communication round as recommended. The only difference between our setup and the one in (Reddi et al.,, 2020) is that we use local epoch $k=1$ for all algorithms.
6.1.3 Shakespeare dataset (Li et al., 2020a, )
For experiments on the Shakespeare dataset, we use the same model, data preprocessing and sampling procedure as in qFedAvg paper (Li et al., 2020b, ). The dataset is built from The Complete Works of William Shakespeare, where each role in the play represents one user. Following Li et al., 2020a , we subsample $31$ users to train a neural language model for next character prediction. Each character is embedded in an $8$dimensional space and the sequence length is $80$ characters. The model we use is a twolayer LSTM (with hidden size $256$) followed by one dense layer (McMahan et al.,, 2017; Li et al., 2020a, ). Joint hyperparameters that are shared by all algorithms include: total communication rounds $T=200$, local batch size $b=10$, local epoch $k=1$, and local optimizer being SGD, unless otherwise stated.
6.1.4 Adult dataset (Dua and Graff,, 2017)
Following the setting in AFL (Mohri et al.,, 2019), we split the Adult dataset into two nonoverlapping domains based on the education attribute—phd domain and nonphd domain. The resulting FL setting consists of two users each of which has data from one of the two domains. Further, data is preprocessed as in Li et al., 2020b to have $99$ binary features. We use a logistic regression model for all FL algorithms mentioned in the main paper. Local data is split into train, validation, and test sets with percentage of $80$%, $10$%, and $10$%, respectively. In each round, both users participate and the server aggregates their losses and gradients (or weights). Joint hyperparameters that are shared by all algorithms include: total communication rounds $T=500$, local batch size $b=10$, local epoch $k=1$, local learning rate $\eta =0.01$, and local optimizer being SGD without momentum, unless otherwise stated. Algorithmspecific hyperparameters will be mentioned in the appropriate places below. One important note is that the phd domain has only $413$ samples while the nonphd domain has $32,148$ samples, so the split is very unbalanced while training only on the phd domain yields inferior performance on all domains due to the insufficient sample size.
6.1.5 Hyperparameters
We evaluate the performance of different algorithms with a wide range of hyperparameters, summarized in Table 5. In particular, following Anstreicher and Wolsey, (2009) we tried sublinear $O(1/t)$ and exponential decay $O({\beta}^{t})$ learning rates $\mathrm{\eta}$ on the server, and a fixed local learning rate $\eta $ for client updates. Eventually we settled on decaying ${\mathrm{\eta}}_{t}$ by a factor of $\beta $ every $100$ steps: ${\mathrm{\eta}}_{t}={\beta}^{[\frac{t}{100}]}$, where $\beta ={\text{decay}}^{100/T}$ and $T$ is the total number of communication rounds (with e.g. decay = $1/10$). We note that Reddi et al., (2020) also found exponential decay to be most effective in their experiments. We use grid search to choose suitable local learning rates for all algorithms.
We evaluate our algorithm FedMGDA+ on several public datasets: CIFAR10 (Krizhevsky,, 2009), FMNIST (Xiao et al.,, 2017), Federated EMNIST (Caldas et al.,, 2018), Shakespeare (Li et al., 2020a, ) and Adult (Dua and Graff,, 2017), and compare to existing FL systems including FedAvg (McMahan et al.,, 2017), FedProx (Li et al., 2020a, ), qFedAvg (Li et al., 2020b, ), and AFL^{3}^{3}3Experiments of AFL in the original work (Mohri et al.,, 2019) and later work that compare with it (e.g. (Li et al., 2020b, )) was reported on datasets with very few clients ($2$ or $3$), possibly due to applicability reasons. We followed this convention in our work. (Mohri et al.,, 2019). In addition, from the discussions in §5, one can envision several potential extensions of existing algorithms to improve their performance. So, we also compare to the following extensions: FedAvgn which is FedAvg with gradient normalization, and MGDAProx which is FedMGDA+ with a proximal regularizer added to each user’s loss function.^{4}^{4}4One can also apply the gradient normalization idea to qFedAvg; however, we observed from our experiments that the resulting algorithm is unstable particularly for large $q$ values. We distinguish between FedMGDA+ and FedMGDA which is a vanilla extension of MGDA to FL.
We point out that FL algorithms are to be deployed on smart devices with moderate computational capabilities. Thus, the models we chose to experiment on are mediumsized (see Tables 2, 3 and 4 for details), with similar complexity to the ones in FedAvg, qFedAvg and AFL. Due to space limits we only report some representative results in the main paper, and defer the full set of experiments to Appendix B.
6.2 Experimental results
In this subsection we report experimental results about our proposed algorithm FedMGDA+ and compare it with stateoftheart (SOTA) alternatives under a variety of performance metrics, including accuracy, robustness and fairness. We remind that the accuracy metric is exactly what FedAvg aims to optimize during training, and hence it has some advantage in this metric over other alternative algorithms such as FedMGDA+, AFL, and qFedAvg, which all aim to bring some fairness among users, perhaps at some occasional, and hopefully small, loss of accuracy.
6.2.1 Recovering FedAvg
As mentioned in §5, we can control the balance between the user average performance and fairness by tuning the $\u03f5$constraint in Equation 25. Setting $\u03f5=0$ recovers FedAvg while setting $\u03f5=1$ recovers FedMGDA. To verify this empirically, we run (25) with different $\u03f5$, and report results on CIFAR10 in Figure 1 for both iid and noniid distributions of data (for results on FMNIST, see Section B.1). These results confirm that changing $\u03f5$ from $0$ to $1$ yields an interpolation between FedAvg and FedMGDA, as expected. Since FedAvg essentially optimizes the (uniformly) averaged training loss, it naturally performs the best under this metric (Figure 1 (c) and (d)). Nevertheless, it is interesting to note that some intermediate $\u03f5$ values actually lead to better user accuracy than FedAvg in the noniid setting (Figure 1 (a)).
6.2.2 Robustness
We discussed earlier in §5 that the gradient normalization and MGDA’s builtin robustness allow FedMGDA+ to combat against certain adversarial attacks in practical FL deployment. We now empirically evaluate the robustness of FedMGDA+ against these attacks. We run various FL algorithms in the presence of a single malicious user who aims to manipulate the system by inflating its loss. We consider an adversarial setting where the attacker participates in each communication round and inflates its loss function by (i) adding a bias to it, or (ii) multiplying it by a scaling factor, termed the bias and scaling attack, respectively. In the first experiment, we simulate a bias attack on the Adult dataset by adding a constant bias to the underrepresented user, i.e. the PhD domain, since it’s more natural to expect an attacker to be consisted of a small number of users. In this setup, the worst performance we can get is bounded by training the model using PhD data only. Results under the bias attack are presented in Figure 2 (Left); also see Section B.2 for more results. We observe that AFL and qFedAvg perform slightly better than FedMGDA+ without the attack; however, their performances deteriorate to a level close to the worst case scenario under the attack. In contrast, FedMGDA+ is not affected by the attack with any bias, which empirically supports our claim in §5. Note that we did not include FedAvg in this comparison since from its definition it is clear that FedAvg, like FedMGDA+, is not affected by the bias attack. Figure 2 (Right) shows the results of different algorithms on CIFAR10 with and without an adversarial scaling. As mentioned earlier, qFedAvg with gradient normalization is highly unstable particularly under the scaling attack, so we did not include its result here. From Figure 2 (Right) it is immediate to see that (i) the scaling attack affects all algorithms that do not employ gradient normalization; (ii) qFedAvg is the most affected under this attack; (iii) surprisingly, FedMGDA+ and, to a lesser extent, MGDAProx actually converge to slightly better Pareto solutions, compared to their own results under no scaling attack. The above results empirically verify the robustness of FedMGDA+ under perhaps the most common bias and scaling attacks.
6.2.3 Fairness
Lastly, we compare FedMGDA+ with existing FL algorithms using different notions of fairness on CIFAR10. For the first experiment, we adopt the same fairness metric as (Li et al., 2020b, ), and measure fairness by calculating the variance of users’ test error. We run each algorithm with different hyperparameters, and among the results, we pick the best ones in terms of average accuracy to be shown in Figure 3; full table of results can be found in Section B.3. From this figure, we observe that (i) FedMGDA+ achieves the best average accuracy while its standard deviation is comparable with that of qFedAvg; (ii) FedMGDA+ significantly outperforms FedMGDA, which clearly justifies our proposed modifications in Algorithm 1 to the vanilla MGDA; and (iii) FedMGDA+ outperforms FedAvgn, which uses the same normalization step as FedMGDA+, in terms of average accuracy and standard deviation. These observations confirm the effectiveness of FedMGDA+ in inducing fairness. We perform the same experiment on the Federated EMNIST dataset, and observed similar results, which can be found in Table 6 and Section B.4.
In the next experiment, we show that FedMGDA+ not only yields a fair final solution but also maintains fairness during the entire training process in the sense that, in each round, it refrains from sacrificing the performance of any participating user for the sake of improving the overall performance. To the best of our knowledge, “fairness during training” has not been investigated before, in spite of having great practical implications—it encourages user participation. To examine this fairness, we run several experiments on CIFAR10 and measure the percentage of improved participants in each communication round. Specifically, we measure the training loss before and after each round for all participating users, and report the percentage of those improved or stay unchanged.^{5}^{5}5The percentage of improved users at time $t$ is defined as ${\sum}_{i\in {I}_{t}}\mathbb{I}\{{f}_{i}({\mathbf{w}}_{t+1})\le {f}_{i}({\mathbf{w}}_{t})\}/{I}_{t},$ where ${I}_{t}$ is the selected users at time $t$, and $\mathbb{I}\{A\}$ is the indicator function of an event $A$. Figure 4 shows the percentage of improved participating users in each communication round in terms of training loss for two representative cases; see Section B.5 for full results with different hyperparameters.
We can see that FedMGDA+ consistently outperforms other algorithms in terms of percentage of improved users, which means that by using FedMGDA+, fewer users’ performances get worse after each participation. Furthermore, we notice from Figure 4 (Left) that, with local batch size $b=10$, the percentage of improved users is less than $100$%, which can be explained as follows: for small batch sizes (i.e., $$ where $\mathcal{D}$ represents a local dataset), the received updates from users are not the true gradients of users’ losses given the global model (i.e., ${\mathbf{g}}_{i}\ne \nabla {f}_{i}(\mathbf{w})$); they are noisy estimates of the true gradients. Consequently, the common descent direction calculated by MGDA is noisy and may not always work for all participating users. To remove the effect of this noise, we set $b=\mathcal{D}$ which allows us to recover the true gradients from the users. The results are presented in Figure 4 (Right), which confirms that, when step size decays (less overshooting), the percentage of improved users for FedMGDA+ reaches towards $100$% during training, as is expected.
Algorithm  Average (%)  Std. (%) 

FedMGDA  $85.73\pm 0.05$  $14.79\pm 0.12$ 
FedMGDA+  $87.60\pm 0.20$  $13.68\pm 0.19$ 
MGDAProx  $87.59\pm 0.19$  $13.75\pm 0.18$ 
FedAvg  $84.97\pm 0.44$  $15.25\pm 0.36$ 
FedAvgn  $87.57\pm 0.09$  $13.74\pm 0.11$ 
FedProx  $84.97\pm 0.45$  $15.26\pm 0.35$ 
qFedAvg  $84.97\pm 0.44$  $15.25\pm 0.37$ 
7 Conclusion
We have proposed a novel algorithm FedMGDA+ for federated learning. FedMGDA+ is based on multiobjective optimization and aims to converge to Pareto stationary solutions. FedMGDA+ is simple to implement, has fewer hyperparameters to tune, and complements existing FL systems nicely. Most importantly, FedMGDA+ is robust against additive and multiplicative adversarial manipulations and ensures fairness among all participating users. We established preliminary convergence guarantees for FedMGDA+, pointed out its connections to recent FL algorithms, and conducted extensive experiments to verify its effectiveness. In the future we plan to formally quantify the tradeoff induced by multiple local updates and to establish some privacy guarantee for FedMGDA+.
Acknowledgment
Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute. We gratefully acknowledge funding support from NSERC, the Canada CIFAR AI Chairs Program, and WaterlooHuawei Joint Innovation Lab. We thank NVIDIA Corporation (the data science grant) for donating two Titan V GPUs that enabled in part the computation in this work.
References
 Albuquerque et al., (2019) Albuquerque, I., Monteiro, J., Doan, T., Considine, B., Falk, T., and Mitliagkas, I. (2019). Multiobjective training of Generative Adversarial Networks with multiple discriminators. In ICML.
 Anstreicher and Wolsey, (2009) Anstreicher, K. M. and Wolsey, L. A. (2009). Two “wellknown” properties of subgradient optimization. Mathematical Programming, 120(1):213–220.
 Bagdasaryan et al., (2020) Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., and Shmatikov, V. (2020). How To Backdoor Federated Learning. In AISTATS, Proceedings of Machine Learning Research, pages 2938–2948.
 Bauschke et al., (2008) Bauschke, H. H., Goebel, R., Lucet, Y., and Wang, X. (2008). The Proximal Average: Basic Theory. SIAM Journal on Optimization, 19(2):766–785.
 Bhagoji et al., (2019) Bhagoji, A. N., Chakraborty, S., Mittal, P., and Calo, S. (2019). Analyzing Federated Learning through an Adversarial Lens. In ICML, pages 634–643.
 Blanchard et al., (2017) Blanchard, P., El Mhamdi, E. M., Guerraoui, R., and Stainer, J. (2017). Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent. In NeurIPS.
 Caldas et al., (2018) Caldas, S., Wu, P., Li, T., Konecny, J., McMahan, H. B., Smith, V., and Talwalkar, A. (2018). Leaf: A benchmark for federated settings. arXiv:1812.01097.
 Désidéri, (2012) Désidéri, J.A. (2012). Multiplegradient descent algorithm (MGDA) for multiobjective optimization. Comptes Rendus Mathematique, 350(5):313–318.
 Diakonikolas et al., (2019) Diakonikolas, I., Kamath, G., Kane, D., Li, J., Steinhardt, J., and Stewart, A. (2019). Sever: A Robust MetaAlgorithm for Stochastic Optimization. In ICML.
 Dua and Graff, (2017) Dua, D. and Graff, C. (2017). UCI Machine Learning Repository.
 Fliege and Svaiter, (2000) Fliege, J. and Svaiter, B. F. (2000). Steepest descent methods for multicriteria optimization. Mathematical Methods of Operations Research, 51(3):479–494.
 Fliege et al., (2019) Fliege, J., Vaz, A. I. F., and Vicente, L. N. (2019). Complexity of gradient descent for multiobjective optimization. Optimization Methods and Software, 34(5):949–959.
 He et al., (2020) He, C., Li, S., So, J., Zhang, M., Wang, H., Wang, X., Vepakomma, P., Singh, A., Qiu, H., Shen, L., et al. (2020). FedML: A research library and benchmark for federated machine learning. arXiv:2007.13518.
 Huo et al., (2020) Huo, Z., Yang, Q., Gu, B., Carin, L., and Huang, H. (2020). Faster OnDevice Training Using New Federated Momentum Algorithm. arXiv:2002.02090.
 Jahn, (2009) Jahn, J. (2009). Vector optimization. Springer.
 Kairouz et al., (2019) Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., D’Oliveira, R. G. L., Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z., Gascón, A., Ghazi, B., Gibbons, P. B., Gruteser, M., Harchaoui, Z., He, C., He, L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi, T., Joshi, G., Khodak, M., Konečný, J., Korolova, A., Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P., Mohri, M., Nock, R., Özgür, A., Pagh, R., Raykova, M., Qi, H., Ramage, D., Raskar, R., Song, D., Song, W., Stich, S. U., Sun, Z., Suresh, A. T., Tramèr, F., Vepakomma, P., Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F. X., Yu, H., and Zhao, S. (2019). Advances and Open Problems in Federated Learning. arXiv:1912.04977.
 Krizhevsky, (2009) Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto.
 Li et al., (2019) Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. (2019). Federated learning: Challenges, methods, and future directions. arXiv:1908.07873.
 (19) Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. (2020a). Federated Optimization in Heterogeneous Networks. In MLSys, pages 429–450.
 (20) Li, T., Sanjabi, M., Beirami, A., and Smith, V. (2020b). Fair Resource Allocation in Federated Learning. In ICLR.
 Li et al., (2020) Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. (2020). On the Convergence of FedAvg on NonIID Data. In ICLR.
 Lin et al., (2019) Lin, X., Zhen, H.L., Li, Z., Zhang, Q.F., and Kwong, S. (2019). Pareto MultiTask Learning. In NeurIPS.
 McMahan et al., (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. (2017). CommunicationEfficient Learning of Deep Networks from Decentralized Data. In AISTATS, pages 1273–1282.
 Mercier et al., (2018) Mercier, Q., Poirion, F., and Désidéri, J.A. (2018). A stochastic multiple gradient descent algorithm. European Journal of Operational Research, 271(3):808–817.
 Miettinen, (1998) Miettinen, K. M. (1998). Nonlinear Multiobjective Optimization. Springer.
 Mohri et al., (2019) Mohri, M., Sivek, G., and Suresh, A. T. (2019). Agnostic Federated Learning. In ICML, volume 97, pages 4615–4625.
 Mukai, (1980) Mukai, H. (1980). Algorithms for multicriterion optimization. IEEE Transactions on Automatic Control, 25(2):177–186.
 Pathak and Wainwright, (2020) Pathak, R. and Wainwright, M. J. (2020). FedSplit: An algorithmic framework for fast federated optimization. In NeurIPS.
 Reddi et al., (2020) Reddi, S., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Konečny, J., Kumar, S., and McMahan, H. B. (2020). Adaptive Federated Optimization. arXiv:2003.00295.
 Sener and Koltun, (2018) Sener, O. and Koltun, V. (2018). MultiTask Learning as MultiObjective Optimization. In NeurIPS.
 Stich, (2019) Stich, S. U. (2019). Local SGD converges fast and communicates little. In ICLR.
 Sun et al., (2019) Sun, Z., Kairouz, P., Suresh, A. T., and McMahan, H. B. (2019). Can You Really Backdoor Federated Learning? arXiv:1911.07963.
 (33) Wang, H., Sreenivasan, K., Rajput, S., Vishwakarma, H., Agarwal, S., Sohn, J.y., Lee, K., and Papailiopoulos, D. (2020a). Attack of the Tails: Yes, You Really Can Backdoor Federated Learning. In NeurIPS, volume 33, pages 16070–16084.
 (34) Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., and Khazaeni, Y. (2020b). Federated Learning with Matched Averaging. In ICLR.
 Xiao et al., (2017) Xiao, H., Rasul, K., and Vollgraf, R. (2017). FashionMNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv:1708.07747.
 Xie et al., (2019) Xie, C., Koyejo, S., and Gupta, I. (2019). Fall of Empires: Breaking Byzantinetolerant SGD by Inner Product Manipulation. In UAI.
 Yang et al., (2019) Yang, Q., Liu, Y., Chen, T., and Tong, Y. (2019). Federated Machine Learning: Concept and Applications. ACM Transactions on Intelligent Systems and Technology, 10(2).
 Yin et al., (2018) Yin, D., Chen, Y., Kannan, R., and Bartlett, P. (2018). ByzantineRobust Distributed Learning: Towards Optimal Statistical Rates. In ICML.
 Yu, (2013) Yu, Y. (2013). Better Approximation and Faster Algorithm Using the Proximal Average. In NeurIPS.
 Yu et al., (2015) Yu, Y., Zheng, X., MarchettiBowick, M., and Xing, E. P. (2015). Minimizing Nonconvex NonSeparable Functions. In AISTATS, pages 1107–1115.
 Yurochkin et al., (2019) Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., and Khazaeni, Y. (2019). Bayesian Nonparametric Federated Learning of Neural Networks. In ICML, pages 7252–7261.
Appendix A Proofs
Theorem 1a (full version).
Suppose each user function ${f}_{i}$ is $L$Lipschitz smooth (i.e., ${\nabla}^{2}{f}_{i}\u2aafL\mathbf{I}$) and $M$Lipschitz continuous. Then, with step size ${\mathrm{\eta}}_{t}\in (0,\frac{1}{2L}]$ we have
$\underset{t=0,\mathrm{\dots},T}{\mathrm{min}}\mathbf{E}[\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}\Vert ]\le {\displaystyle \frac{2[\mathbf{f}({\mathbf{w}}_{0})\mathrm{\mathbf{E}\mathbf{f}}({\mathbf{w}}_{T+1})+{\sum}_{t=0}^{T}{\mathrm{\eta}}_{t}(M{\sigma}_{t}+L{\mathrm{\eta}}_{t}{\sigma}_{t}^{2})]}{{\sum}_{t=0}^{T}{\mathrm{\eta}}_{t}}},$  (31) 
where ${\sigma}_{t}^{2}:=\mathbf{E}{\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\mathbf{\lambda}}_{t}{\widehat{J}}_{\mathbf{f}}({\mathbf{w}}_{t}){\widehat{\mathbf{\lambda}}}_{t}\Vert}^{2}$ is the variance of the stochastic common direction. Moreover, if some user function ${f}_{i}$ is bounded from below, and it is possible to choose ${\mathrm{\eta}}_{t}$ so that $$, then the lefthand side in (31) converges to 0.
Proof.
Let ${\bm{\xi}}_{t}:={J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}{\widehat{J}}_{\mathbf{f}}({\mathbf{w}}_{t}){\widehat{\bm{\lambda}}}_{t}$, where ${\widehat{J}}_{\mathbf{f}}({\mathbf{w}}_{t}):=[\widehat{\nabla}{f}_{1}({\mathbf{w}}_{t}),\mathrm{\dots},\widehat{\nabla}{f}_{m}({\mathbf{w}}_{t})]$ is the concatenation of stochastic gradients at each user, and
${\bm{\lambda}}_{t}=\underset{\bm{\lambda}\in \mathrm{\Delta}}{\text{argmin}}\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t})\bm{\lambda}\Vert ,{\widehat{\bm{\lambda}}}_{t}=\underset{\widehat{\bm{\lambda}}\in \mathrm{\Delta}}{\text{argmin}}\Vert {\widehat{J}}_{\mathbf{f}}({\mathbf{w}}_{t})\widehat{\bm{\lambda}}\Vert ,$  (32) 
where for the latter we also constrain ${\widehat{\lambda}}_{i}=0$ if the $i$th user is not participating in round $t$. Then, applying the quadratic bound and the update rule (we remind that comparison between vector and scalar should be understood as componentwise):
$\mathbf{f}({\mathbf{w}}_{t+1})$  $\le \mathbf{f}({\mathbf{w}}_{t}){\mathrm{\eta}}_{t}{J}_{\mathbf{f}}^{\top}({\mathbf{w}}_{t}){\widehat{J}}_{\mathbf{f}}({\mathbf{w}}_{t}){\widehat{\bm{\lambda}}}_{t}+{\displaystyle \frac{L{\mathrm{\eta}}_{t}^{2}}{2}}{\Vert {\widehat{J}}_{\mathbf{f}}({\mathbf{w}}_{t}){\widehat{\bm{\lambda}}}_{t}\Vert}^{2}$  (33)  
$\le \mathbf{f}({\mathbf{w}}_{t}){\mathrm{\eta}}_{t}{J}_{\mathbf{f}}^{\top}({\mathbf{w}}_{t}){J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}+L{\mathrm{\eta}}_{t}^{2}{\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}\Vert}^{2}+{\mathrm{\eta}}_{t}{J}_{\mathbf{f}}^{\top}({\mathbf{w}}_{t}){\bm{\xi}}_{t}+L{\mathrm{\eta}}_{t}^{2}{\Vert {\bm{\xi}}_{t}\Vert}^{2}$  (34)  
$\le \mathbf{f}({\mathbf{w}}_{t}){\mathrm{\eta}}_{t}(1L{\mathrm{\eta}}_{t}){\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}\Vert}^{2}+{\mathrm{\eta}}_{t}M\Vert {\bm{\xi}}_{t}\Vert +L{\mathrm{\eta}}_{t}^{2}{\Vert {\bm{\xi}}_{t}\Vert}^{2},$  (35) 
where we used the Lipschitz continuity $\Vert \nabla {f}_{i}(\mathbf{w})\Vert \le M$ and the firstorder optimality condition of ${\bm{\lambda}}_{t}$ so that
$\forall \bm{\lambda}\in \mathrm{\Delta},\u27e8\bm{\lambda},{J}_{\mathbf{f}}^{\top}({\mathbf{w}}_{t}){J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}\u27e9\ge \u27e8{\bm{\lambda}}_{t},{J}_{\mathbf{f}}^{\top}({\mathbf{w}}_{t}){J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}\u27e9.$  (36) 
Letting ${\mathrm{\eta}}_{t}\le \frac{1}{2L}$, taking expectations and rearranging we obtain
$\underset{t=0,\mathrm{\dots},T}{\mathrm{min}}\mathbf{E}[\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}\Vert ]\le {\displaystyle \frac{2[\mathbf{f}({\mathbf{w}}_{0})\mathrm{\mathbf{E}\mathbf{f}}({\mathbf{w}}_{T+1})+{\sum}_{t=0}^{T}{\mathrm{\eta}}_{t}(M{\sigma}_{t}+L{\mathrm{\eta}}_{t}{\sigma}_{t}^{2})]}{{\sum}_{t=0}^{T}{\mathrm{\eta}}_{t}}},$  (37) 
where ${\sigma}_{t}^{2}:=\mathbf{E}{\Vert {\bm{\xi}}_{t}\Vert}^{2}$. ∎
Theorem 1b (full version).
Suppose each user function ${f}_{i}$ is $L$Lipschitz smooth (i.e., ${\nabla}^{2}{f}_{i}\u2aafL\mathbf{I}$) and $M$Lipschitz continuous. Then, for any number of local updates $k$, with global learning rate ${\mathrm{\eta}}_{t}\in (0,\frac{1}{2L}]$, deterministic gradient update and local learning rate ${\eta}_{t}^{l}$, we have
$\underset{t=0,\mathrm{\dots},T}{\mathrm{min}}\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}\Vert \le {\displaystyle \frac{2\left[\mathbf{f}({\mathbf{w}}_{0})\mathbf{f}({\mathbf{w}}_{T+1})+{M}^{2}{\mathrm{\eta}}_{t}{\sum}_{t=0}^{T}\left(\left({\epsilon}_{t}\sqrt{m}+{\eta}_{t}^{l}(k1)\right)+L{\mathrm{\eta}}_{t}{\left({\epsilon}_{t}\sqrt{m}+{\eta}_{t}^{l}(k1)\right)}^{2}\right)\right]}{{\sum}_{t=0}^{T}{\mathrm{\eta}}_{t}}},$  (38) 
where ${\epsilon}_{t}:=\Vert {\mathbf{\lambda}}_{t}{\stackrel{~}{\mathbf{\lambda}}}_{t}\Vert $ is the deviation between the exact and approximate (dual) weightings. Moreover, if some user function ${f}_{i}$ is bounded from below, then the lefthand side in (38) converges to 0 as long as ${\epsilon}_{t}\to 0$ , ${\eta}_{t}^{l}\to 0$ and ${\mathrm{\eta}}_{t}\to 0$ with ${\sum}_{t}{\mathrm{\eta}}_{t}=\mathrm{\infty}$.
Proof.
Let
${\bm{\lambda}}_{t}=\underset{\bm{\lambda}\in \mathrm{\Delta}}{\text{argmin}}\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t})\bm{\lambda}\Vert ,{\stackrel{~}{\bm{\lambda}}}_{t}=\underset{\bm{\lambda}\in \mathrm{\Delta}}{\text{argmin}}\Vert {\stackrel{~}{J}}_{\mathbf{f}}({\mathbf{w}}_{t})\bm{\lambda}\Vert ,$  (39) 
and ${\delta}_{t}:={J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}{\stackrel{~}{J}}_{\mathbf{f}}({\mathbf{w}}_{t}){\stackrel{~}{\bm{\lambda}}}_{t}$, where ${\stackrel{~}{J}}_{\mathbf{f}}({\mathbf{w}}_{t}):=[\stackrel{~}{\nabla}{f}_{1}({\mathbf{w}}_{t}),\mathrm{\dots},\stackrel{~}{\nabla}{f}_{m}({\mathbf{w}}_{t})]$ is the concatenation of accumulated updates $\stackrel{~}{\nabla}{f}_{i}({\mathbf{w}}_{t})$ at each user. Formally, $\stackrel{~}{\nabla}{f}_{i}({\mathbf{w}}_{t}):={\mathbf{w}}_{t}{\mathbf{w}}_{t}^{k}$, which denotes the difference between the initial ${\mathbf{w}}_{t}$ and the final ${\mathbf{w}}_{t}^{k}$ after $k$ local updates, for user $i$. (Note that we have abused the notation ${\mathbf{w}}_{t}$ and ${\mathbf{w}}_{t}^{k}$ a bit for simplicity here, as they do not distinguish user $i$. This is not a big problem since the context is clear.)
Let ${\mathbf{w}}_{t}^{1}:={\mathbf{w}}_{t}\nabla {f}_{i}({\mathbf{w}}_{t})$ and ${\mathbf{w}}_{t}^{j+1}:={\mathbf{w}}_{t}^{j}{\eta}_{t}^{l}\nabla {f}_{i}({\mathbf{w}}_{t}^{j}),j=1,\mathrm{\dots},k1$ be the local optimization steps.
Then,
$\stackrel{~}{\nabla}{f}_{i}({\mathbf{w}}_{t})$  $={\mathbf{w}}_{t}{\mathbf{w}}_{t}^{k}$  (40)  
$=({\mathbf{w}}_{t}{\mathbf{w}}_{t}^{1})+({\mathbf{w}}_{t}^{1}{\mathbf{w}}_{t}^{2})+\mathrm{\dots}+({\mathbf{w}}_{t}^{k1}{\mathbf{w}}_{t}^{k})$  (41)  
$=\nabla {f}_{i}({\mathbf{w}}_{t})+{\eta}_{t}^{l}\nabla {f}_{i}({\mathbf{w}}_{t}^{1})+\mathrm{\dots}+{\eta}_{t}^{l}\nabla {f}_{i}({\mathbf{w}}_{t}^{k1}),$  (42) 
Thus, the difference between $\stackrel{~}{\nabla}{f}_{i}({\mathbf{w}}_{t})$ and gradient $\nabla {f}_{i}({\mathbf{w}}_{t})$ is bounded by:
$\Vert \stackrel{~}{\nabla}{f}_{i}({\mathbf{w}}_{t})\nabla {f}_{i}({\mathbf{w}}_{t})\Vert $  $=\Vert {\eta}_{t}^{l}{\displaystyle \sum _{j=1}^{k1}}\nabla {f}_{i}({\mathbf{w}}_{t}^{j})\Vert $  (44)  
$\le {\eta}_{t}^{l}{\displaystyle \sum _{j=1}^{k1}}\Vert \nabla {f}_{i}({\mathbf{w}}_{t}^{j})\Vert $  (45)  
$\le {\eta}_{t}^{l}(k1)M,$  (46) 
Thus,
$\Vert {\delta}_{t}\Vert $  $=\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}{\stackrel{~}{J}}_{\mathbf{f}}({\mathbf{w}}_{t}){\stackrel{~}{\bm{\lambda}}}_{t}\Vert $  (47)  
$\le \Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}{J}_{\mathbf{f}}({\mathbf{w}}_{t}){\stackrel{~}{\bm{\lambda}}}_{t}\Vert +\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\stackrel{~}{\bm{\lambda}}}_{t}{\stackrel{~}{J}}_{\mathbf{f}}({\mathbf{w}}_{t}){\stackrel{~}{\bm{\lambda}}}_{t}\Vert $  (48)  
$\le {\epsilon}_{t}\sqrt{m}M+{\eta}_{t}^{l}(k1)M,$  (49) 
the last step comes from matrix norm inequality on the first term, and triangular inequality on the second term. Note that $\Vert {\delta}_{t}\Vert $ vanishes when ${\epsilon}_{t}\to 0$ and ${\eta}_{t}^{l}\to 0$.
Then, applying the quadratic upper bound, we have
$\mathbf{f}({\mathbf{w}}_{t+1})$  $\le \mathbf{f}({\mathbf{w}}_{t}){\mathrm{\eta}}_{t}{J}_{\mathbf{f}}^{\top}({\mathbf{w}}_{t}){\stackrel{~}{J}}_{\mathbf{f}}({\mathbf{w}}_{t}){\stackrel{~}{\bm{\lambda}}}_{t}+{\displaystyle \frac{L{\mathrm{\eta}}_{t}^{2}}{2}}{\Vert {\stackrel{~}{J}}_{\mathbf{f}}({\mathbf{w}}_{t}){\stackrel{~}{\bm{\lambda}}}_{t}\Vert}^{2}$  (50)  
$=\mathbf{f}({\mathbf{w}}_{t}){\mathrm{\eta}}_{t}{J}_{\mathbf{f}}^{\top}({\mathbf{w}}_{t}){J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}+L{\mathrm{\eta}}_{t}^{2}{\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}\Vert}^{2}+{\mathrm{\eta}}_{t}{J}_{\mathbf{f}}^{\top}({\mathbf{w}}_{t}){\delta}_{t}+L{\mathrm{\eta}}_{t}^{2}{\Vert {\delta}_{t}\Vert}^{2}$  (51)  
$\le \mathbf{f}({\mathbf{w}}_{t}){\mathrm{\eta}}_{t}(1L{\mathrm{\eta}}_{t}){\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}\Vert}^{2}+{\mathrm{\eta}}_{t}M\Vert {\delta}_{t}\Vert +L{\mathrm{\eta}}_{t}^{2}{\Vert {\delta}_{t}\Vert}^{2},$  (52) 
Letting ${\mathrm{\eta}}_{t}\le \frac{1}{2L}$, telescoping and rearranging we obtain
$\underset{t=0,\mathrm{\dots},T}{\mathrm{min}}\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}\Vert \le {\displaystyle \frac{2[\mathbf{f}({\mathbf{w}}_{0})\mathbf{f}({\mathbf{w}}_{T+1})+{\sum}_{t=0}^{T}{\mathrm{\eta}}_{t}(M\Vert {\delta}_{t}\Vert +L{\mathrm{\eta}}_{t}{\Vert {\delta}_{t}\Vert}^{2})]}{{\sum}_{t=0}^{T}{\mathrm{\eta}}_{t}}},$  (53) 
substitute $\Vert {\delta}_{t}\Vert $ with (49), and we get (38).
Finally, if ${\epsilon}_{t}\to 0$ and ${\eta}_{t}^{l}\to 0$, then ${\delta}_{t}\to 0$ and hence the righthand side in (38) $\to 0$ when $T\to \mathrm{\infty}$, in which case the lefthand side ${\mathrm{min}}_{t=0,\mathrm{\dots},T}\Vert {J}_{\mathbf{f}}({\mathbf{w}}_{t}){\bm{\lambda}}_{t}\Vert $ converges to $0$ as well. ∎
Theorem 2 (full version).
Suppose each user function ${f}_{i}$ is $\sigma $strongly convex (i.e. ${\nabla}^{2}{f}_{i}\u2ab0\sigma \mathbf{I}$) and $M$Lipschitz continuous. Suppose at each round $t$ FedMGDA includes some function ${f}_{{v}_{t}}$ such that
${f}_{{v}_{t}}({\mathbf{w}}_{t}){f}_{{v}_{t}}({\mathbf{w}}_{t}^{\ast})\ge \frac{{\mathrm{\ell}}_{t}}{2}{\Vert {\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast}\Vert}^{2},$  (54) 
where ${\mathbf{w}}_{t}^{\ast}$ is the projection of ${\mathbf{w}}_{t}$ to the Pareto stationary set ${W}^{\ast}$ of (11). Assume $\mathbf{E}[{\lambda}_{{v}_{t}}{\mathrm{\ell}}_{t}+{\sigma}_{t}{\mathbf{w}}_{t}]\ge c>0$, then
$\mathbf{E}[{\Vert {\mathbf{w}}_{t+1}{\mathbf{w}}_{t+1}^{\ast}\Vert}^{2}]\le {\pi}_{t}(1c{\mathrm{\eta}}_{0})\mathbf{E}[{\Vert {\mathbf{w}}_{0}{\mathbf{w}}_{0}^{\ast}\Vert}^{2}]+{\displaystyle \sum _{s=0}^{t}}{\displaystyle \frac{{\pi}_{t}}{{\pi}_{s}}}{\mathrm{\eta}}_{s}^{2}{M}^{2},$  (55) 
where ${\pi}_{t}={\prod}_{s=1}^{t}{\mathrm{\eta}}_{s}$ and ${\pi}_{0}=1$. In particular,

•
if $$, then $\mathbf{E}[{\Vert {\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast}\Vert}^{2}]\to 0$ and ${\mathbf{w}}_{t}$ converges to the Pareto stationarity set ${W}^{\ast}$ almost surely;

•
with the choice ${\mathrm{\eta}}_{t}=\frac{2}{c(t+2)}$ we have
$\mathbf{E}[{\Vert {\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast}\Vert}^{2}]\le {\displaystyle \frac{4{M}^{2}}{{c}^{2}(t+3)}}.$ (56)
Proof.
For each user $i$, let us define the function
${\widehat{f}}_{i}(\mathbf{w},I):={I}_{i}{f}_{i}(\mathbf{w}),$  (57) 
where the random variable $I\in {\{0,1\}}^{m}$ indicates which user participates at a particular round. Clearly, we have $\mathbf{E}{\widehat{f}}_{i}(\mathbf{w},I)={f}_{i}(\mathbf{w})\mathbf{E}{I}_{i}$. Therefore, our multiobjective minimization problem is equivalent as:
$\underset{\mathbf{w}}{\mathrm{min}}\{\mathbf{E}{\widehat{f}}_{1}(\mathbf{w},I),\mathrm{\dots},\mathbf{E}{\widehat{f}}_{m}(\mathbf{w},I)\},$  (58) 
since positive scaling does not change Pareto stationarity. (If one prefers, we can also normalize the stochastic functions ${\widehat{f}}_{i}(\mathbf{w},I)$ so that the unbiasedness property $\mathbf{E}{\widehat{f}}_{i}(\mathbf{w},I)={f}_{i}(\mathbf{w})$ holds.)
We now proceed as in Mercier et al., (2018) and provide a slightly sharper analysis. Let us denote ${\mathbf{w}}_{t}^{\ast}$ the projection of ${\mathbf{w}}_{t}$ to the Paretostationary set ${W}^{\ast}$ of (58), i.e.,
${\mathbf{w}}_{t}^{\ast}=\underset{\mathbf{w}\in {W}^{\ast}}{\text{argmin}}\Vert {\mathbf{w}}_{t}\mathbf{w}\Vert .$  (59) 
Then,
${\Vert {\mathbf{w}}_{t+1}{\mathbf{w}}_{t+1}^{\ast}\Vert}^{2}$  $\le {\Vert {\mathbf{w}}_{t+1}{\mathbf{w}}_{t}^{\ast}\Vert}^{2}$  (60)  
$={\Vert {\mathbf{w}}_{t}{\mathrm{\eta}}_{t}{\mathbf{d}}_{t}{\mathbf{w}}_{t}^{\ast}\Vert}^{2}$  (61)  
$={\Vert {\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast}\Vert}^{2}2{\mathrm{\eta}}_{t}\u27e8{\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast},{\mathbf{d}}_{t}\u27e9+{\mathrm{\eta}}_{t}^{2}{\Vert {\mathbf{d}}_{t}\Vert}^{2}.$  (62) 
To bound the middle term, we have from our assumption:
$\exists {v}_{t},{\widehat{f}}_{{v}_{t}}({\mathbf{w}}_{t},{I}_{t}){\widehat{f}}_{{v}_{t}}({\mathbf{w}}_{t}^{\ast},{I}_{t})$  $\ge \frac{{\mathrm{\ell}}_{t}}{2}{\Vert {\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast}\Vert}^{2},$  (63)  
$\forall i,{\widehat{f}}_{i}({\mathbf{w}}_{t},{I}_{t}){\widehat{f}}_{i}({\mathbf{w}}_{t}^{\ast},{I}_{t})$  $\ge 0,$  (64) 
where the second inequality follows from the definition of ${\mathbf{w}}_{t}^{\ast}$. Therefore,
$\u27e8{\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast},{\mathbf{d}}_{t}\u27e9$  $=\u27e8{\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast},{\displaystyle \sum _{i:{I}_{i}=1}}{\lambda}_{i}\nabla {f}_{i}({\mathbf{w}}_{t})\u27e9$  (65)  
$\ge {\displaystyle \sum _{i:{I}_{i}=1}}{\lambda}_{i}\left({f}_{i}({\mathbf{w}}_{t}){f}_{i}({\mathbf{w}}_{t}^{\ast})\right)+\frac{{\sigma}_{t}}{2}{\Vert {\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast}\Vert}^{2}$  (66)  
$={\displaystyle \sum _{i}}{\lambda}_{i}\left({\widehat{f}}_{i}({\mathbf{w}}_{t},{I}_{t}){\widehat{f}}_{i}({\mathbf{w}}_{t}^{\ast},{I}_{t})\right)+\frac{{\sigma}_{t}}{2}{\Vert {\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast}\Vert}^{2}$  (67)  
$\ge \frac{{\lambda}_{{v}_{t}}{\mathrm{\ell}}_{t}+{\sigma}_{t}}{2}{\Vert {\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast}\Vert}^{2}.$  (68) 
Continuing from (62) and taking conditional expectation:
$\mathbf{E}[{\Vert {\mathbf{w}}_{t+1}{\mathbf{w}}_{t+1}^{\ast}\Vert}^{2}{\mathbf{w}}_{t}]$  $\le (1{c}_{t}{\mathrm{\eta}}_{t}){\Vert {\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast}\Vert}^{2}+{\mathrm{\eta}}_{t}^{2}{M}^{2},$  (69) 
where ${c}_{t}:=\mathbf{E}[{\lambda}_{{v}_{t}}{\mathrm{\ell}}_{t}+{\sigma}_{t}{\mathbf{w}}_{t}]\ge c>0$. Taking expectation we obtain the familiar recursion:
$\mathbf{E}[{\Vert {\mathbf{w}}_{t+1}{\mathbf{w}}_{t+1}^{\ast}\Vert}^{2}]$  $\le (1c{\mathrm{\eta}}_{t})\mathbf{E}[{\Vert {\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast}\Vert}^{2}]+{\mathrm{\eta}}_{t}^{2}{M}^{2},$  (70) 
from which we derive
$\mathbf{E}[{\Vert {\mathbf{w}}_{t+1}{\mathbf{w}}_{t+1}^{\ast}\Vert}^{2}]$  $\le {\pi}_{t}(1c{\mathrm{\eta}}_{0})\mathbf{E}[{\Vert {\mathbf{w}}_{0}{\mathbf{w}}_{0}^{\ast}\Vert}^{2}]+{\displaystyle \sum _{s=0}^{t}}\frac{{\pi}_{t}}{{\pi}_{s}}{\mathrm{\eta}}_{s}^{2}{M}^{2},$  (71) 
where ${\pi}_{t}={\prod}_{s=1}^{t}(1c{\mathrm{\eta}}_{s})$ and ${\pi}_{0}=1$. Since ${\pi}_{t}\to 0\iff {\sum}_{t}{\mathrm{\eta}}_{t}=\mathrm{\infty}$, we know
$\mathbf{E}[{\Vert {\mathbf{w}}_{t+1}{\mathbf{w}}_{t+1}^{\ast}\Vert}^{2}]\to 0$  (72) 
if ${\sum}_{t}{\mathrm{\eta}}_{t}=\mathrm{\infty}$ and $$.
Setting ${\mathrm{\eta}}_{t}=\frac{2}{c(t+2)}$ we obtain ${\pi}_{t}=\frac{2}{(t+2)(t+1)}$ and by induction
$\sum _{s=0}^{t}}{\displaystyle \frac{{\pi}_{t}}{{\pi}_{s}}}{\mathrm{\eta}}_{s}^{2}={\displaystyle \frac{4}{{c}^{2}(t+2)(t+1)}}{\displaystyle \sum _{s=0}^{t}}{\displaystyle \frac{s+1}{s+2}}\le {\displaystyle \frac{4}{{c}^{2}(t+4)}},$  (73) 
whence
$\mathbf{E}[{\Vert {\mathbf{w}}_{t+1}{\mathbf{w}}_{t+1}^{\ast}\Vert}^{2}]\le {\displaystyle \frac{4{M}^{2}}{{c}^{2}(t+4)}}.$  (74) 
Using a standard supermartingale argument we can also prove that
${\mathbf{w}}_{t}{\mathbf{w}}_{t}^{\ast}\to 0\text{almost surely}.$  (75) 
The proof is wellknown in stochastic optimization hence omitted (or see Mercier et al., (2018, Theorem 5) for details). ∎
Appendix B Full experimental results
In this section we provide additional results that are deferred from the main paper.
B.1 Recovering FedAvg full results: results on FashionMNIST and CIFAR10
B.2 Robustness full results: bias attack on Adult dataset
Name  Bias  Uniform  PhD  NonPhD 

AFL  $0$  $83.26\pm 0.01$  $77.90\pm 0.00$  $83.32\pm 0.01$ 
AFL  $0.01$  $83.28\pm 0.03$  $76.58\pm 0.27$  $83.36\pm 0.03$ 
AFL  $0.1$  $82.30\pm 0.04$  $74.59\pm 0.00$  $82.39\pm 0.04$ 
AFL  $1$  $81.86\pm 0.05$  $74.25\pm 0.57$  $81.94\pm 0.05$ 
qFedAvg, $q=5$  $0$  $83.26\pm 0.18$  $76.80\pm 0.61$  $83.33\pm 0.19$ 
qFedAvg, $q=5$  $1000$  $83.34\pm 0.04$  $76.57\pm 0.44$  $83.41\pm 0.04$ 
qFedAvg, $q=5$  $5000$  $81.19\pm 0.03$  $74.14\pm 0.41$  $81.27\pm 0.03$ 
qFedAvg, $q=5$  $10000$  $81.07\pm 0.03$  $73.48\pm 0.78$  $81.16\pm 0.02$ 
qFedAvg, $q=2$  $0$  $83.30\pm 0.09$  $76.46\pm 0.56$  $83.38\pm 0.09$ 
qFedAvg, $q=2$  $1000$  $83.33\pm 0.04$  $76.24\pm 0.00$  $83.41\pm 0.04$ 
qFedAvg, $q=2$  $5000$  $83.11\pm 0.03$  $75.69\pm 0.00$  $83.20\pm 0.03$ 
qFedAvg, $q=2$  $10000$  $82.50\pm 0.07$  $75.69\pm 0.00$  $82.58\pm 0.07$ 
qFedAvg, $q=0.1$  $0$  $83.44\pm 0.06$  $76.46\pm 0.56$  $83.52\pm 0.07$ 
qFedAvg, $q=0.1$  $1000$  $83.3$ 