Rethinking the Representation in Federated Unsupervised Learning with Non-IID Data

Xinting Liao¹, Weiming Liu¹, Chaochao Chen¹, Pengyang Zhou¹, Fengyuan Yu¹, Huabin Zhu¹,
Binhui Yao^{1, 2}, Tao Wang², Xiaolin Zheng¹, Yanchao Tan³
¹Zhejiang University, ²Midea Group, ³Fuzhou University
{xintingliao, 21831010, zjuccc, zhoupy, fengyuanyu, zhb2000, xlzheng}@zju.edu.cn,
tony.yao@midea.com, tao.wang.seu@gmail.com, yctan@fzu.edu.cn
Chaochao Chen is the corresponding author.

Abstract

Federated learning achieves effective performance in modeling decentralized data. In practice, client data are not well-labeled, which makes it potential for federated unsupervised learning (FUSL) with non-IID data. However, the performance of existing FUSL methods suffers from insufficient representations, i.e., (1) representation collapse entanglement among local and global models, and (2) inconsistent representation spaces among local models. The former indicates that representation collapse in local model will subsequently impact the global model and other local models. The latter means that clients model data representation with inconsistent parameters due to the deficiency of supervision signals. In this work, we propose Fed $\text{U}^{2}$ which enhances generating uniform and unified representation in FUSL with non-IID data. Specifically, Fed $\text{U}^{2}$ consists of flexible uniform regularizer (FUR) and efficient unified aggregator (EUA). FUR in each client avoids representation collapse via dispersing samples uniformly, and EUA in server promotes unified representation by constraining consistent client model updating. To extensively validate the performance of Fed $\text{U}^{2}$ , we conduct both cross-device and cross-silo evaluation experiments on two benchmark datasets, i.e., CIFAR10 and CIFAR100.

1 Introduction

To meet the demands of privacy regulation, federated learning (FL) [31] is boosting to model decentralized data in both academia and industry. This is because FL enables the collaboration of clients with decentralized data, aiming to develop a high-performing global model without the need for data transfer. However, conventional FL work mostly assumes that client data is well-labeled, which is less practical in real-world applications. In this work, we consider the problem of federated unsupervised learning (FUSL) with non-IID data [43, 14], i.e., modeling unified representation among imbalanced, unlabeled, and decentralized data.

Utilizing existing centralized unsupervised methods cannot adapt to FUSL which has non-IID data [44]. To mitigate it, one of the popular categories is to train self-supervised learning models, e.g., BYOL [8], SimCLR [3], and Simsiam [5], in clients, and aggregate models via accounting extremely divergent model [44, 45], knowledge distillation [9], and combining with clustering [30]. However, two coupling challenges of FUSL, i.e., CH1: Mitigating representation collapse entanglement, and CH2: Obtaining unified representation spaces, are not well considered.

The first challenge is that representation collapse [13] in the client subsequently exacerbates the representation of global and other local models. Motivated by regularizing Frobenius norm of representation in centralized self-supervised models [15, 21], FedDecorr [36] tackles representation collapse with the global supervision signals in federated supervised learning. But directly applying these methods to FUSL has three aspects of limitations. Firstly, it relies on large data batch size [30] to capture reliable distribution statistics, e.g., representation variance. Besides, regularizing the norm of high-dimensional representations inevitably causes inactivated neurons and suppresses meaningful features [18]. Moreover, clients cannot eliminate representation collapse entanglement by decorrelating representations for FUSL problem, once clients represent data in different representation spaces.

The second challenge refers to optimizing inconsistent client model parameters toward discrepant parameter spaces, bringing less unified representations among local models. Most of the existing FUSL methods aggregate participating models with the ratio of samples, i.e., FedAvg [31]. This not only fails to tackle the client shift from global optimum to local optimum, but also brings sub-optimal results [22, 33]. To mitigate this, FUSL methods maintain consistency by (1) abandoning extremely divergent clients by threshold [45, 44], (2) obtaining global supervised signal via clustering client sub-clusters [30, 6], and (3) scaling angular divergence among client models in a layer-wise way [33]. These methods either forget to adjust clients updated with inconsistent directions, or break down the performance coherence among different layers of the whole model, failing to capture unified representations.

To fill this gap, we propose a framework, i.e., Fed $\text{U}^{2}$ , to enhance Uniform and Unified representation in FUSL with non-IID data. To tackle CH1, we initially devise a flexible uniform regularizer (FUR) to prevent the sample representation collapse with no regard to data distribution and client discrepancies. In each client, FUR minimizes unbalanced optimal transport divergence between client data and uniform random samples, i.e., samples from the same spherical Gaussian distribution among clients. Thus it not only flexibly disperses local data representations toward ideal uniform distribution, but also avoids the representation collapse entanglement among clients without leaking privacy. To mitigate CH 2, we propose efficient unified aggregator (EUA) to aggregate a global model that maintains model consistency among global optimization and different local optimizations. Specifically, EUA formulates model aggregation as a multiple-objective optimization based on the model deviation change rates of clients. EUA reduces computation by searching exact solutions in the dual formulation with alternating direction methods of multipliers. Compared with conventional aggregation methods, we equivalently maintain consistent model updating based on client model deviation change, enhancing unified representations.

Summarily, we aim to enhance the representation in FUSL by mitigating representation collapse and unifying representation generalization. (1) We enhance uniform representation by approaching data samples to spherical Gaussian distribution, which mitigates representation collapse and its subsequent entangled impacts. (2) We enhance unified representation by constraining the consistent updating of different client models. (3) To reach the above goals, we propose Fed $\text{U}^{2}$ with FUR and EUA, which is agnostic and orthogonal for the backbone of self-supervised models. (4) In our empirical studies, we conduct experiments on two benchmark datasets and two evaluation settings, which extensively validate the performance of Fed $\text{U}^{2}$ .

2 Related Work

2.1 Federated Unsupervised Learning

To enhance FUSL with non-IID data [24], there are two categories of efforts, i.e., (1) generating global supervised signals, and (2) enhancing unified representation. The former targets at generating global supervised signals via local-global clustering [6], and sharing data representation among clients [41, 43]. But these methods either suffer from randomness in obtaining global supervision[30], or take the risk of leaking privacy [44]. The latter enhances unified representation by adapting existing unsupervised representation methods, and tackling non-IID modeling with divergence-aware model aggregation [44, 45, 33, 17, 30]. Both FedU [44] and FedEMA [45] enhance the awareness of heterogeneity in federated self-supervised learning by divergence-aware predictor update rule, and adaptive global knowledge interpolation, respectively. However, this kind of work overlooks representation collapse in non-IID clients. Orchestra [30] utilizes local-global clustering derived from K-Fed [6] to guide self-supervised learning. This brings additional cost for clustering and is fragile to random initialization. Moreover, FedX [9] devises local relational loss to distill the invariance of data samples, and global relational loss to maintain client inconsistencies. Recently, L-DAWA [33] corrects the FUSL optimization trajectory by measuring and scaling angular divergence among client models in a layer-wise way. However, it is hard to guarantee that the newly aggregated global model is still compatible and performant as a consistent model. Differently, the proposed Fed $\text{U}^{2}$ enhances uniform and unified representation without the prior knowledge of unsupervised models, data distribution, and federated settings.

2.2 Representation Collapse

Representation collapse [15, 21] means representation vectors are highly correlated and simply span a lower-dimensional subspace, which is widely studied in metric learning [34], i.e., self-supervised learning [15], and supervised federated learning [36]. In federated supervised learning, FedDecorr [36] finds the dimensional collapse entanglement among server and client models, and decorrelates representations via regularizing the Frobenius norm of batch samples. However, FedDecorr relies on large batch size and deactivates lots of neuron parameters, degrading performance once the scale of clients increases [18]. To avoid representation collapse entanglement in FUSL, the proposed FUR in Fed $\text{U}^{2}$ will regularize data representations to a uniform distribution that is the same among clients. In this way, decorrelating representation is not affected by the data sampling. Meanwhile, the data is uniformly dispersed into the same random distribution space, avoiding intriguing collapse impacts of client collaboration.

3 Method

Refer to caption — Figure 1: Framework of Fed $\text{U}^{2}$ . For clients with agnostic self-supervised framework, FUR expands non-IID data uniformly to avoid representation collapse for FUSL. EUA in server maintains a balanced aggregation for all client models, bringing unified representations.

3.1 Federated Unsupervised Learning Formulation

We introduce the FUSL problem formulation and related assumptions in the following. Empirically, we assume a dataset decentralizes among $K$ clients, i.e., $\mathcal{D}=\cup_{k\in[K]}\mathcal{D}_{k}$ . Data distributions of different clients, i.e., $\mathcal{D}_{k}=\{\bm{x}_{k,i}\}_{i=1}^{N_{k}}$ , are unlabeled and non-IID in practice. FUSL can be formulated as a global objective that seeks a collaborative aggregation among clients, i.e.,

{\operatorname{argmin}}_{\bm{\theta}}\mathcal{L}(\bm{\theta};\bm{p})=\Sigma_{k% =1}^{K}p_{k}\mathbb{E}_{\bm{x}\sim\mathcal{D}_{k}}[\mathcal{L}_{k}(\bm{\theta}% ;\bm{x})],

(1)

where $\mathcal{L}_{k}(\cdot)$ is the unsupervised model loss at client $k$ , $\bm{p}=[p_{1},\dots,p_{K}]$ , and $p_{k}$ represents its weight ratio. The common aggregation approach is to assign the ratio of sample amount in client $k$ as the weight ratio, e.g., FedAvg [31]. Nevertheless, the client with a large amount of data will dominate in aggregating, deteriorating the optimization of other clients with inconsistent local optimums [33]. Due to privacy constraints, directly aligning client local optimums with representations is forbidden [44, 45]. Therefore, it is necessary to account for a multi-objective optimal combination, i.e., restraining the consistency between global and local model parameters.

3.2 Fed $\text{U}^{2}$ Overview

To address FUSL with non-IID data, i.e., Eq. (1), we propose Fed $\text{U}^{2}$ , whose framework overview is depicted in Fig. 1. There are one server and $K$ clients in Fed $\text{U}^{2}$ , which share the same self-supervised model, e.g., Simsiam [5], SimCLR [3], and BYOL [8]. Additionally, Fed $\text{U}^{2}$ contains extra flexible uniform regularizer (FUR) module, which mitigates representation collapse without requiring prior knowledge for FUSL. We introduce the unsupervised local modeling at each client $k$ , and then illustrate the communication between clients and server. For a batch of image data $\bm{X}$ at client $k$ , we augment them with two transformations, i.e., $\bm{X}^{v}=T^{v}(\bm{X})$ for $T^{v}\sim\mathcal{T}$ with $\mathcal{T}$ denoting transformation set and $v\in\{1,2\}$ . And feature extractor represents two views of augmented data with $d$ -dimensional $l_{2}-$ normalized representations, i.e., $\bm{Z}^{v}=\mathcal{F}_{\bm{\theta}_{k}}(\bm{X}^{v})$ . Then for each view of sample representations, we maximize its representation space via approximating uniform Gaussian distribution in FUR. Meanwhile, we align two views of feature representations by predictor module (if available) and alignment module.

In one communication round, every participating client $k$ uploads its model parameters $\bm{\theta}_{k}$ to server. Next, server with efficient unified aggregator (EUA) module, first formulates the client model aggregation as a multi-objective optimization based on different client model deviation change rates, and searches for a balanced model combination. Server further restrains client parameters in a consistent space, which not only enhances the consistency between global optimum and local optimums, but also captures unified representation for data of the same class but different clients. This communication between server and clients iterates until the performance of Fed $\text{U}^{2}$ converges.

3.3 FUR for Mitigating Representation Collapse

Representation collapse is a long-standing issue due to its intriguing phenomenon, e.g., constant collapse and partial/full dimensional collapse [15, 8]. In federated learning, representation collapse not only degrades the performance of local clients, but also intricately affects the representation of global and local models [36]. Besides, lacking labels, clients represent data samples to the space around local optimums, where decorrelating sample representations of limited client data suppresses capturing useful features [18].

Without ground truth labels, self-supervised learning not only keeps the invariance of the same sample with different augmentations, but also expands the uniformity of different representations to avoid representation collapse [40]. Given a batch of $B$ representations, i.e., $\bm{Z}_{B}^{1}=\{\bm{z}_{i}^{1}\}_{i\in[B]}$ and $\bm{Z}_{B}^{2}=\{\bm{z}_{i}^{2}\}_{i\in[B]}$ , we train the self-supervised model by minimizing the total objective as below:

\small\begin{gathered}\mathcal{L}=\mathbb{E}_{T^{1},T^{2}\sim\mathcal{T}}\ell_% {a}\left(\bm{Z}_{B}^{1},\bm{Z}_{B}^{2}\right)+\lambda_{U}\left(\ell_{u}\left(% \bm{Z}_{B}^{1}\right)+\ell_{u}\left(\bm{Z}_{B}^{2}\right)\right),\end{gathered}

(2)

where $\ell_{a}$ and $\ell_{u}$ are alignment term and uniformity term, respectively. $\lambda_{U}>0$ is a hyperparameter that balances the two terms. The alignment term keeps data samples of the same class to be clustered, while others are separable, i.e., $\ell_{a}\left(\bm{Z}_{B}^{1},\bm{Z}_{B}^{2}\right):=\frac{1}{B}\sum_{i\in[B]}% \left\|\bm{z}_{i}^{1}-\bm{z}_{i}^{2}\right\|_{2}^{2}.$

The crucial of mitigating representation collapse is to enhance the representation uniformity [40]. To relieve reliance on prior knowledge of client data, we regularize local sample representations to a random distribution with high entropy. Specifically, we select samples following the spherical Gaussian distribution, i.e., $\bm{s}\sim\mathcal{N}(0,1),s.t.,\|\bm{s}\|=1$ , as the prior. In this way, mitigating representation collapse in FUSL not only avoids leaking privacy, but also disentangles the collapse impacts among clients. Then FUR regularizes the divergence between the data representations $\bm{Z}_{B}^{v}$ and a set of random samples following spherical Gaussian distribution ${\bm{S}_{B}=\{\bm{s}_{i}\}_{i\in[B]}}$ :

\ell_{u}\left(\bm{Z}_{B}^{v}\right):=\operatorname{Div}(\bm{Z}_{B}^{v},\bm{S}_% {B}).

(3)

Thus the uniform term $\ell_{u}$ disperses uniformly to avoid representation collapse without repulsing in instance-based contrastive learning.

Since the client data is non-IID, it will break the class separation when strictively constraining sample representation to approach random instances [39]. A more flexible method is to match data samples and random Gaussian samples with arbitrary or proportional masses, i.e., leaving the sampling coupling with lower uncertainties unmatched. And unbalanced Optimal Transport (UOT) [1, 35] is one of the effective resolutions. UOT computes the transport mapping [26, 29, 28] between two sample masses of different distributions under the soft marginal constraints, e.g., $l_{2}-$ normalization between the predicted margin and the ground truth margin. Given marginal constraints $\bm{a}$ and $\bm{b}$ for data and Gaussian distribution respectively, we formulate a UOT problem that searches a coupling matrix $\bm{\pi}$ with minimal distribution divergence:

	$\displaystyle\min_{\pi_{i,j}\geq 0}\ell_{u}(\bm{Z}^{v})$	$\displaystyle=\operatorname{vec}(\bm{C})^{\top}\operatorname{vec}(\bm{\pi})+% \frac{\tau_{a}}{2}\left\\|\bm{\Phi}_{\mathrm{r}}\operatorname{vec}(\bm{\pi})-% \bm{a}\right\\|_{2}^{2}$		(4)
		$\displaystyle+\frac{\tau_{b}}{2}\left\\|\bm{\Phi}_{\mathrm{c}}\operatorname{vec% }(\bm{\pi})-\bm{b}\right\\|_{2}^{2},$		(4)

where cost matrix $\bm{C}_{ij}=\|\bm{Z}_{i}^{v}-\bm{S}_{j}\|^{2}$ , and $\bm{\Phi}_{\mathrm{r}}=\bm{I}_{N}\otimes\mathbbm{1}_{N}^{\top}$ ( $\bm{\Phi}_{\mathrm{c}}=\mathbbm{1}_{M}^{\top}\otimes\bm{I}_{M}$ ) are indicators for row-wise (column-wise) Kronecker multiplication with $\bm{I}$ denoting identity matrix. Optimization. Denoting $\tau_{a}\bm{\Phi}_{\mathrm{p}}^{\top}\bm{\Phi}_{\mathrm{r}}+\tau_{b}\bm{\Phi}_% {c}^{\top}\bm{\Phi}_{c}=\bm{Q}$ and $\operatorname{vec}(\bm{C})-\tau_{a}\bm{\Phi}_{\mathrm{r}}^{\top}\bm{a}-\tau_{b% }\bm{\Phi}_{c}^{\top}\bm{b}=\bm{w}$ , we rewrite Eq. (4) as a positive definite quadratic form:

\small\min_{\pi_{i,j}\geq 0}\ell_{u}(\bm{Z}^{v})=\frac{1}{2}\operatorname{vec}% (\bm{\pi})^{\top}\bm{Q}\operatorname{vec}(\bm{\pi})+\bm{w}^{\top}\operatorname% {vec}(\bm{\pi})+\bm{\Omega},

(5)

where constant $\bm{\Omega}=\frac{1}{2}(\tau_{a}\bm{a}^{\top}\bm{a}+\tau_{b}\bm{b}^{\top}\bm{b})$ . Next we optimize $\bm{\pi}$ via steepest gradient descent as bellow:

\small\operatorname{vec}\left(\bm{\pi}^{(\text{new })}\right)=\max\left(0,% \operatorname{vec}\left(\bm{\pi}^{\text{(old })}\right)-\eta^{*}\frac{\partial% \ell_{u}(\bm{Z}^{v})}{\partial\operatorname{vec}\left(\bm{\pi}^{(\text{old })}% \right)}\right),

(6)

with $\eta^{*}=\frac{\left(\bm{Q}_{\operatorname{vec}}\left(\bm{\pi}^{(\text{old })}% \right)+\bm{w}\right)^{\top}\left(\bm{Q}_{\operatorname{vec}}\left(\bm{\pi}^{(% \text{old })}\right)+\bm{w}\right)}{\left(\bm{Q}_{\operatorname{vec}}\left(\bm% {\pi}^{(\text{old })}\right)+\bm{w}\right)^{\top}\bm{Q}\left(\bm{Q}% \operatorname{vec}\left(\bm{\pi}^{(\text{old })}\right)+\bm{w}\right)}$ . Finally, we obtain the uniform UOT divergence by taking the optimal $\bm{\pi}^{*}$ back to Eq. (4). FUR minimizes UOT divergence to regularize data samples approaching the spherical Gaussian distribution. Note that the spherical Gaussian distribution maximizes its entropy and distributes its samples uniformly. The mapped data representations enjoy the above nice properties of spherical Gaussian distribution and further mitigate the representation collapse entanglement.

3.4 EUA for Generalizing Unified Representation

Due to non-IID client data, clients optimize to their local optimums with inconsistent model parameters, causing inconsistent even conflicting model deviations from server to clients. Without the guidance of supervision signals, i.e., data labels, this problem further exacerbates in representing data of the same class but different clients, towards inconsistent spaces. Thus it is vital to constrain the consistency among client models in parameter spaces, which further guarantees unified representations.

In round $t$ , the impact of global aggregation on $k$ -th local optimization can be measured with the model deviation change rate, i.e.,

\displaystyle c_{k}\left(\eta,\bm{d}^{t}\right)=\frac{u_{k}\left(\bm{\theta}^{% t}_{g}\right)-u_{k}\left(\bm{\theta}^{t+1}_{g}\right)}{{u}_{k}\left(\bm{\theta% }^{t}_{g}\right)}\approx\eta_{g}\nabla\log{u}_{k}(\bm{\theta}^{t}_{g})\bm{d}^{% t},

(7)

where $u_{k}(\bm{\theta}^{t}_{g})=\|\bm{\theta}^{t}_{g}-\bm{\theta}^{t}_{k}\|^{2}$ is the model deviation from server to client $k$ [16, 12], and global model optimization is $\bm{\theta}_{g}^{t+1}=\bm{\theta}_{g}^{t}-\eta\bm{d}^{t}$ with updating direction $\bm{d}^{t}$ and step size $\eta$ . Overlooking inconsistent model deviations, global aggregated model inevitably gets close to a subset of clients while deviating from others. It corresponds that clients get close to the global model increase the model deviation change rate, and clients away decrease it [12, 32]. Motivated by this, we seek the clients with the worst model deviation change rate, and correct the global optimization with a direction maximizing the overall worst model deviation change rate. This can be formulated as a multi-objective optimization, which benefits for mitigating the inconsistencies and conflicts among clients [12, 42], i.e.,

		$\displaystyle\max_{\bm{d}^{t}}\min_{\bm{p}\in\mathbb{S}^{K}}{\eta}_{g}\sum_{k=% 1}^{K}p_{k}\nabla\log{u}_{k}(\bm{\theta}^{t}_{g})\bm{d}^{t},$		(8)
		$\displaystyle s.t.\\|\bm{d}^{t}\\|^{2}\leq 1,\bm{p}^{\top}\bm{1}=1,p_{k}\geq 0,$		(8)

where $\bm{p}$ denotes the weights for different clients.

Optimization. Adding the constraints as Lagrange multipliers, Eq. (8) can be rewritten as:

\small\max_{\bm{d}^{t}}\min_{\bm{p}}J=\eta_{g}\left<\sum_{k=1}^{K}p_{k}\nabla% \log{u}_{k}(\bm{\theta}^{t}_{g}),\bm{d}^{t}\right>-\frac{\phi}{2}(\|\bm{d}^{t}% \|^{2}-1).

(9)

Differentiating Eq. (9) with regarding to $\bm{d}^{t}$ , we have $\bm{d}^{*}=\frac{\eta_{g}}{\phi}{(\nabla\log{\bm{u}})}^{\top}\bm{p}$ , where ${\bm{u}}=({u}_{1}(\bm{\theta}_{g}^{t}),\dots,{u}_{K}(\bm{\theta}_{g}^{t}))$ . Taking back to Eq. (9), we can obtain its strong dual form with dual variable $\bm{p}$ :

\small\bm{J}=\min_{\bm{p}}\frac{\eta^{2}_{g}}{2\phi}\|{(\nabla\log{\bm{u}})}^{% \top}\bm{p}\|^{2}=\min_{\bm{p}}\frac{\eta^{2}_{g}}{2\phi}\bm{p}^{\top}\bm{G}% \bm{p},

(10)

where $\bm{G}=(\nabla\log{\bm{u}})^{\top}(\nabla\log{\bm{u}})$ . Then we can rewrite it as an augmented Lagrangian form,

\displaystyle\bm{J}=\min_{\bm{p}}\frac{\eta^{2}_{g}}{2\phi}\bm{p}^{\top}\bm{G}% \bm{p}+\mu(\bm{p}^{\top}\bm{1}-1)+\frac{\rho}{2}\|\bm{p}^{\top}\bm{1}-1\|^{2},

(11)

where $\mu$ denotes the Lagrange multipliers. This can be iteratively solved by alternating direction method of multipliers (ADMM) algorithm [7, 27], i.e., fixing $\mu$ to optimize $\bm{p}$ , and vice versa:

\displaystyle\left\{\begin{array}[]{ll}&\bm{p}=\max(0,(\frac{\eta^{2}_{g}}{% \phi}\bm{G}+\rho\bm{I})^{-1}(\rho\bm{I}-\mu\bm{I}))\\ &\mu\leftarrow\mu+\rho(\bm{p}^{\top}\bm{1}-1)\end{array}\right.

(12)

The ADMM iteration guarantees exact solution in minimal computation complexity, then the global model updates towards $\bm{d}^{*}$ with step size $\eta$ .

Theorem 1 (Optimization consistency of model deviations).

Rethinking the Lagrangian of dual form in Eq. (10),

\bm{J}=\min_{\bm{p}}\frac{\eta^{2}_{g}}{2\phi}\|{(\nabla\log{\bm{u}})}^{\top}% \bm{p}\|^{2}+\lambda\bm{p}^{\top}\bm{1},

(13)

it holds $\nabla\log({u}_{i}\left(\bm{\theta}^{t}_{g}\right))=\nabla\log({u}_{j}\left(% \bm{\theta}^{t}_{g}\right))$ , $\forall i\neq j\in[K]$ .

Proof.

We provide the proof details in Appendix A.1. ∎

After the convergence of global and local optimization, EUA balances the model deviation change rate among all clients, making the global aggregation improves all model equivalently. Therefore, all models optimize towards a consistent parameter spaces, obtaining unified representation.

3.5 Overall Algorithm and Convergence Analysis

We describe the overall algorithm of Fed $\text{U}^{2}$ in Algo. 1. In detail, the server collaborates with clients in steps 1:10. After collecting participating client models in step 8, server uses EUA to reach a consistent model updating and obtain unified representations. The client executes self-supervised modeling in steps 11:21, where FUR enhances uniform representations to avoid collapse entanglement in step 17.

Convergence Analysis. In the following, we take four mild assumptions [23], and provide the generalization bounds of model divergence and overall convergence error.

Assumption 1.

Let $F_{k}(\bm{\theta}_{k})$ be the expected model objective for client $k$ , and assume $F_{1},\cdots,F_{K}$ are all L-smooth, i.e., for all $\bm{\theta}_{k}$ , $F_{k}(\bm{\theta}_{k})\leq F_{k}(\bm{\theta}_{k})+(\bm{\theta}_{k}-\bm{\theta}% _{k})^{\top}\nabla F_{k}(\bm{\theta}_{k})+\frac{L}{2}\|\bm{\theta}_{k}-\bm{% \theta}_{k}\|_{2}^{2}$ .

Assumption 2.

Let $F_{1},\cdots,F_{N}$ are all $\mu$ -strongly convex: for all $\bm{\theta}_{k}$ , $F_{k}(\bm{\theta}_{k})\geq F_{k}(\bm{\theta}_{k})+(\bm{\theta}_{k}-\bm{\theta}% _{k})^{\top}\nabla F_{k}(\bm{\theta}_{k})+\frac{\mu}{2}\|\bm{\theta}_{k}-\bm{% \theta}_{k}\|_{2}^{2}$ .

Assumption 3.

Let $\xi^{t}_{k}$ be sampled from the $k$ -th client’s local data uniformly at random. The variance of stochastic gradients in each client is bounded: $\mathbb{E}\left\|\nabla F_{k}\left(\bm{\theta}_{k}^{t},\xi^{t}_{k}\right)-% \nabla F_{k}\left(\bm{\theta}_{k}^{t}\right)\right\|^{2}\leq\sigma_{k}^{2}$ .

Assumption 4.

The expected squared norm of stochastic gradients is uniformly bounded, i.e., $\mathbb{E}\left\|\nabla F_{k}\left(\bm{\theta}_{k}^{t},\xi^{t}_{k}\right)% \right\|^{2}\leq V^{2}$ for all $k=1,\cdots,N$ and $t=1,\cdots,T-1$

Lemma 1 (Bound of Client Model Divergence).

With assumption 4, $\eta_{t}$ is non-increasing and $\eta_{t}<2\eta_{t+E}$ (learning rate of t-th round and E-th epoch) for all $t\geq 0$ , there exists $t_{0}\leq t$ , such that $t-t_{0}\leq E-1$ and $\bm{\theta}^{t_{0}}_{k}=\bm{\theta}^{t_{0}}$ for all $k\in[N]$ . It follows that

\mathbb{E}\left[\sum_{k}^{K}p_{k}\|\bm{\theta}^{t}-\bm{\theta}^{t}_{k}\|^{2}% \right]\leq 4\eta_{t}^{2}{(E-1)}^{2}V^{2}.

(14)

Proof.

We provide the proof details in Appendix A.2. ∎

Theorem 2 (Convergence Error Bound).

Let assumptions 1-4 hold, and $L,\mu,\sigma_{k},V$ be defined therein. Let $\kappa=\frac{L}{\mu},\gamma=\max\{8\kappa,E\}$ and the learning rate $\eta_{t}=\frac{2}{\mu(\gamma+t)}$ . The Fed $\text{U}^{2}$ with full client participation satisfies

\mathbb{E}\left[F\left({\overline{\bm{\theta}}}^{t}\right)\right]-F^{*}\leq% \frac{\kappa}{\gamma+t}\left(\frac{2B}{\mu}+\frac{\mu(\gamma+1)}{2}\|\bm{% \theta}^{t}-\bm{\theta}^{*}\|^{2}\right),

where $B=4(E-1)^{2}V^{2}+K+2\Gamma$ .

Proof.

We provide the proof details in Appendix A.3. ∎

Algorithm 1 Training procedure of Fed

\text{U}^{2}

Input: Batch size $B$ , communication rounds $T$ , number of clients $K$ , local steps $E$ , dataset $\mathcal{D}=\cup_{k\in[K]}\mathcal{D}_{k}$
Output: Global model $\bm{\theta}^{T}$

1: Server executes():

2: Initialize

\bm{\theta}^{0}

with random distribution

3: for

t=0,1,...,T-1

4: for

k=1,2,...,K

in parallel do

5: Send

\bm{\theta}^{t}

to client

k

\bm{\theta}_{k}^{t+1}\leftarrow

Client executes(

k

\bm{\theta}^{t}

)

7: end for

8: EUA optimize Eq. (10) for

\bm{p}^{*}

and update global model

\bm{\theta}^{t+1}

with optimal direction

\bm{d}^{t}

in Eq. (9)

9: end for

10: return

\bm{\theta}^{T}

11: Client executes(

k

\bm{\theta}^{t}

12: Assign global model to the local model

\bm{\theta}_{k}^{t}\leftarrow\bm{\theta}^{t}

13: for each local epoch

e=1,2,...,E

14: for batch of samples

\bm{X}_{k,B}\in\mathcal{D}_{k}

15: Augment samples

\bm{X}_{k,B}^{v}=T^{v}_{v\in\{1,2\}}(\bm{X}_{k,B})

16: Feature extraction

\bm{Z}_{k,B}^{v}\leftarrow\mathcal{F}_{\bm{\theta}_{k}^{e}}(\bm{X}_{k,B}^{v})

17: FUR enhances the uniformity of

\bm{Z}_{k,B}^{v}

by Eq. (4)

18: Compute total loss in Eq. (2) and update

\bm{\theta}_{k}^{e}

19: end for

20: end for

21: return

\bm{\theta}_{k}^{E}

to server

4 Experiments

4.1 Experimental Setups

Datasets.We adopt two benchmark datasets, i.e., CIFAR10 and CIFAR100 [20], to evaluate Fed $\text{U}^{2}$ . Both datasets have 50,000 training samples and 10,000 test samples, but differ in the number of classes. Following FedEMA [45], we simulate non-IID data distribution in $K$ clients by assuming class priors follow the Dirichlet distribution parameterized with non-IID degree, i.e., $\alpha$ [11]. The smaller $\alpha$ simulates the more non-IID federated setting. We conduct extensive experiments in both cross-silo ( $K=10$ ) and cross-device ( $K=100$ ) settings to validate performance generalization.

Comparison Methods. We compare Fed $\text{U}^{2}$ with three categories of approaches , i.e., (1) combining the existing centralized self-supervised model with FedAvg [31]: FedSimsiam, FedSimCLR, and FedBYOL, (2) the state-of-the-art FUSL methods: FedU [44], FedEMA [45], FedX [9], Orchestra [30], and L-DAWA [33], and (3) adapting existing federated supervised learning models solving representation collapse to FUSL: FedDecorr [36]. Firstly, we evaluate the representation performance of the above methods with their best-performing models on both cross-silo and cross-device settings on CIFAR10 and CIFAR100 ( $\alpha=0.1$ ). Secondly, we study the effectiveness of different methods with the same model, i.e., BYOL, for different non-IID degrees [25], i.e., $\alpha=\{0.1,0.5,5\}$ . We evaluate the pre-trained encoder model via the accuracy of KNN [4], standard linear probing [2], and semi-supervised methods (i.e., fine-tuning 1% and 10% labeled data). All of the above metrics illustrate better performance when the values are higher.

Implemental Details. We conduct image augmentation for SimCLR, BYOL, and SimSiam, following their original papers. We adopt ResNet18 [10] as an encoder module, choose Projector/Predictor architectures like original papers, and optimize each model 5 local epochs per communication round until converging. We set all datasets with batch size as 128 and embedding dimension as 512. To obtain fair comparisons, we conduct every experiment for each method with its best hyper-parameters, and report the average result of 3 repetitions. We choose Adam [19] as the optimizer for each local model, and SGD [37] for updating the global model. We set the uniformity effect $\lambda_{U}=0.1$ , the soft margin constraints $\tau_{a}=\tau_{b}=0.8$ , the coefficient of constraints $\phi=0.1$ , and the coefficient in ADMM $\rho=1$ .

Table 1: Accuracy (%) of linear probing (LP), fine-tuning (FT) 1%, and 10% labeled data on CIFAR10 and CIFAR100 (

\alpha=0.1

Dataset	CIFAR10						CIFAR100
Setting $\alpha=0.1$	Cross-Device (K=100)			Cross-Silo (K=10)			Cross-Device (K=100)			Cross-Silo (K=10)
Method \Evaluation	LP	FT 1%	FT 10%	LP	FT 1%	FT 10%	LP	FT 1%	FT 10%	LP	FT 1%	FT 10%
FedSimsiam	60.49	44.45	70.46	70.61	57.60	69.88	31.91	12.58	37.33	49.81	21.64	43.08
FedDecorr-Simsiam	43.18	35.15	58.68	74.56	65.21	80.10	17.09	5.36	20.60	47.93	20.53	45.21
Fed $\text{U}^{2}$ -Simsiam	68.50	56.43	75.33	84.92	77.11	85.21	35.59	13.08	38.22	56.55	31.42	48.75
FedSimCLR	65.76	51.18	68.33	75.65	62.86	76.15	37.09	11.73	31.97	51.62	19.47	41.60
L-DAWA-SimCLR	65.63	49.66	69.89	75.48	63.4	78.66	41.28	13.52	36.56	51.11	21.07	45.02
FedX-SimCLR	67.33	49.96	70.18	78.29	65.03	79.43	38.11	11.18	33.96	51.67	19.65	42.38
Fed $\text{U}^{2}$ -SimCLR	66.49	51.77	70.76	82.37	69.84	82.39	41.56	14.32	36.90	56.56	26.11	47.99
FedBYOL	61.46	54.36	74.01	83.29	74.04	81.40	28.27	10.43	34.90	48.78	19.79	42.82
FedU-BYOL	60.15	53.53	74.62	82.33	69.24	83.37	28.09	10.46	36.06	58.02	28.38	48.12
FedEMA-BYOL	62.27	54.91	74.76	82.17	71.37	83.78	28.40	10.63	35.62	57.25	30.03	50.33
Orchestra	38.66	41.62	62.97	83.53	78.44	85.40	17.91	6.96	23.40	51.31	26.36	48.85
Fed $\text{U}^{2}$ -BYOL	67.62	54.74	74.93	85.58	78.64	86.24	38.09	13.16	36.87	59.71	34.83	53.87

Table 2: KNN accuracy (%) of different

\alpha

on CIFAR10 and CIFAR100 for cross-silo settings.

Dataset	CIFAR10			CIFAR100
Method \ $\alpha$	0.1	0.5	5	0.1	0.5	5
FedBYOL	76.12	77.23	82.71	38.13	43.93	45.30
FedU-BYOL	79.09	79.68	82.75	51.31	51.81	52.05
FedEMA-BYOL	80.32	82.01	82.80	53.18	53.18	53.28
FedDecorr-BYOL	76.76	79.66	81.09	49.87	49.54	52.07
L-DAWA-BYOL	65.40	66.17	82.82	23.09	51.10	52.25
FedX-BYOL	50.94	40.96	41.05	15.83	16.35	16.89
Orchestra	79.25	76.78	76.30	38.52	37.17	34.93
Fed $\text{U}^{2}$ -FUR-BYOL	81.04	82.18	83.45	53.45	53.86	54.34
Fed $\text{U}^{2}$ -EUA-BYOL	80.93	82.14	84.01	53.20	53.72	54.61
Fed $\text{U}^{2}$ -BYOL	81.39	82.21	84.79	53.87	54.07	55.06

4.2 Experimental Results

Representation Performance Comparison. Firstly, we follow the existing FUSL methods [45, 30, 33], to evaluate the performance of pre-trained models learned in Tab. 1. We group the state-of-the-art methods in terms of their best-performing models, i.e., Simsiam, SimCLR, and BYOL. For the first group, we can observe that FedDecorr performs the worst, especially on CIFAR100 cross-device task. It indicates that directly avoiding collapse via decorrelating a batch of data representations is unsuitable for FUSL with limited data and severely heterogeneous data distribution. In terms of the second group, compared with FedX, L-DAWA performs better on CIFAR100, while is less competitive on CIFAR10. We can conclude that: (1) L-DAWA can better control model divergence when clients have inconsistent optimums, and (2) L-DAWA fails to obtain discriminative representations since it takes no action to representation collapse. On mentioned the third group, Orchestra captures global supervision signals to guide data representation, whose effectiveness suffers from randomness. In general, directly combining existing self-supervised model with FedAvg cannot tackle FUSL with non-IID data well. Cross-device simulation on CIFAR100 is so challenging that some existing methods fail dramatically. Moreover, Fed $\text{U}^{2}$ is agnostic to self-supervised model and performs better than existing work, which validates the superiority of enhancing uniform and unified representations.

Effect of Heterogeneity on Generalization. Next, we report the KNN-accuracy of cross-silo methods on CIAFR10 and CIFAR100 in Tab. 2, for validating the performance generalization. We choose the same model, i.e., BYOL, to be comparable among all FUSL methods. We can discover that: (1) Most FUSL methods increase their performance when the non-IID degree $\alpha$ increases, and the performance variances among different methods increase with the decreasing of $\alpha$ . (2) FedEMA-BYOL is not sensitive to the non-IID degrees, while Orchestra behaves on opposite. This states that capturing global supervision signals to guide local representation suffers from clustering randomness. (3) Fed $\text{U}^{2}$ performs the best among all tasks, even in $\alpha=0.1$ , illustrating its performance generalization.

Ablation Studies. In Tab. 2, we also consider two variants of Fed $\text{U}^{2}$ : (1) Fed $\text{U}^{2}$ removes FUR, i.e., Fed $\text{U}^{2}$ -FUR, (2) Fed $\text{U}^{2}$ removes EUA, i.e., Fed $\text{U}^{2}$ -EUA, to study the effect of each module. From Tab. 2, we can see that either applying FUR or EUA can enhance representations, since they have better performance than the existing FUSL methods. Compared with Fed $\text{U}^{2}$ , Fed $\text{U}^{2}$ -FUR and Fed $\text{U}^{2}$ -EUA drop KNN accuracy slightly, validating the effectiveness of tackle two challenges, i.e., representation collapse entanglement, and generating unified representations. Fed $\text{U}^{2}$ -FUR performs better than Fed $\text{U}^{2}$ -EUA when $\alpha=0.1$ , while gets worse when $\alpha=5$ .

4.3 Representation Visualization

Analysis of Representation Collapse Entanglement. To study the representation collapse entanglement caused by non-IID data, we capture the representation covariance matrices of $N$ test data points on CIFAR10 from both the global and local BYOL models of FUSL methods, i.e., FedDecorr, L-DAWA, FedBYOL, and Fed $\text{U}^{2}$ . And we utilize the singular value decomposition on each of the representation covariance matrices, and visualize the top-100 singular values in Fig. 3. Both L-DAWA and FedBYOL suffer from severe representation collapse, because they have less singular values beyond 0 than FedDecorr and Fed $\text{U}^{2}$ . The representation collapses in global model and local model are consistent, proving that collapse impacts are entangled intricately. Compared with the singular values decomposed from the covariance matrix of Gaussian random samples, the singular values of FedDecorr and Fed $\text{U}^{2}$ are not similar. Because a fully uniform distribution breaks down the alignment effect and deteriorates clustering. Furthermore, in Fig. 4, we visualize the representation collapse on 3-D spherical space, where the existing FUSL methods leave evident blank space and suffer from collapse entanglements.

Analysis of Unified Representation. We also use t-SNE [38] to picture the 2-D representation of both global (circle) and local (cross) BYOL models in Fig. 5. There are three interesting conclusions: Firstly, Fed $\text{U}^{2}$ has clearer cluster boundary than FedDecorr, validating that directly decomposing the Frobenius norm of representations deteriorates generalization. Secondly, compared with FeBYOL, the global and local representations of L-DAWA are looser, implying its ineffectiveness in controlling conflicting model deviations. Lastly, with the effect of EUA, Fed $\text{U}^{2}$ achieves tighter distribution consistency between global and local representations, as well as more clear cluster bound.

Hyper-parameters sensitivity. We consider the sensitivity of highly relevant hyper-parameters, i.e., the effect of uniformity term $\lambda_{U}=\{0,0.01,0.05,0.1,0.2,0.5\}$ on Cifar10 Cross-silo ( $\alpha=0.1)$ , in Fig. 6. We set $\lambda_{U}=0.1$ in experiments since it reaches the highest performance. And we leave the number of clients $K=\{5,10,20,50,100\}$ and the local epochs $E=\{5,10,20,50\}$ in Appendix B.

5 Conclusion

In this work, we propose a FUSL framework, i.e., Fed $\text{U}^{2}$ , to enhance Uniform and Unified representation. Fed $\text{U}^{2}$ consists of flexible uniform regularizer (FUR) and efficient unified aggregator (EUA). FUR encourages data representations to uniformly distribute in a spherical Gaussian space, mitigating representation collapse and its subsequent entangled impacts. EUA further constrains the consistent optimization improvements among different client models, which is good for unified representation. In our empirical studies, we set both cross-silo and cross-device settings, and conduct experiments on CIFAR10 and CIFAR100 datasets, which extensively validate the superiority of Fed $\text{U}^{2}$ .

Acknowledgements. This work was supported by National Key R&D Program of China (2022YFB4501500, 2022YFB4501504), and the National Natural Science Foundation of China (No.72192823).

References

Benamou [2003] Jean-David Benamou. Numerical resolution of an “unbalanced” mass transport problem. ESAIM: Mathematical Modelling and Numerical Analysis, 37(5):851–868, 2003.
Chen et al. [2020a] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020a.
Chen et al. [2020b] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020b.
Chen and He [2021a] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021a.
Chen and He [2021b] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021b.
Dennis et al. [2021] Don Kurian Dennis, Tian Li, and Virginia Smith. Heterogeneity for the win: One-shot federated clustering. In International Conference on Machine Learning, pages 2611–2620. PMLR, 2021.
Eckstein and Yao [2012] Jonathan Eckstein and Wang Yao. Augmented lagrangian and alternating direction methods for convex optimization: A tutorial and some illustrative computational results. RUTCOR Research Reports, 32(3):44, 2012.
Grill et al. [2020] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
Han et al. [2022] Sungwon Han, Sungwon Park, Fangzhao Wu, Sundong Kim, Chuhan Wu, Xing Xie, and Meeyoung Cha. Fedx: Unsupervised federated learning with cross knowledge distillation. In European Conference on Computer Vision, pages 691–707. Springer, 2022.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Hsu et al. [2019] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
Hu et al. [2022] Zeou Hu, Kiarash Shaloudegi, Guojun Zhang, and Yaoliang Yu. Federated learning meets multi-objective optimization. IEEE Transactions on Network Science and Engineering, 9(4):2039–2051, 2022.
Hua et al. [2021] Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, and Hang Zhao. On feature decorrelation in self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9598–9608, 2021.
Jin et al. [2023] Yilun Jin, Yang Liu, Kai Chen, and Qiang Yang. Federated learning without full labels: A survey. arXiv preprint arXiv:2303.14453, 2023.
Jing et al. [2021] Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. In International Conference on Learning Representations, 2021.
Karimireddy et al. [2020] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pages 5132–5143. PMLR, 2020.
Kim et al. [2023a] Hansol Kim, Youngjun Kwak, Minyoung Jung, Jinho Shin, Youngsung Kim, and Changick Kim. Protofl: Unsupervised federated learning via prototypical distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6470–6479, 2023a.
Kim et al. [2023b] Jaeill Kim, Suhyun Kang, Duhun Hwang, Jungwook Shin, and Wonjong Rhee. Vne: An effective method for improving deep representation by manipulating eigenvalue distribution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3799–3810, 2023b.
Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Li et al. [2022] Alexander C Li, Alexei A Efros, and Deepak Pathak. Understanding collapse in non-contrastive siamese representation learning. In European Conference on Computer Vision, pages 490–505. Springer, 2022.
Li et al. [2020] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020.
Li et al. [2019] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2019.
Liao et al. [2023a] Xinting Liao, Chaochao Chen, Weiming Liu, Pengyang Zhou, Huabin Zhu, Shuheng Shen, Weiqiang Wang, Mengling Hu, Yanchao Tan, and Xiaolin Zheng. Joint local relational augmentation and global nash equilibrium for federated learning with non-iid data. In Proceedings of the 31st ACM International Conference on Multimedia, pages 1536–1545, 2023a.
Liao et al. [2023b] Xinting Liao, Weiming Liu, Chaochao Chen, Pengyang Zhou, Huabin Zhu, Yanchao Tan, Jun Wang, and Yue Qi. Hyperfed: Hyperbolic prototypes exploration with consistent aggregation for non-iid data in federated learning. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 3957–3965, 2023b.
Liu et al. [2021] Weiming Liu, Jiajie Su, Chaochao Chen, and Xiaolin Zheng. Leveraging distribution alignment via stein path for cross-domain cold-start recommendation. Advances in Neural Information Processing Systems, 34:19223–19234, 2021.
Liu et al. [2023a] Weiming Liu, Xiaolin Zheng, Chaochao Chen, Mengling Hu, Xinting Liao, Fan Wang, Yanchao Tan, Dan Meng, and Jun Wang. Differentially private sparse mapping for privacy-preserving cross domain recommendation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6243–6252, 2023a.
Liu et al. [2023b] Weiming Liu, Xiaolin Zheng, Chaochao Chen, Jiajie Su, Xinting Liao, Mengling Hu, and Yanchao Tan. Joint internal multi-interest exploration and external domain alignment for cross domain sequential recommendation. In Proceedings of the ACM Web Conference 2023, pages 383–394, 2023b.
Liu et al. [2023c] Weiming Liu, Xiaolin Zheng, Jiajie Su, Longfei Zheng, Chaochao Chen, and Mengling Hu. Contrastive proxy kernel stein path alignment for cross-domain cold-start recommendation. IEEE Transactions on Knowledge and Data Engineering, 2023c.
Lubana et al. [2022] Ekdeep Lubana, Chi Ian Tang, Fahim Kawsar, Robert Dick, and Akhil Mathur. Orchestra: Unsupervised federated learning via globally consistent clustering. In International Conference on Machine Learning, pages 14461–14484. PMLR, 2022.
McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
Pan et al. [2023] Zibin Pan, Shuyi Wang, Chi Li, Haijin Wang, Xiaoying Tang, and Junhua Zhao. Fedmdfg: Federated learning with multi-gradient descent and fair guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9364–9371, 2023.
Rehman et al. [2023] Yasar Abbas Ur Rehman, Yan Gao, Pedro Porto Buarque de Gusmao, Mina Alibeigi, Jiajun Shen, and Nicholas D Lane. L-dawa: Layer-wise divergence aware weight aggregation in federated self-supervised visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16464–16473, 2023.
Roth et al. [2020] Karsten Roth, Timo Milbich, Samarth Sinha, Prateek Gupta, Bjorn Ommer, and Joseph Paul Cohen. Revisiting training strategies and generalization performance in deep metric learning. In International Conference on Machine Learning, pages 8242–8252. PMLR, 2020.
Séjourné et al. [2019] Thibault Séjourné, Jean Feydy, François-Xavier Vialard, Alain Trouvé, and Gabriel Peyré. Sinkhorn divergences for unbalanced optimal transport. arXiv preprint arXiv:1910.12958, 2019.
Shi et al. [2022] Yujun Shi, Jian Liang, Wenqing Zhang, Vincent Tan, and Song Bai. Towards understanding and mitigating dimensional collapse in heterogeneous federated learning. In The Eleventh International Conference on Learning Representations, 2022.
Sutskever et al. [2013] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
Van der Maaten and Hinton [2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
Wang and Liu [2021] Feng Wang and Huaping Liu. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2495–2504, 2021.
Wang and Isola [2020] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
[41] Y Wu, Z Wang, D Zeng, M Li, Y Shi, and J Hu. Federated contrastive representation learning with feature fusion and neighborhood matching (2021). In URL https://openreview. net/forum.
Yu et al. [2020] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836, 2020.
Zhang et al. [2023] Fengda Zhang, Kun Kuang, Long Chen, Zhaoyang You, Tao Shen, Jun Xiao, Yin Zhang, Chao Wu, Fei Wu, Yueting Zhuang, et al. Federated unsupervised representation learning. Frontiers of Information Technology & Electronic Engineering, 24(8):1181–1193, 2023.
Zhuang et al. [2021] Weiming Zhuang, Xin Gan, Yonggang Wen, Shuai Zhang, and Shuai Yi. Collaborative unsupervised visual representation learning from decentralized data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4912–4921, 2021.
Zhuang et al. [2022] Weiming Zhuang, Yonggang Wen, and Shuai Zhang. Divergence-aware federated self-supervised learning. In International Conference on Learning Representations, 2022.

In the supplemental materials, we provide the theoretical analysis in Appendix A, and the additional experimental details in Appendix B.

Appendix A Theoretical Analysis

A.1 Optimization Consistency of Model Deviation

In the following, we provide the theorem related to optimization consistency of model deviations. When the aggregating weights, i.e., $\bm{p}$ , achieve optimal, the model deviation rate is equally contributed to global updating.

Theorem 3 (Optimization consistency of model deviations).

Rethinking the Lagrangian of dual form in Eq. (10),

\bm{J}=\min_{\bm{p}}\frac{\eta^{2}_{g}}{2\phi}\|{(\nabla\log{\bm{u}})}^{\top}% \bm{p}\|^{2}+\lambda_{E}\bm{p}^{\top}\bm{1},

(15)

it holds $\nabla\log({u}_{i}\left(\bm{\theta}^{t}_{g}\right))=\nabla\log({u}_{j}\left(% \bm{\theta}^{t}_{g}\right))$ , $\forall i\neq j\in[K]$ .

Proof.

By deviating Eq. (15) with regarding to $\bm{p}$ , we can obtain ${(\nabla\log{\bm{u}})}\nabla{\log({\bm{u}})}^{\top}\bm{p}^{*}=-\frac{\phi% \lambda_{E}}{\eta_{g}^{2}}\bm{I}$ . Remind that $\bm{d}^{*}=\frac{\eta_{g}}{\phi}{(\nabla\log{\bm{u}})}^{\top}\bm{p}^{*}$ , for client $i\neq j\in[K]$ and ${\eta_{g}}\rightarrow 0$ , we finally have consistent model deviation change rate as below:

	$\displaystyle\quad\lim_{\eta_{g}\rightarrow 0}c_{i}\left(\eta_{g},\bm{d}^{}% \right)=\nabla\log({u}_{i}\left(\bm{\theta}^{}_{g}\right))\bm{d}^{*}$		(16)
	$\displaystyle=\nabla\log({u}_{j}\left(\bm{\theta}^{}_{g}\right))\bm{d}^{}=% \lim_{\eta_{g}\rightarrow 0}c_{j}\left(\eta_{g},\bm{d}^{*}\right).$		(16)

∎

Therefore, the global model updates with a direction that balances all model deviation change rates, obtaining consistent parameters for server and client models.

A.2 Bound of Client Model Divergence

In this part, we first introduce mild and general assumptions [23], and induct the model updating divergence bound for each client.

Assumption 5.

Let $F_{k}(\bm{\theta})$ be the expected model objective for client $k$ , and assume $F_{1},\cdots,F_{K}$ are all L-smooth, i.e., for all $\bm{\theta}_{k}$ , $F_{k}(\bm{\theta}_{k})\leq F_{k}(\bm{\theta}_{k})+(\bm{\theta}_{k}-\bm{\theta}% _{k})^{\top}\nabla F_{k}(\bm{\theta}_{k})+\frac{L}{2}\|\bm{\theta}_{k}-\bm{% \theta}_{k}\|_{2}^{2}$ .

Assumption 6.

Assumption 7.

Assumption 8.

Next, we introduce the lemma related to the bound of client model divergence.

Lemma 2 (Bound of Client Model Divergence).

With assumption 8, $\eta_{t}$ is non-increasing and $\eta_{t}<2\eta_{t+E}$ (learning rate of t-th round and E-th epoch) for all $t\geq 0$ , there exists $t_{0}\leq t$ , such that $t-t_{0}\leq E-1$ and $\bm{\theta}^{t_{0}}_{k}=\bm{\theta}^{t_{0}}$ for all $k\in[N]$ . It follows that

\mathbb{E}\left[\sum_{k}^{K}p_{k}\|\bm{\theta}^{t}-\bm{\theta}^{t}_{k}\|^{2}% \right]\leq 4\eta_{t}^{2}{(E-1)}^{2}V^{2}.

(17)

Proof.

Let $E$ be the maximal local epoch. For any round $t>0$ , communication rounds from $t$ to $t_{0}$ exist $t-t_{0}<E-1$ . and the global model $\bm{\theta}^{t_{0}}$ and each local model $\bm{\theta}_{k}^{t_{0}}$ are same at round $t_{0}$ .

		$\displaystyle\mathbb{E}\left[\sum_{k}^{K}p_{k}\\|\bm{\theta}^{t}-\bm{\theta}^{t% }_{k}\\|^{2}\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{k}^{K}p_{k}\\|(\bm{\theta}^{t}_{k}-\bm{% \theta}^{t_{0}})-(\bm{\theta}^{t}-\bm{\theta}^{t_{0}})\\|^{2}\right]$	$\displaystyle(17a)$
		$\displaystyle\leq\mathbb{E}\sum_{k}^{K}p_{k}\\|\bm{\theta}^{t}_{k}-\bm{\theta}^% {t_{0}}\\|^{2}$	$\displaystyle(17b)$
		$\displaystyle=\mathbb{E}\sum_{k}^{K}p_{k}\left\\|\sum_{t=t0}^{t-1}\eta_{t}% \nabla F_{k}(\bm{\theta}^{t}_{k},\xi_{k}^{t})\right\\|^{2}$	$\displaystyle(17c)$
		$\displaystyle\leq\mathbb{E}\sum_{k}^{K}p_{k}(t-t_{0})\sum_{t=t0}^{t-1}\eta_{t_% {0}}^{2}\left\\|\nabla F_{k}(\bm{\theta}_{t}^{k},\xi_{k}^{t})\right\\|^{2}$	$\displaystyle(17d)$
		$\displaystyle\leq 4\eta^{2}_{t}(E-1)^{2}V^{2},$	$\displaystyle(17e)$

where the Eq. (17b) holds since $\mathbb{E}(\bm{\theta}^{t}_{k}-\bm{\theta}^{t_{0}})=\bm{\theta}^{t}-\bm{\theta% }^{t_{0}}$ , and $\mathbb{E}\|X-\mathbb{E}(X)\|\leq\mathbb{E}\|X\|$ , and Eq. (17d) derives from Jensen inequality. ∎

A.3 Convergence Error Bound

Definition 1 (Heterogeneity Quantification [23]).

Let $F^{*}$ and $F_{k}^{*}$ be the minimum values of $F$ and $F_{k}$ , respectively. We use the term $\Gamma=F^{*}-\sum_{k=1}^{N}p_{k}F_{k}^{*}$ for quantifying the degree of non-IID. If the data are IID, then $\Gamma$ obviously goes to zero as the number of samples grows. If the data are non-IID, then $\Gamma$ is nonzero, and its magnitude reflects the heterogeneity of the data distribution.

Theorem 4 (Convergence Error Bound).

Let assumptions 5-8 hold, and $L,\mu,\sigma_{k},V$ be defined therein. Let $\kappa=\frac{L}{\mu},\gamma=\max\{8\kappa,E\}$ and the learning rate $\eta_{t}=\frac{2}{\mu(\gamma+t)}$ . The Fed $\text{U}^{2}$ with full client participation satisfies

\mathbb{E}\left[F\left({\overline{\bm{\theta}}}^{t}\right)\right]-F^{*}\leq% \frac{\kappa}{\gamma+t}\left(\frac{2B}{\mu}+\frac{\mu(\gamma+1)}{2}\|\bm{% \theta}^{t}-\bm{\theta}^{*}\|^{2}\right),

where $B=4(E-1)^{2}V^{2}+K+2\Gamma$ .

Proof.

By L-smooth assumption 5, we can obtain:

		$\displaystyle\mathbb{E}\left[F(\bm{\theta}^{t})-F(\bm{\theta}^{*})\right]$		(18)
		$\displaystyle\leq\mathbb{E}\left[(\bm{\theta}^{t}-\bm{\theta}^{})^{\top}% \nabla F\left(\bm{\theta}^{}\right)+\frac{L}{2}\left\\|\bm{\theta}^{t}-\bm{% \theta}^{*}\right\\|^{2}\right]$
		$\displaystyle=\mathbb{E}\left[\frac{L}{2}\left\\|\bm{\theta}^{t}-\bm{\theta}^{*% }\right\\|^{2}\right].$

Since the updating in EUA is $\bm{\theta}^{t+1}=\bm{\theta}^{t}-\eta_{t}\bm{d}^{t}$ for $\bm{d}^{*}_{t}=\frac{\eta_{t}}{\phi}{(\nabla\log{\bm{u}}^{t})}^{\top}\bm{p}^{*% }_{t}=\sum_{k}^{K}p_{k}\nabla F_{k}(\bm{\theta}^{t}_{k})$ , we can rewrite it as:

		$\displaystyle\\|\bm{\theta}^{t+1}-\bm{\theta}^{*}\\|^{2}$		(19)
		$\displaystyle=\\|\bm{\theta}^{t}-\eta_{t}\bm{d}_{t}-\bm{\theta}^{*}\\|^{2}$
		$\displaystyle=\\|\bm{\theta}^{t}-\bm{\theta}^{}\\|^{2}-2\left<\bm{\theta}^{t}-% \bm{\theta}^{},\eta_{t}\bm{d}_{t}\right>+\eta\\|\bm{d}_{t}\\|^{2}.$

Next, we induce the bound of the second term.

		$\displaystyle\left<\bm{\theta}^{t}-\bm{\theta}^{*},\eta_{t}\bm{d}_{t}\right>$		(20)
		$\displaystyle=\sum_{k}^{K}p_{k}\eta_{t}\left<\bm{\theta}^{t}-\bm{\theta}^{t}_{% k},\nabla F_{k}(\bm{\theta}_{k}^{t})\right>-\sum_{k}^{K}p_{k}\eta_{t}\left<\bm% {\theta}^{t}_{k}-\bm{\theta}^{*},\nabla F_{k}(\bm{\theta}_{k}^{t})\right>$		(20)

By Cauchy-Schwarz inequality and AM-GM inequality, we have inequality of the first term of Eq. (20):

-2\left<\bm{\theta}^{t}-\bm{\theta}^{t}_{k},\nabla F_{k}(\bm{\theta}_{k}^{t})% \right>\leq\frac{1}{\eta_{t}}\|\bm{\theta}^{t}-\bm{\theta}^{t}_{k}\|^{2}+\eta_% {t}\|\nabla F_{k}(\bm{\theta}_{k}^{t})\|^{2}.

(21)

By the $\mu-$ strong convexity of $F_{k}(\cdot)$ , we have

\small-\left<\bm{\theta}^{t}_{k}-\bm{\theta}^{*},\nabla F_{k}(\bm{\theta}_{k}^% {t})\right>\leq-\left(F_{k}(\bm{\theta}^{t}_{k})-F_{k}(\bm{\theta}^{*})\right)% -\frac{\mu}{2}\|\bm{\theta}^{t}_{k}-\bm{\theta}^{*}\|^{2}.

(22)

In Theorem 3, we get ${(\nabla\log{\bm{u}})}^{\top}\nabla{\log({\bm{u}})}\bm{p}^{*}=-\frac{\phi% \lambda_{E}}{\eta_{t}^{2}}\bm{I}$ , which indicates that:

$\displaystyle\\|\bm{d}_{t}\\|^{2}$	$\displaystyle=\frac{\eta_{t}^{2}}{\phi}\\|{(\nabla\log{\bm{u}}^{t})}^{\top}\bm{% p}\\|^{2}$	(23)
	$\displaystyle=\lambda_{E}\\|\bm{p}^{\top}\\|^{2}$
	$\displaystyle\leq K,$

where the last inequation holds due to $\lambda_{E}<1$ and $\|\bm{p}\|\leq K$ . By combining Eq. (21)-(23) and Lemma 2, it follows that

		$\displaystyle\\|\bm{\theta}^{t+1}-\bm{\theta}^{*}\\|^{2}$		(24)
		$\displaystyle=\\|\bm{\theta}^{t}-\eta_{t}\bm{d}_{t}-\bm{\theta}^{*}\\|^{2}$
		$\displaystyle\leq\\|\bm{\theta}^{t}-\bm{\theta}^{*}\\|^{2}+\eta_{t}\sum_{k}^{K}p% _{k}\left(\frac{1}{\eta_{t}}\\|\bm{\theta}^{t}-\bm{\theta}^{t}_{k}\\|^{2}+\eta_{% t}\\|\nabla F_{k}(\bm{\theta}_{k}^{t})\\|^{2}\right)$
		$\displaystyle+2\eta_{t}\sum_{k}^{K}p_{k}\left(-\left(F_{k}(\bm{\theta}^{t}_{k}% )-F_{k}(\bm{\theta}^{})\right)-\frac{\mu}{2}\\|\bm{\theta}^{t}_{k}-\bm{\theta}% ^{}\\|^{2}\right)+\eta_{t}^{2}\\|\bm{d}_{t}\\|^{2}$
		$\displaystyle=(1-\mu\eta_{t})\\|\bm{\theta}^{t}-\bm{\theta}^{*}\\|^{2}+\sum_{k}^% {K}p_{k}\\|\bm{\theta}^{t}-\bm{\theta}^{t}_{k}\\|^{2}+\eta_{t}^{2}K+2\eta_{t}^{2}\Gamma$
		$\displaystyle\leq(1-\mu\eta_{t})\\|\bm{\theta}^{t}-\bm{\theta}^{*}\\|^{2}+4\eta_% {t}^{2}(E-1)^{2}V^{2}+\eta_{t}^{2}K+2\eta_{t}^{2}\Gamma.$

Lastly, let $D_{t}=\mathbb{E}\|\bm{\theta}^{t}-\bm{\theta}^{*}\|^{2}$ , it follows that

D_{t+1}\leq(1-\eta_{t}\mu)D_{t}+\eta_{t}^{2}B,

(25)

where $B=4(E-1)^{2}V^{2}+K+2\Gamma$ .

For a diminishing stepsize, $\eta_{t}=\frac{\beta}{t+\gamma}$ for some $\beta>\frac{1}{\mu}$ and $\gamma>0$ such that $\eta_{1}\leq\min\{\frac{1}{\mu},\frac{1}{4L}\}=\frac{1}{4L}$ and $\eta_{t}\leq 2\eta_{t+E}$ . For $v=\max\{\frac{\beta^{2}B}{\beta\mu-1},(\gamma+1)D_{1}\}$ , by definition, it holds $D_{t}\leq\frac{v}{\gamma+t}$ for $t=1$ . Assume $D_{t}\leq\frac{v}{\gamma+t}$ holds, then we expand as below:

$\displaystyle D_{t+1}$	$\displaystyle\leq\left(1-\eta_{t}\mu\right)D_{t}+\eta_{t}^{2}B$	(26)
	$\displaystyle\leq\left(1-\frac{\beta\mu}{t+\gamma}\right)\frac{v}{t+\gamma}+% \frac{\beta^{2}B}{(t+\gamma)^{2}}$
	$\displaystyle=\frac{t+\gamma-1}{(t+\gamma)^{2}}v+\left[\frac{\beta^{2}B}{(t+% \gamma)^{2}}-\frac{\beta\mu-1}{(t+\gamma)^{2}}v\right]$
	$\displaystyle\leq\frac{v}{t+\gamma+1}.$

Recall Eq. (18), we finally catch:

\mathbb{E}\left[F(\theta^{t})-F(\theta^{*})\right]\leq\frac{L}{2}D_{t}\leq% \frac{L}{2}\frac{v}{\gamma+t}.

(27)

Following the specific case of [23], we can choose $\beta=\frac{2}{\mu},\gamma=\max\left\{8\frac{L}{\mu},E\right\}-1$ and denote $\kappa=\frac{L}{\mu}$ , then $\eta_{t}=\frac{2}{\mu}\frac{1}{\gamma+t}$ . One can verify that the choice of $\eta_{t}$ satisfies $\eta_{t}\leq 2\eta_{t+E}$ for $t\geq 1$ . Then, we have

$\displaystyle v$	$\displaystyle=\max\left\{\frac{\beta^{2}B}{\beta\mu-1},(\gamma+1)\Delta_{1}\right\}$	(28)
	$\displaystyle\leq\frac{\beta^{2}B}{\beta\mu-1}+(\gamma+1)\Delta_{1}$
	$\displaystyle\leq\frac{4B}{\mu^{2}}+(\gamma+1)D_{1}$

and

\mathbb{E}\left[F\left({\overline{\bm{\theta}}}^{t}\right)\right]-F^{*}\leq% \frac{L}{2}\frac{v}{\gamma+t}\leq\frac{\kappa}{\gamma+t}\left(\frac{2B}{\mu}+% \frac{\mu(\gamma+1)}{2}D_{1}\right).

(29)

As we can see, Fed $\text{U}^{2}$ similarly converges to a generalization error bound as the FedAvg-like FL model with non-IID data. Discriminatively, benefiting from the optimization of EUA, the communication round multiplies with a smaller $B$ . ∎

Appendix B Experimental Supplementary

B.1 Hyper-parameter Sensitivity Analysis

In the following, we study the sensitivity of remaining highly relevant hyper-parameters, i.e., the effect of client numbers and local epochs. Specifically, we compare Fed $\text{U}^{2}$ -SimCLR and its runner-up method, i.e., FedX-SimCLR, on CIFAR10 $\alpha=0.1$ , by varying the local epochs $E=\{5,10,20,50\}$ in Fig. 7 and the number of clients $K=\{5,10,20,50,100\}$ in Fig. 8. We train all models until converge to obtain fairly comparable results. As we can see: (1) With the increase of local epochs, each client of FedX-SimCLR obtains a better-performing model, while each client of Fed $\text{U}^{2}$ -SimCLR is insensitive. This states that Fed $\text{U}^{2}$ balances the client model deviation change rate in EUA, bringing the benefits of quick convergence. (2) The performance of all methods decreases when the number of clients increases, but Fed $\text{U}^{2}$ -SimCLR consistently outperforms FedX-SimCLR. It validates that enhancing uniform and unified representations will make FUSL methods more generalizable to the cases of various participants amounts.

B.2 Enlarged Figures in Visualization

In our main paper, we depict the top-k singular values of covariance matrix representations in Fig. (3), the corresponding 3-D representation in Fig. (4), and the distribution of data representation in Fig. (5) between global model and randomly sampled local models. The purpose of the above figures is to illustrate the representation enhancement of Fed $\text{U}^{2}$ . In Fig. 9-11, we enlarge these figures to explore the detailed comparisons. In terms of Fig. 11, Fed $\text{U}^{2}$ keeps the unified representation between global and local models as well as clearer decision boundary for each class.

		$\displaystyle\\|\bm{\theta}^{t+1}-\bm{\theta}^{*}\\|^{2}$		(19)
		$\displaystyle=\\|\bm{\theta}^{t}-\eta_{t}\bm{d}_{t}-\bm{\theta}^{*}\\|^{2}$
		$\displaystyle=\\|\bm{\theta}^{t}-\bm{\theta}^{}\\|^{2}-2\left<\bm{\theta}^{t}-% \bm{\theta}^{},\eta_{t}\bm{d}_{t}\right>+\eta\\|\bm{d}_{t}\\|^{2}.$

$\displaystyle\\|\bm{d}_{t}\\|^{2}$	$\displaystyle=\frac{\eta_{t}^{2}}{\phi}\\|{(\nabla\log{\bm{u}}^{t})}^{\top}\bm{% p}\\|^{2}$	(23)
	$\displaystyle=\lambda_{E}\\|\bm{p}^{\top}\\|^{2}$
	$\displaystyle\leq K,$

		$\displaystyle\\|\bm{\theta}^{t+1}-\bm{\theta}^{*}\\|^{2}$		(24)
		$\displaystyle=\\|\bm{\theta}^{t}-\eta_{t}\bm{d}_{t}-\bm{\theta}^{*}\\|^{2}$
		$\displaystyle\leq\\|\bm{\theta}^{t}-\bm{\theta}^{*}\\|^{2}+\eta_{t}\sum_{k}^{K}p% _{k}\left(\frac{1}{\eta_{t}}\\|\bm{\theta}^{t}-\bm{\theta}^{t}_{k}\\|^{2}+\eta_{% t}\\|\nabla F_{k}(\bm{\theta}_{k}^{t})\\|^{2}\right)$
		$\displaystyle+2\eta_{t}\sum_{k}^{K}p_{k}\left(-\left(F_{k}(\bm{\theta}^{t}_{k}% )-F_{k}(\bm{\theta}^{})\right)-\frac{\mu}{2}\\|\bm{\theta}^{t}_{k}-\bm{\theta}% ^{}\\|^{2}\right)+\eta_{t}^{2}\\|\bm{d}_{t}\\|^{2}$
		$\displaystyle=(1-\mu\eta_{t})\\|\bm{\theta}^{t}-\bm{\theta}^{*}\\|^{2}+\sum_{k}^% {K}p_{k}\\|\bm{\theta}^{t}-\bm{\theta}^{t}_{k}\\|^{2}+\eta_{t}^{2}K+2\eta_{t}^{2}\Gamma$
		$\displaystyle\leq(1-\mu\eta_{t})\\|\bm{\theta}^{t}-\bm{\theta}^{*}\\|^{2}+4\eta_% {t}^{2}(E-1)^{2}V^{2}+\eta_{t}^{2}K+2\eta_{t}^{2}\Gamma.$

Rethinking the Representation in Federated Unsupervised Learning with Non-IID Data

Abstract

1 Introduction

2 Related Work

2.1 Federated Unsupervised Learning

2.2 Representation Collapse

3 Method

3.1 Federated Unsupervised Learning Formulation

3.2 Fed𝚄2 Overview

3.3 FUR for Mitigating Representation Collapse

3.4 EUA for Generalizing Unified Representation

Theorem 1 (Optimization consistency of model deviations).

Proof.

3.5 Overall Algorithm and Convergence Analysis

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Lemma 1 (Bound of Client Model Divergence).

Proof.

Theorem 2 (Convergence Error Bound).

Proof.

4 Experiments

4.1 Experimental Setups

4.2 Experimental Results

4.3 Representation Visualization

5 Conclusion

References

Appendix A Theoretical Analysis

A.1 Optimization Consistency of Model Deviation

Theorem 3 (Optimization consistency of model deviations).

Proof.

A.2 Bound of Client Model Divergence

Assumption 5.

Assumption 6.

Assumption 7.

Assumption 8.

Lemma 2 (Bound of Client Model Divergence).

Proof.

A.3 Convergence Error Bound

Definition 1 (Heterogeneity Quantification [23]).

Theorem 4 (Convergence Error Bound).

Proof.

Appendix B Experimental Supplementary

B.1 Hyper-parameter Sensitivity Analysis

B.2 Enlarged Figures in Visualization

3.2 Fed $\text{U}^{2}$ Overview