Heterogeneity for the Win: One-Shot Federated Clustering

Don Kurian Dennis    Tian Li    Virginia Smith
Abstract

In this work, we explore the unique challenges—and opportunities—of unsupervised federated learning (FL). We develop and analyze a one-shot federated clustering scheme, k-FED, based on the widely-used Lloyd’s method for k-means clustering. In contrast to many supervised problems, we show that the issue of statistical heterogeneity in federated networks can in fact benefit our analysis. We analyse k-FED under a center separation assumption and compare it to the best known requirements of its centralized counterpart. Our analysis shows that in heterogeneous regimes where the number of clusters per device (k) is smaller than the total number of clusters over the network k, (kk), we can use heterogeneity to our advantage—significantly weakening the cluster separation requirements for k-FED. From a practical viewpoint, k-FED also has many desirable properties: it requires only one round of communication, can run asynchronously, and can handle partial participation or node/network failures. We motivate our analysis with experiments on common FL benchmarks, and highlight the practical utility of one-shot clustering through use-cases in personalized FL and device sampling.

Machine Learning, ICML

1 Introduction

Federated learning (FL) aims to perform machine learning over large, heterogeneous networks of devices such as mobile phones or wearables (McMahan et al., 2017). While significant attention has been given to the problem of supervised learning in such settings, the problem of unsupervised federated learning has been relatively unexplored (Kairouz et al., 2019). In this work, we show that unsupervised learning presents unique opportunities for FL, specifically for the task of clustering data that resides in a federated network.

Clustering is a crucial first step in many learning tasks. In the case of federated learning, clustering has found applications in client-selection (Cho et al., 2020), personalization (Ghosh et al., 2020) and exploratory data analysis. While many works have explored techniques for distributed clustering (Section 2), most do not take into account the unique challenges of federated learning, such as statistical heterogeneity, systems heterogeneity, and stringent communication constraints (Li et al., 2020a)111Privacy, while an important concern for many federated applications, is not the main focus of our work. However, a possible benefit of the one-shot nature of k-FED is that it requires significantly fewer messages to be shared over the network relative to standard iterative techniques such as distributed k-means.. These challenges can complicate analyses, reduce efficiency, and lead to practical issues with stragglers and device failures. In this work, we study communication-efficient distributed clustering in settings where the data is non-identically distributed across the network (i.e., heterogeneous), and devices can join and leave the network abruptly. For such settings, we develop and analyse a one-shot clustering scheme, k-FED, based on the classical Lloyd’s heuristic (Lloyd, 1982) for clustering.

The method we propose, k-FED, requires only one round of communication with a central server. Each device, indexed by z, solves a local k(z)-means problem and then communicates its local cluster means via a message of size O(dk(z)). As we show in Section 3, this allows for device failures, only requiring that there are enough devices available in the network such that k target clusters exist in the data. Moreover, it is possible to cluster points in previously unavailable devices via a simple recomputation at the central server.

Beyond the practical benefits of k-FED, our work is unique in rigorously demonstrating a problem setting where possible benefits of statistical heterogeneity exist for federated learning. In particular, in supervised learning, many works have highlighted detrimental effects of statistical heterogeneity, observing that heterogeneity can lead to poor convergence for federated optimization methods (McMahan et al., 2017; Li et al., 2020b), result in unfair models (Mohri et al., 2019), or necessitate novel forms of personalization (Smith et al., 2017; Mansour et al., 2020). In contrast to these works, we show that for the specific notion of heterogeneity considered herein (provided in Definition 3.2 and motivated by the application of clustering), heterogeneity can in fact have measurable benefits for our approach.

More specifically, similar to many works in clustering (Kumar & Kannan (2010); Awasthi & Sheffet (2012) and references therein), we analyse k-FED under a center-separation assumption; that is, we assume that the mean of the clusters are well separated. We also consider a specific notion of heterogeneity: given a target clustering with k clusters that we wish to recover from the data, we assume that each device contains data from only kk of these target clusters. For instance, for clustering data generated by a mixture of k well separated Gaussians, we assume that each device contains data from kk component Gaussians. In this regime, we show that our separation requirement is similar to that of the centralized counterpart. Further, while the centralized setting requires all pairs of cluster centers to satisfy a Ω(k) center separation requirement, the federated approach can handle a large fraction of cluster pairs only satisfying a weaker Ω(k14) separation requirement. This is the first result we are aware of that analyzes the benefits of heterogeneity in the context of federated clustering.

Contributions. We propose and analyze a one-shot communication scheme for federated clustering. Our proposed method, k-FED, addresses common practical concerns in federated settings, such as high communication costs, stragglers, and device failures. Theoretically, we show that k-FED performs similarly to centralized clustering in regimes where each device only has data from at most k clusters with a similar Ω(k) center separation requirement. Moreover, in contrast to the centralized setting, we show that a large number of cluster pairs need only a Ω(k14) weaker separation assumption in heterogeneous networks, thus allowing a broader class of problems to be solved in this setting compared with centralized clustering. We demonstrate our method through experiments on common FL benchmarks, and explore the applicability of k-FED to problems in personalized federated learning and device sampling. Our work highlights that heterogeneity can have distinct benefits for a subset of problems in federated learning.

2 Background and Related Work

Centralized Clustering. Clustering is one of the most widely-used unsupervised learning tasks, and has been extensively studied in both centralized and distributed settings. Although a variety of clustering methods exist, Lloyd’s heuristic (Lloyd, 1982) remains popular due in part to its simplicity. In Lloyd’s method, we start with an initial set of k centers. We then assign each point to its nearest center and reassign the centers to be the mean of all the points assigned to it, continuing this process till termination. While it is easy to show that this method terminates, it is also known that this process can take superpolynomial time to converge (Arthur & Vassilvitskii, 2006). However, under suitable assumptions and careful choice of the initial centers, it can be shown to converge in polynomial time (Arthur & Vassilvitskii, 2006; Ostrovsky et al., 2013; Kumar & Kannan, 2010; Awasthi & Sheffet, 2012).

The method we propose, k-FED (Section 3.2), is a simple, communication-efficient distributed variant of these classical techniques. k-FED runs a variant of Lloyd’s method for k-means clustering locally on each device, and then performs one round of communication to aggregate and assign clusters. Our work builds on the analysis of a variant of Lloyd’s algorithm developed by Kumar & Kannan (2010) and later improved in Awasthi & Sheffet (2012) for the problem of clustering data from mixture distributions and other related results (e.g., McSherry, 2001; Ostrovsky et al., 2013). These works develop a deterministic framework with no generative assumptions on the data. Our analysis follows this framework and does not make any generative assumptions on the data.

Parallel and Distributed Clustering. Many works have explored parallel or distributed implementations of centralized clustering techniques (Dhillon & Modha, 2002; Tasoulis & Vrahatis, 2004; Datta et al., 2005; Bahmani et al., 2012; Xu et al., 1999). Unlike the one-shot communication scheme explored herein, these methods are typically direct parallel implementations of methods such as Lloyd’s heuristic or DBSCAN (Ester et al., 1996), and require numerous rounds of communication. Another line of work has considered communication-efficient distributed clustering variants that require only one or two rounds of communication (e.g., Kargupta et al., 2001; Januzaj et al., 2004; Feldman et al., 2012; Balcan et al., 2013; Bateni et al., 2014; Bachem et al., 2018). These works are mostly empirical, in that there are no provable guarantees on the approximation quality of the distributed schemes; the works of Balcan et al. (2013); Bateni et al. (2014); Bachem et al. (2018) differ by providing communication-efficient distributed coreset methods for clustering, along with provable approximation guarantees. However, these works do not explore the federated setting or potential benefits of heterogeneity.

Federated Clustering. Several works have explored clustering in the context of supervised FL as a way to better model non-IID data (Smith et al., 2017; Ghosh et al., 2019, 2020; Sattler et al., 2020). These works differ from our own by clustering specifically in terms of devices, focusing on the downstream supervised learning task, and using either iterative (Smith et al., 2017; Ghosh et al., 2020; Sattler et al., 2020) or centralized (Ghosh et al., 2019) clustering schemes. Though not the main of focus of our work, in Section 4 we demonstrate the applicability of one-shot clustering by showing how k-FED can be used as a simple pre-processing step to deliver personalized federated learning—achieving similar or superior performance relative to the recent iterative approach for clustered FL proposed in Ghosh et al. (2020).

More recently, a distributed matrix factorization based clustering approach was explored in Wang & Chang (2020) for the purposes of unsupervised learning. However, while the authors consider the impact of statistical heterogeneity on their convergence guarantees, the focus is not on one-shot clustering or on showing distinct benefits of heterogeneity in their analyses.

3 k-FED: Preliminaries and Main Results

In this section, we begin by discussing some preliminaries and existing results in clustering related to Lloyd-type methods. In Section 3.1, we present the deterministic framework of Awasthi & Sheffet (2012) for centralized clustering, which we build upon. We present our method k-FED and state our theoretical results in Section 3.2. We provide detailed proofs in Appendix A.

3.1 Centralized k-means

In the standard (centralized) k-means problem, we are given a matrix An×d where each row Ai is a data point in d. We are also given a fixed positive integer kn, and our objective is to partition the data points into k disjoint partitions, 𝒯=(T1,,Tk), so as to minimize the k-means cost:

ϕ(𝒯)=j=1kiTjAiμ(Tj)22. (1)

Here we use μ(S) as an operator to indicate the mean of the points indexed by S, i.e., μ(S)=1|S|iSAi. To ease notation, we simplify this as μr:=μ(Tr), when Tr is unambiguous.

Algorithm 1 Local k(z)-means (Awasthi & Sheffet, 2012)
1: Input: On device indexed by z, the matrix of data points A(z), integer k(z);
2: Project A(z) onto the subspace spanned by the top k(z) singular vectors to get A^(z). Run any standard 10-approximation algorithm on the projected data and estimate k(z) centers (ν1,ν2,,νk(z)).
3: Set
Sr{i:A^i(z)νr213A^i(z)νs2, for every s}
and θr(z)μ(Sr)
4: Run Lloyd steps until convergence
Ur(z){i:Ai(z)θr(z)2Ai(z)θs(z)2,s}
and θr(z)μ(Ur(z)).
5: Return: Cluster assignments (U1(z),U2(z),,Uk(z)(z)) and their means Θ(z)=(θ1(z),,θk(z)).

While the k-means problem as stated here does not specify any generative model for the data points Ai, a popular setting to consider is when the data is sampled from a mixture of k-distributions in d-dimensions (kd). For instance, we could imagine the data points as being sampled from a mixture of k Gaussian distributions. This generative model also introduces a notion of a target clustering, 𝒯=(T1,,Tk) where the set Ti contains all points generated by the i-th component distribution. Many distribution dependent results are known for the problem of clustering distributions (see Kumar & Kannan (2010)). In general, they can be stated as: If the means of the distributions are poly(k) standard deviations apart, then we can cluster the data in polynomial time. Kumar & Kannan (2010) introduce a deterministic (distribution independent) framework that encompasses many of these known results. This work was later simplified and improved by Awasthi & Sheffet (2012). We state the main results of this framework here, after stating the notation we use. We emphasis that in our analysis we make no assumptions on how the data is generated; all relevant quantities only depend on the provided data.

Notation. We now introduce several definitions and notations that will be used throughout the paper. Let A denote the spectral norm of a matrix A, defined as A=maxu:u2=1Au2, and let Ai2 denote the 2 norm of a vector Ai. For consistency, we index individual rows of A with i and j. Moreover, when a target clustering T1,,Tk is fixed, we index clusters with r,s, e.g., Ar is the matrix of points indexed by Tr. For notational convenience, we let c(Ai) to denote the cluster index for data point Ai such that, AiTc(Ai). For some set of points M, and another point say x, let dM(x) denote the distance of x to the set M, defined as dM(x)=minyMxy2. Finally, let C be a n×d matrix with each row Ci=μc(Ai). For cluster Tr with nr=|Tr|, we define

Δ~r:=kACnr. (2)

Here the quantity AC/nr can be thought of as a deterministic analogue of the standard deviation; it measures the maximum average variance along any direction. Thus instead of reasoning about the separation between two clusters Tr and Ts in terms of the standard deviation, we will use (Δ~r+Δ~s). In particular, we say that the two clusters Tr and Ts are well separated if for large enough constant c, their means satisfy:

μrμs2c(Δ~r+Δ~s). (3)

Again, we can interpret this as saying that two clusters are well separated if their means are c-standard-deviations apart.222 Any c100 is sufficient for our arguments (see Lemma 5). Using the center separation assumption in (3), Awasthi & Sheffet (2012) show that for a target clustering T1,T2,,Tk satisfying the separation assumption, the variant of Lloyd’s algorithm presented in Algorithm-1 when applied to the centralized clustering problem correctly clusters all but a small fraction of the data points. We state their result formally in Lemma 1, but before that we define a proximity condition, that will be used to precisely characterize the misclassified points.

Definition 3.1.

A point Ai for some iTs is said to satisfy the proximity condition, if for every rs, the projection of Ai onto the line connecting μr and μs, denoted by A¯i satisfies

A¯iμr2A¯iμs2(1nr+1ns)AC.

Thus a point Ai for iTs satisfies the proximity condition if its projection on the line connecting μr and μs is closer to μs by AC(1nr+1ns). We refer to points that do not satisfy the proximity condition as ‘bad points’. We now state the main result from Awasthi & Sheffet (2012) in the following lemma.

Lemma 1 (Awasthi-Sheffet, 2011).

Let 𝒯=(T1,,Tk) be the target clustering. Assume that each pair of clusters Tr and Ts are well separated. Then, after step 2 of Algorithm-1, for every r, it holds that μ(Sr)μr225c1nrAC. Moreover, if the number of bad points is ϵn, then (a) the clustering {U1,U2,,Uk} misclassifies no more than (ϵ+O(1)c4)n points and (b) ϵ<O((c1k)2). Finally, if ϵ=0 then all points are correctly assigned.

When we say misclassify, we mean with respect to 𝒯 and up to a permutation of labels. Lemma 1 tells us that the cluster means, μ(Sr), are not very far away from the target cluster means, μr. Note that there are no distribution dependent terms in this statement; all relevant quantities are defined in terms of the data matrix A and 𝒯.

3.2 k-FED: Method and Main Result

We now turn our attention to clustering data in a federated network. In our setting, we assume that all the devices in the network can communicate with a central server. Our clustering method k-FED, described in Algorithm 2, can be thought of as working in two stages. In the first stage, each device solves a local clustering subproblem and computes the cluster means for this subproblem. In the second stage, the central server accumulates and aggregates the results to compute the final clustering.

Notation. Let A be an n×d data matrix of all the data points in our network. We index individual devices by z[Z] and thus, we denote the data-matrix for any particular device by A(z)n(z)×d, where n(z) is the number of data points on the device. Let nmin=minzn(z). Note that A(z) is some subset of rows of A. Let 𝒯=(T1,,Tk) be a clustering of all the data, referred to as a target clustering. For a fixed 𝒯, let 𝒯(z)=(T1(z),T2(z),,Tk(z)) be subsets of our target clustering that reside on a device z. Note that some Tr(z) could be empty. Let k(z) be the number of non-empty subsets on device z and let k=maxzk(z). Our notion of heterogeneity is formally defined based on the value of k, as described below.

Definition 3.2 (Heterogeneity of Clustering).

In the context of clustering, we say that a federated network with sufficient data is heterogeneous if kk. The lower the ratio between k and k, the more heterogeneity exists in the network.

Intuitively, this definition of heterogeneity states that—in contrast to the data from the k total clusters being partitioned in an IID fashion across the network—the data are partitioned in an non-IID fashion, such that only data from a small number of clusters (at most k) exists on each device. Such non-IID partitioning is reasonable to expect in heterogeneous federated networks with a large number of clusters, since the distribution of data on each device may differ, and it is not possible to actively re-distribute data across the network. For instance, consider identifying interests of mobile phone users based on the interaction data on an application. Here the interaction data is generated by the user on their particular device, and will reflect the tastes of individual. While the total number of ‘tastes’ (clusters) over the entire network could be quite large, a typical user will be interested in only a small number of them. With this definition in mind, we next describe our one-shot clustering method, k-FED, and analyze it in heterogeneous regimes.

Method Description.

Similar to the centralized case (Section 3.1), let C(z) be a n(z)×d matrix of the local cluster means, i.e. of 𝒯(z). Consider a non-empty susbset Tr(z) of cluster Tr on some device and let nr(z)=|Tr(z)|. We assume that there is a constant m0>1, such that nr(z)1m0nr for all r. We will use this quantity to ensure that individual devices have ‘enough’ points. Let,

Δr=kACnr,andλ=k(ACnmin). (4)

In the first step of k-FED (Algorithm-2), each (available) device z[Z] runs Algorithm-1 locally and solves a local clustering problem with their local dataset A(z) and parameter k(z). We assume that k(z) is known. This stage outputs device cluster centers Θ(z)=(θ1(z),,θk(z)(z)) and cluster assignments, U1(z),,Uk(z)(z) for each device z. At this stage, note that even though each device has classified its own points into clusters, we do not yet have a clustering for points across devices. The central server attempts to create this clustering by aggregating the device cluster centers and separating them into k sets, τ1,,τk. These sets induce a clustering of the data on the network as defined here:

Definition 3.3 (k-FED induced clustering).

Let τ1,τ2,,τk be the clustering of device centers returned by Algorithm 2. Define,

Tr={i:Ai(z)Us(z) and θs(z)τr,z[Z],s[k(z)]}.

Then, 𝒯=(T1,,Tk) form a disjoint partition of the entire data, called the k-FED induced clustering.

Algorithm 2 k-FED
1: On each device z[Z], run Algorithm-1 with local data A(z) and k(z) and obtain device cluster centers Θ(z)=(θ1(z),,θk(z)(z)) at the central node.
2: Pick any z[Z] and let MΘ(z).
3: repeat
4: Let θ¯argmaxz[Z],i[k]dM(θiz). That is, the farthest θi(z) from the set M.
5: MM{θ¯}.
6: until there are k points in M, i.e. |M|=k
7: Run one round of Lloyd’s heuristic to cluster points θi(z), z[Z],i[k] into k sets/clusters, (τ1,τ2,,τk). Use points in M as initial centers.
8: Return: the clustering (τ1,τ2,,τk) of the device cluster centers and the corresponding k-FED induced clustering (Definition 3.3).

For our analysis comparing the quality of the k-FED induced clustering, 𝒯, to our target clustering 𝒯, we require two different separation assumptions. We refer to them as active and inactive separation and introduce them through the following two definitions.

Definition 3.4 (Active/Inactive cluster pairs).

A pair of clusters (Tr,Ts) are said to be an active pair if there exists at least one device that contains data points from both Tr and Ts. If no device has data points from both clusters Tr and Ts, we refer to the cluster pair (Tr,Ts) as an inactive pair.

Definition 3.5.

We say that two clusters Tr and Ts satisfies the active separation requirement if, μrμs22cm0(Δr+Δs), for some large enough constant c. Similarly, we say that they satisfy the inactive separation requirement if μrμs210m0(λr+λs).

Intuitively, these notions capture the difficulty in clustering two different types of clusters pairs—active and inactive cluster pairs. If no device has data from both Tr and Ts (i.e. an inactive pair), then the clustering sub-problems individual devices have to solve is easier since they never involve data from both of these clusters simultaneously. Thus the separation requirement for inactive cluster pairs is weaker than that for an active cluster pair. We now state our main theorem, which characterizes the performance of k-FED. We provide a detailed proof in Appendix A.

Theorem 3.1 (Main theorem).

Let 𝒯=(T1,T2,,Tk) be a fixed target clustering of the data on a federated network. Let m0>1 be such that, |Tr(z)|1m0|Tr| for all r,s and for all z[Z]. Assume that each active cluster pairs Tr and Ts satisfy the active separation requirement, i.e.,

μ(Tr)μ(Ts)2cm0(Δr+Δs).

Further, assume that for each inactive cluster pairs Tr,Ts,

μ(Tr)μ(Ts)210m0λ.

Then, at termination of k-FED all but O(1c2)n points are correctly classified. Moreover, if for each device z, the data points A(z) satisfy the proximity condition (Definition 3.1) for its local problem, then all points are classified correctly.

As before, by classified we mean that the clustering 𝒯 produced by k-FED and 𝒯 agree on all but O(1c2)n points, up to permutation of labels of 𝒯. Note that when kk, our active separation requirement is stricter than that required in centralized clustering (Ω(k) vs Ω(k)). Further, as one would expect, as the number of points per cluster on each device decreases, the local clustering becomes harder. This is highlighted by our adverse dependency on m0.

However, in contrast to the general distributed learning framework where each device typically has a random subset of the data, the data residing on the devices in federated networks are typically generated locally and thus the partition of data among the devices is non-identically distributed. Specifically, in practice, the number of subsets of target clusters that reside on a device may be much smaller than the total number of clusters. Thus, as outlined in Definition 3.2, we look at the cases where kk. Observe that in such settings, our active separation requirement reduces to that of the centralized k-means problem (with an additional m0 penalty) and our inactive separation requirement weakens to k1/4. We state this formally in Corollary 1.1.

Corollary 1.1.

Assuming kk, an active cluster pair (Tr,Ts) satisfies the active separation requirement if

μrμs2 cm0k(ACnr+ACns)
=cm0(Δr+Δs).

Similarly, an inactive cluster pair (Tr,Ts) satisfies the inactive separation requirement if

μrμs210m0k14(ACnr+ACns).

Thus in this setting of k<k, k-FED recovers the target partitions in only one round of communication. Moreover, inactive cluster pairs need only satisfy our Ω(k14) separation requirement as opposed to the Ω(k) separation that all cluster pairs need to satisfy in the centralized setting for Lemma 1 to hold. This highlights that there exists a benefit of heterogeneity in the context of running k-FED over federated networks.

Practical benefits of k-FED.

Finally, we highlight several practical benefits of the k-FED method:

  • One-shot: k-FED only requires one round of communication for each device: one outgoing message to send the local clustering results and one incoming message to receive cluster identity information.

  • No network-wide synchronization: Classical parallel implementations of Lloyd’s heuristic and variants (e.g., Dhillon & Modha, 2002), require a network wide synchronization/initialization step. Unlike these methods, each device in k-FED works independently does not require an initialization/synchronization step.

  • New devices/Device Failures: Assuming we have already performed clustering on the current network, for any new device entering the network, either from a previous failure or as a new participant, computing the clustering information can be done without involving any other device in the network. As we show in Theorem 3.2 (below), simply assigning any new local cluster center θi(z) from the new device z, to the nearest device cluster mean in M sufficient. The central server only has to maintain k cluster means μ(τ1),,μ(τk) to perform this update.

Theorem 3.2.

Steps 2-8 of k-FED take O(Zkk2) pairwise distance computations to terminate. Further, after the set M in Step 6 has been computed, new local cluster centers Θ(z) from a yet unseen device z can be correctly assigned in O(kk) distance computations.

As we show in Section 4, these properties of k-FED make it an ideal candidate for being used as an inexpensive heuristic for clustering in federated networks, either for data exploration or as part of a preprocessing step for another algorithm, even in settings where the separation requirements are not formally satisfied.

4 Applications and Experiments

We now present experimental evaluation of k-FED. We first specialize the theory to the special case where data is drawn from a mixture of k Gaussians in Section 4.1 to validate our theory on synthetic data. In Section 4.2, we evaluate k-FED on real datasets—presenting experimental evidence that highlights the benefit of heterogeneity and the communication efficiency of k-FED. We further present two applications of k-FED, in client selection as well as personalization. The dataset details for each experiment can be found in the corresponding section. Implementation of k-FED and experimental setup details can be found at: http://github.com/metastableB/kfed/.

4.1 Separating Mixture of Gaussians

We first specialize our theorem to the case of separating data generated from a mixture of k Gaussians F1,F2,,Fk. Let μr=μ(Fr) be the mean of the mixture component Fr and let w1,w2,,wk be the mixing weights. Finally, let wmin=minrwr be the minimum mixing weight. Let σmax be the maximum variance along any direction among all the component distributions. Assume this data resides over our devices such that no single device has data from more than k<k components. We state the following theorem (proved in Appendix A) that specifies the conditions required for this setup to satisfy our separation assumptions:

Theorem 4.1.

Let the total number of data pints, n=poly(dwmin). Then any active cluster pairs r,s satisfy the active separation requirement with high probability if;

μrμs2ckm0σmaxwminpolylog(dwmin).

Further, an inactive cluster pairs r,s satisfy the inactive separation requirement with high probability if

μrμs2cm0k14σmaxwminpolylog(dwmin).

Finally, with this separation in place, all points satisfy the proximity condition with high probability.

Concretely, in this setup k-FED recovers the target clustering exactly with high probability. To empirically evaluate our theory, we instantiate an simplified instance of the above setup as follows:

Table 1: Clustering accuracy for clustering a mixture of Gaussians. Here for all instances we choose k=k. We can see that the one-shot clustering produced by k-FED agrees with the target clustering with high accuracy, particularly when k is relatively small compared to d.
Parameters Accuracy
(d=100,k=16,m0=5,c=100) 100.00±0.00
(d=100,k=64,m0=5,c=100) 98.82±0.70
(d=300,k=64,m0=5,c=100) 99.27±0.73
(d=300,k=100,m0=5,c=100) 98.40±0.80
(d=300,k=16,m0=5,c=100) 100.00±0.00

Setup. Again consider the Gaussian components F1,,Fk, and define the set of integers Gi={p(i1)×kpi×k}. These sets Gi thus can be used to index the Gaussian components (F(i1)k,,Fik). For each Gi, construct a set of data points Di by sampling poly(dk) samples from each component Fp for pGi. Thus the set Di contains kpoly(dk) samples (wr=1k,r). Pick m0 and for each set of data points Di, distribute the data among m0 devices such that each device receives exactly 1m0poly(dk) samples. We now run k-FED on this setup and measure the quality of the clustering averaged over 10 runs, (shown in Table 1). As one would expect, the clustering produced by k-FED agrees strongly with the target clustering. Note that by construction all devices with data from the same set Gi contain data from the same set of Gaussian components. Further, devices with data from different sets Gi have no common Gaussian component. Thus all cluster pairs within the same set Gi are active cluster pairs and there are k(k2) such pairs. Moreover, any pair (r,s) such that rGi, sGj ij form an inactive cluster pair and there are (k2)k(k2)=O(k2) such pairs. These need only satisfy the weaker inactive separation requirement.

Refer to caption
Refer to caption
Figure 1: Impact of the separation constant c on the clustering accuracy when clustering a mixture of Gaussians. Even for relatively small values of c, for the case of data generated from a mixture of Gaussians, k-FED can recover highly accurate clustering with decreasing variance across runs.

Note that while we prescribe c100 for our arguments to hold, Figure 1 demonstrates that clustering can be recovered even in settings where c is much smaller.

4.2 Empirical Evaluation on Real Data

In this section, we empirically explore k-FED and the related analyses from Section 3. First, we validate our theoretical results, showing that clustering over structured (heterogeneous) partitions can improve clustering performance relative to clustering over random, IID partitioned data. Second, we explore the effect of one-shot clustering relative to more communication-intensive baselines. Finally, we investigate practical applications of one-shot clustering in terms of client sampling and personalized federated learning.

4.2.1 Properties of k-FED

Benefits of Heterogeneity (Def. 3.2). We compare the performance of k-FED on two different partitions of data among devices: (i) one with IID random partitions, and (ii) another with structured partitions. To generate the underlying structured partition for this experiment we use the following heuristic. First, we cluster all the data into k clusters for a range of values of k. For each k, we take the clustering we have as the target clustering 𝒯, and construct the data matrix A and the matrix of centers C. Finally, for each pair of cluster means μr,μs, we compute the quantity μrμs2m0(Δr+Δs), the ratio of the actual separation of the cluster mean to the required active separation. We pick a value of k at which a large number of clusters are reasonably well separated (see Appendix B, Figure 5). We call this our oracle clustering. Now to generate the IID partition for (i), we randomly distribute this data among Z devices. To generate the structured partition for (ii), we divide the data among Z devices such that each device receives only data from a random subset of no more than k clusters. For each value of k, we cluster the data for both cases over the devices using k-FED and compute the k-means cost. Let ϕ denotes the k-means cost of the original oracle clustering. Let ϕ(k) denote the k-means cost when k clusters are assigned to each device. Figure 2 presents the relative cost ratio between the cost change in structured partitions (ϕ(k)ϕ) and random partitions (ϕ(k)ϕ).

We perform this experiment on the FEMNIST and Shakespeare datasets (Caldas et al., 2018) (see Appendix B for details). It can be seen from the results plotted in Figure 2 that clustering on structured splits achieves a cost closer to that of the oracle partition compared to the cost achieved on the IID random partition. We note that the separation achieved in real datasets is much smaller than required even with this careful construction (Appendix B). Even still, our experiments demonstrate that heterogeneity can benefit federated clustering on common benchmarks.

Refer to caption
Refer to caption
Figure 2: The k-means cost under structured partitions (ϕ(k)) is closer to the cost of oracle clustering (ϕ) than that under random partitions (ϕ(k)). As heterogeneity increases (k decreases), the benefits of structured partitions are becoming more significant, with ϕ(k)ϕϕ(k)ϕ.

Communication-Efficiency. One advantage of the proposed method is that it requires only a single round of communication. Given this, it is natural to wonder how the performance of k-FED would compare with other, more communication-intensive clustering baselines. In particular, a common way to solve k-means in distributed settings is to simply parallelize the cluster assignment and cluster mean calculations at each step. Here, we show that for different partitions of the dataset with multiple values of k, our one-shot method k-FED is able to produce similar clustering outputs (in terms of the k-means cost; lower is better) as naive distributed k-means, which requires multiple communication rounds. Here we use the same oracle clustering as the previous experiment to construct our device data.

Refer to caption
Refer to caption
Figure 3: k-FED (using just one communication round) is able can provide similar clustering quality as naive distributed k-means.

4.2.2 Applications of k-Fed

Personalized FL. Compared with fitting a single global model to data across all device, jointly learning personalized (separate but related) models can boost the effective sample size while adapting to the heterogeneity in federated networks (e.g., Smith et al., 2017; Mansour et al., 2020).

Ghosh et al. (2020) recently proposed an algorithm to learn models over federated networks where devices are partitioned into clusters when the clustering information is unavailable. Consider a supervised learning problem that each cluster of devices want to solve and assume the number of clusters k is known. Their method, the Iterative Federated Clustering Algorithm (IFCA), in its first step initializes k models (m1,,mk), one for each cluster. At the start of each round, all k models are sent to the devices. Each device picks the model that minimizes a loss function on its locally available data. The device can be configured to now either compute and transmit the gradient of the loss function of this model or it can perform a few model updates locally and send the updated model to the central server. As the last step of the round, for each model mi i[k], all the devices that picked this model are identified. All these devices are assigned cluster id i. Model mi then is updated by either model averaging or gradient averaging using the information sent by devices in cluster i.

Table 2: Test accuracy of rotated MNIST on three methods. Training personalized models based on the clustering information output by k-FED achieves the same performance of IFCA, without the high computation and communication overhead of IFCA when k=1. For k=2, the performance of k-FED degrades much less when compared to that of IFCA.
Global IFCA k-FED
100 devices (k=1) 95.0 98.0 98.0
200 devices (k=1) 94.5 97.2 97.8
100 devices (k=2) 95.3 95.6 97.1
200 devices (k=2) 94.5 95.1 96.4

We instantiate IFCA on the problem of learning personalized models for clusters. As in (Ghosh et al., 2020), we use the MNIST dataset for this experiment. We construct k=4 clusters by 0,90,180 and 270 degree rotations and distribute them among devices. Note that in the setup for IFCA, each device only contains data from a single cluster (since we are clustering devices and not individual data points). Thus we set k=1 and compare IFCA with a simple k-FED based method: We first perform one-shot clustering to obtain an initial clustering and then we use FedAvg (McMahan et al., 2017) to learn one model per cluster. As a baseline, we also learn a single global model and include it for comparison. As can be seen from the test accuracies in Table 2 (k=1), k-FED is competitive with IFCA. Moreover, k-FED has the additional advantage that once the cluster identities have been assigned, we only need to transmit one model instead of the k models that are transmitted with IFCA.

Since k-FED clusters data, the k-FED + FedAvg approach can also handle cases where there are data from multiple clusters on the same device. Table 2 (k=k=2) shows the test accuracy on such a partition. Here we observe the performance of IFCA degrade when compared to k-FED.

Client Selection. Finally, we demonstrate that the clustering information produced by k-FED is a useful prior for client selection applications (Cho et al., 2020). In practice, cross-device federated optimization algorithms need to tolerate partial device participation (Kairouz et al., 2019). Intuitively, incorporating information from ‘representative’ devices at each communication round may speed up the convergence of learning tasks over federated networks as opposed to randomly sampling devices. When randomly sampling, similar and potentially redundant clients can be selected. A recent device selection method proposes to additionally select the devices with large training losses among those randomly-selected subset of devices (Cho et al., 2020) to help with convergence speed. We combine k-FED with this approach by further filtering out the devices coming from the same clusters. Note that k-FED does not add significant additional overhead to the baseline algorithm as it only requires running one-shot clustering before training. The results are shown in Figure 4. We see that leveraging the underlying structure learnt by k-FED can boost convergence on these realistic federated benchmarks.

Refer to caption
Refer to caption
Figure 4: Additional clustering information provided by k-FED can help achieve faster convergence than recent client selection techniques pow-d (Cho et al., 2020).

Similar to Cho et al. (2020), we also observe that for the experiments in Figure 4, the variance of test performance across all devices has been reduced using client selection strategies favoring more informative (potentially more underrepresented) clients compared with that of random selection. For instance, on FEMNIST, the variance of final test accuracies is reduced by 35% when using k-fed combined with pow-d instead of random selection. This may be useful in scenarios where we wish to impose notions fairness for federated learning (Mohri et al., 2019; Li et al., 2020c).

5 Conclusion

In this work, we provide an example of how heterogeneity in federated networks can be beneficial, by rigorously analyzing the effects of heterogeneity on a simple, one-shot variant of Lloyd’s algorithm for distributed clustering. Our proposed method, k-FED, addresses common practical concerns in federated settings, such as high communication costs, stragglers, and device failures. We believe that other, specific notions of heterogeneity—together with careful analyses—may provide benefits for a plethora of other problems in federated learning, which is an interesting direction of future work.

6 Acknowledgements

This work was supported in part by the National Science Foundation Grant IIS1838017, a Google Faculty Award, a Facebook Faculty Award, and the CONIX Research Center. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the National Science Foundation or any other funding agency.

References

  • Arthur & Vassilvitskii (2006) Arthur, D. and Vassilvitskii, S. How slow is the k-means method? In Proceedings of the Twenty-Second Annual Symposium on Computational Geometry, 2006.
  • Awasthi & Sheffet (2012) Awasthi, P. and Sheffet, O. Improved spectral-norm bounds for clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. 2012.
  • Bachem et al. (2018) Bachem, O., Lucic, M., and Krause, A. Scalable k-means clustering via lightweight coresets. In International Conference on Knowledge Discovery & Data Mining, 2018.
  • Bahmani et al. (2012) Bahmani, B., Moseley, B., Vattani, A., Kumar, R., and Vassilvitskii, S. Scalable k-means+. Proceedings of the VLDB Endowment, 2012.
  • Balcan et al. (2013) Balcan, M.-F., Ehrlich, S., and Liang, Y. Distributed k-means and k-median clustering on general topologies. In Advances in Neural Information Processing Systems, 2013.
  • Bateni et al. (2014) Bateni, M., Bhaskara, A., Lattanzi, S., and Mirrokni, V. Distributed balanced clustering via mapping coresets. In Advances in Neural Information Processing Systems, 2014.
  • Caldas et al. (2018) Caldas, S., Duddu, S. M. K., Wu, P., Li, T., Konečnỳ, J., McMahan, H. B., Smith, V., and Talwalkar, A. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097, 2018.
  • Cho et al. (2020) Cho, Y. J., Wang, J., and Joshi, G. Client selection in federated learning: Convergence analysis and power-of-choice selection strategies. arXiv preprint arXiv:2010.01243, 2020.
  • Dasgupta et al. (2007) Dasgupta, A., Hopcroft, J., Kannan, R., and Mitra, P. Spectral clustering with limited independence. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007.
  • Datta et al. (2005) Datta, S., Giannella, C., Kargupta, H., et al. K-means clustering over peer-to-peer networks. In International Workshop on High Performance and Distributed Mining, 2005.
  • Dhillon & Modha (2002) Dhillon, I. S. and Modha, D. S. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining. 2002.
  • Ester et al. (1996) Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In International Conference on Knowledge Discovery & Data Mining, 1996.
  • Feldman et al. (2012) Feldman, D., Sugaya, A., and Rus, D. An effective coreset compression algorithm for large scale sensor networks. In International Conference on Information Processing in Sensor Networks, 2012.
  • Ghosh et al. (2019) Ghosh, A., Hong, J., Yin, D., and Ramchandran, K. Robust federated learning in a heterogeneous environment. arXiv preprint arXiv:1906.06629, 2019.
  • Ghosh et al. (2020) Ghosh, A., Chung, J., Yin, D., and Ramchandran, K. An efficient framework for clustered federated learning. Advances in Neural Information Processing Systems, 2020.
  • Januzaj et al. (2004) Januzaj, E., Kriegel, H.-P., and Pfeifle, M. Dbdc: Density based distributed clustering. In International Conference on Extending Database Technology, 2004.
  • Kairouz et al. (2019) Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
  • Kargupta et al. (2001) Kargupta, H., Huang, W., Sivakumar, K., and Johnson, E. Distributed clustering using collective principal component analysis. Knowledge and Information Systems, 2001.
  • Kumar & Kannan (2010) Kumar, A. and Kannan, R. Clustering with spectral norm and the k-means algorithm. In Annual Symposium on Foundations of Computer Science, 2010.
  • Li et al. (2020a) Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 2020a.
  • Li et al. (2020b) Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems, 2020b.
  • Li et al. (2020c) Li, T., Sanjabi, M., Beirami, A., and Smith, V. Fair resource allocation in federated learning. In International Conference on Learning Representations, 2020c.
  • Lloyd (1982) Lloyd, S. Least squares quantization in pcm. IEEE Transactions on Information Theory, 1982.
  • Mansour et al. (2020) Mansour, Y., Mohri, M., Ro, J., and Suresh, A. T. Three approaches for personalization with applications to federated learning. arXiv preprint arXiv:2002.10619, 2020.
  • McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, 2017.
  • McSherry (2001) McSherry, F. Spectral partitioning of random graphs. In Symposium on Foundations of Computer Science, 2001.
  • Mohri et al. (2019) Mohri, M., Sivek, G., and Suresh, A. T. Agnostic federated learning. In International Conference on Machine Learning, 2019.
  • Ostrovsky et al. (2013) Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. The effectiveness of lloyd-type methods for the k-means problem. Journal of the ACM, 2013.
  • Sattler et al. (2020) Sattler, F., Müller, K.-R., and Samek, W. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE Transactions on Neural Networks and Learning Systems, 2020.
  • Smith et al. (2017) Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. S. Federated multi-task learning. In Advances in Neural Information Processing Systems, 2017.
  • Tasoulis & Vrahatis (2004) Tasoulis, D. K. and Vrahatis, M. N. Unsupervised distributed clustering. In Parallel and Distributed Computing and Networks, 2004.
  • Wang & Chang (2020) Wang, S. and Chang, T.-H. Federated clustering via matrix factorization models: From model averaging to gradient sharing. arXiv preprint arXiv:2002.04930, 2020.
  • Xu et al. (1999) Xu, X., Jäger, J., and Kriegel, H.-P. A fast parallel clustering algorithm for large spatial databases. In High Performance Data Mining. 1999.

Appendix A Proofs

A.1 Proving Theorem 3.1 (Main Theorem)

Before we proceed to proving Theorem 3.1, we first establish a few preliminary results. Let 𝒯=(T1,,Tk) be our target clustering and let Tr(z) be the subset of points of a cluster Tr on device z. For any point, Ai(z) on device z, let c(Ai(z)) denote the index of the cluster it belongs to. That is,

Ai(z)Tc(Ai(z))(z)Tc(Ai(z)).

Also recall the definition of matrix C, the matrix of means. Here the i-th row of C contains the mean of the cluster which contains data points Ai, i.e. Ci=μ(Tc(Ai)). Our first lemma bounds how far the ‘local’ cluster mean μ(Tr(z)) can deviate from μ(Tr).

Lemma 2 (Lemma 5.2 in Kumar & Kannan (2010)).

Let Tr(z) be a subset of Tr on device z. Let μ(Tr(z)) denote the mean of the points indexed by Tr(z). Then,

μ(Tr(z))μ(Tr)2AC|Tr(z)|.
Proof.

Let A(z) be the sub-matrix of A on device z and let C~(z) be the corresponding sub-matrix of our matrix of means C. Let u be an indicator vector for points in Tr(z). Observe that,

|Tr(z)|(μ(Tr(z))μr)2 =(A(z)C~(z))u2
A(z)C~(z)u2
AC|Tr(z)|.

Here, for the last inequality, we note that (A(z)C~(z)) contains a subset of rows of (AC), and therefore A(z)C~(z)AC.

Now consider the local clustering problem on each device z. The device has a data matrix A(z), whose rows are a subset of A. Let T1(z),T2(z),,Tk(z) be subsets of T1,T2,,Tk on this device, such that no more than k of them are non-empty. Construct a matrix C(z), of the same dimensions as A(z) where for each row of A(z), the corresponding row of C(z) contains the mean of the local cluster the point belongs to. That is, the i-th row of C(z) contains μ(Tc(Ai(z))(z)). Using this next lemma, we bound the operator norm of the matrix (A(z)C(z)), in terms of (AC).

Lemma 3.

Let T1(z),T2(z),Tk(z) be subsets of target cluster that reside on a device such that k of them are non-empty. Let A(z) be the corresponding n(z)×d data matrix. Let C(z) be the corresponding matrix of means; that is each row Ci(z)=μ(Tc(Aiz)z). Then,

A(z)C(z)2kAC.
Proof.

Let C~(z) be an n(z)×d matrix where C~i(z)=μ(Tc(Ai(z))). First, consider a unit vector u along the top singular direction and observe that:

C~(z)C(z)2 =r=1k|Tr(z)|((μ(Tr(z))μ(Tr))u)2
r=1k|Tr(z)|μ(Tr(z))μ(Tr)22
(a)kAC2.

Here for inequality (a) we invoke Lemma 2. Also, noting that A(z)C~(z)AC, we get,

A(z)C(z) A(z)C~(z)+C~(z)C(z)
(1+k)AC2kAC.

We prove Theorem 3.1 in four parts:

  1. 1.

    In the first step we show that satisfying the active separation condition is sufficient to satisfy the Awasthi-Sheffet separation condition required for Lemma 1 (Lemma 4).

  2. 2.

    Next we use Lemma 4 to show that the first step of k-FED (Algorithm-1) will find local centers θr(z) that are close to true centers μ(Tr(z)) on device z. We state and prove this in Lemma 5.

  3. 3.

    In next step, we show that the process of picking k initial centers in steps 2-6 of k-FED picks exactly one local cluster center θr(z) for each cluster r. That is, we pick k local centers one corresponding to each target cluster. (Lemma 6)

  4. 4.

    Finally, we argue that with this initialization, the clustering of local cluster centers produced (τ1,,τk) has the property that, all local cluster centers corresponding the to the same cluster (say Tr) will be in the same set (say τr). Moreover, no local cluster center corresponding to any Ts, sr will be in τr. As we argue later, this is sufficient for the induced clustering produced by (τ1,,τk) to agree with our target clustering 𝒯=(T1,T2,) up to permutation of labels and missclassifications incurred at the local clustering stage.

Lemma 4.

Let (Tr,Ts) be cluster pairs such that, μrμs22cm0(Δr+Δs). Let TrzTr and TszTs be large subsets on device z. Then,

μr(z)μs(z)2ck(A(z)C(z)nr(z)+A(z)C(z)ns(z)).
Proof.

(Lemma 4) Using the triangle inequality, we have

μr(z)μs(z)2 μrμs2μr(z)μr2μsμs(z)2
2cm0(Δr+Δs)ACnr(z)ACns(z) (5)

using the active separation assumption. Now, expanding the terms can write the left hand side as

μr(z)μs(z)2 2cm0(kACnr+kACns)ACnr(z)ACns(z)
(2m0nr(z)nr1ck)ckACnr(z)(i)+(2m0ns(z)ns1ck)ckACns(z)(ii).

We first only consider the term (i). According to Lemma 3, AC12kA(z)C(z). Using this we can bound (i) as

(2m0nr(z)nr1ck)ckACnr(z)(2m0nr(z)nr1ck)ckA(z)C(z)nr(z).

Now recall that for large cluster subsets nr(z)1m0nr and thus 2m0nr(z)nr1ck21ck1. This means that we can bound term (i) as,

(2m0nr(z)nr1ck)ckACnr(z)ckA(z)C(z)nr(z).

We get a symmetric expression for term (ii) as well. Using this in equation 5, we get the desired result:

μr(z)μs(z)2ck(A(z)C(z)nr(z)+A(z)C(z)ns(z)).

Since Algorithm-1 is run locally on each device, it is unaffected by the inactive separation condition, as by definition, subsets of only active cluster pairs exist on each device. This implies that Algorithm-1 solves the local clustering problem successfully. Specifically on device z containing data from some cluster Tr, θrz is not too far from μ(Tr(z)). Showing this result is our second step and we state this formally in Lemma 5 below.

Lemma 5.

Let (T1(z),,Tk(z)) be the subsets of (T1,,Tk) on some device z such that no more than k of them are non-empty. Moreover, assume all non-empty subsets are large, i.e. |Tr(z)|1m0|Tr|. Finally, assume that the active separation requirement is satisfied for all active cluster pairs on z. Then, on termination of Algorithm-1, for each non-empty Tr(z), we have

θr(z)μ(Tr(z))225cA(z)C(z)nrz50kcACnrz,

and,

θr(z)μ(Tr)22m0kACnr2m0λ.
Proof.

First note that the local clustering problem with data matrix A(z) and matrix of centers C(z) satisfies the requirements of Lemma 1. Thus it follows that,

θr(z)μ(Tr(z))225cA(z)C(z)nrz.

Now applying Lemma 3 gives us the first statement.

To prove the second statement, we start off with the triangle inequality:

θr(z)μ(Tr)2 θr(z)μ(Tr(z))2+μ(Tr(z))μ(Tr)2
25cA(z)C(z)nrz+ACnrz.

Here for the last inequality we used Lemma 2. Now applying Lemma 3 and taking take c100, we get

θr(z)μ(Tr)2 50kcACnrz+ACnrz
(50c+1k)kACnrz
2kACnrz2m0kACnr
2m0λ.

This means that for a fixed r, all the θr(z) received at the central server from devices z[Z] are ‘close’ to μ(Tr).

The next step is to show that in the k initial centers k-FED picks in steps 2-6, there is exactly one corresponding to each target cluster Ti. We will show later that this is sufficient for the final step of the algorithm to correctly assign local cluster centers to the correct partition.

Lemma 6.

Let 𝒯=(T1,,Tk) be our target clustering. Assume all active cluster pairs and inactive cluster pairs satisfy their separation requirements. Further let nmin4c2knmax. Then at the end of step 6 of k-FED, for every target cluster Tr, there exists an θs(z)M such that θs(z)=μ(Ts(z)) for some z[Z].

Before we proceed to proving this lemma, we state and prove a lower bound on how close a local cluster center θr(z) can be to some cluster mean μ(Ts) for sr:

Lemma 7.

Let θr(z):=μ(Tr(z)). The for any sr, z[Z],

θr(z)θs(z)26m0λ.
Proof.

First, from the triangle inequality note that,

θr(z)θs(z)2μrμs2μrθr(z)2μsθs(z)2.

Using Lemma 5 and our inactive separation assumption we bound the right hand side further as,

μrμs2μrθr(z)2μsθs(z)2 10m0kACnmin4m0kACnr
6m0kACnmin6m0λ,

as desired.

Proof.

(Lemma 6) Let Mt denote the set M in step 2-6 of k-FED, after picking the first t points (1tk). Let us denote the point k-FED selects in iteration t as θt. That is,

θtargmaxz[Z],i[k]dMt1(θi(z)).

We will show that the set Mt contains t points corresponding to t different target clusters at every iteration t. This invariant holds trivially at t=1. Assume the statement first became false at some 1<tk. Let the point θt correspond to a local cluster mean from cluster Tr. Then there must exist some 1t′′<t such that θt′′ also correspond to a local cluster mean from Tr. Further, there must exist some cluster sr such that θs(z)Mt for any z[Z].

Now by definition of dMt1(θt), we have

dMt1(θt) =minθMt1θtθ2
θtθt′′2
(a)θtμ(Tr)2+μ(Tr)θt′′2
(b)4m0kACnr4m0λ. (6)

Here inequality (a) follows from the triangle inequality and (b) follows from Lemma 5.

Now consider θs(z) for any z. Since no other local cluster center from Ts is contained in Mt, from Lemma 7 we conclude that for every θMt1,

θs(z)θ26m0λ.

But this means that dMt1(θt)4m0λ6m0λdMt1(θs(z)) leading to a contradiction based on the definition of θt. This completes our argument.

Now we are ready to prove our main Theorem 3.1.

Proof.

From Lemma 6, we know that the set M at the end of step 6 of k-FED contains exactly one center corresponding to each target clustering. Let the local cluster center θ~rM correspond to the cluster Tr. Observe that for any z[Z],

θrzθ~r2 θrzμr2+μrθ~r2
4m0λ,

using Lemma 7. Further, for any sr,

θszθ~r2 6m0λ.

This means that for every r and z[Z], θrz is closer to the corresponding initial center θ~r than to any other initial center θ~s, sr. Let τr be the set of local cluster centers assigned to θr~. Then it can be seen that τr only contains local cluster centers θr(z) for all devices z, i.e. τr contains all the device cluster centers corresponding to target cluster Tr.

Now consider the definition of k-FED induced clustering (Definition 3.3), where we define

Tr={i:Ai(z)Us(z) and θs(z)τr}.

In this case, we know that only local cluster centers corresponding to cluster Tr is contained in τr. Thus our induced cluster Tr becomes,

Tr={i:Ai(z)Ur(z)}.

Now from Lemma 1 we know that on each device the sets (U1(z),,Uk(z)) and (T1(z),,Tk(z)) only differ on at most O(1c2)n(z). Summing this error over all devices z, we see that our induced clustering (T1,,Tk) and the target clustering (T1,,Tk) differ only on O(1c2)n points. Finally, if all the local points satisfy their respective proximity condition (Definition 3.1), then no points are missclassified. This concludes our proof.

A.2 Running Time of k-FED and Handling New Devices

We now analyze the running time of k-FED steps 2-8. Since step 1 is running Algorithm-1 on individual devices, we do not include the running time of this step as part of our analysis. Note that with the separation assumptions in place, Algorithm-1 will converge in polynomial time. However, as observed in practise, Lloyd like methods typically only take a few iterations to terminate.

Theorem A.1.

Steps 2-8 of k-FED takes O(Zkk2) pairwise distance computations to terminate. Further, after the set M in step 6 has been computed, new local cluster centers Θ(z) from a yet unseen device z can be correctly assigned in O(kk) distance computations.

Proof.

(Theorem 3.2) The proof of the first part follows from a simple step by step analysis. Step 1 can be performed in O(1). Step 2-6 executes exactly k times. At each iteration t, (1tk), we compute the distance of all device cluster centers, of which there are most Zk, to the points in Mt1. Thus at iteration t, this can be implemented with Zkt distance computations. Summing over all t, we see that steps 2-6 can run in O(Zkk2) distance computations. Finally, step 7 requires us to assign all the Zk device cluster centers to one of the k initial points in M. This can be implemented in O(Zkk) distance computations. Thus the overall complexity in terms of pairwise distance computations is O(Zkk2).

The second part of the statement follows from noting that for each θr(z)Θ(z), the nearest point in set M must be the initial center θ~r we picked as was demonstrated in the proof of Theorem 3.1. Thus every θr(z)Θ(z) is assigned to the correct partition τr as required.

A.3 Separating Data from Mixture of Gaussian

We now prove Theorem 4.1. Recall that we are working in the setting where kk. Our proof builds on results from Lemma 6.3, Kumar & Kannan (2010).

Proof.

First consider an active cluster pair r,s. Based on our separation requirement, we have:

μrμs2 2ckm0σmaxwminpolylog(dwmin)
2ckm0σmaxnwminnpolylog(dwmin).

We further simplify the right hand to get,

μrμs2ckm0σmaxn(1wrn+1wsn)polylog(dwmin).

Now note the number of points from each component Fr is very close to wrnr with very high probability. Here wr is the mixing weight of component r and nr is the number of data points. Using this, with high probability we have

μrμs2ckm0σmaxn(1nr+1ns)polylog(dwmin).

Further, it can be shown that AC is O(σmaxnpolylog(dwmin)) with high probability (see (Dasgupta et al., 2007)). Thus we conclude that, with high probability

μrμs2ckm0(ACnr+ACns).

Thus the active separation requirement is satisfied. The proof for the inactive separation condition is similar. Finally, the proximity condition follows from the concentration properties of Gaussians.

Appendix B Experimental Details

B.1 Datasets

For all experiments involving real data, we use the EMNIST, FEMNIST, and Shakespeare datasets. These datasets and their corresponding models are available at the LEAF benchmark: https://leaf.cmu.edu/. For client selection experiments, we manually partition a subset of FEMNIST (first 10 classes) by assigning 2 classes to each device. There are 500 devices in total. Both the number of training samples across all devices and the number of training samples per class within each device follow a power law. We use the natural partition of Shakespeare where each device corresponds to a speaking role in the plays of William Shakespeare. We randomly sample 109 users from the entire dataset. For personalization experiments, following Ghosh et al. (2020), we use a CNN-based model with one hidden layer and 200 hidden units trained with a learning rate of 0.01 and 10 local updates on each device.

B.2 Choosing k Based on Separation

As mentioned in Section 4.2, to create our oracle clustering, we compute the quantity crs=μrμs2m0(Δr+Δs) for each cluster pairs (r,s), for every candidate value of k we are considering. We construct a distribution plot of these crs. An example of such a plot for the MNIST dataset is provided in Figure 5. As can be seen here, for all values of k, the relative separation is quite small. Thus even for this oracle clustering, the actual separation between cluster means is small. To pick a k for our oracle clustering, we pick a fixed value c0 (say 0.5) and then pick the value of k which leads to maximum fraction of cluster pairs (r,s) to have crs>c0.

Refer to caption
Figure 5: Distribution plot of crs, for various values of k on the MNIST dataset. As can be seen, crs<1 for most cluster pairs, indicating that the separation between them is relatively small.