[1910.09036] 1 Introduction

1 Introduction

2 Clustering with Optimal Transport

3 Experiments

4 Conclusion and discussion

References

1 Introduction

2 Clustering with Optimal Transport

3 Experiments

4 Conclusion and discussion

References

Setting and notations.

A discriminative approach to clustering.

Learning a ’clustering-friendly’ representation.

Clustering and optimal transport

Contributions

Cluster assignment as an optimal transport problem

Entropic regularization of optimal transport

Link with soft-assignments in $k$ -means

Solving the clustering problem

Experimental setting

Evaluation

Setting and notations.

A discriminative approach to clustering.

Learning a ’clustering-friendly’ representation.

Clustering and optimal transport

Contributions

Cluster assignment as an optimal transport problem

Entropic regularization of optimal transport

Link with soft-assignments in k-means

Solving the clustering problem

Experimental setting

Evaluation

Link with soft-assignments in $k$ -means

Abstract

Proposition 1.

Proof.

Proposition 2.

Proof.

Abstract

Proposition 1.

Proof.

Proposition 2.

Proof.

Differentiable Deep Clustering with Cluster Size Constraints

Aude Genevay Gabriel Dulac-Arnold Jean-Philippe Vert

CSAIL, MIT Google Brain Google Brain

Clustering is a fundamental unsupervised learning approach. Many clustering algorithms – such as $k$ -means – rely on the euclidean distance as a similarity measure, which is often not the most relevant metric for high dimensional data such as images. Learning a lower-dimensional embedding that can better reflect the geometry of the dataset is therefore instrumental for performance. We propose a new approach for this task where the embedding is performed by a differentiable model such as a deep neural network. By rewriting the $k$ -means clustering algorithm as an optimal transport task, and adding an entropic regularization, we derive a fully differentiable loss function that can be minimized with respect to both the embedding parameters and the cluster parameters via stochastic gradient descent. We show that this new formulation generalizes a recently proposed state-of-the-art method based on soft- $k$ -means by adding constraints on the cluster sizes. Empirical evaluations on image classification benchmarks suggest that compared to state-of-the-art methods, our optimal transport-based approach provide better unsupervised accuracy and does not require a pre-training phase.

Clustering is a fundamental unsupervised learning task, where, given a training set of data $(x_{1},\ldots,x_{n})\subset\mathcal{X}$ and a number of classes $K$ , we aim to partition the training data into $K$ non-overlapping clusters corresponding to different classes of points. We consider an extension of this problem where we additionally want to classify out-of-sample data not in the training set, i.e., we want to infer a function $c:\mathcal{X}\rightarrow[1,K]$ that maps a given point in the data space to a class. Many clustering techniques exist, such as agglomerative clustering, when $\mathcal{X}$ is endowed with a metric, or $k$ -means, when $\mathcal{X}=\mathbb{R}^{d}$ . However, these methods often fail when applied to complex higher-dimentional data, as the usual metric on the dataspace (e.g. euclidean in $\mathbb{R}^{d}$ ) is not meaningful.

For complex data such as images or strings, recent years have witnessed significant progress in learning representations $f_{\theta}:\mathcal{X}\rightarrow\mathbb{R}^{p}$ , where $f_{\theta}$ is typically a deep neural network (DNN) with parameters $\theta$ , and $\mathbb{R}^{p}$ is a low dimensional space that intends to capture the underlying structure of the data (Bengio et al., 2013). The representation $f_{\theta}$ is usually optimized to either solve a supervised task such as image classification using a training set of labeled images, or, in the absence of labels, can be used to summarize the data, e.g., by using (variational) auto-encoders or GANs (Kingma et al., 2014; Goodfellow et al., 2014).

Any such learned representation $f_{\theta}$ can be used in conjunction with any clustering algorithm to cluster the training set mapped to the representation space $(f_{\theta}(x_{1}),\ldots,f_{\theta}(x_{n}))\subset\mathbb{R}^{p}$ . However, there is no guarantee that the representation $f_{\theta}$ is “good” for this clustering task if it has been optimized for another task. In this work, we propose a new approach to learn a representation $f_{\theta}$ well adapted to solve the clustering task, in the absence of labels. This setting has been considered by several authors recently, often under the name of “deep clustering”, and before reviewing existing approaches let us fix some notations.

We consider a dataset of $n$ points $\mathcal{D}_{n}=(x_{1},\dots,x_{n})\in\mathcal{X}^{n}$ where $\mathcal{X}\in\mathbb{R}^{d}$ is the space of input data (e.g., $\mathcal{X}=\mathbb{R}^{32\times 32\times 3}$ for 3-channel $32\times 32$ images). Our goal is to cluster the data into $K$ clusters, which might correspond to the number of classes in a supervised setting. For a dataset $\mathcal{D}_{n}$ we denote by $f_{\theta}(\mathcal{D}_{n})$ the embedded dataset $(f_{\theta}(x_{1}),\dots,f_{\theta}(x_{n}))$ , where $f_{\theta}:\mathcal{X}\rightarrow\mathbb{R}^{p}$ is a deep neural network with parameters $\theta$ and $p<<d$ .

Additionally, we denote by $\Delta_{K}=\left\{z\in\mathbb{R}_{+}^{K}\,:\,\sum_{k=1}^{K}z_{k}=1\right\}$ the probability simplex. For any integer $n\in\mathbb{N}$ , let $\mathbf{1}_{n}\in\mathbb{R}^{n}$ be the $n$ -dimensional vector of ones. Given two vectors $a,b\in\mathbb{R}^{n}$ , where $b_{i}\neq 0$ for $i\in[1,n]$ , we denote by $a\oslash b\in\mathbb{R}^{n}$ the vector with entries $(a\oslash b)_{i}=a_{i}/b_{i}$ . For any vector or matrix $M$ , we denote by $\exp(M)$ the matrix obtained by applying the operation entrywise, e.g., $[\exp(M)]_{ij}=\exp(M_{ij})$ , and by $M^{\top}$ the transpose of $M$ .

The clustering task can be recast as a classification problem, without relying on a representation of the data, focusing directly on the clustering function $c_{\theta}:\mathcal{X}\rightarrow[1,K]$ . This only makes sense in cases where classes are well separated. In that case, the labels $Y=(y_{1},\ldots,y_{n})\in[1,K]^{n}$ of the training set are known, then one could train a supervised model by minimizing over $\theta$ an empirical risk of the form

R(\theta,Y)=\frac{1}{n}\sum_{i=1}^{n}\ell\left(c_{\theta}(x_{i}),y_{i}\right)\,.

Since we are considering an unsupervised setting, $Y$ is not available. A solution is thus to jointly optimize the above criterion over $\theta$ and $Y$ to learn both a class assignment and a classifier:

(\hat{\theta},\hat{Y})\in\mathop{\mathrm{argmin}\,}_{\theta\in\Theta,Y\in[1,K]^{n}}R(\theta,Y).

(1)

An obvious computational difficulty is that this problem involves the discrete variable $Y$ . Besides, some kind of regularization is required in this double optimization task to prevent trivial solutions; adding constraints on $Y$ is crucial to prevent empty or overpopulated clusters. Joulin et al. (2010) propose a convex relaxation of (1) in the case of linear regression with the squared loss $\ell(u,v)=(u-v)^{2}$ for binary problems ( $K=2$ ). In that case, the objective function is quadratic in $Y$ and they use the standard semidefinite programming (SDP) relaxation for the matrix $YY^{\top}$ to approximate a minimum.

A different approach is used by Chang et al. (2017) who recast the clustering problem as a binary classification problem: given two data points, do they belong to the same cluster? The resulting algorithm, Deep Adaptive Clustering (DAC) can be summarized as follows: each data point is mapped to a vector in the unit ball of $\mathbb{R}^{K}$ thanks to $f_{\theta}$ , which represents its probability to belong to each class. These probabilities are then compared with the cosine distance, which defines the similarity of the two points. The points are assumed to belong to the same class if the similarity is above a certain threshold. The parameters $\theta$ are then updated in order to increase similarity between similar points.

However, representations are useful beyond the clustering task, e.g., to extract features or reduce the dimension of a dataset, which is why many methods in the literature rather learn a representation $f_{\theta}:\mathcal{X}\rightarrow\mathbb{R}^{p}$ such that $f_{\theta}(\mathcal{D}_{n})$ becomes easy to cluster. In most cases, the problem consists in minimizing a combined loss made of two terms : (i) a “representation loss” $\ell_{r}(\theta)$ to ensure that the representation space is not degenerate (ii) a “clustering loss” $\ell_{c}(\theta)$ to enforce that the learned representation $f_{\theta}(\mathcal{D})$ is relevant for the clustering task. This yields the following optimization problem:

\min_{\theta}L(\theta)\stackrel{{\scriptstyle\mbox{def.}}}{{=}}\ell_{r}(\theta)+\lambda\ell_{c}(\theta)\,.

(

\mathcal{P}

)

While the choice of the reconstruction loss of an auto-encoder (with encoder $f_{\theta}$ and decoder $g_{\varphi}$ )

\ell_{r}^{ae}(\theta)=\min_{\varphi}\sum_{i=1}^{n}\|x_{i}-g_{\varphi}\left(f_{\theta}(x_{i})\right)\|^{2}\,,

(2)

is the standard for the representation loss, these methods vary mostly in their choices of the clustering losses, the auto-encoder model, and the optimization strategies (in particular to prevent trivial solutions).

Song et al. (2013) are one of the earliest to learn a representation for clustering by tweaking the objective function of a standard auto-encoder. They formulate the problem as minimizing the combined loss ( $\mathcal{P}$ ) with the objective of $k$ -means as the clustering loss:

\ell_{c}^{km}(\theta)=\min_{\mu_{1},\ldots,\mu_{k}\in\mathbb{R}^{p}}\left\{\sum_{i=1}^{n}\min_{j\in[1,K]}\|f_{\theta}(x_{i})-\mu_{j}\|^{2}\right\}\,.

(3)

To optimize the objective function over the encoder parameters $\theta$ , the decoder parameters $\varphi$ and the cluster centers $\mu_{1},\ldots,\mu_{k}$ , they alternate one epoch of stochastic gradient descent over $(\theta,\varphi)$ , with one update of the cluster centers and assignments.

While most state-of-the-art methods rely on clustering objectives that are strongly linked to $k$ -means, Joint Unsupervised LEarning (JULE) (Yang et al., 2016) uses a clustering loss based on “agglomerative clustering”. Starting from clusters consisting of datapoints, the training alternates between a few steps of agglomerative clustering, i.e., merging similar clusters, and a backward pass during which the network parameters are updated to minimize the clustering loss. Although this method has a more flexible geometry, it requires building an affinity graph of the dataset after each update and is thus computationally heavy.

Xie et al. (2016) propose Deep Embedded Clustering (DEC) which starts with a pre-training phase using only the reconstruction loss $\ell_{r}(\theta,\varphi)$ and then improves the clustering ability of the representation by optimizing $f_{\theta}$ in a self-supervised manner. Their clustering loss is the Kullback-Leibler divergence between the soft-assignments $q_{ik}$ of each point $i$ to each cluster $k$ and the square of the soft-assignments, which should push the embedding to favor harder assignments. There are several variants of DEC using more sophisticated auto-encoders and training techniques such as Guo et al. (2017). The DEPICT algorithm (Dizaji et al., 2017) similarly minimize the KL divergence to sharpen their assignments but also introduce a classifier $h_{\beta}$ , that outputs $h_{\beta}(z)$ a probability distribution over the $k$ classes (typically, a neural network with softmax activation at the last layer). Thus the clustering loss corresponds to the possibility to discriminate the data in $k$ different classes.

The clustering loss in Deep Clustering Network (DCN) (Yang et al., 2017) is the objective of $k$ -means in the representation space. However, minimizing the total loss $L$ over $\theta,\varphi,\mu$ (cluster centers) and $\pi$ (cluster assignments) jointly is challenging. Thus, Yang et al. (2017) alternate optimization in $(\theta,\varphi)$ for fixed $(\mu,\pi)$ , which becomes a variant of AE training, and in $(\mu,\pi)$ for fixed $(\theta,\varphi)$ . The Deep $k$ -means (DKM) (Fard et al., 2018) algorithm uses the same loss as DCN but relax the assignment problem by replacing the cluster assignments with soft-assignments in the $k$ -means objective. This results in a clustering loss can be jointly minimized over $\theta$ and $\mu$ , using stochastic gradient descent (SGD), and leads to state-of-the-art performance in deep clustering (Fard et al., 2018). The latter is the approach which is closest to ours, as we also propose a fully differentiable objective based on $k$ -means.

There is a link between $k$ -means clustering and optimal transport, which was first noticed in Pollard (1982) and studied in more details in Canas and Rosasco (2012). Roughly speaking, optimal transport is equivalent to a constrained formulation of $k$ -means in which the cluster sizes are prescribed. This framework makes sense in a setting where the proportion of each class in a dataset is known, but no information is available at the individual level. Cuturi and Doucet (2014) introduced an entropic regularization of that problem which allows for an efficient solver.

Following Cuturi and Doucet (2014) we exploit the connection between optimal transport and $k$ -means, and rely on entropic regularization to derive a fully-differentiable clustering loss that can be used in ( $\mathcal{P}$ ) and directly optimized with SGD. We give an insight on the effect of regularization in the cluster assignment problem, and show that the soft $k$ -means loss introduced by Fard et al. (2018) can be interpreted as an optimal transport loss with only one marginal constraint. The constraints on cluster sizes that naturally occur with optimal transport allow to enforce a prior on cluster sizes without relying to additional terms in the optimization problem. This leads to better clustering performance on benchmark datasets.

Consider $n$ sample points $\left\{x_{1},\ldots,x_{n}\right\}\subset\mathbb{R}^{d}$ embedded in the representation space via $f_{\theta}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{p}$ , and $K$ clusters in that representation space with centers $\left\{\mu_{1},\ldots,\mu_{K}\right\}\subset\mathbb{R}^{p}$ . We want to assign samples to clusters so that:

(i)

each sample is assigned to exactly one cluster,
(ii)

each cluster $k=1,\ldots,K$ contains exactly $n_{k}$ points,
(iii)

the total distance (in the representation space) between cluster centers and their assigned samples is minimal.

The mathematical formulation of the above problem reads as follows:


$\displaystyle\ell_{c}^{OT}=$	$\displaystyle\!\min_{\pi\in\{0,1/n\}^{n\times K}}$	$\displaystyle\sum_{i=1}^{n}\sum_{k=1}^{K}\\|f_{\theta}(x_{i})-\mu_{k}\\|^{2}\pi_{k,i}$	( $O T$ )
	s.t.	$\displaystyle\pi\mathbf{1}_{K}=\frac{1}{n}\mathbf{1}_{n},$	( $c_{1}$ )
	$\displaystyle\pi^{\top}\mathbf{1}_{n}=w,$		( $c_{2}$ )

where $w=(\frac{n_{1}}{n},\dots,\frac{n_{k}}{n})\in\Delta_{K}$ is the vector of cluster proportions.

The above problem is known as optimal transport between the discrete measure $\alpha\stackrel{{\scriptstyle\mbox{def.}}}{{=}}\frac{1}{n}\sum_{i=1}^{n}\delta_{f_{\theta}(x_{i})}$ and $\beta=\sum_{k=1}^{K}\frac{n_{k}}{n}\delta_{\mu_{k}}$ . If we remove the constraint on cluster sizes ( $c_{1}$ ), it boils down to the objective function of the $k$ -means problem with cluster centers $\left\{\mu_{1},\ldots,\mu_{K}\right\}$ (Pollard, 1982).

Algorithm 1 Sinkhorn’s Algorithm for Reg. OT

1:Parameters

\varepsilon

;

n_{iter}

2:Input

(f_{\theta}(x_{i}))_{i=1\dots n}

;

(\mu_{k})_{k=1\dots K}

;

w

C_{ik}=\left\|{f_{\theta}(x_{i})-\mu_{k}}\right\|^{2}\quad\forall\>i,k

M=\exp(-C/\varepsilon)

5:Initialize

b\leftarrow\mathbf{1}_{K}

6:for

j=1,2,\dots,n_{iter}

a\leftarrow\frac{1}{n}\frac{\mathbf{1}_{n}}{Mb}

b\leftarrow\frac{w}{M^{\top}a}

9:Return

\pi_{ik}=a_{i}M_{ik}b_{k}\quad\forall\>i,k

Solving optimal transport is computationally expensive as it requires solving a large linear program and a common workaround in the literature is to regularize the problem with entropy (Cuturi, 2013). The regularized problem then reads as follows:

The addition of entropy allows to solve the problem with a much faster iterative algorithm, called Sinkhorn’s algorithm, whose iterations are summarized in Algorithm 1. Although this fast solver is the main reason why regularized optimal transport became routinely used in machine learning tasks, recent papers have exploited the fact that it also leads to a differentiable loss, whose gradients can be easily computed with backpropagation through Sinkhorn iterations (Genevay et al., 2018; Salimans et al., 2018).

It is known that a linear program reaches its maximum on the vertices, which is why the Optimal Transport problem is equivalent to its relaxation to the simplex. The addition of entropy will move the solution away from the optimal vertex, towards the center of the constraint polytope thus yielding smoother assignments (Peyré et al., 2019). This is formalized in the proposition below:

Consider the regularized optimal transport problem ( $O T$ ), and the optimal assignment $\pi_{\varepsilon}$ .
When $\varepsilon\rightarrow 0$ :

•

$\pi_{\varepsilon}\rightarrow\pi$ (the solution of ( $O T$ ))
•

$\ell_{c}^{OT_{\varepsilon}}\rightarrow\ell_{c}^{OT}$ ,

When $\varepsilon\rightarrow\infty$ :

•

$\pi_{\varepsilon}\rightarrow\frac{1}{n}\mathbf{1}_{n}w$ (i.e., each point is assigned to all clusters according to global proportions w)
•

$\ell_{c}^{OT_{\varepsilon}}\rightarrow\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}w_{k}\|f_{\theta}(x_{i})-\mu_{k}\|^{2}$ .

The proposition is an adaptation of Theorem 1 in (Genevay et al., 2018) to our clustering setting. ∎

The choice of $\varepsilon$ is a crucial question: when epsilon gets smaller – i.e. when we get closer to ’true’ Optimal Transport – Sinkhorn’s algorithm requires more iterations to converge (see e.g. (Peyré et al., 2019)) meaning that a better approximation of optimal transport comes at a heavy computational price. However, it has recently been proved that approximating optimal transport from samples – which is typically the case in machine learning, it is actually beneficial to use $\varepsilon$ not too small to avoid the curse of dimension from which optimal transport suffers (Genevay et al., 2019).

Algorithm 2 OT-based Deep Clustering

1:Parameters

K

n_{pre-train}

n_{epochs}

, m

2:Input Dataset

(x_{1},\dots,x_{n})

, cluster proportions

w

3:Initialize

f_{\theta}

(encoder) and

g_{\varphi}

(decoder) with random weights

4:Initialize centers

\mu

with

k

-means on embedded images

(f_{\theta}(x_{1}),\dots,f_{\theta}(x_{n}))

5:for

i=1

n_{pre-train}

do (pre-training)

6: for

j=1

n/m

\mathcal{D}_{j}=(x^{(j)}_{1},\dots,x^{(j)}_{m})

batch of size

m

8: Compute loss

\ell_{r}^{ae}(\theta,\varphi)

9: Update

\theta

\varphi

with a gradient step

10:for

i=1

n_{epochs}

do (Training)

11: for

j=1

n/m

12:

\mathcal{D}_{j}=(x^{(j)}_{1},\dots,x^{(j)}_{m})

batch of size

m

13: Compute

\pi(f_{\theta}(\mathcal{D}_{j}),\mu,w)

with Sinkhorn

14: Compute loss

\ell_{r}^{ae}(\theta,\varphi)+\ell_{c}^{OT_{\varepsilon}}(\theta,\mu)

15: Update

\theta

\varphi

and

\mu

with a gradient step

16:for

i=1

n

do (Final Clustering)

17: Assign

x_{i}

k_{i}=\mathop{\mathrm{argmin}\,}_{k}\left\|{f_{\theta}(x_{i})-\mu_{k}}\right\|^{2}

Refer to caption — Figure 1: Accuracy on MNIST (left) and CIFAR10 (right)

The optimal transport formulation includes two marginal constraints, one being that each sample is assigned to exactly one cluster and the other being the cluster size. The latter constraint can be omitted to obtain an objective which is that of $k$ -means. When regularizing the optimal transport problem with only one marginal constraint, we thus get a differentiable $k$ -means objective.

Consider the variant of entropy-regularized Optimal Transport with only one marginal constraint (i.e., no prior on cluster sizes):


	$\displaystyle\!\min_{\pi\in[0,1]^{n\times K}}$		$\displaystyle\sum_{i=1}^{n}\sum_{k=1}^{K}\\|f_{\theta}(x_{i})-\mu_{k}\\|^{2}\pi_{k,i}$
	$\displaystyle\qquad+\varepsilon\pi_{k,i}\left(\log(\pi_{k,i})-1\right)$
	s.t.		$\displaystyle\pi\mathbf{1}_{K}=\frac{1}{n}\mathbf{1}_{n},$

then the optimal assignment $\pi^{*}$ is given by

\pi_{k,i}^{*}=\frac{e^{-\left\|{f_{\theta}(x_{i})-\mu_{k}}\right\|^{2}}/\varepsilon}{n\sum_{k^{\prime}=1}^{K}e^{-\left\|{f_{\theta}(x_{i})-\mu_{k^{\prime}}}\right\|^{2}}/\varepsilon}.

(7)

This is a convex function of $\pi$ with linear constraints. Denoting by $\lambda$ the Lagrange multiplier for the constraint, the Lagrangian is :

$\displaystyle\mathcal{L}(\pi,\lambda)=$	$\displaystyle\sum_{i=1}^{n}\sum_{k=1}^{K}\\|f_{\theta}(x_{i})-\mu_{k}\\|^{2}\pi_{k,i}$	(8)
	$\displaystyle+\varepsilon\pi_{k,i}\left(\log(\pi_{k,i})-1\right)$	(9)
	$\displaystyle+\sum_{i=1}^{n}\lambda_{i}(\sum_{k=1_{K}}\pi_{ik}-\frac{1}{n})$	(10)

The first order conditions of the Lagrangian gives (7). ∎

The solution of that problem corresponds exactly to the differentiable $k$ -means loss introduced by Fard et al. (2018), which the authors motivated by replacing the $\min$ in the $k$ -means objective (3) with the softmin function. Hence Proposition 2 provides a new interpretation of DKM, and shows that the approach we propose below generalizes DKM by adding constraints on the cluster sizes.

Several papers in the literature use the $k$ -means objective as the clustering loss $\ell_{c}$ in ( $\mathcal{P}$ ). We propose to replace this by the objective function of regularized optimal transport, which allows to:

(i)

enforce a prior on the cluster proportions,
(ii)

obtain a loss that is differentiable with respect to both the cluster centers $(\mu_{k})_{k}$ and the embedding parameters $\theta$ .

The clustering problem ( $\mathcal{P}$ ) becomes:

\min_{\theta,\varphi,\mu}\ell_{r}^{ae}(\theta,\varphi)+\ell_{c}^{OT_{\varepsilon}}(\theta,\mu)\,,

(11)

where $\ell_{r}^{ae}$ is the reconstruction loss of the auto-encoder defined in (2) and $\ell_{c}^{OT_{\varepsilon}}$ is the result of the regularized optimal transport problem ( $O T$ ). This function is differentiable with respect to all its variables, and can be minimized with SGD. We outline the full learning procedure in Algorithm 2.

We assess the efficiency of our optimal transport-based clustering loss on two classical benchmark datasets for image classification, MNIST and CIFAR10. We compare it against a simple baseline which consists in learning an embedding with an auto-encoder and then running $k$ -means on the embedded data, and against the state-of-the-art DKM method of Fard et al. (2018) based on the soft $k$ -means loss, which already proved its superiority over existing methods.

For the MNIST dataset, we follow mutiple examples from the litterature by using a fully connected auto-encoder with 500-250-10-250-500 structure and ReLu activation. For the CIFAR dataset, we use as an encoder a standard convolutional network using ReLu activation, with three convolution layers of respective depths 32, 64, 128, respective kernel sizes 5, 5, 3, and common stride of 2, followed by a fully connected layer to a latent space of dimension 10. In both cases we use batches of size $m=300$ . The gradient updates are made with the Adam algorithm and the standard learning rate from TensorFlow (0.001) with step decay, as there should not be parameter tuning in an unsupervised setting. The algorithm used for training is summarized in Algorithm 2. We also run $k$ -means on raw pixels, to give an idea of how much structure is induced in the data by the embedding $f_{\theta}$ . Following usual guidelines regarding regularization for optimal transport, we set $\varepsilon=10^{-2}$ which gives a good enough approximation of optimal transport without requiring too many Sinkhorn iterations. We show in Fig. 2 that the final performance is robust to the choice of $\varepsilon$ as long as it is not too large.

	$k$ -means	AE + $k$ -means	soft $k$ -means	soft $k$ -means (p)	OT	OT (p)
MNIST	0.513	$0.801(\pm 0.025)$	$0.810(\pm 0.033)$	$0.837(\pm 0.032)$	$0.851(\pm 0.032)$	$0.846(\pm 0.040)$
		$[0.765,0.912]$	$[0.677,0.883]$	$[0.781,0.923]$	$[0.771,0.932]$	$[0.759,0.928]$
CIFAR10	0.206	$0.237(\pm 0.005)$	$0.239(\pm 0.008)$	$0.243(\pm 0.010)$	$0.243(\pm 0.007)$	$0.244(\pm 0.009)$
		$[0.230,0.259]$	$[0.227,0.257]$	$[0.227,0.261]$	$[0.232,0.260]$	$[0.230,0.266]$

Table 1: Average accuracy from clustering on CIFAR and MNIST datasets (over 50 runs) with standard deviation and max and min accuracy over the runs (second line). (p) means ‘with pre-training’

After each epoch, we evaluate the different methods by computing the accuracy given by matching clusters to classes through the following formula:

accuracy=\max_{\sigma\in\mathfrak{S}}\sum_{i=1}^{n}\mathbf{1}_{y_{i}=\sigma(k_{i})},

where $y_{i}$ and $k_{i}$ are respectively the class label and the cluster index associated to $x_{i}$ and $\mathfrak{S}$ is the set of permutations of $\{1\dots K\}$ . For the ‘AE + $k$ -means’ method, the cluster assignment is made by running $k$ -means on the embedded data, while for both ‘soft $k$ -means’ and ‘OT’, since these methods also learn the cluster centers $(\mu_{1},\dots,\mu_{K})$ we assign the point $x_{i}$ to cluster $k_{i}$ such that $k_{i}=\mathop{\mathrm{argmin}\,}_{k}\left\|{f_{\theta}(x_{i})-\mu_{k}}\right\|^{2}$ . The optimal matching between clusters and classes is done via the Hungarian algorithm, as in the literature.

The evolution of accuracy during training for all three methods (auto-encoder + kmeans, soft $k$ -means, optimal transport) is plotted in Fig. 1. The curves are averaged over 50 runs. The final accuracies are reported in Table 1 along with the standard deviations. We can seen that the optimal transport loss achieves superior accuracy, but mostly doesn’t need to rely on pre-training to get good performance, contrarily to soft $k$ -means, whose performance is only slightly above the baseline without pre-training. To assert the statistical significance of the superiority of optimal transport in this framework, we further run a Welch’s t-test over the final accuracies in the 50 runs. Without pre-training, we find out that optimal transport is significantly better than soft $k$ -means (p-value of $0.0067$ for CIFAR10 and $10^{-12}$ for MNIST). With pre-training, optimal transport is still significantly better than soft $k$ -means for MNIST at the 10% level (p-value of 0.10) but it’s not the case for CIFAR10 (p-value of 0.33). Note that for both datasets, pre-training did not yield significantly better performance for optimal transport (p-values $>0.2$ ), while it significantly improves soft $k$ -means (p-values $<0.05$ ).

Fig. 2 displays the accuracy after 200 epochs for each method, as a function of $\varepsilon$ . We see that the competitive advantage of OT over soft $k$ -means is robust to the choice of $\varepsilon$ as long as it is not too large. Incidentally, the methods with pre-training are more robust to large values for $\varepsilon$ . Note that these curves are averaged over only 5 runs and thus can not be regarded as statistically significant, they merely serve as a proof of robustness of the method to the chosen parameter.

In this paper we propose a new fully differentiable framework for deep clustering, based on regularized optimal transport, which generalizes the recently proposed approach of Fard et al. (2018) based on soft- $k$ -means. Its main advantage over competing methods is its ability to naturally enforce a prior on class proportions. This significantly improves performance on datasets with well balanced classes, without relying on pre-training of the embedding.

In our experiments we observed a benefit over soft- $k$ -means in situations where the classes are balanced. An interesting direction to explore is to extend the application of our method when the prior knowledge on cluster size is not uniform. This may be relevant in cases, for example, when an expert provides a rough estimate of the proportion of different classes, such as the proportion of cancer cells in an histopathological image. While our formulation lends itself naturally to non-uniform cluster proportions, we observed in preliminary experiments that it performs poorly if no care is taken to make sure that the cluster size constraints ( $w$ in Algorithm 2) is coherent with the set of cluster centers (vectors $\mu_{k}$ in Algorithm 2). We found in particular that this is often poorly achieved by the initialization of the centers via $k$ -means (step 4 in Algorithm 2. For instance, consider the case where we have two clusters – say images of ones and images of twos – and we know that we should have 20 % of the former and 80 % of the latter. However, if the $k$ -means initialization of the centers puts the first center in the middle of twos and the second center in the middle of ones, the algorithm will try to enforce a 80% proportion on ones and 20% proportion on twos. Generally speaking, to ensure that we are enforcing the cluster proportions properly, some sort of matching has to be done before the learning phase between the indexes of the clusters and the indexes of the classes. This could be done in a supervised way, by using an example from each class to initialize the centers. This extension falls in the framework of one-shot learning, with an additional knowledge on class proportions.

Another extension of our method would be relax the strict constraint of cluster proportions to a soft constraint, using for example unbalanced optimal transport with a relaxed version of the Sinkhorn algorithm which penalizes the marginal constraints instead of enforcing them strongly (Chizat et al., 2018). This may be particularly relevant when small batches are considered, as one would not expect the composition of each batch to perfectly reflect the overall composition.

Finally, we note that our formulation of the clustering problem with optimal transport is closely linked to that of Dulac-Arnold et al. (2019), who propose an algorithm to learn a classifier from label proportions in mini-batches. The main difference is that instead of using $f_{\theta}$ to parametrize an embedding, the authors directly use it to predict the probability of belonging to class $k$ . The last layer of $f_{\theta}$ is a softmax, and thus the term $\|f_{\theta}(x_{i})-\mu_{k}\|^{2}$ is replaced by $f_{\theta}(x_{i})_{k}$ . Besides, they loosen the marginal constraint prescribing the clusters proportions by using unbalanced optimal transport. The latter can also be implemented in our proposed method, as it consists in using a variation of Sinkhorn’s algorithm (Chizat et al., 2018). However, the performance that they report for large batch sizes in lower than what we report for the fully unsupervised task in our experiments. This would make our optimal transport approach a good candidate for the learning with proportions problem.

Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, August 2013.
Canas and Rosasco (2012) Guillermo Canas and Lorenzo Rosasco. Learning probability measures with respect to optimal transport metrics. In Advances in Neural Information Processing Systems, pages 2492–2500, 2012.
Chang et al. (2017) Jianlong Chang, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Deep adaptive image clustering. In Proceedings of the IEEE International Conference on Computer Vision, pages 5879–5887, 2017.
Chizat et al. (2018) Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Scaling algorithms for unbalanced optimal transport problems. Mathematics of Computation, 87(314):2563–2609, 2018.
Cuturi (2013) Marco Cuturi. Sorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
Cuturi and Doucet (2014) Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In International Conference on Machine Learning, pages 685–693, 2014.
Dizaji et al. (2017) K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5747–5756, Oct 2017. doi: 10.1109/ICCV.2017.612.
Dulac-Arnold et al. (2019) G. Dulac-Arnold, N. Zeghidour, M. Cuturi, L. Beyer, and J.-P. Vert. Deep multi-class learning from label proportions. Technical Report 1905.12909, arXiv, 2019.
Fard et al. (2018) Maziar Moradi Fard, Thibaut Thonet, and Eric Gaussier. Deep $k$ -means: Jointly clustering with $k$ -means and learning representations. arXiv preprint arXiv:1806.10069, 2018.
Genevay et al. (2018) Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with Sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, 2018.
Genevay et al. (2019) Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of Sinkhorn divergences. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), Naha, Okinawa, Japan, 2019.
Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
Guo et al. (2017) Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. Improved deep embedded clustering with local structure preservation. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pages 1753–1759. AAAI Press, 2017. ISBN 978-0-9992411-0-3.
Joulin et al. (2010) A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image co-segmentation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
Kingma et al. (2014) Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3581–3589, 2014.
Peyré et al. (2019) Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
Pollard (1982) David Pollard. Quantization and the method of k-means. IEEE Transactions on Information theory, 28(2):199–205, 1982.
Salimans et al. (2018) Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving gans using optimal transport. arXiv preprint arXiv:1803.05573, 2018.
Song et al. (2013) Chunfeng Song, Feng Liu, Yongzhen Huang, Liang Wang, and Tieniu Tan. Auto-encoder based data clustering. In José Ruiz-Shulcloper and Gabriella Sanniti di Baja, editors, Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 117–124, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
Xie et al. (2016) Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 478–487. JMLR.org, 2016.
Yang et al. (2017) Bo Yang, Xiao Fu, Nicholas D. Sidiropoulos, and Mingyi Hong. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3861–3870, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
Yang et al. (2016) Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsupervised learning of deep representations and image clusters. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016. doi: 10.1109/cvpr.2016.556.

$\displaystyle\ell_{c}^{OT_{\varepsilon}}=$

$\displaystyle\!\min_{\pi\in[0,1]^{n\times K}}$

$\displaystyle\sum_{i=1}^{n}\sum_{k=1}^{K}\|f_{\theta}(x_{i})-\mu_{k}\|^{2}\pi_{k,i}$

$\displaystyle\qquad+\varepsilon\pi_{k,i}\left(\log(\pi_{k,i})-1\right)$

( $OT_{\varepsilon}$ )

$\displaystyle\pi\mathbf{1}_{K}=\frac{1}{n}\mathbf{1}_{n},$

( $c_{1}$ )

$\displaystyle\pi^{\top}\mathbf{1}_{n}=w,$

( $c_{2}$ )