Chapter 2. Clustering

In this chapter, we will start applying the data transforming operations that we learned in the previous chapter, and will begin finding interesting patterns in some given information, discovering groups of data, or clusters, using clustering techniques.

In this process we will also gain two new tools: the ability to generate synthetic sample sets from a collection of representative data structures via the scikit-learn library, and the ability to graphically plot our data and model results, this time via the matplotlib library.

The topics we will cover in this chapter are as follows:

Getting an idea of how clustering works, and comparing it to alternative, existing classification techniques
Using scikit-learn and matplotlib to enrich the possibilities of dataset choices, and to get a professional-looking graphical representation of the data
Implementing the k-means clustering algorithm
Implementing the nearest neighbor method, and comparing the results with that of the k-means

Learning from data - unsupervised learning

In this chapter we will be reviewing two cases of unsupervised learning.

Unsupervised learning basically consists of finding patterns on a previous dataset. Normally, little or no information is given for this technique and the procedure should be able to automatically determine how the information is organized, and recognize the different structures in the data organization.

Clustering

One of the simplest operations that can be initially to unlabeled dataset is to try to understand the possible groups of the dataset members' common features.

To do so, the dataset can be split into an arbitrary number of segments, where each can be represented as a central mass (centroid) point that represents the points belonging to a determined group or cluster.

In order to define the criteria that assigns the same group to different group members, we need to define a concept that represents the distance between data elements, so we can simply say that all class members are closer to their own centroids than to any other centroid.

In the following graphic, we can see the results of a typical clustering algorithm and the representation of the cluster centers:

Sample clustering algorithm output

k-means

k-means is a very popular clustering algorithm and it can be easily implemented. It is very straight forward, and applying it as a first procedure to datasets with good class separation can provide a good a priori understanding of the data.

Mechanics of k-means

k-means tries to divide a set of samples in k disjoint groups or clusters using the mean value of the members as the main indicator. This point is normally called a Centroid, referring to the arithmetical entity with the same name, and is represented as a vector in a space of arbitrary dimensions.

k-means is a naive method because it works by looking for the appropriate centroids but doesn't know a priori what the number of clusters is.

In order to get an evaluation of how many clusters give a good representation of the provided data, one of the more popular methods is the Elbow method.

Algorithm iteration criterion

The criterion and goal of this method is to minimize the sum of squared distances from the cluster's member to the actual centroid of all cluster-contained samples. This is also known as Minimization of Inertia.

Error Minimization criteria for k-means

Latex: \sum_{i=0}^{n}\lim_{\mu_j\epsilon C}(\left \| x_{j} - \mu_i \right \|^{2}).

k-means algorithm breakdown

The mechanism of the k-means algorithm can be summarized by the following flowchart:

Simplified flowchart of the k-means process

The algorithm can be simplified as follows:

We start with the unclassified samples and take k elements as the starting centroids. There are also possible simplifications of this algorithm that take the first elements in the element list for the sake of brevity.
We then calculate the distances between the samples and the first chosen samples and get the first, calculated centroids (or other representative values). You can see the centroids in the illustration move toward a more common-sense centroid.
After the centroids change, their displacement will provoke the individual distances to change, and so the cluster membership can change.
This is the time when we recalculate the centroids and repeat the first steps in case the stop condition isn't met.

The stopping conditions could be of various types:

After N iterations it could be that either we chose a very large number and we'll have unnecessary rounds of computing, or it could converge slowly and we will have very unconvincing results if the centroid doesn't have a very stable means. This stop condition could also be used as a last resort, in case we have a very long iterative process.
Referring to the previous mean result, a possible better criterion for the convergence of the iterations is to take a look at the changes of the centroids, be it in total displacement or total cluster element switches. The last one is employed normally, so we will stop the process once there are no more elements changing from its current cluster to another one.

k-means simplified graphic

Pros and cons of k-means

The advantages of this method are:

It scales very well (most of the calculations can be run in parallel)
It has been used in a very large range of applications

But simplicity also comes with a cost (no silver bullet rule applies):

It requires a-priori knowledge (the number of possible clusters should be known beforehand)
The outlier values can Push the values of the centroids, as they have the same value as any other sample
As we assume that the figure is convex and isotropic, it doesn't work very well with non-circle-like delimited clusters

k-nearest neighbors

k-nearest neighbors (k-nn) is a simple, classical method for clustering that will serve as a good introduction to this class of techniques, looking at the vicinity of each sample, and supposing that each new sample should pertain to the class of the already known data points.

Mechanics of k-nearest neighbors

k-nn can be implemented in more than one of our configurations, but in this chapter we will use the Semi Supervised approach. We will start from a certain number of already assigned samples, and we will later guess the cluster membership based on the characteristics of the train set.

Nearest neighbor algorithm

In the previous figure, we can see a breakdown of the algorithm. It can be summarized by the following steps:

We place the previously known samples on the data structures.
We then read the next sample to be classified and calculate the Euclidean distance from the new sample to every sample of the training set.
We decide the class of the new element by selecting the class of the nearest sample by Euclidean distance. The k-nn method requires the vote of the k closest samples.
We repeat the procedure until there are no more remaining samples.

Pros and cons of k-nn

The advantages of this method are:

Simplicity; no need for tuning parameters
No formal training; we just need more training examples to improve the model

The disadvantages:

Computationally expensive (All distances between points and every new sample have to be calculated)

Practical examples for Useful libraries

We will discuss some useful libraries in the following sections.

matplotlib plotting library

Data plotting is an integral part of data science discipline. For this reason, we need a very powerful framework to be able to plot our results. For this task, we do not have a general solution implemented in TensorFlow, we will use the matplotlib library.

From the matplotlib site (http://matplotlib.org/), the definition is:

"matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms."

Sample synthetic data plotting

In this example, we will generate a list of 100 random numbers, generate a plot of the samples, and save the results in a graphics file:

    import tensorflow as tf
    import numpy as np
    import matplotlib.pyplot as plt
    with tf.Session() as sess:
        fig, ax = plt.subplots()
        ax.plot(tf.random_normal([100]).eval(), tf.random_normal([100] ).eval(),'o')
        ax.set_title('Sample random plot for TensorFlow')
        plt.savefig("result.png")

This is the resulting image:

Sample plot generated with TensorFlow and matplotlib

Tip

In order to see a more general explanation of the scikit dataset module, please refer to: http://matplotlib.org/.

scikit-learn dataset module

TensorFlow is not currently implementing methods for the easy generation of synthetic datasets. For this reason, we'll be using the sklearn library as a helper.

About the scikit-learn library

From its website (http://scikit-learn.org/stable/):

"scikit-learn (formerly scikits.learn) is an open source machine learning library for the Python programming language. It features various classification, regression and clustering, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy."

In this example, we will use the dataset module, which deals with the generation and loading of many well-known synthetic, and field extracted, datasets.

Tip

In order to see a more general explanation of the scikit dataset module, please refer to http://scikit-learn.org/stable/datasets/.

Synthetic dataset types

Some of the generated dataset types we'll be using are:

Blob, circle, and moon dataset types

Blobs dataset

This dataset is ideal for testing simple clustering algorithms. It doesn't present problems because the data is grouped coherently and the separation of classes is clear.

Employed method

The following method is used for employed method:

sklearn.datasets.make_blobs(n_samples=100, n_features=2,  centers=3, cluster_std=1.0, center_box=(-10.0, 10.0),  shuffle=True, random_state=None)

Here, n_samples is the total data numbers, n_features is the quantity of columns or features our data has, centers is a list of centers or a number of random centers, cluster_std is the standard deviation, center_box is the bounding box for each cluster center when centers are generated at random, shuffle indicates if we have to shuffle the samples, and random_state is the random seed.

Circle dataset

This is a dataset that has circles within other circles. It is a nonlinear, separable problem, so it needs to be solved by a nonlinear model. This rules out the simple algorithms such as k-means. In this chapter we'll try to use it anyway, to make a point.

Employed method

The following method is used for employed method:

sklearn.datasets.make_circles(n_samples=100,shuffle=True,noise=None, random_state=None,factor=0.8)

Here, n_samples is the total data numbers, shuffle indicates whether we have to shuffle the samples, noise is the number of random amounts to be applied to the circular data, random_state is the random seed, and factor is the scale factor between circles.

Moon dataset

This is another nonlinear problem but with another type of class separation, because there is no closure such as in the circle's ring.

Project 1 - k-means clustering on synthetic datasets

Dataset description and loading

During this chapter, we'll be using generated datasets that are specially crafted to have special properties. Two of the target properties are the possibility of linear separation of classes and the existence of clearly separated clusters (or not).

Generating the dataset

With these lines, we create the data structures that will contain all the elements for working on the solutions, that is:

centers = [(-2, -2), (-2, 1.5), (1.5, -2), (2, 1.5)] 
data, features = make_blobs (n_samples=200, centers=centers, n_features = 2, cluster_std=0.8, shuffle=False, random_state=42)

Graphing the dataset via matplotlib:

    ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250)
    plt.plot()

Model architecture

The points variable contains the 2D coordinates of the dataset points, the centroids variable will contain the coordinates of the center points of the groups, and the cluster_assignments variable contains the centroid index for each data element.

For example, cluster_assignments[2] = 1 indicates that the data[2] data point pertains to the cluster with the center, centroid 1. The location of centroid 1 is located in centroids[1].

points=tf.Variable(data) 
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64)) 
centroids = tf.Variable(tf.slice(points.initialized_value(), [0,0], [K,2]))

Then we can draw the position of these centroids using matplotlib:

fig, ax = plt.subplots() 
ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250) 
plt.show()

Initial center seeding

Loss function description and optimizer loop

Then we will do N copies of all centroids, K copies of each point, and N x K copies of every point so we can calculate the distances between each point and every centroid, for each dimension:

rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, 2]) 
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, 2]) 
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),  
reduction_indices=2)

Then we perform the sum for all dimensions and get the index of the lowest sum (this will be the index of the centroid or cluster assigned for each point):

best_centroids = tf.argmin(sum_squares, 1)

Centroids will also be updated with a bucket:mean function, defined in the full source code.

Stop condition

This is the stop condition that the new centroids and assignments don't change:

did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids, cluster_assignments))

Here, we use control_dependencies to calculate whether we need to update the centroids:

with tf.control_dependencies([did_assignments_change]): 
    do_updates = tf.group( 
    centroids.assign(means), 
    cluster_assignments.assign(best_centroids))

Results description

We get the following output when the program is executed:

This is a summarizing graphic of the centroid changes after a round of iterations with the original clusters drawn as they were generated from the algorithm.

In the following image, we represent the different stages in the application of the k-means algorithm for this clearly separated case:

Centroid changes per iteration

Full source code

Following is the complete source code:

import tensorflow as tf 
import numpy as np 
import time 
 
import matplotlib 
import matplotlib.pyplot as plt 
 
from sklearn.datasets.samples_generator import make_blobs 
from sklearn.datasets.samples_generator import make_circles 
 
DATA_TYPE = 'blobs' 
 
# Number of clusters, if we choose circles, only 2 will be enough 
if (DATA_TYPE == 'circle'): 
    K=2 
else: 
    K=4 
 
# Maximum number of iterations, if the conditions are not met 
MAX_ITERS = 1000 
 
start = time.time() 
 
centers = [(-2, -2), (-2, 1.5), (1.5, -2), (2, 1.5)] 
if (DATA_TYPE == 'circle'): 
    data, features = make_circles(n_samples=200, shuffle=True, noise= 0.01, factor=0.4) 
else: 
    data, features = make_blobs (n_samples=200, centers=centers, n_features = 2, cluster_std=0.8, shuffle=False, random_state=42) 
 
fig, ax = plt.subplots() 
ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250) 
plt.show() 
 
fig, ax = plt.subplots() 
if (DATA_TYPE == 'blobs'): 
ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250) 
ax.scatter(data.transpose()[0], data.transpose()[1], marker = 'o', s = 100, c = features, cmap=plt.cm.coolwarm ) 
plt.plot() 
 
points=tf.Variable(data) 
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64)) 
centroids = tf.Variable(tf.slice(points.initialized_value(), [0,0], [K,2])) 
 
sess = tf.Session() 
sess.run(tf.initialize_all_variables()) 
 
rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, 2]) 
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, 2]) 
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),  
reduction_indices=2) 
best_centroids = tf.argmin(sum_squares, 1) 
 
did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids, cluster_assignments)) 
 
def bucket_mean(data, bucket_ids, num_buckets): 
total = tf.unsorted_segment_sum(data, bucket_ids, num_buckets) 
count = tf.unsorted_segment_sum(tf.ones_like(data), bucket_ids, num_buckets) 
return total / count 
 
means = bucket_mean(points, best_centroids, K) 
 
with tf.control_dependencies([did_assignments_change]): 
do_updates = tf.group( 
centroids.assign(means), 
cluster_assignments.assign(best_centroids)) 
 
changed = True 
iters = 0 
 
fig, ax = plt.subplots() 
if (DATA_TYPE == 'blobs'): 
    colourindexes=[2,1,4,3] 
else: 
    colourindexes=[2,1] 
while changed and iters < MAX_ITERS: 
fig, ax = plt.subplots() 
iters += 1 
[changed, _] = sess.run([did_assignments_change, do_updates]) 
[centers, assignments] = sess.run([centroids, cluster_assignments]) 
ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker = 'o', s = 200, c = assignments, cmap=plt.cm.coolwarm ) 
ax.scatter(centers[:,0],centers[:,1], marker = '^', s = 550, c = colourindexes, cmap=plt.cm.plasma) 
ax.set_title('Iteration ' + str(iters)) 
plt.savefig("kmeans" + str(iters) +".png") 
 
ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker = 'o', s = 200, c = assignments, cmap=plt.cm.coolwarm ) 
plt.show() 
 
end = time.time() 
print ("Found in %.2f seconds" % (end-start)), iters, "iterations" 
print "Centroids:" 
print centers 
print "Cluster assignments:", assignments

This is the simplest case for observing the algorithm mechanics. When the data comes from the real world, the classes are normally not so clearly separated and it is more difficult to label the data samples.

k-means on circle synthetic data

For the circular plot, we observe that this data characterization is not simple to represent by a simple set of means. As the image clearly shows, the two circles either share a Centroid position, or are really close and so we cannot predict a clear outcome:

Circle type dataset

For this dataset, we are only using two classes to be sure the main drawbacks of this algorithm are understood:

k-means applied to a circular synthetic dataset

As we can see, the initial centers drifted toward the areas that had the most concentrated sample numbers, and so divided the data linearly. This is one of the limitations of the simple models we are employing at this stage. To cope with nonlinear separability samples, we can try other statistical approaches outside the scope of this chapter, such as density-based spatial clustering of applications with noise (DBSCAN).

Project 2 - nearest neighbor on synthetic datasets

In this project, we will be loading a dataset with which the previous algorithm (k-means) has problems separating classes.

Dataset generation

The dataset is the same circular classes dataset from the first example with two classes, but this time we will increase the error probability by adding a bit more noise (from 0.01 to 0.12):

data, features = make_circles(n_samples=N, shuffle=True, noise=0.12,factor=0.4)

This is the resulting training data plot:

Model architecture

The variables that will sustain data are simply the original data and a test list, which will hold the calculated classes for the test data:

data, features = make_circles(n_samples=N, shuffle=True, noise= 0.12, factor=0.4)
tr_data, tr_features= data[:cut], features[:cut]
te_data,te_features=data[cut:], features[cut:]
test=[]

Loss function description

In clustering, we will use the function to optimize as the Euclidean distance, the same as Chapter 1, Exploring and Transforming Data. It is calculated on the cluster assignment loop, getting the distance from the new point to the existing training points, asking for the index of the minimum, and then using that index to search the class of the nearest neighbor:

distances = tf.reduce_sum(tf.square(tf.sub(i , tr_data)),reduction_indices=1)
neighbor = tf.arg_min(distances,0)

Stop condition

In this simple example, we will finish once all the elements of the test partition have been visited.

Results description

Here is a graphic of the test data class assignation where we can see the clearly separated classes. We can observe that, at least with this limited dataset scope, this method works better than the non-overlapping, blob-optimizing, k-means method.

Full source code

Following is the complete source code:

import tensorflow as tf 
import numpy as np 
import time 
 
import matplotlib 
import matplotlib.pyplot as plt 
 
from sklearn.datasets.samples_generator import make_circles 
 
N=210 
K=2 
# Maximum number of iterations, if the conditions are not met 
MAX_ITERS = 1000 
cut=int(N*0.7) 
 
start = time.time() 
 
data, features = make_circles(n_samples=N, shuffle=True, noise= 0.12, factor=0.4) 
tr_data, tr_features= data[:cut], features[:cut] 
te_data,te_features=data[cut:], features[cut:] 
test=[] 
 
fig, ax = plt.subplots() 
ax.scatter(tr_data.transpose()[0], tr_data.transpose()[1], marker = 'o', s = 100, c = tr_features, cmap=plt.cm.coolwarm ) 
plt.plot() 
 
sess = tf.Session() 
sess.run(tf.initialize_all_variables()) 
 
for i, j in zip(te_data, te_features): 
    distances = tf.reduce_sum(tf.square(tf.sub(i , tr_data)),reduction_indices=1) 
    neighbor = tf.arg_min(distances,0) 
 
    test.append(tr_features[sess.run(neighbor)]) 
print test 
fig, ax = plt.subplots() 
ax.scatter(te_data.transpose()[0], te_data.transpose()[1], marker = 'o', s = 100, c = test, cmap=plt.cm.coolwarm ) 
plt.plot() 
 
end = time.time() 
print ("Found in %.2f seconds" % (end-start)) 
 
print "Cluster assignments:", test

Summary

In this chapter, we've seen a simple overview of some of the most basic models we can implement, but tried to be as detailed in the explanation as possible.

From now on, we will be able to generate synthetic datasets, allowing us to rapidly test the adequacy of a model for different data configurations and so evaluate the advantages and shortcomings of them, without having to load models with a greater number of unknown characteristics.

Additionally, we have implemented the first iterative methods and tested convergence, a task that will continue in the following chapters in a similar way, but with finer and much more exact methods.

In the next chapter, we will solve classification problems using linear functions, and for the first time, use previous data from training sets to learn from its characteristics. This is the objective of supervised learning techniques and is more useful in general for a lot of real-life problem-solving.