In this chapter, we will start applying the data transforming operations that we learned in the previous chapter, and will begin finding interesting patterns in some given information, discovering groups of data, or clusters, using clustering techniques.
In this process we will also gain two new tools: the ability to generate synthetic sample sets from a collection of representative data structures via the scikit-learn library, and the ability to graphically plot our data and model results, this time via the matplotlib library.
The topics we will cover in this chapter are as follows:
In this chapter we will be reviewing two cases of unsupervised learning.
Unsupervised learning basically consists of finding patterns on a previous dataset. Normally, little or no information is given for this technique and the procedure should be able to automatically determine how the information is organized, and recognize the different structures in the data organization.
One of the simplest operations that can be initially to unlabeled dataset is to try to understand the possible groups of the dataset members' common features.
To do so, the dataset can be split into an arbitrary number of segments, where each can be represented as a central mass (centroid) point that represents the points belonging to a determined group or cluster.
In order to define the criteria that assigns the same group to different group members, we need to define a concept that represents the distance between data elements, so we can simply say that all class members are closer to their own centroids than to any other centroid.
In the following graphic, we can see the results of a typical clustering algorithm and the representation of the cluster centers:
Sample clustering algorithm output
k-means is a very popular clustering algorithm and it can be easily implemented. It is very straight forward, and applying it as a first procedure to datasets with good class separation can provide a good a priori understanding of the data.
k-means tries to divide a set of samples in k disjoint groups or clusters using the mean value of the members as the main indicator. This point is normally called a Centroid, referring to the arithmetical entity with the same name, and is represented as a vector in a space of arbitrary dimensions.
k-means is a naive method because it works by looking for the appropriate centroids but doesn't know a priori what the number of clusters is.
In order to get an evaluation of how many clusters give a good representation of the provided data, one of the more popular methods is the Elbow method.
The criterion and goal of this method is to minimize the sum of squared distances from the cluster's member to the actual centroid of all cluster-contained samples. This is also known as Minimization of Inertia.
Error Minimization criteria for k-means
Latex: \sum_{i=0}^{n}\lim_{\mu_j\epsilon C}(\left \| x_{j} - \mu_i \right \|^{2}).
The mechanism of the k-means algorithm can be summarized by the following flowchart:
Simplified flowchart of the k-means process
The algorithm can be simplified as follows:
The stopping conditions could be of various types:
k-means simplified graphic
The advantages of this method are:
But simplicity also comes with a cost (no silver bullet rule applies):
k-nearest neighbors (k-nn) is a simple, classical method for clustering that will serve as a good introduction to this class of techniques, looking at the vicinity of each sample, and supposing that each new sample should pertain to the class of the already known data points.
k-nn can be implemented in more than one of our configurations, but in this chapter we will use the Semi Supervised approach. We will start from a certain number of already assigned samples, and we will later guess the cluster membership based on the characteristics of the train set.
Nearest neighbor algorithm
In the previous figure, we can see a breakdown of the algorithm. It can be summarized by the following steps:
We will discuss some useful libraries in the following sections.
Data plotting is an integral part of data science discipline. For this reason, we need a very powerful framework to be able to plot our results. For this task, we do not have a general solution implemented in TensorFlow, we will use the matplotlib library.
From the matplotlib site (http://matplotlib.org/
), the definition is:
"matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms."
In this example, we will generate a list of 100 random numbers, generate a plot of the samples, and save the results in a graphics file:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
with tf.Session() as sess:
fig, ax = plt.subplots()
ax.plot(tf.random_normal([100]).eval(), tf.random_normal([100] ).eval(),'o')
ax.set_title('Sample random plot for TensorFlow')
plt.savefig("result.png")
This is the resulting image:
Sample plot generated with TensorFlow and matplotlib
In order to see a more general explanation of the scikit dataset module, please refer to: http://matplotlib.org/.
TensorFlow is not currently implementing methods for the easy generation of synthetic datasets. For this reason, we'll be using the sklearn
library as a helper.
From its website (http://scikit-learn.org/stable/):
"scikit-learn (formerly scikits.learn) is an open source machine learning library for the Python programming language. It features various classification, regression and clustering, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy."
In this example, we will use the dataset module, which deals with the generation and loading of many well-known synthetic, and field extracted, datasets.
In order to see a more general explanation of the scikit dataset module, please refer to http://scikit-learn.org/stable/datasets/.
Some of the generated dataset types we'll be using are:
Blob, circle, and moon dataset types
This dataset is ideal for testing simple clustering algorithms. It doesn't present problems because the data is grouped coherently and the separation of classes is clear.
The following method is used for employed method:
sklearn.datasets.make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None)
Here, n_samples
is the total data numbers, n_features
is the quantity of columns or features our data has, centers
is a list of centers or a number of random centers, cluster_std
is the standard deviation, center_box
is the bounding box for each cluster center when centers are generated at random, shuffle
indicates if we have to shuffle the samples, and random_state
is the random seed.
This is a dataset that has circles within other circles. It is a nonlinear, separable problem, so it needs to be solved by a nonlinear model. This rules out the simple algorithms such as k-means. In this chapter we'll try to use it anyway, to make a point.
The following method is used for employed method:
sklearn.datasets.make_circles(n_samples=100,shuffle=True,noise=None, random_state=None,factor=0.8)
Here, n_samples
is the total data numbers, shuffle
indicates whether we have to shuffle the samples, noise
is the number of random amounts to be applied to the circular data, random_state
is the random seed, and factor
is the scale factor between circles.
During this chapter, we'll be using generated datasets that are specially crafted to have special properties. Two of the target properties are the possibility of linear separation of classes and the existence of clearly separated clusters (or not).
With these lines, we create the data structures that will contain all the elements for working on the solutions, that is:
centers = [(-2, -2), (-2, 1.5), (1.5, -2), (2, 1.5)]
data, features = make_blobs (n_samples=200, centers=centers, n_features = 2, cluster_std=0.8, shuffle=False, random_state=42)
Graphing the dataset via matplotlib:
ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250)
plt.plot()
The points variable contains the 2D coordinates of the dataset points, the centroids variable will contain the coordinates of the center points of the groups, and the cluster_assignments
variable contains the centroid index for each data element.
For example, cluster_assignments[2] = 1
indicates that the data[2]
data point pertains to the cluster with the center, centroid 1
. The location of centroid 1
is located in centroids[1]
.
points=tf.Variable(data)
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64))
centroids = tf.Variable(tf.slice(points.initialized_value(), [0,0], [K,2]))
Then we can draw the position of these centroids using matplotlib:
fig, ax = plt.subplots()
ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250)
plt.show()
Initial center seeding
Then we will do N copies of all centroids, K copies of each point, and N x K copies of every point so we can calculate the distances between each point and every centroid, for each dimension:
rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, 2])
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, 2])
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),
reduction_indices=2)
Then we perform the sum for all dimensions and get the index of the lowest sum (this will be the index of the centroid or cluster assigned for each point):
best_centroids = tf.argmin(sum_squares, 1)
Centroids will also be updated with a bucket:mean
function, defined in the full source code.
This is the stop condition that the new centroids and assignments don't change:
did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids, cluster_assignments))
Here, we use control_dependencies
to calculate whether we need to update the centroids:
with tf.control_dependencies([did_assignments_change]):
do_updates = tf.group(
centroids.assign(means),
cluster_assignments.assign(best_centroids))
We get the following output when the program is executed:
This is a summarizing graphic of the centroid changes after a round of iterations with the original clusters drawn as they were generated from the algorithm.
In the following image, we represent the different stages in the application of the k-means algorithm for this clearly separated case:
Centroid changes per iteration
Following is the complete source code:
import tensorflow as tf
import numpy as np
import time
import matplotlib
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.datasets.samples_generator import make_circles
DATA_TYPE = 'blobs'
# Number of clusters, if we choose circles, only 2 will be enough
if (DATA_TYPE == 'circle'):
K=2
else:
K=4
# Maximum number of iterations, if the conditions are not met
MAX_ITERS = 1000
start = time.time()
centers = [(-2, -2), (-2, 1.5), (1.5, -2), (2, 1.5)]
if (DATA_TYPE == 'circle'):
data, features = make_circles(n_samples=200, shuffle=True, noise= 0.01, factor=0.4)
else:
data, features = make_blobs (n_samples=200, centers=centers, n_features = 2, cluster_std=0.8, shuffle=False, random_state=42)
fig, ax = plt.subplots()
ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250)
plt.show()
fig, ax = plt.subplots()
if (DATA_TYPE == 'blobs'):
ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250)
ax.scatter(data.transpose()[0], data.transpose()[1], marker = 'o', s = 100, c = features, cmap=plt.cm.coolwarm )
plt.plot()
points=tf.Variable(data)
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64))
centroids = tf.Variable(tf.slice(points.initialized_value(), [0,0], [K,2]))
sess = tf.Session()
sess.run(tf.initialize_all_variables())
rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, 2])
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, 2])
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),
reduction_indices=2)
best_centroids = tf.argmin(sum_squares, 1)
did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids, cluster_assignments))
def bucket_mean(data, bucket_ids, num_buckets):
total = tf.unsorted_segment_sum(data, bucket_ids, num_buckets)
count = tf.unsorted_segment_sum(tf.ones_like(data), bucket_ids, num_buckets)
return total / count
means = bucket_mean(points, best_centroids, K)
with tf.control_dependencies([did_assignments_change]):
do_updates = tf.group(
centroids.assign(means),
cluster_assignments.assign(best_centroids))
changed = True
iters = 0
fig, ax = plt.subplots()
if (DATA_TYPE == 'blobs'):
colourindexes=[2,1,4,3]
else:
colourindexes=[2,1]
while changed and iters < MAX_ITERS:
fig, ax = plt.subplots()
iters += 1
[changed, _] = sess.run([did_assignments_change, do_updates])
[centers, assignments] = sess.run([centroids, cluster_assignments])
ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker = 'o', s = 200, c = assignments, cmap=plt.cm.coolwarm )
ax.scatter(centers[:,0],centers[:,1], marker = '^', s = 550, c = colourindexes, cmap=plt.cm.plasma)
ax.set_title('Iteration ' + str(iters))
plt.savefig("kmeans" + str(iters) +".png")
ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker = 'o', s = 200, c = assignments, cmap=plt.cm.coolwarm )
plt.show()
end = time.time()
print ("Found in %.2f seconds" % (end-start)), iters, "iterations"
print "Centroids:"
print centers
print "Cluster assignments:", assignments
This is the simplest case for observing the algorithm mechanics. When the data comes from the real world, the classes are normally not so clearly separated and it is more difficult to label the data samples.
For the circular plot, we observe that this data characterization is not simple to represent by a simple set of means. As the image clearly shows, the two circles either share a Centroid position, or are really close and so we cannot predict a clear outcome:
Circle type dataset
For this dataset, we are only using two classes to be sure the main drawbacks of this algorithm are understood:
k-means applied to a circular synthetic dataset
As we can see, the initial centers drifted toward the areas that had the most concentrated sample numbers, and so divided the data linearly. This is one of the limitations of the simple models we are employing at this stage. To cope with nonlinear separability samples, we can try other statistical approaches outside the scope of this chapter, such as density-based spatial clustering of applications with noise (DBSCAN).
In this project, we will be loading a dataset with which the previous algorithm (k-means) has problems separating classes.
The dataset is the same circular classes dataset from the first example with two classes, but this time we will increase the error probability by adding a bit more noise (from 0.01
to 0.12
):
data, features = make_circles(n_samples=N, shuffle=True, noise=0.12,factor=0.4)
This is the resulting training data plot:
The variables that will sustain data are simply the original data and a test list, which will hold the calculated classes for the test data:
data, features = make_circles(n_samples=N, shuffle=True, noise= 0.12, factor=0.4)
tr_data, tr_features= data[:cut], features[:cut]
te_data,te_features=data[cut:], features[cut:]
test=[]
In clustering, we will use the function to optimize as the Euclidean distance, the same as Chapter 1, Exploring and Transforming Data. It is calculated on the cluster assignment loop, getting the distance from the new point to the existing training points, asking for the index of the minimum, and then using that index to search the class of the nearest neighbor:
distances = tf.reduce_sum(tf.square(tf.sub(i , tr_data)),reduction_indices=1)
neighbor = tf.arg_min(distances,0)
Following is the complete source code:
import tensorflow as tf
import numpy as np
import time
import matplotlib
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_circles
N=210
K=2
# Maximum number of iterations, if the conditions are not met
MAX_ITERS = 1000
cut=int(N*0.7)
start = time.time()
data, features = make_circles(n_samples=N, shuffle=True, noise= 0.12, factor=0.4)
tr_data, tr_features= data[:cut], features[:cut]
te_data,te_features=data[cut:], features[cut:]
test=[]
fig, ax = plt.subplots()
ax.scatter(tr_data.transpose()[0], tr_data.transpose()[1], marker = 'o', s = 100, c = tr_features, cmap=plt.cm.coolwarm )
plt.plot()
sess = tf.Session()
sess.run(tf.initialize_all_variables())
for i, j in zip(te_data, te_features):
distances = tf.reduce_sum(tf.square(tf.sub(i , tr_data)),reduction_indices=1)
neighbor = tf.arg_min(distances,0)
test.append(tr_features[sess.run(neighbor)])
print test
fig, ax = plt.subplots()
ax.scatter(te_data.transpose()[0], te_data.transpose()[1], marker = 'o', s = 100, c = test, cmap=plt.cm.coolwarm )
plt.plot()
end = time.time()
print ("Found in %.2f seconds" % (end-start))
print "Cluster assignments:", test
In this chapter, we've seen a simple overview of some of the most basic models we can implement, but tried to be as detailed in the explanation as possible.
From now on, we will be able to generate synthetic datasets, allowing us to rapidly test the adequacy of a model for different data configurations and so evaluate the advantages and shortcomings of them, without having to load models with a greater number of unknown characteristics.
Additionally, we have implemented the first iterative methods and tested convergence, a task that will continue in the following chapters in a similar way, but with finer and much more exact methods.
In the next chapter, we will solve classification problems using linear functions, and for the first time, use previous data from training sets to learn from its characteristics. This is the objective of supervised learning techniques and is more useful in general for a lot of real-life problem-solving.