Chapter 4. Logistic Regression

In the previous chapter, we have seen an approach to model a part of reality as a linear function, which has independent variables and bias minimized an error function.

This particular analysis is not enough except for some very clearly defined problems, with expected results being continuous variables and function.

But what would happen if we are faced with data with qualitative dependent variables? For example, the presence or absence of a determinate feature; has a subject got blond hair? Has the patient had a previous illness?

These are the kinds of problem that we will be working on in this chapter.

Problem description

The kind of problem that linear regression aims to solve is not the prediction of a value based on a continuous function, this time we want to know the probability for a sample of pertaining to a determined class.

In this chapter, we will rely on a generalization of the linear model solving a regression, but with the final objective of solving classification problems where we have to apply tags or assign all the elements from an observation set to predefined groups.

Problem description

In the preceding figure, we can see how the old and new problems can be classified. The first one (linear regression) can be imagined as a continuum of increasingly growing values.

The other is a domain where the output can have just two different values, based on the x value. In the particular case of the second graphic, we can see a particular bias to one of the options, toward the extremes: on the left there is a bias towards the 0 y value, and to the right the bias is towards a value of 1.

This terminology can be bit tricky given that even when we are doing a regression, thus looking for a continuous value, in reality, the final objective is building a prediction for a classification problem, with discrete variables.

The key here is to understand that we will obtain probabilities of an item pertaining to a class, and not a totally discrete value.

Logistic function predecessor - the logit functions

Before we study the logistic function, we will review the original function on which it is based, and which gives it some of its more general properties.

Essentially, when we talk about the logit function, we are working with the function of a random variable p, more specifically, one corresponding with a Bernoulli distribution.

Bernoulli distribution

Before explaining theoretical details it is worthwhile noting that a Bernoulli distribution is a random variable that:

  • Takes a value of 0 with a failure probability of q = 1 - p
  • Takes a value of 1 with a success probability of p

It can be expressed as follows (for a random variable X with Bernoulli distribution):

Bernoulli distribution

This is the kind of probability distribution that will represent the probability of occurrence of the events as binary options, just as we want to represent our variables (existence of features, event occurrence, causality of phenomena, and so on).

Link function

As we are trying to build a generalized linear model, we want to start from a linear function, and from the dependent variable, obtain a mapping to a probability distribution.

Since the options are of a binary nature, the normally chosen distribution is the recently mentioned Bernoulli distribution, and the link function, leaning toward the logistic function, is the logit function.

Logit function

One of the possible variables that we could utilize, is the natural logarithm of the odds that p equals one. This function is called the logit function:

Logit function

We can also call the logit function a log-odd function, because we are calculating the log of the odds (p/1-p) for a given probability p:

Logit function

So, as we can visually infer, replacing X with the combination of the independent variables, no matter the value of them, replacing X with any occurrence from minus infinity to infinity, we are scaling the response to be within 0 and 1.

The importance of the logit inverse

Suppose that we calculate the inverse of the logit function. This will let us write the following function:

The importance of the logit inverse

This function is a sigmoid function.

The logistic function

The logistic function will serve us to represent the binary options in our new regression tasks.

In the following figure, you will find the graphical representation of the sigmoid function:

The logistic function

Graphical representation of the logistic function or Sigmoid

Logistic function as a linear modeling generalization

The logistic function δ(t) is defined as follows:

Logistic function as a linear modeling generalization

The normal interpretation of this equation is that t represents a simple independent variable. But we will improve this model, and will assume that t is a linear function of a single explanatory variable x (the case where t is a linear combination of multiple explanatory variables is treated similarly).

We will then express t as the following:

Logistic function as a linear modeling generalization

Final estimated regression equation

So we start from the following equation:

Final estimated regression equation

With all these elements, we can calculate the regression equation, which will give us the regressed probability:

Final estimated regression equation

The following graphic will show how the mapping from arbitrary range will be finally transformed to range [0, 1], which can be interpreted as probability p of the occurrence of the event being represented:

Final estimated regression equation

What effects will make changes to the parameters of the linear function? They are the values that will change the central slope and the displacement from zero of the sigmoid function, allowing it to more exactly reduce the error between the regressed values and the real data points.

Properties of the logistic function

Every curve in the function space can be described by the possible objectives it could be applied to. In the case of the logistic function, they are:

  • Model the probability of an event p depending on one or more independent variables. For example, the probability of being awarded a prize, given previous qualifications.
  • Estimate (this is the regression part) p for a determined observation, related to the possibility of the event not occurring.
  • Predict the effect of the change of independent variables on a binary response.
  • Classify observations by calculating the probability of an item being of a determined class.

Loss function

In the previous section we saw our approximated [latex]\hat p[/latex] function, which will model the probability of a sample being of a particular class. In order to measure how well we are approximating to the solution, we will look for a carefully chosen loss function.

This loss function is expressed as:

Loss function

The main property of this loss function is that it doesn't penalize the error in a similar manner, the error penalization factor grows asymptotically when the error increases far beyond 0.5.

Multiclass application - softmax regression

Up until now, we have been classifying for the case of only two classes, or in probabilistic language, event occurrence probabilities, p.

In the case of having more than two classes to decide from, there are two main approaches; one versus one, and one versus all.

  • The first technique consists of calculating many models that represent the probability of every class against all the other ones.
  • The second one consists of one set of probabilities, in which we represent the probability of one class against all the others.
  • This second approach is the output format of the softmax regression, a generalization of the logistic regression for n classes.

So we will be changing, for training samples, from binary labels ( y(i)ε{0,1}), to vector labels, using the handle y(i)ε{1,...,K}, where K is the number of classes, and the label Y can take on K different values, rather than only two.

So for this particular technique, given a test input X, we want to estimate the probability that P(y=k|x) for each value of k=1,...,K. The softmax regression will output a K-dimensional vector (whose elements sum to 1), giving us our K estimated probabilities.

In the following diagram, we represent the mapping that occurs on the probability mappings of the uniclass and multiclass logistic regression:

Multiclass application - softmax regression

Cost function

The cost function of the softmax function is an adapted cross entropy function, which is not linear, and thus penalizes the big order function differences much more than the very small ones.

Cost function

Here, c is the class number and I the individual train sample index, and yc is 1 for the expected class and 0 for the rest.

Expanding this equation, we get the following:

Cost function

Data normalization for iterative methods

As we will see in the following sections, for logistic regression we will be using the gradient descent method for minimizing a cost function.

Data normalization for iterative methods

This method is very sensitive to the form and the distribution of the feature data.

For this reason, we will be doing some preprocessing in order to get better and faster converging results.

We will leave the theoretical reasons for this method to other books, but we will summarize the reason saying that with normalization, we are smoothing the error surface, allowing the iterative gradient descent to reach the minimum error faster.

One hot representation of outputs

In order to use the softmax function as the regression function, we must use a type of encoding known as one hot. This form of encoding simply transforms the numerical integer value of a variable into an array, where a list of values is transformed into a list of arrays, each with a length of as many elements as the maximum value of the list, and the value of each elements is represented by adding a one on the index of the value, and leaving the others at zero.

For example, this will be the representation of the list [1, 3, 2, 4] in one hot encoding:

[[0 1 0 0 0] 
[0 0 0 1 0] 
[0 0 1 0 0]
[0 0 0 0 1]]

Example 1 - univariate logistic regression

In this first example, we will work approximating the probability of the presence of heart disease, using an univariate logistic regression, being this variable, the patients age.

Useful libraries and methods

Since version 0.8, TensorFlow has provided a means of generating one hot. The function used for this generation is tf.one_hot, which has this form:

tf.one_hot(indices, depth, on_value=1, off_value=0, axis=None, dtype=tf.float32, name=None)

This functions generates a generalized one hot encoding data structure, which can specify the values, axis of generation, data type, and so on.

In the resulting tensors, the indicated values of the indexes will take the on_value, default 1, and the others will have off_value, default 0.

Dtype is the data type of the generated tensor; the default is float32.

The depth variable defines how many columns each element will have. We suppose it logically should be max(indices) + 1, but it could be also cut.

TensorFlow's softmax implementation

The included method for applying softmax regression in TensorFlow is tf.nn.log_softmax, with the following form:

tf.nn.log_softmax(logits, name=None)

Here, the arguments are:

  • logits: A tensor must be one of the following types: float32, float64 2D with shape [batch_size, num_classes]
  • name: A name for the operation (optional)

This function returns a tensor with the same type and shape as logits.

Dataset description and loading

The first case we will cover is where we want to fit a logistic regression, measuring only one variable and we only have a binary possible result.

The CHDAGE dataset

For the first simple example, we will use a very simple and studied dataset, first known for being published in the book; Applied Logistic Regression-Third EditionDavid W. Hosmer Jr., Stanley Lemeshow, Rodney X. Sturdivant, by Wiley.

Lists the age in years (AGE), and presence or absence of evidence of significant Coronary Heart Disease (CHD) for 100 subjects in a hypothetical study of risk factors for heart disease. The table also contains an identifier variable (ID) and an age group variable (AGEGRP). The outcome variable is CHD, which is coded with a value of 0 to indicate that CHD is absent, or 1 to indicate that it is present in the individual. In general, any two values could be used, but we have found it most convenient to use zero and one. We refer to this data set as the CHDAGE data.

CHDAGE dataset format

The CHDAGE dataset is a two-column CSV file that we will download from an external repository.

In the chapter 1Exploring and Transforming Data, we used native TensorFlow methods for the reading of the dataset. In this chapter, we will use a complementary and popular library to get the data.

The cause for this new addition is that, given that the dataset only has 100 tuples, it is practical to just have to read it in one line, and also we get simple but powerful analysis methods for free, provided by the pandas library.

So in the first stage of this project, we will start loading an instance of the CHDAGE dataset, then we will print vital statistics about the data, and then proceed to preprocessing.

After doing some plots of the data, we will build a model composed of the activation function, which will be a softmax function for special cases where it becomes a standard logistic regression; that is when there are only two classes (existence, or not, of the illness).

Dataset loading and preprocessing implementation

First, we import the required libraries, and indicate that all our matplotlib programs will be included inline (if we are using Jupyter):

>>> import pandas as pd 
>>> import numpy as np 
>>> %matplotlib inline 
>>> import matplotlib.pyplot as plt 

Then we read the data and ask pandas to check important statistical information about the dataset:

>>> df = pd.read_csv("data/CHD.csv", header=0) 
>>> print df.describe() 
    age        chd
    count  100.000000  100.00000
    mean    44.380000    0.43000
    std     11.721327    0.49757
    min     20.000000    0.00000
    25%     34.750000    0.00000
    50%     44.000000    0.00000
    75%     55.000000    1.00000
    max     69.000000    1.000000

We then proceed with drawing the data to get an idea of the data:

plt.figure() # Create a new figure 
plt.scatter(df['age'],df['chd']) #Plot a scatter draw of the random datapoints 

Dataset loading and preprocessing implementation

Model architecture

Here, we describe the sections of the code where we will build the elements of the model, starting from the following variables:

learning_rate = 0.8 #Learning speed 
batch_size = 100 #number of samples for the batch 
display_step = 2 #number of steps before showing progress

Here, we create the initial variables and placeholders for the graph, the univariate x and y float values:

x = tf.placeholder("float", [None, 1]) # Placeholder for the 1D data 
y = tf.placeholder("float", [None, 2]) # Placeholder for the classes (2)

Now we will create linear model variables, which will be modified and updated as the model fitting goes on:

W = tf.Variable(tf.zeros([1, 2])) 
b = tf.Variable(tf.zeros([2]))

And finally, we will build the activation function applying the softmax operation to the linear function:

activation = tf.nn.softmax(tf.matmul(x, W) + b) 

Loss function description and optimizer loop

Here, we just define the cross correlation function as the loss function, and define the optimizer operation, which will be the gradient descent. This will be explained in the following chapters; for now, you can see it as a black box that will change the variables until the loss is minimized:

cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(activation), reduction_indices=1)) 
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) 
#Iterate through all the epochs 
for epoch in range(training_epochs): 
        avg_cost = 0. 
        total_batch = 400/batch_size 
# Loop over all batches 
        for i in range(total_batch): 
            # Transform the array into a one hot format 
 
        temp=tf.one_hot(indices = df['chd'].values, depth=2, on_value = 1, off_value = 0, axis = -1 , name = "a")       
        batch_xs, batch_ys =(np.transpose([df['age']])-44.38)/11.721327, temp 
 
        # Fit training using batch data 
        sess.run(optimizer, feed_dict={x: batch_xs.astype(float), y: batch_ys.eval()}) 
 
        # Compute average loss, suming the corrent cost divided by the batch total number 
        avg_cost += sess.run(cost, feed_dict={x: batch_xs.astype(float), y: batch_ys.eval()})/total_batch 

Stop condition

The process will simply stop once the data has been training according to training epochs times.

Results description

This will be the output of the program:

Epoch: 0001 cost= 0.638730764
[ 0.04824295 -0.04824295]
[[-0.17459483  0.17459483]]
Epoch: 0002 cost= 0.589489654
[ 0.08091066 -0.08091066]
[[-0.29231569  0.29231566]]
Epoch: 0003 cost= 0.565953553
[ 0.10427245 -0.10427245]
[[-0.37499282  0.37499279]]
Epoch: 0004 cost= 0.553756475
[ 0.12176144 -0.12176143]
[[-0.43521613  0.4352161 ]]
Epoch: 0005 cost= 0.547019333
[ 0.13527818 -0.13527818]
[[-0.48031801  0.48031798]]

Fitting function representations across epochs

In the following image we represent the progression of the fitting function across the different epochs:

Fitting function representations across epochs

Full source code

Here is the complete source code:

import pandas as pd 
import numpy as np 
get_ipython().magic(u'matplotlib inline') 
import matplotlib.pyplot as plt 
import tensorflow as tf 
 
df = pd.read_csv("data/CHD.csv", header=0) 
# Parameters 
 
learning_rate = 0.2 
training_epochs = 5 
batch_size = 100 
display_step = 1 
sess = tf.Session() 
b=np.zeros((100,2)) 
 
# tf Graph Input 
 
x = tf.placeholder("float", [None, 1]) 
y = tf.placeholder("float", [None, 2]) 
 
# Create model 
# Set model weights 
W = tf.Variable(tf.zeros([1, 2])) 
b = tf.Variable(tf.zeros([2])) 
 
# Construct model 
activation = tf.nn.softmax(tf.matmul(x, W) + b) 
# Minimize error using cross entropy 
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(activation), reduction_indices=1)) # Cross entropy 
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) # Gradient Descent 
 
# Initializing the variables 
init = tf.initialize_all_variables() 
 
# Launch the graph 
 
with tf.Session() as sess: 
    tf.train.write_graph(sess.graph, './graphs','graph.pbtxt') 
    sess.run(init) 
    writer = tf.train.SummaryWriter('./graphs', sess.graph) 
    #Initialize the graph structure 
 
    graphnumber=321 
 
    #Generate a new graph 
    plt.figure(1) 
 
    #Iterate through all the epochs 
    for epoch in range(training_epochs): 
        avg_cost = 0. 
        total_batch = 400/batch_size 
        # Loop over all batches 
 
        for i in range(total_batch): 
            # Transform the array into a one hot format 
 
            temp=tf.one_hot(indices = df['chd'].values, depth=2, on_value = 1, off_value = 0, axis = -1 , name = "a")       
            batch_xs, batch_ys = (np.transpose([df['age']])-44.38)/11.721327, temp 
 
            # Fit training using batch data 
            sess.run(optimizer, feed_dict={x: batch_xs.astype(float), y: batch_ys.eval()}) 
 
            # Compute average loss, suming the corrent cost divided by the batch total number 
            avg_cost += sess.run(cost, feed_dict={x: batch_xs.astype(float), y: batch_ys.eval()})/total_batch 
        # Display logs per epoch step 
 
        if epoch % display_step == 0: 
            print "Epoch:", '%05d' % (epoch+1), "cost=", "{:.8f}".format(avg_cost) 
 
            #Generate a new graph, and add it to the complete graph 
 
            trX = np.linspace(-30, 30, 100) 
            print (b.eval()) 
            print (W.eval()) 
            Wdos=2*W.eval()[0][0]/11.721327 
            bdos=2*b.eval()[0] 
 
            # Generate the probabiliy function 
            trY = np.exp(-(Wdos*trX)+bdos)/(1+np.exp(-(Wdos*trX)+bdos) ) 
 
            # Draw the samples and the probability function, whithout the normalization 
            plt.subplot(graphnumber) 
            graphnumber=graphnumber+1 
 
            #Plot a scatter draw of the random datapoints 
            plt.scatter((df['age']),df['chd']) 
            plt.plot(trX+44.38,trY) #Plot a scatter draw of the random datapoints 
            plt.grid(True) 
 
        #Plot the final graph 
        plt.savefig("test.svg")  

Graphical representation

Using the TensorBoard utility, we will see the operation chain. Note that in half of the operation graphics we define the main global operation (doftmas), and the gradient operations applied to the remaining terms, which are needed to do the loss function minimization. This is a theme to be talked about in the following chapters.

Graphical representation

Example 2 - Univariate logistic regression with skflow

In this example, we will explore the univariate examples domain, but this time we will use help from a new library, which eases the model building for us, called skflow.

Useful libraries and methods

In the field of machine learning libraries, there is a great number and variety of alternatives. One of the most well-known is sklearn, which we talked about in Chapter 2Clustering.

Very early after the release of TensorFlow, a new contribution library appeared, called skflow, with the main objective of emulating the interface and workflow of sklearn, which is much more succinct to work in this TensorFlow session environment.

In the following example, we will repeat the analysis of the previous regression, but using the skflow interface.

In the example, we will also see how skflow automatically generates a detailed and very organized graph for the regression model, just setting a log directory as a parameter.

Dataset description

The dataset loading stage is the same as the previous example, using the pandas library:

import pandas as pd 
 
df = pd.read_csv("data/CHD.csv", header=0) 
print df.describe() 

Model architecture

Here is the code snippet for my_model:

 def my_model(X, y): 
    return skflow.models.logistic_regression(X, y) 
 
X1 =a.fit_transform(df['age'].astype(float)) 
y1 = df['chd'].values 
classifier = skflow.TensorFlowEstimator(model_fn=my_model, n_classes=2) 

Here we can see a detailed view of the logistic regression stage, with the softmax classifier:

Model architecture

Model architecture

Results description

score = metrics.accuracy_score(df['chd'].astype(float), classifier.predict(X)) 
print("Accuracy: %f" % score) 

The output result is a respectable (for the simplicity of the model) 74% accuracy:

Accuracy: 0.740000

Full source code

Here is the complete source code:

import tensorflow.contrib.learn as skflow 
from sklearn import datasets, metrics, preprocessing 
import numpy as np 
import pandas as pd 
 
df = pd.read_csv("data/CHD.csv", header=0) 
print df.describe() 
 
def my_model(X, y): 
    return skflow.models.logistic_regression(X, y) 
 
a = preprocessing.StandardScaler() 
 
X1 =a.fit_transform(df['age'].astype(float)) 
 
y1 = df['chd'].values 
 
 
classifier = skflow.TensorFlowEstimator(model_fn=my_model, n_classes=2) 
classifier.fit(X1,y1 , logdir='/tmp/logistic') 
 
 
 
score = metrics.accuracy_score(df['chd'].astype(float), classifier.predict(X)) 
print("Accuracy: %f" % score) 

Summary

In this chapter, we have learned a new modeling technique, the logistic function, and began with a simple approach to the task of classification.

We also learned a new approach for the reading of text-based data via the pandas library.

Additionally, we have seen a complementary approach to the classical workflow, working with the skflow library.

In the next chapter, we will start working with more complex architectures, and enter the field where the TensorFlow library excels: the training, testing, and final implementation of neural networks to solve real-world problems.