Chapter 8. Deep Neural Networks

In this chapter, we will be reviewing one of the most state of the art, and most prolifically studied fields in Machine Learning, Deep neural networks.

Deep neural network definition

This is an area which is experiencing a blast on news techniques, and every day we hear of successful experiments applying DNN in solving new problems, for example, in computer vision, autonomous car driving, speech and text understanding, and so on.

In the previous chapters, we were using techniques that can be related with DNN, especially in the one covering Convolutional Neural Network.

For practical reasons, we will be referring to Deep Learning and Deep Neural Networks, to the architectures where the number of layers is significantly superior to a couple of similar layer, we will be referring to the Neural Network architectures with like tens of layer, or combinations of complex constructs.

Deep network architectures through time

In this section, we will be reviewing the milestone architectures that appeared throughout the history of deep learning, starting with LeNet5.

LeNet 5

The field of neural networks had been quite silent during the 1980s and the 1990s. There were some efforts, but the architectures were quite simple, and a big (and often not available) machine power was needed to try more complex approaches.

Around 1998, in Bells Labs, during research around the classification of hand written check digits, Ian LeCun started a new trend implementing the bases of what is considered Deep Learning - The Convolutional Neural Networks, which we have already studied in Chapter 5, Simple FeedForward Neural Networks .

In those years, SVM and other much more rigorously defined techniques were used to tackle those kinds of problems, but the fundamental paper on CNN, shows that Neural Networks could have a comparable or better performance compared to the then state of the art methods.


After some more years of hiatus (even though LeCun continued applying his networks to other tasks, such as face and object recognition), the exponential growth of both available structured data, and raw processing power, allowed the teams to grow and tune the models, to an extent that could have been considered impossible, and thus the complexity of the models could be increased without the risk of waiting months for training.

Computer research teams from a number of technological firms and universities began competing on some very difficult tasks, including image recognition. For one of these challenges, the Imagenet Classification Challenge, the Alexnet architecture was developed:


Alexnet architecture

Main features

Alexnet can be seen as an augmented LeNet5, in the sense that its first layers with convolution operations. but add the not so used max pooling layers, and then a series of dense connected layers, building a last output class probability layer. The Visual Geometry Group (VGG) model

One of the other main contenders of the image classification challenge was the VGGof the University of Oxford.

The main characteristic of the VGG network architecture is that they reduced the size of the convolutional filters, to a simple 3x3, and combined them in sequences.

This idea of tiny convolutional kernels was disruptive to the initial ideas of the LeNet and its successor Alexnet, which used filters of up to 11x11 filters, much more complex and low in performance. This change in filter size was the beginning of a trend that is still current:

Main features

Summary of the parameter number per layer in VGG

However, this positive change of using a series of small convolution weights, the total setup amounted to a really big number of parameters (in the order of many millions) and so it had to be limited by a number of measures.

The original inception model

After two main research cycles dominated by Alexnet and VGG, Google disrupted the challenges with a very powerful architecture, Inception, which has several iterations.

The first of these iterations, started with its own version of Convolutional Neural Network layer-based architecture, called GoogLeNet, an architecture with a name reminiscent to the network approach that started it all.

GoogLenet (Inception V1)

1Inception module

GoogLeNet was the first iteration of this effort, and as you will see in the following figure, it has a very deep architecture, but it has the chilling sum of nine chained inception modules, with little or no modification:

GoogLenet (Inception V1)

Inception original architecture

Even being so complex, it managed to reduce the needed parameter number, and increased the accuracy, compared to Alexnet, which had been released just two years before.

The comprehension and scalability of this complex architecture is improved nevertheless, by the fact that almost all the structure consists of a determined arrangement and repetition of the same original structural layer building blocks.

Batch normalized inception (V2)

The state of the art neural networks of 2015, while improving iteration over iteration, were having a problem of training instability.

In order to understand how the problems consisted, first we will remember the simple normalization steps that we applied in the previous examples. It basically consisted of centering the values on zero, and dividing by the maximum value, or the standard deviation, in order to have a good baseline for the gradients of the back propagations.

What occurs during the training of really large datasets, is that after a number of training examples, the different value oscillations begin to amplify the mean parameter value, like in a resonance phenomenon. What we very simply described is called a co variance shift.

Batch normalized inception (V2)

Performance comparison with and without Batch Normalization

This is the main reason why the Batch Normalization techniques had been developed.

Again simplifying the process description, it consists of applying normalizations not only to the original input values, it also normalizes the output values at each layer, avoiding the instabilities appearing between layers, before they begin to affect or drift the values.

This is the main feature that Google shipped in its improved implementation of GoogLeNet, released in February 2015, and it is also called Inception V2.

Inception v3

Fast forward to December 2015, and there is a new iteration of the Inception architecture. The difference of months between releases gives us an idea of the pace of development of the new iterations.

The basic adaptations for this architecture are:

The following diagram illustrates how the improved inception module can be interpreted:

Inception v3

Inception V3 base module

And this is a representation of the whole V3 architecture, with many instances of the common building module:

Inception v3

Inception V3 general diagram

Residual Networks (ResNet)

The Residual Network architecture appears in December 2015 (more or less the same time as the Inception V3), and it brought a simple but novel idea: not only use the output of each constitutional layer, but also combine the output of the layer with the original input.

In the following diagram, we observe a simplified view of one of the ResNet modules; it clearly shows the sum operation at the end of the Convolutional layer stack, and a final relu operation:

Residual Networks (ResNet)

ResNet general architecture

The convolutional part of the module includes a feature reduction from 256 to 64 values, a 3x3 filter layer maintaining the features number, and then a feature augmenting 1x1 layer, from 64 x 256 values. In recent developments, ResNet is also used in a depth of less than 30 layers, with a wide distribution.

Other deep neural network architectures

There are a big number of recently developed neural network architectures; in fact, the field is so dynamic that we have more or less a new outstanding architecture apparition every year. A list of the most promising neural network architectures are:

  • SqueezeNet: This architecture is an effort at reducing the parameter number and complexity of Alexnet, claiming a 50x parameter number reduction
  • Efficient Neural Network (Enet): Aims to build a simpler, low latency, number of floating point operations, neural networks with real-time results
  • Fractalnet: Its main characteristics are the implementation of very deep networks, without requiring the residual architecture, organizing the structural layout as a truncated fractal

Example - painting with style - VGG style transfer

In this example, we will work with the implementation of the paper A Neural Algorithm of Artistic Style from Leon Gatys.


The original code for this exercise was kindly provided by Anish Athalye (

We have to note that this exercise does not have a training part. We will just be loading a pretrained coefficient matrix, provided by VLFeat, a database of pre trained models, which can be used to work on models, avoiding the normally computationally intensive training:

Example - painting with style - VGG style transfer

Style transfer main concepts

Useful libraries and methods

  • Loading parameters files with
    • The first useful library that we will be using is the scipy io module, to load the coefficient data, which is saved as a matlab mat format.

  • Usage of the preceding parameter:, mdict=None, appendmat=True, **kwargs) 
  • Returns of the preceding parameter:

    mat_dict : dict :dictionary with variable names as keys, and loaded matrices as values. If the mdict parameter is filled, the results will be assigned to it.

Dataset description and loading

For the solution of this problem, we will be using a pre-trained dataset, that is, the retrained coefficients of a VGG neural network, with the Imagenet dataset.

Dataset description and loading

Dataset preprocessing

Given that the coefficients are given in the loaded parameter matrix, there is not much work to do regarding the initial dataset.

Modeling architecture

The modeling architecture is divided mainly in two parts: the style and the content.

For the generation of the final images, a VGG network without the final fully connected layer is used.

Loss functions

This architecture defines two different loss functions to optimize the two different aspects of the final image, one for the content and one for the style.

Content loss function

The code for content_loss function is as follows:

 # content loss 
        content_loss = content_weight * (2 * tf.nn.l2_loss( 
                net[CONTENT_LAYER] - content_features[CONTENT_LAYER]) / 

Style loss function

Loss optimization loop

The code for loss optimization loop is as follows:

        best_loss = float('inf') 
        best = None 
        with tf.Session() as sess: 
            for i in range(iterations): 
                last_step = (i == iterations - 1) 
                print_progress(i, last=last_step) 
                if (checkpoint_iterations and i % checkpoint_iterations == 0) or last_step: 
                    this_loss = loss.eval() 
                    if this_loss < best_loss: 
                        best_loss = this_loss 
                        best = image.eval() 
                    yield ( 
                        (None if last_step else i), 
                        vgg.unprocess(best.reshape(shape[1:]), mean_pixel) 

Convergency test

In this example, we will just check for the number of indicated iterations (the iterations parameter).

Program execution

For the execution of this program for a good (around 1000) iteration number, we recommend to have at least 8GB of RAM memory available:

python --content examples/2-content.jpg --styles examples/2-style1.jpg  --checkpoint-iterations=100 --iterations=1000 --checkpoint-output=out%s.jpg --output=outfinal

The results for the preceding command is as follows:

Program execution

Style transfer steps

The console output is as follows:

Iteration 1/1000
Iteration 2/1000
Iteration 3/1000
Iteration 4/1000
Iteration 999/1000
Iteration 1000/1000
  content loss: 908786
    style loss: 261789
       tv loss: 25639.9
    total loss: 1.19621e+06

Full source code

The code for is as follows:

import os 
import numpy as np 
import scipy.misc 
from stylize import stylize 
import math 
from argparse import ArgumentParser 
# default arguments 
TV_WEIGHT = 1e2 
VGG_PATH = 'imagenet-vgg-verydeep-19.mat' 
def build_parser(): 
    parser = ArgumentParser() 
            dest='content', help='content image', 
            metavar='CONTENT', required=True) 
            nargs='+', help='one or more style images', 
            metavar='STYLE', required=True) 
            dest='output', help='output path', 
            metavar='OUTPUT', required=True) 
            dest='checkpoint_output', help='checkpoint output format', 
    parser.add_argument('--iterations', type=int, 
            dest='iterations', help='iterations (default %(default)s)', 
            metavar='ITERATIONS', default=ITERATIONS) 
    parser.add_argument('--width', type=int, 
            dest='width', help='output width', 
    parser.add_argument('--style-scales', type=float, 
            nargs='+', help='one or more style scales', 
            dest='network', help='path to network parameters (default %(default)s)', 
            metavar='VGG_PATH', default=VGG_PATH) 
    parser.add_argument('--content-weight', type=float, 
            dest='content_weight', help='content weight (default %(default)s)', 
            metavar='CONTENT_WEIGHT', default=CONTENT_WEIGHT) 
    parser.add_argument('--style-weight', type=float, 
            dest='style_weight', help='style weight (default %(default)s)', 
            metavar='STYLE_WEIGHT', default=STYLE_WEIGHT) 
    parser.add_argument('--style-blend-weights', type=float, 
            dest='style_blend_weights', help='style blending weights', 
            nargs='+', metavar='STYLE_BLEND_WEIGHT') 
    parser.add_argument('--tv-weight', type=float, 
            dest='tv_weight', help='total variation regularization weight (default %(default)s)', 
            metavar='TV_WEIGHT', default=TV_WEIGHT) 
    parser.add_argument('--learning-rate', type=float, 
            dest='learning_rate', help='learning rate (default %(default)s)', 
            metavar='LEARNING_RATE', default=LEARNING_RATE) 
            dest='initial', help='initial image', 
    parser.add_argument('--print-iterations', type=int, 
            dest='print_iterations', help='statistics printing frequency', 
    parser.add_argument('--checkpoint-iterations', type=int, 
            dest='checkpoint_iterations', help='checkpoint frequency', 
    return parser 
def main(): 
    parser = build_parser() 
    options = parser.parse_args() 
    if not os.path.isfile( 
        parser.error("Network %s does not exist. (Did you forget to download it?)" % 
    content_image = imread(options.content) 
    style_images = [imread(style) for style in options.styles] 
    width = options.width 
    if width is not None: 
        new_shape = (int(math.floor(float(content_image.shape[0]) / 
                content_image.shape[1] * width)), width) 
        content_image = scipy.misc.imresize(content_image, new_shape) 
    target_shape = content_image.shape 
    for i in range(len(style_images)): 
        style_scale = STYLE_SCALE 
        if options.style_scales is not None: 
            style_scale = options.style_scales[i] 
        style_images[i] = scipy.misc.imresize(style_images[i], style_scale * 
                target_shape[1] / style_images[i].shape[1]) 
    style_blend_weights = options.style_blend_weights 
    if style_blend_weights is None: 
        # default is equal weights 
        style_blend_weights = [1.0/len(style_images) for _ in style_images] 
        total_blend_weight = sum(style_blend_weights) 
        style_blend_weights = [weight/total_blend_weight 
                               for weight in style_blend_weights] 
    initial = options.initial 
    if initial is not None: 
        initial = scipy.misc.imresize(imread(initial), content_image.shape[:2]) 
    if options.checkpoint_output and "%s" not in options.checkpoint_output: 
        parser.error("To save intermediate images, the checkpoint output " 
                     "parameter must contain `%s` (e.g. `foo%s.jpg`)") 
    for iteration, image in stylize(, 
        output_file = None 
        if iteration is not None: 
            if options.checkpoint_output: 
                output_file = options.checkpoint_output % iteration 
            output_file = options.output 
        if output_file: 
            imsave(output_file, image) 
def imread(path): 
    return scipy.misc.imread(path).astype(np.float) 
def imsave(path, img): 
    img = np.clip(img, 0, 255).astype(np.uint8) 
    scipy.misc.imsave(path, img) 
if __name__ == '__main__': 

The code for is as follows:

import vgg 
import tensorflow as tf 
import numpy as np 
from sys import stderr 
CONTENT_LAYER = 'relu4_2' 
STYLE_LAYERS = ('relu1_1', 'relu2_1', 'relu3_1', 'relu4_1', 'relu5_1') 
except NameError: 
    from functools import reduce 
def stylize(network, initial, content, styles, iterations, 
        content_weight, style_weight, style_blend_weights, tv_weight, 
        learning_rate, print_iterations=None, checkpoint_iterations=None): 
    Stylize images. 
    This function yields tuples (iteration, image); `iteration` is None 
    if this is the final image (the last iteration).  Other tuples are yielded 
    every `checkpoint_iterations` iterations. 
    :rtype: iterator[tuple[int|None,image]] 
    shape = (1,) + content.shape 
    style_shapes = [(1,) + style.shape for style in styles] 
    content_features = {} 
    style_features = [{} for _ in styles] 
    # compute content features in feedforward mode 
    g = tf.Graph() 
    with g.as_default(), g.device('/cpu:0'), tf.Session() as sess: 
        image = tf.placeholder('float', shape=shape) 
        net, mean_pixel =, image) 
        content_pre = np.array([vgg.preprocess(content, mean_pixel)]) 
        content_features[CONTENT_LAYER] = net[CONTENT_LAYER].eval( 
                feed_dict={image: content_pre}) 
    # compute style features in feedforward mode 
    for i in range(len(styles)): 
        g = tf.Graph() 
        with g.as_default(), g.device('/cpu:0'), tf.Session() as sess: 
            image = tf.placeholder('float', shape=style_shapes[i]) 
            net, _ =, image) 
            style_pre = np.array([vgg.preprocess(styles[i], mean_pixel)]) 
            for layer in STYLE_LAYERS: 
                features = net[layer].eval(feed_dict={image: style_pre}) 
                features = np.reshape(features, (-1, features.shape[3])) 
                gram = np.matmul(features.T, features) / features.size 
                style_features[i][layer] = gram 
    # make stylized image using backpropogation 
    with tf.Graph().as_default(): 
        if initial is None: 
            noise = np.random.normal(size=shape, scale=np.std(content) * 0.1) 
            initial = tf.random_normal(shape) * 0.256 
            initial = np.array([vgg.preprocess(initial, mean_pixel)]) 
            initial = initial.astype('float32') 
        image = tf.Variable(initial) 
        net, _ =, image) 
        # content loss 
        content_loss = content_weight * (2 * tf.nn.l2_loss( 
                net[CONTENT_LAYER] - content_features[CONTENT_LAYER]) / 
        # style loss 
        style_loss = 0 
        for i in range(len(styles)): 
            style_losses = [] 
            for style_layer in STYLE_LAYERS: 
                layer = net[style_layer] 
                _, height, width, number = map(lambda i: i.value, layer.get_shape()) 
                size = height * width * number 
                feats = tf.reshape(layer, (-1, number)) 
                gram = tf.matmul(tf.transpose(feats), feats) / size 
                style_gram = style_features[i][style_layer] 
                style_losses.append(2 * tf.nn.l2_loss(gram - style_gram) / style_gram.size) 
            style_loss += style_weight * style_blend_weights[i] * reduce(tf.add, style_losses) 
        # total variation denoising 
        tv_y_size = _tensor_size(image[:,1:,:,:]) 
        tv_x_size = _tensor_size(image[:,:,1:,:]) 
        tv_loss = tv_weight * 2 * ( 
                (tf.nn.l2_loss(image[:,1:,:,:] - image[:,:shape[1]-1,:,:]) / 
                    tv_y_size) + 
                (tf.nn.l2_loss(image[:,:,1:,:] - image[:,:,:shape[2]-1,:]) / 
        # overall loss 
        loss = content_loss + style_loss + tv_loss 
        # optimizer setup 
        train_step = tf.train.AdamOptimizer(learning_rate).minimize(loss) 
        def print_progress(i, last=False): 
            stderr.write('Iteration %d/%d\n' % (i + 1, iterations)) 
            if last or (print_iterations and i % print_iterations == 0): 
                stderr.write('  content loss: %g\n' % content_loss.eval()) 
                stderr.write('    style loss: %g\n' % style_loss.eval()) 
                stderr.write('       tv loss: %g\n' % tv_loss.eval()) 
                stderr.write('    total loss: %g\n' % loss.eval()) 
        # optimization 
        best_loss = float('inf') 
        best = None 
        with tf.Session() as sess: 
            for i in range(iterations): 
                last_step = (i == iterations - 1) 
                print_progress(i, last=last_step) 
                if (checkpoint_iterations and i % checkpoint_iterations == 0) or last_step: 
                    this_loss = loss.eval() 
                    if this_loss < best_loss: 
                        best_loss = this_loss 
                        best = image.eval() 
                    yield ( 
                        (None if last_step else i), 
                        vgg.unprocess(best.reshape(shape[1:]), mean_pixel) 
def _tensor_size(tensor): 
    from operator import mul 
    return reduce(mul, (d.value for d in tensor.get_shape()), 1) 
import tensorflow as tf 
import numpy as np 
def net(data_path, input_image): 
    layers = ( 
        'conv1_1', 'relu1_1', 'conv1_2', 'relu1_2', 'pool1', 
        'conv2_1', 'relu2_1', 'conv2_2', 'relu2_2', 'pool2', 
        'conv3_1', 'relu3_1', 'conv3_2', 'relu3_2', 'conv3_3', 
        'relu3_3', 'conv3_4', 'relu3_4', 'pool3', 
        'conv4_1', 'relu4_1', 'conv4_2', 'relu4_2', 'conv4_3', 
        'relu4_3', 'conv4_4', 'relu4_4', 'pool4', 
        'conv5_1', 'relu5_1', 'conv5_2', 'relu5_2', 'conv5_3', 
        'relu5_3', 'conv5_4', 'relu5_4' 
    data = 
    mean = data['normalization'][0][0][0] 
    mean_pixel = np.mean(mean, axis=(0, 1)) 
    weights = data['layers'][0] 
    net = {} 
    current = input_image 
    for i, name in enumerate(layers): 
        kind = name[:4] 
        if kind == 'conv': 
            kernels, bias = weights[i][0][0][0][0] 
            # matconvnet: weights are [width, height, in_channels, out_channels] 
            # tensorflow: weights are [height, width, in_channels, out_channels] 
            kernels = np.transpose(kernels, (1, 0, 2, 3)) 
            bias = bias.reshape(-1) 
            current = _conv_layer(current, kernels, bias) 
        elif kind == 'relu': 
            current = tf.nn.relu(current) 
        elif kind == 'pool': 
            current = _pool_layer(current) 
        net[name] = current 
    assert len(net) == len(layers) 
    return net, mean_pixel 
def _conv_layer(input, weights, bias): 
    conv = tf.nn.conv2d(input, tf.constant(weights), strides=(1, 1, 1, 1), 
    return tf.nn.bias_add(conv, bias) 
def _pool_layer(input): 
    return tf.nn.max_pool(input, ksize=(1, 2, 2, 1), strides=(1, 2, 2, 1), 
def preprocess(image, mean_pixel): 
    return image - mean_pixel 
def unprocess(image, mean_pixel): 
    return image + mean_pixel 


In this chapter, we have been learning about the different Deep Neural Network architectures.

We learned about building one of the most well known architectures of recent years, VGG, and how to employ it to generate images that translate artistic style.

In the next chapter, we will be using one of the most useful technologies in Machine Learning: Graphical Processing Units. We will review the steps needed to install TensorFlow with GPU support and train models with it, comparing execution times with the CPU as the only model running.