Reviewing what we know about the more traditional neural networks models, we observe that the train and prediction phases are normally expressed in a static manner, where an input is feed, and we get an output, but we don't just take in account the sequence in which the events occur. Unlike the prediction models reviewed so far, recurrent neural networks predictions depends on the current input vector and also the values of previous ones.
The topics we will cover in this chapter are as follow:
Knowledge doesn't normally appear from a void. Many new ideas are born as a combination of previous knowledge, and so that's a useful behaviour to emulate. Traditional neural networks don't include any mechanism translating previous seen elements to the current state.
Trying to implement this concepts, we have recurrent neural networks, or RNN. Recurrent neural networks can be defined a sequential model of neural networks, which have the property of reusing information already given. One of their main assumptions is that the current information has a dependency on previous data. In the following figure, we observe a simplified diagram of a RNN basic element, called Cell:
The main information elements of a cell are the input (Xt), an state, and an output (ht). But as we said before, cells have not an independent state, so it stores also state information. In the following figure we will show an "unrolled" RNN cell, showing how it goes from the initial state, to outputting the final hn value, with some intermediate states in between.
Once we define the dynamics of the cell, the next objective would be to investigate the contents of what makes or defines an RNN cell. In the most common case of standard RNN, there is simply a neural network layer, which takes the input, and the previous state as inputs, applies the tanh operation, and outputs the new state h(t+1).
This simple setup is able to sum up information as the epochs pass, but further experimentation showed that for complex knowledge, the sequence distance makes difficult to relate some contexts (For example, The architect knows about designing beautiful buildings) seems like a simple structure to remember, but the context needed for them for being associated, requires an increasing sequence to be able to relate both concepts. This also brings the associated issue of exploding and vanishing gradients.
One of the main problems of recurrent neural networks happens in the back propagation stages, given its recurrent nature, the number of steps that the back propagation of the errors has is one corresponding to a very deep network. This cascade of gradient calculations could lead to a very non significant value on the last stages, or in the contrary, to ever increasing and unbounded parameter. Those phenomena receive the name of vanishing and exploding gradients. This is one of the reasons for which LSTM architecture was created.
In order to better understand the building blocks of the internal of the lstm cell, we will describe the main operational block of the LSTM: the gate operation.
This operation basically has a multivariate input, and in this block we decide to let some of the inputs go trough, and block the other. We can think of it as an information filter, and contributes mainly to allow for getting and remembering the needed information elements.
In order to implement this function, we take a multivariate control vector (marked with an arrow), which is connected with a neural network layer with a sigmoid activation function. Applying the control vector and passing through the sigmoid function, we will get a binary like vector.
We will represent this function with many switch signs:
After defining that binary vector, we will multiply the input function with the vector so we will filter it, letting only parts of the information to get through. We will represent this operation with a triangle, pointing in the direction to which the information goes.
General LSTM cell structure
In the following picture, we represent the general structure of a LSTM Cell. It mainly consist of three of the the mentioned gate operations, to protect and control the cell state.
This operation will allow both discard (Hopefully not important) low state data, and incorporate (Hopefully important) new data to the state.
The previous figure tries to show all concepts going on on the operation of one LSTM Cell.
As the inputs we have:
And as outputs, we have, the result of combining the application of all the gate operations.
In this section we will describe a generalization of all the different substeps that the information will do for each loop steps of its operation.
In this section, we will take the values coming from the short term, combined with the input itself, and this values will set the values for a binary fuction, represented by a multivariable sigmoid. Depending on the input and short term memory values, the sigmoid output will allow or restrict some of the previous Knowledge or weights contained on the cell state.
Then is time to set the filter which will allow or reject the incorporation of new and short term memory to the cell semi-permanent state.
So in this stage, we will determine how much of the new and semi-new information will be incorporated in the new cell state. Additionally, we will finally pass through the information filter we have been configuring, and as a result, we will have an updated long term state.
In order to normalize the new and short term information, we pass the new and short term info via a neural network with tanh activation, this will allow to feed the new information in a normalized (-1,1) range.
Now its the turn of the short term state. It will also use the new and previous short term state to allow new information to pass, but the input will be the long term status, dot multiplied multiplied by a tanh function, again to normalize the input to a (-1,1) range.
In this chapter in general, and assuming the field of RNN is much more general that the we will be focusing on the LSTM type of recurrent neural network cells. There are also other variations of the RNN that are being employed and add advantages to the field, for example.
In this section, we will review the main classes and methods that we can use to build a LSTM layer, which we will use in the examples of the book.
This class basic LSTM recurrent network cell, with a forget bias, and no fancy characteristics of other related types, like peep-holes, that allow the cell to take a look on the cell state even on stages where it's not supposed to have an influence on the results.
The following are the main parameters:
In the architectures we will be using for this particular example, we won't be using a single cell to take in account the historical values. In this case we will be using a stack of connected cells. For this reason we will be instantiating the MultiRNNCell
class.
MultiRNNCell(cells, state_is_tuple=False)
Thisis the constructor for the multiRNNCell
, the main argument of this method is cells, which will be an instance of RNNCells
we want to stack.
This function split the input on a dimension, and then it squeezes the previous dimension the splitted tensor belonged. It takes the dimension to cut, the number of ways to split, and then tensor to split. It return the same tensor but with one dimension reduced.
In this example, we will be solving a problem of the domain of regression. The dataset we will be working on is a compendium of many measurements of power consumption of one home, throughout a period of time. As we could infer, this kind of behaviour can easily follow patterns (It increases when the persons uses the microwave to prepare breakfast, and computers after the wake up hour, can decrease a bit in the afternoon, and then increase at night with all the lights, decreasing to zero starting from midnight until next wake up hour).
So let's try to model for this behavior in a sample case.
In this example we will be using the Electricity Load Diagrams Data Sets, from Artur Trindade (site: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014).
This is the description of the original dataset:
Data set has no missing values. Values are in kW of each 15 min. To convert values in kWh values must be divided by 4. Each column represent one client. Some clients were created after 2011. In these cases consumption were considered zero. All time labels report to Portuguese hour. However all days present 96 measures (24*15). Every year in March time change day (which has only 23 hours) the values between 1:00 am and 2:00 am are zero for all points. Every year in October time change day (which has 25 hours) the values between 1:00 am and 2:00 am aggregate the consumption of two hours.
In order to simplify our model description, we took just one client complete measurements, and converted its format to standard CSV. It is located on the data subfolder of this chapter code folder
With this code lines, we will open and represent the client's data:
import pandas as pd from matplotlib import pyplot as plt df = pd.read_csv("data/elec_load.csv", error_bad_lines=False) plt.subplot() plot_test, = plt.plot(df.values[:1500], label='Load') plt.legend(handles=[plot_test])
I we take a look at this representation (We look to the first 1500 samples) we see an initial transient state, probable when the measurements were put in place, and then we see a really clear cycle of high and low consumption levels.
From simple observation we also see that the cicles are more or less of 100 samples, pretty close to the 96 samples per day this dataset has.
In order to assure a better convergency of the back propagation methods, we should try to normalize the input data.
So we will be applying the classic scale and centering technique, substracting the mean value, and scaling by the floor of the maximum value.
To get the needed values, we use the pandas the describe()
method.
Load
count 140256.000000
mean 145.332503
std 48.477976
min 0.000000
25% 106.850998
50% 151.428571
75% 177.557604
max 338.218126
Here we will succinctly describe the architecture that will try to model the variations on electricity consumption:
The resulting architecture basically consists on a 10 member serial connected LSTM multicell, which has a linear regress or variable at the end, which will transform the results of the output of the linear array of cells, to a final real number, for a given history of values (in this case we have to input the last 5 values to predict the next one).
def lstm_model(time_steps, rnn_layers, dense_layers=None):
def lstm_cells(layers):
return [tf.nn.rnn_cell.BasicLSTMCell(layer['steps'],state_is_tuple=True)
for layer in layers]
def dnn_layers(input_layers, layers):
return input_layers
def _lstm_model(X, y):
stacked_lstm = tf.nn.rnn_cell.MultiRNNCell(lstm_cells(rnn_layers), state_is_tuple=True)
x_ = learn.ops.split_squeeze(1, time_steps, X)
output, layers = tf.nn.rnn(stacked_lstm, x_, dtype=dtypes.float32)
output = dnn_layers(output[-1], dense_layers)
return learn.models.linear_regression(output, y)
return _lstm_model
The following figure shows the main blocks, complemented later by the learn module, there you can see the RNN stage, the optimizer, and the final linear regression before the output.
In this picture we take a look at the RNN stage, there we can observe the cascade of individual LSTM cells, with the input squeeze, and all the complementary operations that the learn package adds.
And then we will complete the definition of the model with the regressor:
regressor = learn.TensorFlowEstimator(model_fn=lstm_model(
TIMESTEPS, RNN_LAYERS, DENSE_LAYERS), n_classes=0,
verbose=2, steps=TRAINING_STEPS, optimizer='Adagrad',
learning_rate=0.03, batch_size=BATCH_SIZE)
Here we will run the fit function for the current model:
regressor.fit(X['train'], y['train'], monitors=[validation_monitor], logdir=LOG_DIR)
And will obtain the following (Very good)! error rates. One exercise we could do is to avoid normalizing the data, and see if the mean error is the same (NB: It's not, its much worse)
This is the simple console output we will get:
MSE: 0.001139
And this is the generated loss/mean graphic that tells us how the error is decaying with every iteration:
The following is the complete source code:
import numpy as np
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt
from tensorflow.python.framework import dtypes
from tensorflow.contrib import learn
import logging
logging.basicConfig(level=logging.INFO)
from tensorflow.contrib import learn
from sklearn.metrics import mean_squared_error
LOG_DIR = './ops_logs'
TIMESTEPS = 5
RNN_LAYERS = [{'steps': TIMESTEPS}]
DENSE_LAYERS = None
TRAINING_STEPS = 10000
BATCH_SIZE = 100
PRINT_STEPS = TRAINING_STEPS / 100
def lstm_model(time_steps, rnn_layers, dense_layers=None):
def lstm_cells(layers):
return [tf.nn.rnn_cell.BasicLSTMCell(layer['steps'],state_is_tuple=True)
for layer in layers]
def dnn_layers(input_layers, layers):
return input_layers
def _lstm_model(X, y):
stacked_lstm = tf.nn.rnn_cell.MultiRNNCell(lstm_cells(rnn_layers), state_is_tuple=True)
x_ = learn.ops.split_squeeze(1, time_steps, X)
output, layers = tf.nn.rnn(stacked_lstm, x_, dtype=dtypes.float32)
output = dnn_layers(output[-1], dense_layers)
return learn.models.linear_regression(output, y)
return _lstm_model
regressor = learn.TensorFlowEstimator(model_fn=lstm_model(TIMESTEPS, RNN_LAYERS, DENSE_LAYERS), n_classes=0,
verbose=2, steps=TRAINING_STEPS, optimizer='Adagrad',
learning_rate=0.03, batch_size=BATCH_SIZE)
df = pd.read_csv("data/elec_load.csv", error_bad_lines=False)
plt.subplot()
plot_test, = plt.plot(df.values[:1500], label='Load')
plt.legend(handles=[plot_test])
print df.describe()
array=(df.values- 147.0) /339.0
plt.subplot()
plot_test, = plt.plot(array[:1500], label='Normalized Load')
plt.legend(handles=[plot_test])
listX = []
listy = []
X={}
y={}
for i in range(0,len(array)-6):
listX.append(array[i:i+5].reshape([5,1]))
listy.append(array[i+6])
arrayX=np.array(listX)
arrayy=np.array(listy)
X['train']=arrayX[0:12000]
X['test']=arrayX[12000:13000]
X['val']=arrayX[13000:14000]
y['train']=arrayy[0:12000]
y['test']=arrayy[12000:13000]
y['val']=arrayy[13000:14000]
# print y['test'][0]
# print y2['test'][0]
#X1, y2 = generate_data(np.sin, np.linspace(0, 100, 10000), TIMESTEPS, seperate=False)
# create a lstm instance and validation monitor
validation_monitor = learn.monitors.ValidationMonitor(X['val'], y['val'],
every_n_steps=PRINT_STEPS,
early_stopping_rounds=1000)
regressor.fit(X['train'], y['train'], monitors=[validation_monitor], logdir=LOG_DIR)
predicted = regressor.predict(X['test'])
rmse = np.sqrt(((predicted - y['test']) ** 2).mean(axis=0))
score = mean_squared_error(predicted, y['test'])
print ("MSE: %f" % score)
#plot_predicted, = plt.plot(array[:1000], label='predicted')
plt.subplot()
plot_predicted, = plt.plot(predicted, label='predicted')
plot_test, = plt.plot(y['test'], label='test')
plt.legend(handles=[plot_predicted, plot_test])
In this example, we will work with a recurrent neural network specialized in character sequences, or the char RNN model.
We will feed this neural network with a series of musical tunes, the Bach Goldberg Variations, expressed in a character based format, and write a sample piece of music based on the learned structures.
Note that this examples owes many ideas and concepts to the paper Visualizing and Understanding Recurrent Networks (https://arxiv.org/abs/1506.02078) and the article titled The Unreasonable Effectiveness of recurrent neural networks, available at (http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
As we previously saw, Char RNN models work with character sequences. This category of inputs can represent a vast array of possible languages. The following, are a few examples:
The input contents of an RNN need a clear and straightforward way of representation. For this reason, the one hot representation is chosen, which is convenient to use directly for the characterization of an output of a limited quantity of possible outcomes (the number of limited characters is finite and in tens), and use it to directly compare with a Sotmax
function value.
So the input of the model is a sequence of characters, and the output of the model will be a sequence of an array per instance. The length of the array will be the same as the vocabulary size, so each of the array positions will represent the probability of the current character being in this sequence position, given the previously entered sequence characters.
In the following figure, we observe a very simplified model of the setup, with the encoded input word and the model predicting the word TEST as the expected output:
When searching for a format to represent the input data, it is important to choose the one that is more simple but structurally homogeneous, if possible.
Regarding music representation, the ABC format is a suitable candidate because it has a very simple structure and uses a limited number of characters, and it is a subset of the ASCII charset.
An ABC format page has mainly two components: a header and the notes.
X:[Reference number]
, T:[Title]
, M:[Meter]
, K:[Key]
, C[Composer]
.|
character.There are other elements, but with the following example, you will have an idea of how the format works, even with no music training:
The original sample is as follows:
X:1
T:Notes
M:C
L:1/4
K:C
C, D, E, F,|G, A, B, C|D E F G|A B c d|e f g a|b c' d' e'|f' g' a' b'|]
The final representation is as follows:
Bach Goldberg variations:
The Bach Goldberg variations is a set of an original aria and 30 works based on it, named after a Bach disciple, Johann Gottlieb Goldberg, who was probably its main interpreter.
In the next listing and figure, we will represent the first part of the variation Nr 1 so you have an idea of the document structure we will try to emulate:
X:1 T:Variation no. 1 C:J.S.Bach M:3/4 L:1/16 Q:500 V:2 bass K:G [V:1]GFG2- GDEF GAB^c |d^cd2- dABc defd |gfg2- gfed ^ceAG| [V:2]G,,2B,A, B,2G,2G,,2G,2 |F,,2F,E, F,2D,2F,,2D,2 |E,,2E,D, E,2G,2A,,2^C2| % (More parts with V:1 and V:2)
In this section, we will learn the new functionalities we will be using in this example.
One very important feature for real world applications is the ability to save and retrieve whole models. TensorFlow provides this ability through the tf.train.Saver
object.
The main methods of this object are the following:
tf.train.Saver(args)
: This is the constructor. This is a list of the main parameters:var_list
: This is a list containing the list of all variables to save. For example, {firstvar: var1
, secondvar: var2
}. If none, save all the objects.max_to_keep
: This denotes the maximum number of checkpoints to maintain.write_version
: This is the file format version, actually only 1 is valid.
tf.train.Saver.save
: This method runs the ops added by the constructor for saving variables. This requires a current session and all variables to have been initialized. The main parameters are as follows:session
: This is a session to save the variablessave_path
: This is the path to the checkpoint filenameglobal_step
: This is a unique step identifier
This methods returns the path where the checkpoint was saved.
tf.train.Saver.restore
: This method restores the previously saved variables. The main parameters are as follows:session
: The session is where the variables are to be restoredsave_path
: This is a variable previously returned by the save method, a call to the latest_checkpoint(), or a provided one
Here, we will build with some sample code a minimal structure for saving and retrieving two sample variables.
The following is the code to create variables:
# Create some variables.simplevar = tf.Variable(..., name="simple")anothervar = tf.Variable(..., name="another")...# Add ops to save and restore all the variables.saver = tf.train.Saver()# Later, launch the model, initialize the variables, do some work, save the# variables to disk.with tf.Session() as sess: sess.run(tf.initialize_all_variables()) # Do some work with the model. .. # Save the variables to disk. save_path = saver.save(sess, "/tmp/model.ckpt")
For this dataset, we start with the 30 works, and then we generate a list of 1000
instances of theirs, randomly distributed:
import random
input = open('input.txt', 'r').read().split('X:')
for i in range (1,1000):
print "X:" + input[random.randint(1,30)] + "\n_____________________________________\n"
The original material for the network training will be the 30
works in the ABC format.
Note that the original ABC file was located at http://www.barfly.dial.pipex.com/Goldbergs.abc.
Then we use this little program ().
For this dataset, we start with the 30
works, and then we generate a list of 1000
instances of theirs, randomly distributed:
import random
input = open('original.txt', 'r').read().split('X:')
for i in range (1,1000):
print "X:" + input[random.randint(1,30)] + "\n_____________________________________\n"
And then we execute the following to get the data set:
python generate_dataset.py > input.txt
The generated dataset needs a bit of information before being useful. First, it needs the definition of the vocabulary.
The first step in the process is to find all the different characters that can be found in the original text in order to be able to dimension and fill the one-hot encoded inputs later.
In the following figure, we represent the different characters found in the ABC music format. Here you can see what's represented in the standard, with normal and special punctuation characters:
The model for this RNN is described in the following lines, and it is a multilayer LSTM with initial zero state:
cell_fn = rnn_cell.BasicLSTMCell
cell = cell_fn(args.rnn_size, state_is_tuple=True)
self.cell = cell = rnn_cell.MultiRNNCell([cell] * args.num_layers, state_is_tuple=True)
self.input_data = tf.placeholder(tf.int32, [args.batch_size, args.seq_length])
self.targets = tf.placeholder(tf.int32, [args.batch_size, args.seq_length])
self.initial_state = cell.zero_state(args.batch_size, tf.float32)
with tf.variable_scope('rnnlm'):
softmax_w = tf.get_variable("softmax_w", [args.rnn_size, args.vocab_size])
softmax_b = tf.get_variable("softmax_b", [args.vocab_size])
with tf.device("/cpu:0"):
embedding = tf.get_variable("embedding", [args.vocab_size, args.rnn_size])
inputs = tf.split(1, args.seq_length, tf.nn.embedding_lookup(embedding, self.input_data))
inputs = [tf.squeeze(input_, [1]) for input_ in inputs]
def loop(prev, _):
prev = tf.matmul(prev, softmax_w) + softmax_b
prev_symbol = tf.stop_gradient(tf.argmax(prev, 1))
return tf.nn.embedding_lookup(embedding, prev_symbol)
outputs, last_state = seq2seq.rnn_decoder(inputs, self.initial_state, cell, loop_function=loop if infer else None, scope='rnnlm')
output = tf.reshape(tf.concat(1, outputs), [-1, args.rnn_size])
The loss function is defined by the losss_by_example function. This is based on a measure called perplexity, which measures how well a probability distribution predicts a sample. This measure is used extensively in language models:
self.logits = tf.matmul(output, softmax_w) + softmax_b
self.probs = tf.nn.softmax(self.logits)
loss = seq2seq.sequence_loss_by_example([self.logits],
[tf.reshape(self.targets, [-1])],
[tf.ones([args.batch_size * args.seq_length])],
args.vocab_size)
self.cost = tf.reduce_sum(loss) / args.batch_size / args.seq_length
In order to run the program, first you run the training script using the following code:
python train.py
Then you run the sample program with the following code:
python sample.py
Configuring a prime of X:1\n
, which is a plausible initialization character sequence, we obtain, depending on the depth (recommended 3) and the length (recommended 512) of the RNN, almost an recognizable complete composition.
The following music sheet was obtained copying the resulting character sequence at http://www.drawthedots.com/ and applying simple character corrections, based on on-site diagnostics:
The following is the complete source code(train.py
):
from __future__ import print_function import numpy as np import tensorflow as tf import argparse import time import os from six.moves import cPickle from utils import TextLoader from model import Model class arguments: def __init__(self): return def main(): args = arguments() train(args) def train(args): args.data_dir='data/'; args.save_dir='save'; args.rnn_size =64; args.num_layers=1; args.batch_size=50;args.seq_length=50 args.num_epochs=5;args.save_every=1000; args.grad_clip=5. args.learning_rate=0.002; args.decay_rate=0.97 data_loader = TextLoader(args.data_dir, args.batch_size, args.seq_length) args.vocab_size = data_loader.vocab_size with open(os.path.join(args.save_dir, 'config.pkl'), 'wb') as f: cPickle.dump(args, f) with open(os.path.join(args.save_dir, 'chars_vocab.pkl'), 'wb') as f: cPickle.dump((data_loader.chars, data_loader.vocab), f) model = Model(args) with tf.Session() as sess: tf.initialize_all_variables().run() saver = tf.train.Saver(tf.all_variables()) for e in range(args.num_epochs): sess.run(tf.assign(model.lr, args.learning_rate * (args.decay_rate ** e))) data_loader.reset_batch_pointer() state = sess.run(model.initial_state) for b in range(data_loader.num_batches): start = time.time() x, y = data_loader.next_batch() feed = {model.input_data: x, model.targets: y} for i, (c, h) in enumerate(model.initial_state): feed[c] = state[i].c feed[h] = state[i].h train_loss, state, _ = sess.run([model.cost, model.final_state, model.train_op], feed) end = time.time() print("{}/{} (epoch {}), train_loss = {:.3f}, time/batch = {:.3f}" \ .format(e * data_loader.num_batches + b, args.num_epochs * data_loader.num_batches, e, train_loss, end - start)) if (e==args.num_epochs-1 and b == data_loader.num_batches-1): # save for the last result checkpoint_path = os.path.join(args.save_dir, 'model.ckpt') saver.save(sess, checkpoint_path, global_step = e * data_loader.num_batches + b) print("model saved to {}".format(checkpoint_path)) if __name__ == '__main__': main()
The following is the complete source code (model.py
):
import tensorflow as tf from tensorflow.python.ops import rnn_cell from tensorflow.python.ops import seq2seq import numpy as np class Model(): def __init__(self, args, infer=False): self.args = args if infer: #When we sample, the batch and sequence lenght are = 1 args.batch_size = 1 args.seq_length = 1 cell_fn = rnn_cell.BasicLSTMCell #Define the internal cell structure cell = cell_fn(args.rnn_size, state_is_tuple=True) self.cell = cell = rnn_cell.MultiRNNCell([cell] * args.num_layers, state_is_tuple=True) #Build the inputs and outputs placeholders, and start with a zero internal values self.input_data = tf.placeholder(tf.int32, [args.batch_size, args.seq_length]) self.targets = tf.placeholder(tf.int32, [args.batch_size, args.seq_length]) self.initial_state = cell.zero_state(args.batch_size, tf.float32) with tf.variable_scope('rnnlm'): softmax_w = tf.get_variable("softmax_w", [args.rnn_size, args.vocab_size]) #Final w softmax_b = tf.get_variable("softmax_b", [args.vocab_size]) #Final bias with tf.device("/cpu:0"): embedding = tf.get_variable("embedding", [args.vocab_size, args.rnn_size]) inputs = tf.split(1, args.seq_length, tf.nn.embedding_lookup(embedding, self.input_data)) inputs = [tf.squeeze(input_, [1]) for input_ in inputs] def loop(prev, _): prev = tf.matmul(prev, softmax_w) + softmax_b prev_symbol = tf.stop_gradient(tf.argmax(prev, 1)) return tf.nn.embedding_lookup(embedding, prev_symbol) outputs, last_state = seq2seq.rnn_decoder(inputs, self.initial_state, cell, loop_function=loop if infer else None, scope='rnnlm') output = tf.reshape(tf.concat(1, outputs), [-1, args.rnn_size]) self.logits = tf.matmul(output, softmax_w) + softmax_b self.probs = tf.nn.softmax(self.logits) loss = seq2seq.sequence_loss_by_example([self.logits], [tf.reshape(self.targets, [-1])], [tf.ones([args.batch_size * args.seq_length])], args.vocab_size) self.cost = tf.reduce_sum(loss) / args.batch_size / args.seq_length self.final_state = last_state self.lr = tf.Variable(0.0, trainable=False) tvars = tf.trainable_variables() grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars), args.grad_clip) optimizer = tf.train.AdamOptimizer(self.lr) self.train_op = optimizer.apply_gradients(zip(grads, tvars)) def sample(self, sess, chars, vocab, num=200, prime='START', sampling_type=1): state = sess.run(self.cell.zero_state(1, tf.float32)) for char in prime[:-1]: x = np.zeros((1, 1)) x[0, 0] = vocab[char] feed = {self.input_data: x, self.initial_state:state} [state] = sess.run([self.final_state], feed) def weighted_pick(weights): t = np.cumsum(weights) s = np.sum(weights) return(int(np.searchsorted(t, np.random.rand(1)*s))) ret = prime char = prime[-1] for n in range(num): x = np.zeros((1, 1)) x[0, 0] = vocab[char] feed = {self.input_data: x, self.initial_state:state} [probs, state] = sess.run([self.probs, self.final_state], feed) p = probs[0] sample = weighted_pick(p) pred = chars[sample] ret += pred char = pred return ret
The following is the complete source code(sample.py
):
from __future__ import print_function
import numpy as np
import tensorflow as tf
import time
import os
from six.moves import cPickle
from utils import TextLoader
from model import Model
from six import text_type
class arguments: #Generate the arguments class
save_dir= 'save'
n=1000
prime='x:1\n'
sample=1
def main():
args = arguments()
sample(args) #Pass the argument object
def sample(args):
with open(os.path.join(args.save_dir, 'config.pkl'), 'rb') as f:
saved_args = cPickle.load(f) #Load the config from the standard file
with open(os.path.join(args.save_dir, 'chars_vocab.pkl'), 'rb') as f:
chars, vocab = cPickle.load(f) #Load the vocabulary
model = Model(saved_args, True) #Rebuild the model
with tf.Session() as sess:
tf.initialize_all_variables().run()
saver = tf.train.Saver(tf.all_variables())
ckpt = tf.train.get_checkpoint_state(args.save_dir) #Retrieve the chkpoint
if ckpt and ckpt.model_checkpoint_path:
saver.restore(sess, ckpt.model_checkpoint_path) #Restore the model
print(model.sample(sess, chars, vocab, args.n, args.prime, args.sample))
#Execute the model, generating a n char sequence
#starting with the prime sequence
if __name__ == '__main__':
main()
The following is the complete source code(utils.py
):
import codecs
import os
import collections
from six.moves import cPickle
import numpy as np
class TextLoader():
def __init__(self, data_dir, batch_size, seq_length, encoding='utf-8'):
self.data_dir = data_dir
self.batch_size = batch_size
self.seq_length = seq_length
self.encoding = encoding
input_file = os.path.join(data_dir, "input.txt")
vocab_file = os.path.join(data_dir, "vocab.pkl")
tensor_file = os.path.join(data_dir, "data.npy")
if not (os.path.exists(vocab_file) and os.path.exists(tensor_file)):
print("reading text file")
self.preprocess(input_file, vocab_file, tensor_file)
else:
print("loading preprocessed files")
self.load_preprocessed(vocab_file, tensor_file)
self.create_batches()
self.reset_batch_pointer()
def preprocess(self, input_file, vocab_file, tensor_file):
with codecs.open(input_file, "r", encoding=self.encoding) as f:
data = f.read()
counter = collections.Counter(data)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])
self.chars, _ = zip(*count_pairs)
self.vocab_size = len(self.chars)
self.vocab = dict(zip(self.chars, range(len(self.chars))))
with open(vocab_file, 'wb') as f:
cPickle.dump(self.chars, f)
self.tensor = np.array(list(map(self.vocab.get, data)))
np.save(tensor_file, self.tensor)
def load_preprocessed(self, vocab_file, tensor_file):
with open(vocab_file, 'rb') as f:
self.chars = cPickle.load(f)
self.vocab_size = len(self.chars)
self.vocab = dict(zip(self.chars, range(len(self.chars))))
self.tensor = np.load(tensor_file)
self.num_batches = int(self.tensor.size / (self.batch_size *
self.seq_length))
def create_batches(self):
self.num_batches = int(self.tensor.size / (self.batch_size *
self.seq_length))
self.tensor = self.tensor[:self.num_batches * self.batch_size * self.seq_length]
xdata = self.tensor
ydata = np.copy(self.tensor)
ydata[:-1] = xdata[1:]
ydata[-1] = xdata[0]
self.x_batches = np.split(xdata.reshape(self.batch_size, -1), self.num_batches, 1)
self.y_batches = np.split(ydata.reshape(self.batch_size, -1), self.num_batches, 1)
def next_batch(self):
x, y = self.x_batches[self.pointer], self.y_batches[self.pointer]
self.pointer += 1
return x, y
def reset_batch_pointer(self):
self.pointer = 0
In this chapter, we reviewed one of the most recent neural networks architectures, recurrent neural networks, completing the panorama of the mainstream approaches in the machine learning field.
In the following chapter, we will research the different neural network layer type combinations appearing in state of the art implementations and cover some new interesting experimental models.