Recurrent Neural Networks

In this lab we will experiment with recurrent neural networks. These are a useful type of model for predicting sequences or handling sequences of things as inputs. We will implement them in Keras+Tensorflow but many implementations can be found online with many sets of variants. Here are installation instructions for Keras: https://keras.io/#installation, and here are installation instructions for Tensorflow: https://github.com/tensorflow/tensorflow#download-and-setup. You should also be able to run those from a Docker container.

We will take a set of 10 thousand image descriptions from the MS-COCO dataset (400,000 sentences) and make our recurrent network learn how to compose new sentences character by character. You can download this data here: http://www.cs.virginia.edu/~vicente/recognition/captions_train.txt.zip

First, let's import libraries and make sure you have everything properly installed.

In [1]:
import tensorflow as tf
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM
from keras.optimizers import RMSprop
from keras.layers.wrappers import TimeDistributed
Using TensorFlow backend.

1. Preprocessing the Text

We will first read the sentences and map each character to a unique identifier so that we can treat each sentence as an array of character ids. The code below loads the captions from a text file and places them inside a caption tensor that is a matrix of size numCaptions x maxCaptionLength x charVocabularySize. We will also create a caption tensor that contains the sentences but shifted by one character. Each character is mapped to an incremental ID, so we keep two hashmaps to convert from character to id and back.

In [2]:
# Read captions into a python list.
maxSamples = 10000
captions = []
fopen = open('captions_train.txt', 'r')
iterator = 0
for line in fopen:
    if iterator < maxSamples:
        captions.append(line.lower().strip())
        iterator += 1
fopen.close()
    
# Compute a char2id and id2char vocabulary.
char2id = {}
id2char = {}
charIndex = 0
for caption in captions: 
    for char in caption:
        if char not in char2id:
            char2id[char] = charIndex
            id2char[charIndex] = char
            charIndex += 1

# Add a special starting and ending character to the dictionary.
char2id['S'] = charIndex; id2char[charIndex] = 'S'  # Special sentence start character.
char2id['E'] = charIndex + 1; id2char[charIndex + 1] = 'E'  # Special sentence ending character.
            
# Place captions inside tensors.
maxSequenceLength = 1 + max([len(x) for x in captions])
# inputChars has one-hot encodings for every character, for every caption.
inputChars = np.zeros((len(captions), maxSequenceLength, len(char2id)), dtype=np.bool)
# nextChars has one-hot encodings for every character for every caption (shifted by one).
nextChars = np.zeros((len(captions), maxSequenceLength, len(char2id)), dtype=np.bool)
for i in range(0, len(captions)):
    inputChars[i, 0, char2id['S']] = 1
    nextChars[i, 0, char2id[captions[i][0]]] = 1
    for j in range(1, maxSequenceLength):
        if j < len(captions[i]) + 1:
            inputChars[i, j, char2id[captions[i][j - 1]]] = 1
            if j < len(captions[i]):
                nextChars[i, j, char2id[captions[i][j]]] = 1
            else:
                nextChars[i, j, char2id['E']] = 1
        else:
            inputChars[i, j, char2id['E']] = 1
            nextChars[i, j, char2id['E']] = 1

print("input:")
print(inputChars.shape)  # Print the size of the inputCharacters tensor.
print("output:")
print(nextChars.shape)  # Print the size of the nextCharacters tensor.
print("char2id:")
print(char2id)  # Print the character to ids mapping.
input:
(10000, 173, 38)
output:
(10000, 173, 38)
char2id:
{' ': 1, '1': 34, '0': 32, '3': 30, '2': 27, '5': 28, '4': 29, '7': 35, '6': 33, '9': 31, 'E': 37, 'S': 36, 'a': 0, 'c': 6, 'b': 15, 'e': 3, 'd': 9, 'g': 22, 'f': 18, 'i': 17, 'h': 16, 'k': 19, 'j': 25, 'm': 13, 'l': 7, 'o': 11, 'n': 8, 'q': 24, 'p': 14, 's': 20, 'r': 4, 'u': 21, 't': 12, 'w': 10, 'v': 2, 'y': 5, 'x': 26, 'z': 23}

Note: In order to clearly show how inputChars, and nextChars store the sequences, let's try printing a sentence back from its stored format in these two arrays.

In [3]:
trainCaption = inputChars[25, :, :]  # Pick some caption
labelCaption = nextChars[25, :, :]  # Pick what we are trying to predict.

def printCaption(sampleCaption):
    charIds = np.zeros(sampleCaption.shape[0])
    for (idx, elem) in enumerate(sampleCaption):
        charIds[idx] = np.nonzero(elem)[0].squeeze()
    print(np.array([id2char[x] for x in charIds]))

printCaption(trainCaption)
printCaption(labelCaption)
['S' 'a' ' ' 'b' 'i' 'c' 'y' 'c' 'l' 'e' ' ' 'i' 's' ' ' 'p' 'a' 'r' 'k'
 'e' 'd' ' ' 'b' 'y' ' ' 'a' ' ' 'b' 'e' 'n' 'c' 'h' ' ' 'a' 't' ' ' 'n'
 'i' 'g' 'h' 't' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E']
['a' ' ' 'b' 'i' 'c' 'y' 'c' 'l' 'e' ' ' 'i' 's' ' ' 'p' 'a' 'r' 'k' 'e'
 'd' ' ' 'b' 'y' ' ' 'a' ' ' 'b' 'e' 'n' 'c' 'h' ' ' 'a' 't' ' ' 'n' 'i'
 'g' 'h' 't' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E'
 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E' 'E']

In the above output, you will notice that the sentences are indeed shifted. This is because we are going to predict the next character at each timestep. The first character is 'S' which means start of sentences, and the next character in our target should be 'a' which is the first actual character of the sentence. The later characters in the sentence will also use the "history" of all previous characters to find out what goes next.

2. Building our model using an LSTM Recurrent Network.

Next we will create a recurrent neural network using Keras which takes an input set of characters (one-hot encoded) of size (batch_size, maxSequenceLength, charVocabularySize), similarly the output of this network will be a vector of size (batch_size, maxSequenceLength, charVocabularySize). However, the output does not contain one-hot encodings. The output contains a probability distribution (the output of a softmax) for every time step in the sequence. We see in section 4 how to decode the sequence from this distribution, you can just take the character corresponding to the index with the max probability for every time step.

In [5]:
print('Building training model...')
hiddenStateSize = 128
hiddenLayerSize = 128
model = Sequential()
# The output of the LSTM layer are the hidden states of the LSTM for every time step. 
model.add(LSTM(hiddenStateSize, return_sequences = True, input_shape=(maxSequenceLength, len(char2id))))
# Two things to notice here:
# 1. The Dense Layer is equivalent to nn.Linear(hiddenStateSize, hiddenLayerSize) in Torch.
#    In Keras, we often do not need to specify the input size of the layer because it gets inferred for us.
# 2. TimeDistributed applies the linear transformation from the Dense layer to every time step
#    of the output of the sequence produced by the LSTM.
model.add(TimeDistributed(Dense(hiddenLayerSize)))
model.add(TimeDistributed(Activation('relu'))) 
model.add(TimeDistributed(Dense(len(char2id))))  # Add another dense layer with the desired output size.
model.add(TimeDistributed(Activation('softmax')))
# We also specify here the optimization we will use, in this case we use RMSprop with learning rate 0.001.
# RMSprop is commonly used for RNNs instead of regular SGD.
# See this blog for info on RMSprop (http://sebastianruder.com/optimizing-gradient-descent/index.html#rmsprop)
# categorical_crossentropy is the same loss used for classification problems using softmax. (nn.ClassNLLCriterion)
model.compile(loss='categorical_crossentropy', optimizer = RMSprop(lr=0.001))

print(model.summary()) # Convenient function to see details about the network model.

# Test a simple prediction on a batch for this model.
print("Sample input Batch size:"),
print(inputChars[0:32, :, :].shape)
print("Sample input Batch labels (nextChars):"),
print(nextChars[0:32, :, :].shape)
outputs = model.predict(inputChars[0:32, :, :])
print("Output Sequence size:"),
print(outputs.shape)
Building training model...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
lstm_2 (LSTM)                    (None, 173, 128)      85504       lstm_input_2[0][0]               
____________________________________________________________________________________________________
timedistributed_5 (TimeDistribute(None, 173, 128)      16512       lstm_2[0][0]                     
____________________________________________________________________________________________________
timedistributed_6 (TimeDistribute(None, 173, 128)      0           timedistributed_5[0][0]          
____________________________________________________________________________________________________
timedistributed_7 (TimeDistribute(None, 173, 38)       4902        timedistributed_6[0][0]          
____________________________________________________________________________________________________
timedistributed_8 (TimeDistribute(None, 173, 38)       0           timedistributed_7[0][0]          
====================================================================================================
Total params: 106918
____________________________________________________________________________________________________
None
Sample input Batch size: (32, 173, 38)
Sample input Batch labels (nextChars): (32, 173, 38)
Output Sequence size: (32, 173, 38)

3. Training the Model

Keras already implements a generic trainModel functionality through the model.fit function, but it also contains model.train_on_batch if you want to perform the training for loop yourself. For more informations about Keras model functionalities you can see here: https://keras.io/models/model/

If you installed Tensorflow with GPU support, this will automatically run on the GPU.

In [ ]:
model.fit(inputChars, nextChars, batch_size = 128, nb_epoch = 10)

4. Verifying the Model is indeed Learning

Here we input an arbitrary caption from the training set (one-hot encoded), compute the output using the trained model, and decode this output back into a char array. Ideally we should see the same input caption shifted by one character. However you would need to run the training code for around 24 hours straight to get the model close to that point (it is ok if you only run the model for 10 iterations for the purposes of this lab).

In [ ]:
# Test a simple prediction on a batch for this model.
captionId = 132

inputCaption = inputChars[captionId:captionId+1, :, :]
outputs = model.predict(inputCaption)

printCaption(inputCaption[0])
print([id2char[x.argmax()] for x in outputs[0, :, :]])

5. Building the Inference Model.

We verified in the previous section that the model was somewhat working on training data. However, we want to be able to create new sentences from this model starting from zero. We want to use the same parameters of the trained model to produce text character by character. We build here such model and just copy the parameters from our trained model above. We show in the following section (section 6) how to produce the sentences using this inference_model. Please pay attention to all the comments in the code below to see what are the differences with the model at training time.

In [ ]:
# The only difference with the "training model" is that here the input sequence has 
# a length of one because we will predict character by character.
print('Building Inference model...')
inference_model = Sequential()
# Two differences here.
# 1. The inference model only takes one sample in the batch, and it always has sequence length 1.
# 2. The inference model is stateful, meaning it inputs the output hidden state ("its history state")
#    to the next batch input.
inference_model.add(LSTM(hiddenStateSize, batch_input_shape=(1, 1, len(char2id)), stateful = True))
# Since the above LSTM does not output sequences, we don't need TimeDistributed anymore.
inference_model.add(Dense(hiddenLayerSize))
inference_model.add(Activation('relu'))
inference_model.add(Dense(len(char2id)))
inference_model.add(Activation('softmax'))

# Copy the weights of the trained network. Both should have the same exact number of parameters (why?).
inference_model.set_weights(model.get_weights())

# Given the start Character 'S' (one-hot encoded), predict the next most likely character.
startChar = np.zeros((1, 1, len(char2id)))
startChar[0, 0, char2id['S']] = 1
nextCharProbabilities = inference_model.predict(startChar)

# print the most probable character that goes next.
print(id2char[nextCharProbabilities.argmax()])

6. Sampling a Complete New Sentence

Now that we have our inference_model working we can start producing new sentences by random sampling from the output of next character probabilities one step at a time. We rely on the np.random.multinomial function from numpy. To see what it does please check the documentation and make sure you understand what it does http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.multinomial.html

In [ ]:
inference_model.reset_states()  # This makes sure the initial hidden state is cleared every time.

startChar = np.zeros((1, 1, len(char2id)))
startChar[0, 0, char2id['S']] = 1

for i in range(0, maxSequenceLength):
    nextCharProbs = inference_model.predict(startChar)
    
    # In theory I should be able to input nextCharProbs to np.random.multinomial.
    nextCharProbs = np.asarray(nextCharProbs).astype('float64') # Weird type cast issues if not doing this.
    nextCharProbs = nextCharProbs / nextCharProbs.sum()  # Re-normalize for float64 to make exactly 1.0.
    
    nextCharId = np.random.multinomial(1, nextCharProbs.squeeze(), 1).argmax()
    print id2char[nextCharId], # The comma at the end avoids printing a return line character.
    startChar.fill(0)
    startChar[0, 0, nextCharId] = 1

Notice how the model learns to always predict 'E' once it has already predicted the first 'E' and does not produce any other character after that. In practice we can stop the for loop once we already found 'E', this has the effect of producing sentences of arbitrary size, meaning our model has learned when to finish a sentence. The sentence might not be perfect at this point in training but probably it has already learned to produce basic words like "a", "the", "and" or "with", however it still produces pseudo-words that look like words but are not actual words. Try running the above code many times, sentences will sound funny if you read them I guess. If you keep training the model for longer it should get better and better.

Lab Questions (8 pts)

  1. In section 4, how long did it take for you to train an epoch on average? and how long did it take to train for 10 epochs? What was your hardware setup? (0.5pts)

  2. In section 5 we predicted the next character after the starting character 'S' from the output probability distribution. Modify the code to print the top 10 most probable characters at the beginning of a sentence. Show the list of characters and their associated probability to show up as the first character in the sentence: (0.5pts)

  3. Print here a five sentences that you obtained from section 6 as a string (not as array and without 'E' characters). (0.5pts)

  4. In section 6, what happens if you remove inference_model.reset_states() from the code? Try removing it and running section 6 code multiple times. Why do you get this effect? (0.5pts)

  5. I have trained this model on a GPU for a few thousand epochs (until the loss went down to around 0.17) and obtained the following weight parameters: weights-vicente.hdf5. Try loading these weights into your model and producing five sentences (see load_weights in Keras). Include five senteces here as strings: (2pts)

  6. Keras already includes an example of how to generate text character by character (using Nietzsche's writings as training text) here https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py. Please describe what are the differences between that model and the model implemented in this lab. How are they different? (2pts)

  7. Include any thoughts that you have about what are other possible uses of this type of model. (For instance, instead of having a one-hot encoding vector for the starting character 'S' as your input you could have the output of a convolutional neural network from an image as the input -- this is the most popular model for generating image captions these days). (2pts)

Optional (2pts)

  1. Try to improve the model presented here by changing maybe batch_size, hiddenStateSize, hiddenLayerSize, adding a Dropout layer, Batch Normalization layer, etc. You could possibly obtain a very low loss function value much faster with the right combination.

  2. Train the model in this lab using the Nietzche's writings from the Keras example on text generation (you might have to split that text into sentences).

If you find any errors or omissions in this material please contact me at vicente@cs.virginia.edu