Natural Language Generation Lab

In this lab we will experiment with recurrent neural networks. These are a useful type of model for predicting sequences or handling sequences of things as inputs. In this lab we will use Keras with Tensorflow. Here are installation instructions for Keras:, and here are installation instructions for Tensorflow: Keras is an easy front-end for Tensorflow that allows you to use high-level layers on top of primitive operations implemented in Tensorflow.

We will take a set of training images from the MS-COCO dataset (400k sentences) and make our recurrent network learn how to compose new sentences word by word.

First, let's import libraries and make sure you have everything properly installed.

In [ ]:
import tensorflow as tf
import numpy as np
import random, json, string
import keras
import keras.layers
import keras.models
import keras.optimizers
from keras.layers.wrappers import TimeDistributed
import keras.layers.embeddings
import keras.preprocessing.text
import keras.preprocessing.sequence
import keras.callbacks

1. Loading and Preprocessing the Text

We will first read the sentences from the ms-coco dataset. This file was downloaded from This file contains ~5 descriptions for 80,000 images for a total of ~400k descriptions.

In [ ]:
mscoco = json.load(open('annotations/captions_train2014.json'))
captionStrings = ['[START] ' + entry['caption'].encode('ascii') for entry in mscoco['annotations']]

print('Number of sentences', len(captionStrings))
print('First sentence in the list', captionStrings[0])

1.1 Definining a word vocabulary: Next, we define a vocabulary and assign each unique word in this dataset with a word id. We use the 1000 most common words in these captions. Then we can transform each sentence into an array of word ids. These preprocessing functionalities are already implemented in keras Tokenizer class:

In [ ]:
vocabularySize = 1000  # vocabulary size.

# Split sentences into words, and define a vocabulary with the most common words.
tokenizer = keras.preprocessing.text.Tokenizer(nb_words = vocabularySize, \
                                               filters = '!"#$%&()*+,-./:;<=>?@\\^_`{|}~\t\n') 

# Convert the sentences into sequences of word ids using our vocabulary.
captionSequences = tokenizer.texts_to_sequences(captionStrings)

# Keep dictionaries that map ids -> words, and words -> ids.
word2id = tokenizer.word_index
id2word = {idx: word for (word, idx) in word2id.items()}
maxSequenceLength = max([len(seq) for seq in captionSequences])  # Find the sentence with most words.

# Print some output to verify the above.
print('Original string', captionStrings[0])
print('Sequence of Word Ids', captionSequences[0])
print('Word Ids back to Words', string.join([id2word[idx] for idx in captionSequences[0]], " "))
print('Max Sequence Length', maxSequenceLength)

1.2 Padding: Another piece of pre-processing that we might need is padding the sequences with zeroes so that all sequences have the same length and we can put them in a single matrix. This is implemented in Keras using the pad_sequences function.

In [ ]:
# By default it pads with zeroes at the beginning (why would that be preferrable?), but we are overriding
# that default behavior by using padding = 'post'.
data = keras.preprocessing.sequence.pad_sequences(captionSequences, 
           maxlen = (maxSequenceLength + 1), padding = 'post', truncating = 'post')

id2word[0] = 'END'
word2id['END'] = 0

# Let's print some output.
print(data.shape)  # This is num_sentences x maxSequenceLength.
# Let's try converting back the first sequence into words again.
print(string.join([id2word[idx] for idx in data[0]], " "))

2. Building our model using a Recurrent Neural Network.

Next we will create a recurrent neural network using Keras which takes an input set of words of size (batch_size, maxSequenceLength), the output of this network will be a vector of size (batch_size, maxSequenceLength, vocabularySize). Notice that the output is of a different size than the input, it contains a pseudo-probability distribution (the output of a softmax layer) for every time step in the sequence. Meaning, it outputs the probability for each word in the vocabulary to be the next word at each time step.

In [ ]:
print('Building training model...')

# Remember that in libraries like Keras/Tensorflow, you only need to implement the forward pass.
# Here we show how to do that for our model.

# Define the shape of the inputs batchSize x (maxSequenceLength + 1).
words = keras.layers.Input(batch_shape=(None, maxSequenceLength), name = "input")

# Build a matrix of size vocabularySize x 300 where each row corresponds to a "word embedding" vector.
# This layer will convert replace each word-id with a word-vector of size 300.
embeddings = keras.layers.embeddings.Embedding(vocabularySize, 300, name = "embeddings")(words)

# Pass the word-vectors to the LSTM layer.
# We are setting the hidden-state size to 512.
# The output will be batchSize x maxSequenceLength x hiddenStateSize
hiddenStates = keras.layers.SimpleRNN(512, return_sequences = True, 
                                      input_shape=(maxSequenceLength, 300), name = "rnn")(embeddings)

# Apply a linear (Dense) layer of size 512 x 256 to the outputs of the LSTM at each time step.
denseOutput = TimeDistributed(keras.layers.Dense(vocabularySize), name = "linear")(hiddenStates)
predictions = TimeDistributed(keras.layers.Activation("softmax"), name = "softmax")(denseOutput)                                      

# Build the computational graph by specifying the input, and output of the network.
model = keras.models.Model(input = words, output = predictions)

# Compile the graph so that we have a way to compute gradients.
# We also specify here the type of optimization to perform. For Recurrent Neural Networks, a type of
# optimization called RMSprop is preferred instead of the standard SGD udpates.
model.compile(loss='sparse_categorical_crossentropy', optimizer = keras.optimizers.RMSprop(lr = 0.001))

print(model.summary()) # Convenient function to see details about the network model.

# Sample 10 inputs from the training data and verify everything works.
sample_inputs = data[0:10,:-1]
sample_outputs = model.predict(sample_inputs)
print('input size', sample_inputs.shape)
print('output size', sample_outputs.shape)

3. Training the Model

Keras already implements a generic trainModel functionality through the function, but it also contains model.train_on_batch which we might need to save memory (e.g. if we want to avoid loading all the dataset in memory at once). For more informations about Keras model functionalities you can see here:

If you installed Tensorflow with GPU support, this will automatically run on the GPU!

In [ ]:
inputData = data[:, :-1]  # words 1, 2, 3, ... , (n-1)
outputData = data[:, 1:]  # words 2, 3, 4, ... , (n)

# We have to add an extra dimension if using "sparse_categorical_crossentropy".
# Sparse is always better if you want to save memory. Only store the non-zeros.
# Read here:
outputLabels = np.expand_dims(outputData, -1)

# The labels have to be equal size to the outputs of the network if using "categorical_crossentropy" in Keras.
# we have to encode the labels as one-hot vectors. There is a function in Keras to do this.
# from keras.utils.np_utils import to_categorical
# print('Converting labels to one-hot encodings..')
# outputLabels = to_categorical(outputData, nb_classes = vocabularySize)
# outputLabels = np.reshape(outputLabels, (outputData.shape[0], outputData.shape[1], vocabularySize))
# print('Finishing converting labels to one-hot encodings')
# I commented out and abandoned this because it required too much memory!

checkpointer = keras.callbacks.ModelCheckpoint(filepath="my_weights.hdf5", save_weights_only = True, \
                                               save_best_only = True, monitor = 'loss'), outputLabels, batch_size = 256, nb_epoch = 10, callbacks = [checkpointer])

# We could also go batch by batch ourselves, however the above function worked well so let's not go this way.
# trainSize = inputData.shape[0]
# batchSize = 100
# nBatches =  trainSize / batchSize
# for b in range(0, nBatches):
     # Build the batch inputs, and batch labels.
#    batchInputs = np.zeros((batchSize, inputData.shape[1]))
#    batchLabels = np.zeros((batchSize, inputData.shape[1], vocabularySize))
#    for bi in range(0, batchSize):
#        rand_int = random.randint(0, trainSize - 1)
#        batchInputs[bi, :] = inputData[rand_int, :]
#        for s in range(0, inputData.shape[1]):
#            batchLabels[bi, s, outputData[rand_int, s]] = 1
#    model.train_on_batch(batchInputs, batchLabels)


4. Building the Inference Model.

Now let's build a model here with the exact same details as the ones we used for training, however this one only takes a single word, and outputs the next word. The other modification is that this network will keep the state of the recurrent network unless we override it.

In [ ]:
# Same layers as in the model used for training.
words = keras.layers.Input(batch_shape=(1, 1), name = "input")
embeddings = keras.layers.embeddings.Embedding(vocabularySize, 300, name = "embeddings")(words)
# Notice here two differences.
# 1. This RNN does not return sequences, only the hidden state in the last step.
# 2. This RNN is stateful, meaning the hidden state of the previous call to this function will be input
#    as the hidden state of the next call of this function.
#    For more information about this check:
hiddenStates = keras.layers.SimpleRNN(512, stateful = True, \
                                      batch_input_shape=(maxSequenceLength, 300), name = "rnn")(embeddings)
# Since the output is only the last hidden state, the output is not a sequence anymore. 
# So we do not need the TimeDistributed wrapper anymore.
denseOutput = keras.layers.Dense(vocabularySize, name = "linear")(hiddenStates)
predictions = keras.layers.Activation("softmax", name = "softmax")(denseOutput)                                      
inference_model = keras.models.Model(input = words, output = predictions)

# Copy the weights of the trained network. Both should have the same exact number of parameters (why?).

# Given the start token '[start]' predict the next most likely word.
startWord = np.zeros((1, 1))
startWord[0, 0] = word2id['[start]']
nextWordProbabilities = inference_model.predict(startWord)

# print the most probable words that goes next.
top_inds = (-nextWordProbabilities).argsort()[0, :10]
top_probs = np.sort(-nextWordProbabilities)[0, :10]

# Print the next probable word given the previous word.
print([(id2word[w], prob) for (w, prob) in zip(top_inds, -top_probs)])

5. Sampling a Complete New Sentence

Now that we have our inference_model working we can start producing new sentences by random sampling from the output of next word probabilities one step at a time. We rely on the np.random.multinomial function from numpy. To see what it does please check the documentation and make sure you understand what it does

In [ ]:
inference_model.reset_states()  # This makes sure the initial hidden state is cleared every time.

startWord = np.zeros((1, 1))
startWord[0, 0] = word2id['[start]']

for i in range(0, maxSequenceLength):
    nextWordProbs = inference_model.predict(startWord)
    # In theory I should be able to input nextCharProbs to np.random.multinomial.
    nextWordProbs = np.asarray(nextWordProbs).astype('float64') # Weird type cast issues if not doing this.
    nextWordProbs = nextWordProbs / nextWordProbs.sum()  # Re-normalize for float64 to make exactly 1.0.
    nextWordId = np.random.multinomial(1, nextWordProbs.squeeze(), 1).argmax()
    print id2word[nextWordId], # The comma at the end avoids printing a return line character.
    startWord[0, 0] = nextWordId

Notice how the model learns to always predict 'END' once it has already predicted the first 'END' and does not produce any other word after that. We can stop the for loop once we already found 'END', this has the effect of producing sentences of arbitrary size, meaning our model has learned when to finish a sentence. The sentence might not be perfect at this point in training but probably it has already learned to produce basic sentences, however it still produces incoherent stuff from time to time. If you keep training the model for longer it should get better and better.

Lab Questions (10pts)

  1. In section 4, how long did it take for you to train an epoch on average? and how long did it take to train for 10 epochs? What was your hardware setup? Did you use a GPU? Which one? (1pts)

  2. In section 2 we show a summary of the training model architecture. Explain why is the size of the embedding layer 300k, the size of the rnn layer 416256, and the size of the linear layer 513k? (1pts)

  3. Print here a five sentences that you obtained from section 6 as strings: (1pts)

  4. Instead of using a SimpleRNN layer, use an LSTM layer, recompile the model and show how the size of the rnn layer changes. Explain the #parameters for this layer. (1pts).

  5. So far we have only used a training set for our language model. However we want to see how good is this model on test data. A common way of measuring the performance of a language model is using a metric in NLP called perplexity. Compute the perplexity of the model trained in Lab Question 1, using the mscoco validation image captions mscoco downloads (3pts)

  6. Train any variation of the original model that produces a lower perplexity on the validation data from mscoco (e.g. LSTM, GRU, more layers, different activation functions, dropout, regularization, different optimization, etc). Feel free to use the parameter validation_data of the fit function to guide your choices. (3pts)

Optional (2pts)

  1. Complete the following Jupyter notebook about LSTMs. If you complete this optional part attach the HTML file with your outputs for this lab, and the HTML output for the notebook in this optional part of your lab. Include here your answer to the following question: Which worked better on the IMDB dataset task in this notebook? GRU? LSTM? SimpleRNN?

If you find any errors or omissions in this material please contact me at