Text-Image Retrieval Lab

In this lab we will experiment with basic image and text retrieval. We will be using a subset of the COCO Dataset http://mscoco.org/, a popular dataset these days for many tasks involving images and text. This is a dataset of images with descriptions written by people. We are using a subset of this dataset for this Lab containing 50k images for training, and 10k images for validation/development. Each image is associated with a single image description (However the full dataset contains 5 image descriptions/captions). The images are resized and center-cropped to 256 by 256 pixels. In this lab we will build a simple retrieval system based on image similarities.

In [7]:
import pickle
import numpy as np
import matplotlib.pyplot as plt
from scipy.misc import imread, imresize
%matplotlib inline

1. Know your Data

When you work with images and vision, it is always helpful to see your data. The first step in any project should be spending some time analyzing what your data looks like. Here we are displaying one image from this dataset however I encourage you to look at a large number of images and captions before proceeding to the next step.

In [8]:
# Load data and show some images.
data = pickle.load(open('mscoco_small.p'))
train_data = data['train']
val_data = data['val']
In [9]:
# Pick an image and show the image.
sampleImageIndex = 290 # Try changing this number and visualizing some other images from the dataset.
plt.imshow(imread('mscoco/%s' % train_data['images'][sampleImageIndex]))
print(sampleImageIndex, train_data['captions'][sampleImageIndex])
(290, u'A herd of cattle that are grazing on some grass. ')

Answer the following questions for this section:

  1. Comment about the set of images in the dataset. How diverse are they? Do they show indoor as well as outdoor scenes? Is the image resolution enough to still understand the image?
  2. Comment about the set of captions in the dataset. How diverse are they? What is their quality? Are they grammatically correct? Are they closely describing the images? Are there any particular patterns that you noticed?

2. Using image pixels directly as features.

This is perhaps the simplest idea to compute the similarity between two images. Use a distance between the raw pixels between the two images. However, it is considered a good idea (not just because of memory space) to scale the images to a low resolution. We resize them here to 16x16x3 and flatten the pixels as a vector. First let's visualize our above image in this resolution below.

In [ ]:
# Pick an image and show the image.
image = imread('mscoco/%s' % train_data['images'][sampleImageIndex])
tiny_image = imresize(image, (16, 16), interp = 'nearest')

Answer the following questions for this section:

  1. Comment on what properties of the image are preserved at this low resolution of 16x16?
  2. Why would be a good idea to compare images pixel by pixel at this resolution rather than at full resolution?
  3. What does interp = 'nearest' mean? What is the effect of changing this parameter to 'bilinear'?

Now we are going to compute a matrix of size 50000 x 768, storing in the rows the pixels at this low resolution for all images in our MSCOCO 50k train set.

In [ ]:
# Compute features for the training set.
train_features = np.zeros((len(train_data['images']), 768), dtype=np.float)  # 768 = 16 * 16 * 3
for (counter, image_id) in enumerate(train_data['images']):
    image = imread('mscoco/%s' % image_id)
    tiny_image = imresize(image, (16, 16), interp = 'nearest')
    train_features[counter, :] = tiny_image.flatten().astype(np.float) / 255
    if (1 + counter) % 10000 == 0:
        print ('Computed features for %d train-images' % (1 + counter))

# Compute features for the validation set.
val_features = np.zeros((len(val_data['images']), 768), dtype=np.float)  # 768 = 16 * 16 * 3
for (counter, image_id) in enumerate(val_data['images']):
    image = imread('mscoco/%s' % image_id)
    tiny_image = imresize(image, (16, 16), interp = 'nearest')
    val_features[counter, :] = tiny_image.flatten().astype(np.float) / 255
    if (1 + counter) % 10000 == 0:
        print ('Computed features for %d val-images' % (1 + counter))

pickle.dump(train_features, open('train_features.p', 'w'))  # Store in case this notebook crashes.
pickle.dump(val_features, open('val_features.p', 'w'))  # Store in case this notebook crashes.

3. Retrieving similar images using a simple distance metric.

We first divide our dataset into train, validation and test. We will then compute distances between images in our validation set and train set. We will use this validation set to decide on how many nearest neighbors to use, what metric to use, and what features to use depending on what works best. We will leave aside the test set until the very end when we have decided on our best set of parameters and only to report our performance. Once we use the test set, we should not make any more changes to the algorithm.

In [ ]:
from scipy.spatial.distance import cdist

# Try changing this image.
sampleTestImageId = 197

# Retrieve the feature vector for this image.
sampleImageFeature = val_features[sampleTestImageId : sampleTestImageId + 1, :]
# Compute distances between this image and the training set of images.
distances = cdist(sampleImageFeature, train_features, 'correlation')
# Compute ids for the closest images in this feature space.
nearestNeighbors = np.argsort(distances[0, :])  # Retrieve the nearest neighbors for this image.

# Show the image and nearest neighbor images.
plt.imshow(imread('mscoco/%s' % val_data['images'][sampleTestImageId])); plt.axis('off')
plt.title('query image:')
fig = plt.figure()
for (i, neighborId) in enumerate(nearestNeighbors[:5]):
    fig.add_subplot(1, 5, i + 1)
    plt.imshow(imread('mscoco/%s' % train_data['images'][neighborId]))

We can also use the above nearest neighbors to return the captions from these neighbors. The assumption is that if the images are similar enough to the query image, then the captions are likely to also be descriptive of the query image.

In [ ]:
print('Query Image Description: ' + val_data['captions'][sampleTestImageId] + '\n')
for (i, neighborId) in enumerate(nearestNeighbors[:5]):
    print('(' + str(i) + ')' + train_data['captions'][neighborId])

4. Measuring Performance on this task.

How can we measure performance for the similarity of our set of returned images? One idea is to use the text returned along with the images to measure how many words in those descriptions match words in the Query Image Description. We could measure this in terms of the number of words matched in the description to the number of words matched in the query.

One such metric is called BLEU and it measures the similarity between two sentences based on how many substrings match subject to a penalty for returning sentences that are too short. You can read details in the paper that proposed this metric here.

We will split the sentences into words using the NTLK library and compute a very simple version of BLEU scores by only computing the number of common words between the reference caption and the top candidate caption.

In [ ]:
from nltk import word_tokenize

reference = [w.lower() for w in word_tokenize(val_data['captions'][sampleTestImageId])]
candidate = [w.lower() for w in word_tokenize(train_data['captions'][nearestNeighbors[0]])]

print ('ref', reference)
print ('cand', candidate)

bleu_score = float(len(set(reference) & set(candidate))) / len(candidate)
print("BLEU-1 score = ", bleu_score)

Lab Questions (5 pts) [Make sure to include your code and intermediate outputs]

  1. In this lab we computed the most similar images for a single test image in our validation set. We computed the bleu score of the top returned "candidate" caption (corresponding to the most similar image) agasint the "reference" caption associated with the query image. Compute the average of this score across all images in the validation set. (1pts). Hint: This task only requires in principle writing a for-loop around the code provided in this lab, however this will likely take very long to compute, note that cdist can also compute efficiently the distance between two sets of vectors at once.

  2. Repeat the above experiment, however this time use a random caption from the training set as the "candidate" caption. How does this number compare to the previous number obtained in step 1? (2pts).

  3. The feature that we used in this dataset is just a vector containing the raw image pixels for each image at a 16x16 resolution. A more robust feature for returning similar images is the Histogram of Oriented Gradients (HOG features). Which uses gradient information (edges) as opposed to color information. Use this feature instead and record here the BLEU score in the entire validation set. Feel free to use the scikit-image package to compute HOG features http://scikit-image.org/docs/dev/auto_examples/plot_hog.html. How does this number compare to the previous two? (2pts)

  4. Put in the table below the numbers obtained in 1, 2, and 3.

  1. Optional (1pt extra): Another global image feature used for retrieval is the GIST image descriptor. Compute the BLEU-1 for this descriptor, feel free to use this package to compute this feature: https://pypi.python.org/pypi/pyleargist. For more details about what this descriptor computes check here: http://people.csail.mit.edu/torralba/code/spatialenvelope/

If you find any errors or omissions in this material please contact me at vicente@virginia.edu