CS 6501-004: Vision & Language

Instructor: Vicente Ordonez (vicente at virginia.edu)

Class Time: Tuesdays and Thursdays 12:30PM - 1:45PM, Thornton Hall E316
Teaching Assistants:
Tianlu (Evelyn) Wang (Office Hour: Wednesday 4 to 5pm) -- tw8cb at virginia dot edu, Rice 430
Xuwang Yin (Office Hour: Monday 4 to 5pm) -- xy4cm at virginia dot edu, Rice 430.
Piazza: http://piazza.com/virginia/spring2017/cs6501004

Visual recognition and language understanding are two challenging tasks in artificial intelligence. In this course we will study and acquire the skills to build machine learning and deep learning models that can reason about images and text. Particularly, we will study a recent body of research at the intersection of vision and language including: generating image descriptions using natural language, visual question answering, image retrieval using complex text queries, learning from weakly supervised text, aligning images and text in large data collections, generating images from textual descriptions, video and language, and other related topics. On the technical side we will be studying models including bag-of-words, n-gram language models, neural language models, probabilistic graphical models (PGMs), recurrent neural networks (RNNs), long-short term memory networks (LSTMs), convolutional neural networks (Convnets), and memory networks.

More about this class

Prerrequisites: Any one of the following courses: Machine Learning, Computational Visual Recognition, Natural Language Processing. In summary, this should not be your introductory course to Machine Learning. This will instead be a research-oriented course targeted to PhD Students and MS students with research interests. If you think you have taken a course with a strong Machine Learning component let me know so I can better advise you.

Additional Requirement: In this class you will need a GPU to execute the code in some of your assignments and to develop a course project which will be an important part of your grade. Therefore you need access ideally to a GTX 1080 GPU, or better graphics card, at the minimum you should have access to a CUDA-capable card with at least 4GB of GPU RAM. Additionally, you can fulfill this requirement by learning how to use Amazon EC2 Instances, G2 and P2 instances have GPUs, more info here. As a student, you can get a limited number of free computing credits from the AWS Educate program, however you have to be careful about not running out of your budget using these cloud instances. Note: The CS Department also has three machines named cuda1-3 which you can SSH from any CS server and has five machines with NVIDIA K20's (artemis1-5) in a cluster (instructions). These nodes might be enough for some assignments but probably not for your project. You are encouraged to use them but they do not fullfil the requirement outlined here.

Grading: 50pts (Project) + 30pts (Labs) + 10pts (Class presentation / participation) + 10pts (paper summaries)
Summary of Hands-on Activities:
1. Image-Text Retrieval Lab: [html] [notebook]
2. Deep Learning Lab: [html] [notebook]
3. Language Generation Lab [html] [notebook]
4. Visual Recognition Lab [html] [notebook]
5. pytorch lab [html] [notebook]

Syllabus / Schedule

Date     Topic
Thursday, January 19th. Lecture: Introduction [slides]
  • Challenges in Computer Vision
  • Challenges in Natural Language Processing
  • Challenges at the Intersection of Vision and Language
Lab for next week (Due Thursday January 26th) -- Image-Text Retrieval Lab: [html] [notebook]
Tuesday, January 24th Lecture: Introduction to Computer Vision [slides]
  • Basic Image Operations
  • Image Filtering
  • Image Features
Recommended Reading: Szeliski, Chapter 3.1-3.2
Link to online book [here].
Thursday, January 26th Lecture: Introduction to Natural Language Processing [slides]
  • Intro to NLP, Tasks in NLP
  • Why is NLP Hard?
  • Representing words and text
Lab for next week (Due Sunday February 5th) -- Deep Learning Lab: [html] [notebook]
Tuesday, January 31sth Lecture: Introduction to Deep Learning [slides]
  • Machine Learning recap
  • Neural Networks
  • Backpropagation
Thursday, February 2nd Lecture: Convolutional Neural Networks I [slides]
  • Convolutional Operator recap
  • Convolutional Layers
  • Convolutional Networks: LeNet
Lab for next next week (Due Sunday February 19th) -- Language Generation Lab [html] [notebook]
Tuesday, February 7thLecture: Convolutional Neural Networks II [slides]
  • Convolutional Operator as Matrix Multiplication
  • The Imagenet Large Scale Visual Recognition Challenge
  • Convolutional Networks: Alexnet, VGG, GoogLenet, ResNet
Thursday, February 9thLecture: Recurrent Neural Networks I [slides]
  • Softmax Classifier
  • Recurrent Neural Networks (RNNs)
  • Long Short Term Memory Networks (LSTMs)
Tuesday, February 14thLecture: Recurrent Neural Networks II / Word Embeddings [slides]
  • Bi-directional RNNs, Multi-layer RNNs
  • RNN Decoding, Beam-search
  • Word Embeddings: SVD, Word2Vec
Thursday, February 16th Lecture: Deep-learning based Visual Recognition [slides]
  • Detection: RCNN, Fast-RCNN, Faster-RCNN
  • Detection: YOLO, SSD
  • Segmentation: Fully Convolutional Networks
Lab for next next week (Due Friday March 3rd) -- Visual Recognition Lab [html] [notebook]
Tuesday, February 21st Course Project Proposal Presentations
  • Maximum 3 slides, maximum 5 minutes.
Thursday, February 23rdStudent Paper Review: Webly-supervised learning
  • DeViSE: A Deep Visual-Semantic Embedding Model. A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov. NIPS 2013 [pdf]
  • Webly supervised learning of convolutional networks. Xinlei Chen, Abhinav Gupta. ICCV 2015 [link]
Ji, Fandi [slides]
Leandra [slides]
Submit a 1 or 2 page project proposal in PDF (Deadline Monday February 27th).
Tuesday, February 28th Student Paper Review: Generating Image Descriptions
  • Show and Tell: A Neural Image Caption Generator. Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan CVPR 2015 [pdf]
  • From Captions to Visual Concepts and Back. Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, Geoffrey Zweig CVPR 2015 [arxiv]
Jieyu [slides]
Brady, Kerry [slides]
Thursday, March 2ndStudent Paper Review: Video and Text
  • Title Generation for User Generated Videos. Kuo-Hao Zeng, Tseng-Hung Chen, Juan Carlos Niebles, Min Sun ECCV 2016 [arxiv]
  • Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. Yukun Zhu, Jamie Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler ICCV 2015 [pdf]
Nick, Tom [slides]
Abhimanyu, Gautam [slides]
Before Spring Break meet with the instructor to confirm your Final Project Proposal
Note: If the instructor approved your project without objections on UVACollab then no need to meet, you can start working on your project.
Tuesday, March 7th Spring Break - no classes this day.
Thursday, March 9th Spring Break - no classes this day.
Tuesday, March 14thStudent Paper Review: Visual Question Answering
  • Exploring Models and Data for Image Question Answering. Mengye Ren, Ryan Kiros, Richard Zemel NIPS 2015 [pdf]
  • Ask Your Neurons: A Neural-based Approach to Answering Questions about Images. Mateusz Malinowski, Marcus Rohrbach, Mario Fritz ICCV 2015 [pdf]
Yujia, Luyao []
Anudeep []
Thursday, March 16th ICCV Deadline - no classes this day.
Tuesday, March 21stStudent Paper Review: Attention Models and Recurrent Neural Networks
  • Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu, Jimmy Ba, Jamie Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio ICML 2015 [pdf]
  • Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. Huijuan Xu, Kate Saenko ECCV 2016 [arxiv]
Siva, Haina [slides]
Arshdeep, Sihang []
Thursday, March 23thStudent Paper Review: Memory-augmented Networks
  • End-To-End Memory Networks. Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus NIPS 2015 [arxiv]
  • Dynamic Memory Networks for Visual and Textual Question Answering. Caiming Xiong, Stephen Merity, Richard Socher ICML 2016 [arxiv]
Paola, Vicente [slides]
Erik, Leigh []
Tuesday, March 28thLecture: Vicente
  • Binary Convolutional Neural Networks / Quantized Networks (ECCV 2016)
  • Situation Recognition: Dealing with Sparsity (CVPR 2017)
  • Notes on Pytorch
Thursday, March 30thStudent Paper Review: Topic of your choice
  • Stacked Attention Networks for Image Question Answering. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola CVPR 2016 [pdf]
  • DenseCap: Fully Convolutional Localization Networks for Dense Captioning Justin Johnson, Andrej Karpathy, Li Fei-Fei CVPR 2016 [pdf]
Wasi, Monica [slides]
Masudur []
Tuesday, April 4thStudent Paper Review: Text to Image Generation.
  • Generative Adversarial Text to Image Synthesis. Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee ICML 2016 [arxiv]
  • StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, Dimitris Metaxas [arxiv]
Cara []
Seth, Colin [slides]
Thursday, April 6thStudent Paper Review: Referring Expressions.
  • Generation and Comprehension of Unambiguous Object Descriptions. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, K. Murphy. CVPR 2016 [arxiv]
  • Segmentation from Natural Language Expressions. Ronghang Hu, Marcus Rohrbach, Trevor Darrell. ECCV 2016 [arxiv]
Qingyu, Mengyao []
Tianyi [slides]
Thursday, April 6th 11:59pm: Submit a 2-page report on UVA Collab detailing progress in your Project (Describe the dataset, experiments, or preliminary results that you might have so far, and state clearly what is left to be completed). I will provide feedback but no grade, use this opportunity to show me your progress. Use this template.
Optional Lab for next week (Due Thursday April 20th) -- pytorch lab [html] [notebook]
Tuesday, April 11thStudent Paper Review: Topic of your choice
  • The Amazing Mysteries of the Gutter: Drawing Inferences Between Panels in Comic Book Narratives. Mohit Iyyer, Varun Manjunatha, Anupam Guha, Yogarshi Vyas, Jordan Boyd-Graber, Hal Daumé III, Larry Davis. [arxiv] [Tech Review Article]
  • Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction. Hyeonwoo Noh, Paul Hongsuck, Seo Bohyung Han. CVPR 2016 [pdf]
Leo [slides]

Yang, Angyang []
Thursday, April 13th Brainstorming Session for New Tasks & Ideas.
Tuesday, April 18th Course re-cap
  • Joint Image-text Embeddings
  • Image Captioning
  • Visual Question Answering
  • Referring Expressions
  • Text to Image Synthesis
Thursday, April 20thIn-class Activity
Tuesday, April 25thProject Presentations
Thursday, April 27thProject Presentations
Tuesday, May 2ndProject Presentations
Project Deadline - Submit Report on UVA Collab (May 2nd 11:59PM EST)
Use the following format for your report: template. Minimum 4 pages, maximum 6 pages.