COMP 646: Deep Learning for Vision and Language | Spring 2022

Instructor: Vicente Ordóñez-Román (vicenteor at
Class Time: Mondays, Wednesdays, and Fridays from 1pm to 1:50pm Central Time (Virtual OR Duncan Hall 1070).

Course Description: Visual recognition and language understanding are two challenging tasks in AI. In this course we will study and acquire the skills to build machine learning and deep learning models that can reason about images and text for generating image descriptions, visual question answering, image retrieval, and other tasks involving both text and images. On the technical side we will leverage models such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformer networks (e.g. BERT), among others.

Learning Objectives: (a) Develop intuitions about the connections between language and vision, (b) Understanding foundational concepts in representation learning for both images and text, (c) Become familiar with state-of-the-art models for tasks in vision and language, (d) Obtain practical experience in the implementation of these models.

Prerrequisites: There are no formal pre-requisities for this class. However a basic command of machine learning, deep learning or computer vision will be useful when taking this class. Students should have knowledge of linear algebra, differential calculus, and basic statistics and probability. Moreover students are expected to have attained some level of proficiency in Python programming or be willing to learn Python programming. Students are encouraged to complete the following activity before the first lecture: [Primer on Image Processing].

Grading: Assignments: 30% (3 assignments), Class Project: 50%, Quiz: 10%, Class Participation: 10%.


Date Topic  
Mon, Jan 10 Introduction to Vision and Language [pptx] [pdf]
Wed, Jan 12 Machine Learning I: Supervised vs Unsupervised Learning, Linear Classifiers [pptx] [pdf]
Fri, Jan 14 Machine Learning II: Stochastic Gradient Descent / Regularization [pptx] [pdf]
Assignment on Text and Image Classification
Mon, Jan 17 Martin Luther King, Jr. Day (Holiday - No Scheduled Classes)
Wed, Jan 19 Neural Networks I: Multi-layer Perceptrons and Backpropagation
Fri, Jan 21 Practical Session: Neural Networks Building Blocks
Mon, Jan 24 Neural Networks II: Convolutional Neural Networks
Wed, Jan 26 Computer Vision I: Convolutional Neural Network Architectures for Vision: AlexNet, VGG-Net, InceptionNets, ResNets
Fri, Jan 28 Practical Session: Using and Finetuning Convolutional Neural Networks
Assignment on Multimodal Analysis of Movies
Mon, Jan 31 Computer Vision II: Convolutional Neural Networks for Object Detection
Wed, Feb 2 Computer Vision III: Convolutional Neural Networks for Segmentation
Fri, Feb 4 Natural Language Processsing I: Bag of Words, N-gram Language Models, Transformer-based Models
Mon, Feb 7 Practical Session: Image, Text Classification and Processing / Optimization and Regularization.
Wed, Feb 9 Natural Language Processsing II: Syntax and Morphology / Parsing / Co-reference Resolution
Fri, Feb 11 Spring Recess (No Scheduled Classes)
Assignment on Vision-Language RNNs and Transformers
Mon, Feb 14 Natural Language Processing III: Distributional Semantics and Word Embeddings
Wed, Feb 16 Recurrent Neural Networks: Basics and Neural Image Captioning
Fri, Feb 18 Practical Session: Recurrent Neural Networks for Text Generation
Mon, Feb 21 Transformer Models I: Multi-headed Self-Attention Layers and Pretrained Transformers e.g BERT, GPT-2, GPT-3
Wed, Feb 23 Transformer Models II: Vision and Language Transformers e.g. UNITER, VilBERT, VisualBERT
Fri, Feb 25 Transformers III: Visual Transformers (ViT) and contrastive learning (CLIP)
Mon, Feb 28 CLIP-tology: CLIP-based models and prompt engineering
Wed, Mar 2 Referring Expressions and Visual Grounding
Fri, Mar 4 Visual Question Answering (VQA)
Mon, Mar 7 ECCV Deadline (No Class this Day)
Wed, Mar 9 Text-Image Retrieval (VSE, VSE++, Drill-Down)
Fri, Mar 11 Vision-and-Language Navigation (VLN)
Mon, Mar 14 Spring Break (No Scheduled Classes)
Wed, Mar 16 Spring Break (No Scheduled Classes)
Fri, Mar 18 Spring Break (No Scheduled Classes)
Mon, Mar 21 Multimodal Machine Translation (MMT)
Wed, Mar 23 Practical Session: Deployment of Vision-Language Models I
Fri, Mar 25 Practical Session: Deployment of Vision-Language Models II
Mon, Mar 28 Visually Grounded Dialog
Wed, Mar 30 Video Representations and Video and Language Tasks
Fri, Apr 1 Recent Developments in Vision-Language Transformers
Mon, Apr 4 Language-guided Image Synthesis: Text2Scene, RetrieveGAN, DALL-E, GLIDE
Wed, Apr 6 Explainability in Vision and Language Models
Fri, Apr 8 Analyzing and Mitigating Biases in Vision and Language Models
Mon, Apr 11 Entry-level Categories and Naming
Wed, Apr 13 Open Problems in Vision and Language Research
Fri, Apr 15 Course Recap
Mon, Apr 18 Project Presentations
Wed, Apr 20 Project Presentations
Fri, Apr 22 Project Presentations

Disclaimer: The professor reserves to right to make changes to the syllabus, including assignment due dates. These changes will be announced as early as possible.

COVID-19 Notice: As we continue another year under the ongoing COVID-19 pandemic and new variants of the virus causing this disease, there might be changes that affect the delivery or requirements of this class. As of the writing of this, the university has declared that classes with more than 50 students such as this one should be virtual during the first two weeks of class. Please pay close attention to university officials and the instructor regarding modifications to course delivery or content due to the pandemic. If a student is personally affected by the pandemic course staff will also make special considerations on a case-by-case basis as allowed -- students however should first follow any guidance put forward by official university channels of communication.

Late Submission Policy: No late assignments will be accepted in this class. Unless the student has procured special accommodations for warranted circumstances -- or due to exceptional personal situations. If you consider this might be your case please contact the instructor directly.

Honor Code and Academic Integrity: "In this course, all students will be held to the standards of the Rice Honor Code, a code that you pledged to honor when you matriculated at this institution. If you are unfamiliar with the details of this code and how it is administered, you should consult the Honor System Handbook at This handbook outlines the University's expectations for the integrity of your academic work, the procedures for resolving alleged violations of those expectations, and the rights and responsibilities of students and faculty members throughout the process." For this class: If assignments are individual then no collaboration is expected, no two students should submit the same source code. Regardless of circumstances I will assume that any source code, text, or images submitted alongside reports or projects are of the authorship of the students unless otherwise explicitly stated through appropriate means. Any missing information regarding sources will be regarded potentially as a failure to abide by the academic integrity statement even if that was not the intent. Please be careful about citing sources and clearly stating what is your original work and what is not in all assignments and projects. Especially avoid vague statements such as "we built our model based on X", instead be explicit e.g. "we downloaded X and modified the encoder so that it can work with videos instead of images by adding three more layers". Avoid vague statements that make it difficult to understand what you did from what was done by others.

Title IX Support: Rice University cares about your wellbeing and safety. Rice encourages any student who has experienced an incident of harassment, pregnancy discrimination or gender discrimination or relationship, sexual, or other forms interpersonal violence to seek support through The SAFE Office. Students should be aware when seeking support on campus that most employees, including myself, as the instructor/TA, are required by Title IX to disclose all incidents of non-consensual interpersonal behaviors to Title IX professionals on campus who can act to support that student and meet their needs. For more information, please visit or email

Disability Resource Center: "If you have a documented disability or other condition that may affect academic performance you should: 1) make sure this documentation is on file with the Disability Resource Center (Allen Center, Room 111 / / x5841) to determine the accommodations you need; and 2) talk with me to discuss your accommodation needs."