Vision & Language Applications
Kate Saenko Machine Learning
so far…
General AI: machines that see, talk, act Action
Vision
AI
Language
▪ ▪
▪ ▪
▪
Social media analysis
Security and smart cameras
AI assistants
Helper robots for the elderly
etc…
More Natural Human-Machine Interaction
Find a teapot on the table
What is in this glass?
A woman
wearing a striped shirt
water
Go outside and wait by the car
• Description
• Visual question answering (VQA)
• Referring expression (REF)
• Instruction following / navigation
•…
Pepper robot
Kate Saenko
4
Vision & Language problems
A baseball game in progress with the batter up to plate
A man is riding a bicycle
A: skateboard
Visual Question Answering
Image captioning
Video captioning
Q: What is the child standing on?
Vision & Language problems
Find “window upper right”
Referring expressions Text-to-clip retrieval from video
…and many others…
Find the moment when “girl looks up at the camera and smiles”
Demos
https://www.captionbot.ai/
http://vqa.cloudcv.org/
Kate Saenko 7
Today: Vision & Language
● Video captioning—in detail
● Other tasks
● Visual question answering (VQA)
● Video clip search
● Following instructions to navigate
Video Captioning
Kate Saenko Machine Learning
Applications of video captioning
Image and video retrieval by content. Video description service.
Children are wearing green shirts. They are dancing as they sing the carol.
Human Robot Interaction Video surveillance
10
Image Captioning, B.D. (before deep learning)
Language: Increasingly focused on grounding meaning in perception. Vision: Exploit linguistic ontologies to “tell a story” from images.
[Farhadi et. al. ECCV’10]
(animal, stand, ground)
[Kulkarni et. al. CVPR’11]
There are one cow and one sky. The golden cow is by the blue sky.
Many early works on Image Description Farhadi et. al. ECCV’10, Kulkarni et. al. CVPR’11, Mitchell et. al. EACL’12, Kuznetsova et. al. ACL’12 & ACL’13
Identify objects and attributes, and combine with linguistic knowledge to “tell a story”.
Dramatic increase in interest the past year. (8 papers in CVPR’15)
11
Video Description, B.D. (before deep learning)
[Krishnamurthy, et al. AAAI’13]
[Yu and Siskind, ACL’13]
● Extract object and action descriptors. ● Learn object, action, scene classifiers. ● Use language to bias visual
interpretation.
● Estimate most likely agents and actions. ● T emplate to generate sentence.
Others: Guadarrama ICCV’13, Thomason COLING’14
Limitations:
● Narrow Domains
● Small Grammars
● T emplate based sentences
● Several features and classifiers
[Rohrbach et. al. ICCV’13]
12
After Deep Learning, A.D.: End-to-End Neural Models based on Recurrent Nets
RNN decoder
CNN Sentence
[Donahue et al. CVPR’15] [Vinyals et al. CVPR’15]
RNN Encode encoder
RNN decoder
CNN
Sentence
[Venugopalan et. al. ICCV’15]
CNN Question
Answer
[Malinowski et. al. ICCV’15]
RNN
Encode encoder
RNN decoder
Recurrent Neural Networks (RNNs) can map a vector to a sequence.
English Sentence
French Sentence
Sentence Sentence
[Sutskever et al. NIPS’14]
[Donahue et al. CVPR’15] [Vinyals et al. CVPR’15]
[Venugopalan et. al. NAACL’15]
RNN encoder
RNN decoder
Encode Encode
RNN decoder
RNN decoder
Key Insight:
Generate feature representation of the video and “decode” it to a sentence
14
[review] Recurrent Neural Networks
xt
ht- 1
Successful in translation, speech.
ht
y t
Cell
Output
RNNs can map an input to an output sequence.
Pr(out yt | input, out y0…yt-1 ) Insight: Each time step has a layer
with the same weights.
Problems:
1. Hard to capture long term dependencies
2. Vanishing gradients (shrink through many layers)
Solution: Long Short Term Memory (LSTM) unit
RNN Unit
xt-1
RNN yt-1 ht-1
xt RNN yt
ht
15
time
LSTM Sequence decoders
Functions are differentiable.
Full gradient is computed by backpropagating through time. Weights updated using Stochastic Gradient Descent.
input
Matches state-of-the-art on: Speech Recognition
[Graves & Jaitly ICML’14]
Machine Translation (Eng-Fr)
[Sutskever et al. NIPS’14]
Image-Description
[Donahue et al. CVPR’15] [Vinyals et al. CVPR’15]
LSTM out t0
LSTM out t1
LSTM out t2
t=0 t=1 t=2
input
input
time
LSTM out t3
17
input
t=3
LSTM Sequence decoders
Two LSTM layers – 2nd layer of depth in temporal processing. Softmax over the vocabulary to predict the output at each time step.
input
LSTM
input
LSTM
input
LSTM
time
LSTM SoftMax
LSTM SoftMax
LSTM SoftMax
LSTM SoftMax
out t0 out t1 out t2 out t3
t=0 t=1 t=2
t=3
18
input
LSTM
Translating Videos to Natural Language
LSTM LSTM LSTM LSTM
LSTM LSTM
LSTM LSTM
LSTM LSTM
LSTM LSTM
CNN
A boy
is playing
golf
[Venugopalan et. al. NAACL’15]
19
Test time: Step 1
LSTM LSTM LSTM LSTM
LSTM LSTM
LSTM LSTM
LSTM LSTM
LSTM LSTM
CNN
A boy
is playing
golf
(a)
Input Video
Sample frames @1/10
Frame
Scale
227×227
(b)
20
[review] Convolutional Neural Networks (CNNs)
Image Credit: Maurice Peeman
Successful in semantic visual recognition tasks.
Layer – linear filters followed by non linear function. Stack layers.
Learn a hierarchy of features of increasing semantic richness.
>>
Krizhevsky, Sutskever, Hinton 2012
ImageNet classification breakthrough
21
Test time: Step 2 Feature extraction
LSTM LSTM LSTM LSTM
LSTM LSTM
LSTM LSTM
LSTM LSTM
LSTM LSTM
CNN
A boy
is playing
golf
CNN
Forward propagate
Output: “fc7” features
fc7: 4096 dimension “feature vector”
(activations before classification layer)
22
Test time: Step 3 Mean pooling
CNN
CNN
CNN
Mean across all frames
Arxiv: http://arxiv.org/abs/1505.00487 23
Input Video
Test time: Step 4 Generation
Convolutional Net
Recurrent Net
LSTM
LSTM
LSTM LSTM
LSTM LSTM
Output
LSTM
LSTM
LSTM
A
boy is
playing
golf
LSTM
LSTM
LSTM
24
.. .
Step1: CNN pre-training
fc7: 4096 dimension “feature vector”
● Based on Alexnet [Krizhevsky et al. NIPS’12]
● 1.2M+ images from ImageNet ILSVRC-12 [Russakovsky et al.]
● Initialize weights of our network.
CNN
25
Step2: Image-Caption training
LSTM
LSTM
LSTM LSTM
LSTM
LSTM LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
CNN
A
man is
scaling
a cliff
26
Step3: Fine-tuning
LSTM
LSTM
LSTM LSTM
LSTM
LSTM LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
CNN
1. Video Dataset
2. Mean pooled feature 3. Lower learning rate
A
boy is
playing golf
27
Experiments: Dataset
Microsoft Research Video Description dataset [Chen & Dolan, ACL’11] Link: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/
1970 YouTube video snippets 10-30s each
typically single activity
no dialogues
1200 training, 100 validation, 670 test
Annotations
Descriptions in multiple languages
~40 English descriptions per video descriptions and videos collected on AMT
28
Sample video and gold descriptions
● Amanappearstobeplowingaricefieldwitha plow being pulled by two oxen.
● A team of water buffalo pull a plow through a rice paddy.
● Domesticatedlivestockarehelpingamanplow.
● Amanleadsateamofoxendownamuddypath.
● Twooxenwalkthroughsomemud.
● Amanistillinghislandwithanoxpulledplow.
● Bullsarepullinganobject.
● Twooxenareplowingafield.
● Thefarmeristillingthesoil.
● Amaninploughingthefield.
● Amaniswalkingonarope.
● Amaniswalkingacrossarope.
● Amanisbalancingonarope.
● Amanisbalancingonaropeatthebeach. ● Amanwalksonatightropeatthebeach. ● Amanisbalancingonavolleyballnet.
● Amaniswalkingonaropeheldbypoles ● Amanbalancedonawire.
● Themanisbalancingonthewire.
● Amaniswalkingonarope.
● Amanisstandin2g9intheseashore.
Evaluation
Machine Translation Metrics BLEU
METEOR Human evaluation
30
S2VT (randomized)
Results (Youtube)
Mean-Pool (VGG)
27.7
28.2
S2VT (RGB)
29.2
S2VT (RGB+Flow)
29.8
38
Movie Corpus – DVS
Processed: Looking troubled, someone descends the stairs.
Someone rushes into the courtyard. She then puts a head scarf on …
39
Example outputs
Examples (M-VAD Movie Corpus)
MPII-MD: https://youtu.be/XTq0huTXj1M M-VAD: https://youtu.be/pER0mjzSYaM
43
Implicit Attention in LSTM
45
Implicit Attention in LSTM
46
Other Vision & Language Applications
Kate Saenko Machine Learning
Visual Question Answering
some questions require reasoning
49
Visual Question Answering: Spatial Memory Network
• Based on Memory Networks [Weston2014], [Sukhbaatar’15]
• Store visual features from image regions in memory
S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. End-to-end memory networks, 2015
J. Weston, S. Chopra, and A. Bordes. Memory networks, 2014.
Huijuan Xu, Kate Saenko,
Ask, Attend and Answer: Exploring Question- Guided Spatial Attention for Visual Question Answering, 2015 https://arxiv.org/abs/1511.05234
50
VQA Results
What season does this appear to be? GT: fall Our Model: fall
What color is the stitching on the ball? GT: red Our Model: red
VQA Results
What is the weather?
GT: rainy Our Model: rainy
What color is the fence?
GT: green Our Model: green
52
Referring Expression Grounding
[Hu et al CVPR16] [Hu et al CVPR17] [Hu et al ECCV18]
Text-based object query query:: “fwenincdeodwwuinpdpoewr rliegfhtto” f center door” query: “lady in black shirt”
prediction
Kate Saenko 53
Grounding expressions in video
Given a query: Person holding the door to the refrigerator open Find it in
video
Multilevel Language and Vision Integration for Text-to-Clip Retrieval, Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, Kate Saenko, AAAI19
Language based Navigation
Instruction: Walk into the kitchen and go to the left once you pass the counters. Go straight into the small room with the sink. Stop next to the door.
Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation, Ronghang Hu, Daniel Fried, Anna
Rohrbach, Dan Klein, Trevor Darrell, Kate Saenko, ACL 2019
Kate Saenko 55
Summary
● varietyoflanguage&visiontasks ● activeresearcharea
References
[1] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational Linguistics (COLING), August 2014.
[2] Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In IEEE International Conference on Computer Vision (ICCV).
[3] Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko. NAACL 2015
[4] Long-term Recurrent Convolutional Networks for Visual Recognition and Description. Jeff Donahue, Lisa Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell. CVPR 2015.
[5] Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko; ICCV 2015 [6] Huijuan Xu, Kate Saenko, Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering, 2015 https://arxiv.org/abs/1511.05234