程序代写代做代考 game go deep learning AI Announcements Reminder: Class challenge out! Ends December 10th

Announcements Reminder: Class challenge out! Ends December 10th
• Lab this week – go over pset6 solutions, tips for challenge

Vision & Language Applications
Slides adapted from Kate Saenko Machine Learning

so far…

General AI: machines that see, talk, act Action
Vision
Language
▪ Social media analysis
▪ Security and smart
cameras
▪ AI assistants
▪ Helper robots for the elderly
▪ etc…
A I

More Natural Human-Machine Interaction
Find a teapot on the table
What is in this glass?
A woman wearing a striped shirt
water
Go outside and wait by the car
• Description
• Visual question answering (VQA)
• Referring expression (REF)
• Instruction following / navigation
•…
Pepper robot
Kate Saenko
5

Vision & Language problems
A baseball game in progress with the batter up to plate
A man is riding a bicycle
Q: What is the child standing on?
Image captioning
Video captioning
A: skateboard
Visual Question Answering

Vision & Language problems
Find “window upper right”
Referring expressions Text-to-clip retrieval from video
…and many others…
Find the moment when “girl looks up at the camera and smiles”

Demos
https://www.captionbot.ai/
http://vqa.cloudcv.org/
Kate Saenko 8

Today: Vision & Language
● Video captioning—in detail
● Other tasks
● Visual question answering (VQA)
● Video clip search
● Following instructions to navigate

Video Captioning
Kate Saenko Machine Learning

Applications of video captioning
Image and video retrieval by content. Video description service.
Children are wearing green shirts. They are dancing as they sing the carol.
Human Robot Interaction Video surveillance
11

Image Captioning, B.D. (before deep learning)
Language: Increasingly focused on grounding meaning in perception. Vision: Exploit linguistic ontologies to “tell a story” from images.
[Farhadi et. al. ECCV’10]
(animal, stand, ground)
[Kulkarni et. al. CVPR’11]
There are one cow and one sky. The golden cow is by the blue sky.
Many early works on Image Description Farhadi et. al. ECCV’10, Kulkarni et. al. CVPR’11, Mitchell et. al. EACL’12, Kuznetsova et. al. ACL’12 & ACL’13
Identify objects and attributes, and combine with linguistic knowledge to “tell a story”.
Dramatic increase in interest since then. (8 papers in CVPR’15)
12

Video Description, B.D. (before deep learning)
[Krishnamurthy, et al. AAAI’13]
[Yu and Siskind, ACL’13]
● Extract object and action descriptors.
● Learn object, action, scene classifiers.
● Use language to bias visual
interpretation.
● Estimate most likely agents and actions.
● Template to generate sentence.
Others: Guadarrama ICCV’13, Thomason COLING’14
Limitations:
● Narrow Domains
● Small Grammars
● Template based sentences
● Several features and classifiers
[Rohrbach et. al. ICCV’13]
13

After Deep Learning, A.D.: End-to-End Neural Models based on Recurrent Nets
RNN decoder
CNN Sentence
[Donahue et al. CVPR’15] [Vinyals et al. CVPR’15]
RNN Encode encoder
RNN decoder
CNN
Sentence
[Venugopalan et. al. ICCV’15]
CNN Question
Answer
[Malinowski et. al. ICCV’15]
RNN
Encode encoder
RNN decoder

Recurrent Neural Networks (RNNs) can map a vector to a sequence.
English Sentence
French Sentence
Sentence
Sentence
[Sutskever et al. NIPS’14]
[Donahue et al. CVPR’15] [Vinyals et al. CVPR’15]
[Venugopalan et. al. NAACL’15]
RNN encoder
RNN decoder
RNN decoder
Encode
Encode
RNN decoder
Key Insight:
Generate feature representation of the video and “decode” it to a sentence
15

[review] Recurrent Neural Networks
xt
ht- 1
Successful in translation, speech.
Cell
Output
RNNs can map an input to an output sequence.
Pr(out yt | input, out y0…yt-1 ) Insight: Each time step has a layer
with the same weights.
ht
y RNN Unit
RNN yt-1 ht-1
t
xt-1
Problems:
1. Hard to capture long term dependencies
2. Vanishing gradients (shrink through many layers)
xt RNN yt
Solution: Long Short Term Memory (LSTM) unit
16
ht
time

LSTM Sequence decoders
Functions are differentiable.
Full gradient is computed by backpropagating through time. Weights updated using Stochastic Gradient Descent.
input
Matches state-of-the-art on: Speech Recognition
[Graves & Jaitly ICML’14]
Machine Translation (Eng-Fr)
[Sutskever et al. NIPS’14]
Image-Description
[Donahue et al. CVPR’15] [Vinyals et al. CVPR’15]
LSTM out t0
LSTM out t1
LSTM out t2
t=0 t=1 t=2 t=3
input
input
time
LSTM out t3
18
input

LSTM Sequence decoders
Two LSTM layers – 2nd layer of depth in temporal processing. Softmax over the vocabulary to predict the output at each time step.
input
input
time
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
SoftMax SoftMax SoftMax SoftMax
out t0 out t1 out t2 out t3
t=0 t=1 t=2
t=3
19
input
input

Translating Videos to Natural Language
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
CNN
A
boy is
playing golf
[Venugopalan et. al. NAACL’15]
20

Test time: Step 1
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
CNN
A boy
is playing
golf
(a)
Input Video
Sample frames @1/10
Frame
Scale
227×227
(b)
21

[review] Convolutional Neural Networks (CNNs)
Image Credit: Maurice Peeman
Successful in semantic visual recognition tasks.
Layer – linear filters followed by non linear function. Stack layers.
Learn a hierarchy of features of increasing semantic richness.
>>
Krizhevsky, Sutskever, Hinton 2012
ImageNet classification breakthrough
22

Test time: Step 2 Feature extraction
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
CNN
A boy
is playing
golf
CNN
Forward propagate
Output: “fc7” features
fc7: 4096 dimension “feature vector”
(activations before classification layer)
23

Test time: Step 3 Mean pooling
CNN
CNN
CNN
Mean across all frames
Arxiv: http://arxiv.org/abs/1505.00487 24

Input Video
Convolutional Net
Recurrent Net
LSTM LSTM
LSTM LSTM
LSTM LSTM LSTM LSTM
LSTM LSTM
LSTM LSTM
Output
Test time: Step 4 Generation
A
boy is
playing
golf
25
.. .

Step1: CNN pre-training
● Based on Alexnet [Krizhevsky et al. NIPS’12]
● 1.2M+ images from ImageNet ILSVRC-12 [Russakovsky et al.]
● Initialize weights of our network.
CNN
fc7: 4096 dimension “feature vector”
26

Step2: Image-Caption training
LSTM
LSTM LSTM
LSTM LSTM LSTM LSTM
LSTM LSTM LSTM LSTM
LSTM
CNN
A
man is
scaling
a cliff
27

Step3: Fine-tuning
LSTM
LSTM LSTM
LSTM LSTM LSTM LSTM
LSTM LSTM LSTM LSTM
LSTM
CNN
1. Video Dataset
2. Mean pooled feature 3. Lower learning rate
A
boy is
playing golf
28

Experiments: Dataset
Microsoft Research Video Description dataset [Chen & Dolan, ACL’11] Link: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/
1970 YouTube video snippets 10-30s each
typically single activity
no dialogues
1200 training, 100 validation, 670 test
Annotations
Descriptions in multiple languages
~40 English descriptions per video descriptions and videos collected on AMT
29

Sample video and gold descriptions
● Amanappearstobeplowingaricefieldwitha plow being pulled by two oxen.
● A team of water buffalo pull a plow through a rice paddy.
● Domesticated livestock are helping a man plow.
● A man leads a team of oxen down a muddy path.
● Twooxenwalkthroughsomemud.
● A man is tilling his land with an ox pulled plow.
● Bullsarepullinganobject.
● Two oxen are plowing a field.
● Thefarmeristillingthesoil.
● Amaninploughingthefield.
● Amaniswalkingonarope.
● Amaniswalkingacrossarope.
● Amanisbalancingonarope.
● A man is balancing on a rope at the beach.
● Amanwalksonatightropeatthebeach.
● Amanisbalancingonavolleyballnet.
● Amaniswalkingonaropeheldbypoles
● A man balanced on a wire.
● Themanisbalancingonthewire.
● Amaniswalkingonarope.
● Amanisstandin3g0intheseashore.

Evaluation
Machine Translation Metrics BLEU
METEOR Human evaluation
31

Results (Youtube)
Mean-Pool (VGG)
27.7
28.2
S2VT (randomized)
S2VT (RGB)
29.2
29.8
S2VT (RGB+Flow)
39

Example outputs

Movie Corpus – DVS
Processed: Looking troubled, someone descends the stairs.
Someone rushes into the courtyard. She then puts a head scarf on …
41

Examples (M-VAD Movie Corpus)
MPII-MD: https://youtu.be/XTq0huTXj1M M-VAD: https://youtu.be/pER0mjzSYaM
44

Implicit Attention in LSTM
46

Implicit Attention in LSTM
47

Other Vision & Language Applications
Kate Saenko Machine Learning

Visual Question Answering
some questions require reasoning
50

Visual Question Answering: Spatial Memory Network
• Based on Memory Networks [Weston2014], [Sukhbaatar’15]
• Store visual features from image regions in memory
S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. End-to-end memory networks, 2015
J. Weston, S. Chopra, and A. Bordes. Memory networks, 2014.
Huijuan Xu, Kate Saenko,
Ask, Attend and Answer: Exploring Question- Guided Spatial Attention for Visual Question Answering, 2015 https://arxiv.org/abs/1511.05234
51

VQA Results
What season does this appear to be? GT: fall Our Model: fall
What color is the stitching on the ball? GT: red Our Model: red

VQA Results
What is the weather?
GT: rainy Our Model: rainy
What color is the fence?
GT: green Our Model: green
53

Referring Expression Grounding
[Hu et al CVPR16] [Hu et al CVPR17] [Hu et al ECCV18]
Text-based object query query:: “fwenincdeodwwuinpdpoewr rliegfhtto” f center door” query: “lady in black shirt”
prediction
Kate Saenko 54

Grounding expressions in video
Given a query: Person holding the door to the refrigerator open Find it in
video
Multilevel Language and Vision Integration for Text-to-Clip Retrieval, Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, Kate Saenko, AAAI19

Language based Navigation
Instruction: Walk into the kitchen and go to the left once you pass the counters. Go straight into the small room with the sink. Stop next to the door.
Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation, Ronghang Hu, Daniel Fried, Anna
Rohrbach, Dan Klein, Trevor Darrell, Kate Saenko, ACL 2019
Kate Saenko 56

Summary
● variety of language & vision tasks
● active research area
References
[1] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational Linguistics (COLING), August 2014.
[2] Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In IEEE International Conference on Computer Vision (ICCV).
[3] Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko. NAACL 2015
[4] Long-term Recurrent Convolutional Networks for Visual Recognition and Description. Jeff Donahue, Lisa Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell. CVPR 2015.
[5] Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko; ICCV 2015 [6] Huijuan Xu, Kate Saenko, Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering, 2015 https://arxiv.org/abs/1511.05234

Next Class
Applications II: Machine Learning Ethics:
Ethics in ML; population bias in machine learning, fairness, transparency, accountability; de-biasing image captioning models

Related Posts