MULTIMEDIA RETRIEVAL
Semester 1, 2022
Large Scale Retrieval
Image/Video Annotation Semantic gap
Copyright By PowCoder代写 加微信 powcoder
Bag of Visual Words model Video Google
School of Computer Science
Semantic Gap
Content based retrieval Use low level features
Human understanding
Semantics: objects and meaningful attributes
School of Computer Science
CBIR: Semantic Gap
Query: “Find me pictures of tiger”
School of Computer Science
Annotation Task
Training
tiger cat grass
hippo, bull, mouth, walk flower, coralberry, leaves, plant
ocean? Lion? Building? Sky?
School of Computer Science
Annotation Approaches
Word co-occurrence model Machine translation model Statistic models
Refinement strategies
School of Computer Science
Co-occurrence Model
An image is divided into parts
Annotated words are inherited for each part
Parts are vector quantized to make cluster
P(word|cluster) for all word for all cluster is estimated statistically
School of Computer Science
Co-occurrence Model
Y. Mori, H. Takahashi, and R. Oka. Image-to-word transformation based on dividing and vector quantizing images with words. In Workshop on Multimedia Intelligent Storage and Retrieval Management, 1999.
School of Computer Science
Co-occurrence Model
School of Computer Science
Machine Translation Model
I love multimedia computing technology very much. 我非常喜欢多媒体计算技术.
Tree Tiger
P. Duygulu, K. Barnard, N. de Fretias, and D. Forsyth, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary, ECCV 2002.
School of Computer Science
Blob Representation
Visual features
Region size
Position
Oriented energy (12 filters) Simple shape features
School of Computer Science
Tokenization
Words → word tokens Image segments
represented by 30 features
(size, position, color, texture and shape)
k-means to cluster features
best cluster for the blob → blob tokens
School of Computer Science
160 CD’s from Corel Data Set 100 images in each
10 sets, each :
randomly selected 80 CD’s ~6000 training
~2000 test
150-200 word tokens
500 blob tokens
Segmentation (using Ncuts) about a month
School of Computer Science
Annotation
Conditional probability p(w|b) The likelihood function
P(w|b) P(w,a|b) a
The event aj=i means that the jth word in the possible translation translates the ith blob
Using Expectation Maximization
Predicting correspondences from translation probabilities Predicting translation probabilities from correspondences
School of Computer Science
Large Scale Annotation/Tagging
5+ billion (Sep 2010)
• 160 years to view all of them (1s per image) • 3000+ uploads/minute
• 2% Internet users visit (2009)
• Daily time on site: 4.7 minutes (2009)
400 million (2010)
• 2007 bandwidth = entire Internet in 2000
• 3B+ views per day (2010)
• 1,920 years to view all of them (1s per image) • ~138M uploads/minute
• 24% Internet users visit (2009)
• Daily time on site: 30 minutes (2009)
• 2,000 years to see all of them
• ~20 hours uploaded/minute (09)
• 20% Internet users visit (2009)
• Daily time on site: 23 minutes (2009)
60 billion (Dec 2010 )
Following slides are from Xian- , Image and Video Tagging in the Internet Era
School of Computer Science
Characteristics of Internet Multimedia
Huge Amount of Data
Be consumed Frequently
Increasing Very Rapidly
Affection Highly Involved
Variances Are Very Large
Connected to Each Other
School of Computer Science
Variety of Internet Multimedia Applications
Recommendation
Authoring/Editing
Copy Detection
Summarization
Visualization
Advertising
Categorization
Media on Mobiles
School of Computer Science
Evolution of Media Tagging
2010-2015- … Image/Video Captioning
Tag Processing
Data Driven 2003 Automated Annotation
2001 Surroundings as Tags
19 70 – 80’ Manual Labeling
2005 Large-Scale Manual Labeling 2006 2010
1970 1980 1990 2000
School of Computer Science
(Automatic) Annotation ≈ Concept Detection
≈ (Automatic) Tagging ≈ (Automatic) Labeling
≈ Concepts
School of Computer Science
Approaches
Automated Annotation – 1st Paradigm
A typical strategy – Individual Concept Detection
Annotate multiple concepts separately Low-Level Features
Outdoor Face Person People- Road Marching
-1 / 1 -1 / 1 -1 / 1 -1 / 1 -1 / 1
Walking- Running
School of Computer Science
To Exploit Label Correlations
√ Person √ Street
√ Building
× Mountain
√ Walking/Running
√? Marrcchining
School of Computer Science
Automated Annotation – 2nd Paradigm
Another typical strategy – Fusion-Based Context Based Concept fusion (CBCF)
People- Marching
Walking- Running
Concept Model Vector
Concept Fusion
-1 / 1 -1 / 1
School of Computer Science
Low-Level Features
Person People- Marching
Score Score
Concept Model Vector
Concept Fusion
Walking- Running
-1 / 1 -1 / 1
School of Computer Science
Automated Annotation – 3rd Paradigm
Integrated Concept Detection Correlative Multi-Label Learning (CML)
Low-Level Features
Outdoor Face Person People- Road Marching
Walking- Running
-1 / 1 -1 / 1 -1 / 1 -1 / 1 -1 / 1
School of Computer Science
CML Roadmap
Multi-Label Annotation
2nd Paradigm
Fusion Based
3rd Paradigm
1st Paradigm
Individual Detectors
Integrated
Model concepts and correlations in one step
Has Correlations, but uses a second step
No correlation
G.-J. Qi, et al., Correlative Multi-Label Video Annotation, ACM Multimedia 2007.
School of Computer Science
Automated Annotation – 3rd Paradigm
Integrated Concept Detection Correlative Multi-Label Learning (CML)
School of Computer Science
How to Model Concept Correlations
How to model concepts and the correlations among concept in a single step
Converting correlations into features.
Constructing a new feature vector that captures both
The characteristics of concepts, and
The correlations among concepts
School of Computer Science
Correlative Multi-Label Video Annotation
Experiments
TRECVID 2005 dataset (170 hours)
39 concepts (LSCOM-Lite)
Training (65%), Validation (16%), Testing (19%)
CML (MAP=0.290) improves IndSVM (MAP=0.246) 17% and CBCF (MAP=0.253)
SVM CML ↑ 1 CBCF CML ↑14
SVM CBCF C
School of Computer Science
Correlative Multi-Label Video Annotation
Experiments
TRECVID 2005 dataset (170 hours)
39 concepts (LSCOM-Lite)
Training (65%), Validation (16%), Testing (19%)
CML (MAP=0.290) improves IndSVM (MAP=0.246) 17% and CBCF (MAP=0.253)
SVM CML ↑ 131% CBCF CML ↑128%
SVM CBCF CML
School of Computer Science
Correlative Multi-Label Video Annotation
Experiments
TRECVID 2005 dataset (170 hours)
39 concepts (LSCOM-Lite)
Training (65%), Validation (16%), Testing (19%)
CML (MAP=0.290) improves IndSVM (MAP=0.246) 17% and CBCF (MAP=0.253)
SVM CBCF CML SVM CBCF CML SVM CBCF CML
School of Computer Science
Correlative Multi-Label Video Annotation
Experiments
TRECVID 2005 dataset (170 hours)
39 concepts (LSCOM-Lite)
Training (65%), Validation (16%), Testing (19%)
CML (MAP=0.290) improves IndSVM (MAP=0.246) 17% and CBCF (MAP=0.253)
School of Computer Science
Internet Media
www, www2009, madrid, spain
www2009, w3c, futuro, future, workshop, congreso w3c, , Don, Quixote palacio, municipal, Madrid, consortium, consorcio
cervantes, Sancho, … 20, aniversario, España, Spain, Vinton, …
School of Computer Science
Social tags
Good, but
Ambiguous
Incomplete
No relevance information
Two directions to improve tag quality
• During tagging – Tag Recommendation • After tagging – Tag Refinement/Ranking
School of Computer Science
The most relevant tag is NOT at the top position in the tag list of the following social image.
Social tags for online images are better than automatic annotation in terms of both scalability and accuracy.
School of Computer Science
This phenomenon is widespread on social media websites such as Flickr.
Only less than 10% images have their
most relevant tag at the top position in their tag list.
School of Computer Science
This has significantly limited the performance of tag-based image search and other applications.
For example, when we search for “bird” on Flickr.
School of Computer Science
What we are going to do:
Rank the tags according to their relevance to the image.
To improve:
Tag-based search
Image annotation (automatic tagging) Group recommendation
School of Computer Science
Towards Storytelling
Image and video captioning
Figure from , ACM MM 2016.
School of Computer Science
Bag-of-words (BoW) Model
A document
A collection of the words in the document
A collection of the objects in the image
What features could characterize an object well?
R. Jin, Content based Image Retrieval.
School of Computer Science
Bag-of-Visual-Words (BoVW)
Slides credit: M. Bressan and L. Fei-Fei
School of Computer Science
Interesting Point Detection
Local features have been shown to be effective for representing images
They are image patterns which differ from their immediate neighborhood.
They could be points, edges, small patches.
We call local features key points or interesting points of an image
School of Computer Science
Interesting Point Detection
An image example with key points detected by a corner detector.
School of Computer Science
Interesting Point Detection
The detection of interesting point needs to be robust to various geometric transformations
Original Scaling+Rotation+Translation Projection
School of Computer Science
Interesting Point Detection
The detection of interesting point needs to be robust to imaging conditions, e.g. lighting, blurring.
School of Computer Science
Descriptor
Representing each detected key point
Take measurements from a region centered on
a interesting point E.g., texture, shape, …
Each descriptor is a vector with fixed length
E.g. SIFT descriptor is a vector of 128 dimension
School of Computer Science
Descriptor
The descriptor should also be robust under different image transformation.
They should have similar descriptors
School of Computer Science
Keypoint + Local Descriptors
SIFT (Scale Invariant Feature Transform) Feature detector SIFT descriptor [Lowe’04]
Difference of Gaussian Eliminating edge response
Calculated in a 16×16 window 128 dimension = (4 x 4) * 8
School of Computer Science
http://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O
School of Computer Science
Difference of Gaussian
Subtract a signal with its blurred/smoothened version
http://en.wikipedia.org/wiki/Difference_of_Gaussians
School of Computer Science
Image Representation
Bag-of-features representation: an example
Each descriptor is 5 dimension
22 0 66 103 232 44 29 55 11 78 220 30
19 23 1 45 6 38 0 11 48 129 0 1 110 1 32 11 34 21
Original image
Detected key points
Descriptors of the key points
22 0 19 23 1 66 103 45 6 38 232 44 0 11 48 29 55 129 0 1 …
How to measure similarity?
22 0 19 23 1 66 103 45 6 38 232 44 0 11 48 29 55 129 0 1 …
Count number of matches !
If the distance between two vectors is smaller than the threshold, we get one match
School of Computer Science
Matched points: 1
Matched points: 5
School of Computer Science
Computationally expensive
Requiring linear scan of the entire data base
Example: match a query image to a database of 1 million images
0.1 second for computing the match between two images
Take more than one day to answer a single query
School of Computer Science
BoVW Model
b… b1b2b3 4
Represent images by histograms of visual words
Group key points into visual words
School of Computer Science
BoVW for Product Tagging
http://www.sccs.swarthmore.edu/users/09/btomasi1/tagging-products.html
School of Computer Science
BoVW Model
Generate “visual vocabulary”
Represent each key point by its nearest “visual
Represent an image by “a bag of visual words”
Text retrieval technique can be applied directly.
School of Computer Science
Step 1: Dataset
Example: Three images imgA, imgB, imgC.
imgA : 2 key points; imgB: 3 key points; imgC:
2 key points.
imgB.jpg 3 imgC.jpg 2 imgA.jpg 2
imgB-key point 1 imgB-key point 2 imgB-key point 3 imgC-key point 1 imgC-key point 2 imgA-key point 1 imgA-key point 2
imglist.txt
esp.feature
School of Computer Science
Step 2: Key Point Quantization
Represent each image by a bag of visual words:
Construct the visual vocabulary
Clustering all the key points into 10,000 clusters Each cluster center is a visual word
Map each key point to a visual word
Find the nearest cluster center for each key point (nearest neighbor search)
School of Computer Science
K-Mean Clustering
A set of n unlabeled examples D={x ,x ,…,x } in d-
E.g. Minimize square distance of vectors to
dimensional feature space
Number of clusters – K
Objective 1
xm;m x
Find the partition of D into K non-empty disjoint
j1xD j xD subsets j
D Kj 1 D j D i D j i j
So that the points in each subset are coherent according to certain criterion
School of Computer Science
Step 2: Key Point Quantization
Clustering 7 key points into 3 clusters The cluster centers are: cnt1, cnt2, cnt3 Each center is a visual word: w1, w2, w3
Find the nearest center to each key point
imgB.jpg 3 imgC.jpg 2 imgA.jpg 2
imgB-key point 1 imgB-key point 2 imgB-key point 3 imgC-key point 1 imgC-key point 2 imgA-key point 1 imgA-key point 2
imglist.txt
esp.feature
School of Computer Science
Step 2: Key Point Quantization
imgA.jpg
1st key point w2
2nd key point w1
imgB.jpg
1st key point w3
2nd key point w3
3rd key point w2
imgC.jpg
1st key point w3
2nd key point w2
Bag-of-words Rep.
imgA.jpg: w2 w1 imgB.jpg: w3 w3 w2 imgC.jpg: w3 w2
School of Computer Science
Step 2: Key Point Quantization
In this step, you need to save:
the cluster centers to a file. You will use this later
on for quantizing key points of query images
bag-of-words representation of each image in “trec” format.
Bag-of-words Rep.
imgA.jpg: w2 w1 imgB.jpg: w3 w3 w2 imgC.jpg: w3 w2
w3 w3 w2 w2 w1
School of Computer Science
Step 3: Build index
Refer to the indexing process of textual information retrieval
IR libraries:
Lucene: http://lucene.apache.org/
Lemur: http://www.lemurproject.org/
http://www.semanticmetadata.net/lire/ http://www.lire-project.net/
School of Computer Science
Step 4: Extract key points for a query
School of Computer Science
Step 5: Generate BoVW for a query
The mapped cluster ID for the 1st key point
The mapped cluster ID for the 2nd key point
The mapped cluster ID for the 1st key point
School of Computer Science
Step 5: Generate BoVW for a query
School of Computer Science
Step 6: Retrieval
Refer to the indexing process of textual information retrieval
School of Computer Science
Video Google
http://www.robots.ox.ac.uk/~vgg/research/vgoogle/index.html
J. Sivic and A. Zisserman, Video Google: A Text Retrieval Approach to Object Matching in Videos, ICCV 2003.
School of Computer Science
Occluded !!!
Problem with bag-of-features
The intrinsic matching scheme performed by BOF is weak
for a “small” visual dictionary: too many false matches
for a “large” visual dictionary: many true matches are missed
No good trade-off between “small” and “large” ! either the Voronoi cells are too big
or these cells can’t absorb the descriptor noise
intrinsic approximate nearest neighbor search of BOF is not sufficient
www.ens-lyon.fr/LIP/Arenaire/ERVision/search_large.ppt
School of Computer Science
20K visual word: false matches
200K visual word: good matches missed
School of Computer Science
Need to Know
Large scale retrieval
Semantic gap
Image/Video annotation/tagging/captioning Bag of visual words model
School of Computer Science
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com