CS代考 ECCV 2002.

MULTIMEDIA RETRIEVAL
Semester 1, 2022
Large Scale Retrieval
 Image/Video Annotation Semantic gap

Copyright By PowCoder代写 加微信 powcoder

 Bag of Visual Words model Video Google
School of Computer Science

Semantic Gap
 Content based retrieval  Use low level features
 Human understanding
Semantics: objects and meaningful attributes
School of Computer Science
CBIR: Semantic Gap
Query: “Find me pictures of tiger”
School of Computer Science

Annotation Task
 Training
tiger cat grass
hippo, bull, mouth, walk flower, coralberry, leaves, plant
ocean? Lion? Building? Sky?
School of Computer Science
Annotation Approaches
 Word co-occurrence model  Machine translation model  Statistic models
 Refinement strategies
School of Computer Science

Co-occurrence Model
 An image is divided into parts
 Annotated words are inherited for each part
 Parts are vector quantized to make cluster
 P(word|cluster) for all word for all cluster is estimated statistically
School of Computer Science
Co-occurrence Model
Y. Mori, H. Takahashi, and R. Oka. Image-to-word transformation based on dividing and vector quantizing images with words. In Workshop on Multimedia Intelligent Storage and Retrieval Management, 1999.
School of Computer Science

Co-occurrence Model
School of Computer Science

Machine Translation Model
I love multimedia computing technology very much. 我非常喜欢多媒体计算技术.
Tree Tiger
P. Duygulu, K. Barnard, N. de Fretias, and D. Forsyth, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary, ECCV 2002.
School of Computer Science
Blob Representation
 Visual features
Region size
 Position
Oriented energy (12 filters) Simple shape features
School of Computer Science

Tokenization
 Words → word tokens  Image segments
 represented by 30 features
 (size, position, color, texture and shape)
k-means to cluster features
 best cluster for the blob → blob tokens
School of Computer Science
 160 CD’s from Corel Data Set  100 images in each
 10 sets, each :
 randomly selected 80 CD’s  ~6000 training
 ~2000 test
 150-200 word tokens
 500 blob tokens
 Segmentation (using Ncuts)  about a month
School of Computer Science

Annotation
 Conditional probability p(w|b)  The likelihood function
P(w|b) P(w,a|b) a
The event aj=i means that the jth word in the possible translation translates the ith blob
 Using Expectation Maximization
 Predicting correspondences from translation probabilities  Predicting translation probabilities from correspondences
School of Computer Science

Large Scale Annotation/Tagging
5+ billion (Sep 2010)
• 160 years to view all of them (1s per image) • 3000+ uploads/minute
• 2% Internet users visit (2009)
• Daily time on site: 4.7 minutes (2009)
400 million (2010)
• 2007 bandwidth = entire Internet in 2000
• 3B+ views per day (2010)
• 1,920 years to view all of them (1s per image) • ~138M uploads/minute
• 24% Internet users visit (2009)
• Daily time on site: 30 minutes (2009)
• 2,000 years to see all of them
• ~20 hours uploaded/minute (09)
• 20% Internet users visit (2009)
• Daily time on site: 23 minutes (2009)
60 billion (Dec 2010 )
Following slides are from Xian- , Image and Video Tagging in the Internet Era
School of Computer Science

Characteristics of Internet Multimedia
Huge Amount of Data
Be consumed Frequently
Increasing Very Rapidly
Affection Highly Involved
Variances Are Very Large
Connected to Each Other
School of Computer Science
Variety of Internet Multimedia Applications
Recommendation
Authoring/Editing
Copy Detection
Summarization
Visualization
Advertising
Categorization
Media on Mobiles
School of Computer Science

Evolution of Media Tagging
2010-2015- … Image/Video Captioning
Tag Processing
Data Driven 2003 Automated Annotation
2001 Surroundings as Tags
19 70 – 80’ Manual Labeling
2005 Large-Scale Manual Labeling 2006 2010
1970 1980 1990 2000
School of Computer Science
(Automatic) Annotation ≈ Concept Detection
≈ (Automatic) Tagging ≈ (Automatic) Labeling
≈ Concepts
School of Computer Science
Approaches

Automated Annotation – 1st Paradigm
 A typical strategy – Individual Concept Detection
 Annotate multiple concepts separately Low-Level Features
Outdoor Face Person People- Road Marching
-1 / 1 -1 / 1 -1 / 1 -1 / 1 -1 / 1
Walking- Running
School of Computer Science
To Exploit Label Correlations
√ Person √ Street
√ Building
× Mountain
√ Walking/Running
√? Marrcchining
School of Computer Science

Automated Annotation – 2nd Paradigm
 Another typical strategy – Fusion-Based  Context Based Concept fusion (CBCF)
People- Marching
Walking- Running
Concept Model Vector
Concept Fusion
-1 / 1 -1 / 1
School of Computer Science
Low-Level Features
Person People- Marching
Score Score
Concept Model Vector
Concept Fusion
Walking- Running
-1 / 1 -1 / 1
School of Computer Science

Automated Annotation – 3rd Paradigm
 Integrated Concept Detection  Correlative Multi-Label Learning (CML)
Low-Level Features
Outdoor Face Person People- Road Marching
Walking- Running
-1 / 1 -1 / 1 -1 / 1 -1 / 1 -1 / 1
School of Computer Science
CML Roadmap
Multi-Label Annotation
2nd Paradigm
Fusion Based
3rd Paradigm
1st Paradigm
Individual Detectors
Integrated
Model concepts and correlations in one step
Has Correlations, but uses a second step
No correlation
G.-J. Qi, et al., Correlative Multi-Label Video Annotation, ACM Multimedia 2007.
School of Computer Science

Automated Annotation – 3rd Paradigm
 Integrated Concept Detection  Correlative Multi-Label Learning (CML)
School of Computer Science
How to Model Concept Correlations
 How to model concepts and the correlations among concept in a single step
Converting correlations into features.
Constructing a new feature vector that captures both
 The characteristics of concepts, and
 The correlations among concepts
School of Computer Science

Correlative Multi-Label Video Annotation
 Experiments
 TRECVID 2005 dataset (170 hours)
 39 concepts (LSCOM-Lite)
 Training (65%), Validation (16%), Testing (19%)
 CML (MAP=0.290) improves IndSVM (MAP=0.246) 17% and CBCF (MAP=0.253)
SVM CML ↑ 1 CBCF  CML ↑14
SVM CBCF C
School of Computer Science
Correlative Multi-Label Video Annotation
 Experiments
 TRECVID 2005 dataset (170 hours)
 39 concepts (LSCOM-Lite)
 Training (65%), Validation (16%), Testing (19%)
 CML (MAP=0.290) improves IndSVM (MAP=0.246) 17% and CBCF (MAP=0.253)
SVM CML ↑ 131% CBCF  CML ↑128%
SVM CBCF CML
School of Computer Science

Correlative Multi-Label Video Annotation
 Experiments
 TRECVID 2005 dataset (170 hours)
 39 concepts (LSCOM-Lite)
 Training (65%), Validation (16%), Testing (19%)
 CML (MAP=0.290) improves IndSVM (MAP=0.246) 17% and CBCF (MAP=0.253)
SVM CBCF CML SVM CBCF CML SVM CBCF CML
School of Computer Science
Correlative Multi-Label Video Annotation
 Experiments
 TRECVID 2005 dataset (170 hours)
 39 concepts (LSCOM-Lite)
 Training (65%), Validation (16%), Testing (19%)
 CML (MAP=0.290) improves IndSVM (MAP=0.246) 17% and CBCF (MAP=0.253)
School of Computer Science

Internet Media
www, www2009, madrid, spain
www2009, w3c, futuro, future, workshop, congreso w3c, , Don, Quixote palacio, municipal, Madrid, consortium, consorcio
cervantes, Sancho, … 20, aniversario, España, Spain, Vinton, …
School of Computer Science
Social tags
 Good, but
 Ambiguous
 Incomplete
 No relevance information
Two directions to improve tag quality
• During tagging – Tag Recommendation • After tagging – Tag Refinement/Ranking
School of Computer Science

The most relevant tag is NOT at the top position in the tag list of the following social image.
Social tags for online images are better than automatic annotation in terms of both scalability and accuracy.
School of Computer Science
This phenomenon is widespread on social media websites such as Flickr.
Only less than 10% images have their
most relevant tag at the top position in their tag list.
School of Computer Science

This has significantly limited the performance of tag-based image search and other applications.
For example, when we search for “bird” on Flickr.
School of Computer Science
What we are going to do:
Rank the tags according to their relevance to the image.
To improve:
 Tag-based search
 Image annotation (automatic tagging)  Group recommendation
School of Computer Science

Towards Storytelling
 Image and video captioning
Figure from , ACM MM 2016.
School of Computer Science
Bag-of-words (BoW) Model
A document
A collection of the words in the document
A collection of the objects in the image
What features could characterize an object well?
R. Jin, Content based Image Retrieval.
School of Computer Science

Bag-of-Visual-Words (BoVW)
Slides credit: M. Bressan and L. Fei-Fei
School of Computer Science
Interesting Point Detection
 Local features have been shown to be effective for representing images
 They are image patterns which differ from their immediate neighborhood.
 They could be points, edges, small patches.
 We call local features key points or interesting points of an image
School of Computer Science

Interesting Point Detection
 An image example with key points detected by a corner detector.
School of Computer Science
Interesting Point Detection
 The detection of interesting point needs to be robust to various geometric transformations
Original Scaling+Rotation+Translation Projection
School of Computer Science

Interesting Point Detection
 The detection of interesting point needs to be robust to imaging conditions, e.g. lighting, blurring.
School of Computer Science
Descriptor
 Representing each detected key point
 Take measurements from a region centered on
a interesting point E.g., texture, shape, …
 Each descriptor is a vector with fixed length
 E.g. SIFT descriptor is a vector of 128 dimension
School of Computer Science

Descriptor
 The descriptor should also be robust under different image transformation.
They should have similar descriptors
School of Computer Science
Keypoint + Local Descriptors
 SIFT (Scale Invariant Feature Transform) Feature detector SIFT descriptor [Lowe’04]
Difference of Gaussian Eliminating edge response
Calculated in a 16×16 window 128 dimension = (4 x 4) * 8
School of Computer Science

http://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O
School of Computer Science
Difference of Gaussian
 Subtract a signal with its blurred/smoothened version
http://en.wikipedia.org/wiki/Difference_of_Gaussians
School of Computer Science

Image Representation
Bag-of-features representation: an example
Each descriptor is 5 dimension
22 0 66 103 232 44 29 55 11 78 220 30
19 23 1 45 6 38 0 11 48 129 0 1 110 1 32 11 34 21
Original image
Detected key points
Descriptors of the key points
22 0 19 23 1 66 103 45 6 38 232 44 0 11 48 29 55 129 0 1 …
How to measure similarity?

22 0 19 23 1 66 103 45 6 38 232 44 0 11 48 29 55 129 0 1 …
Count number of matches !
If the distance between two vectors is smaller than the threshold, we get one match
School of Computer Science

Matched points: 1
Matched points: 5
School of Computer Science
 Computationally expensive
 Requiring linear scan of the entire data base
 Example: match a query image to a database of 1 million images
 0.1 second for computing the match between two images
 Take more than one day to answer a single query
School of Computer Science

BoVW Model
b… b1b2b3 4
Represent images by histograms of visual words
Group key points into visual words
School of Computer Science
BoVW for Product Tagging
http://www.sccs.swarthmore.edu/users/09/btomasi1/tagging-products.html
School of Computer Science

BoVW Model
 Generate “visual vocabulary”
 Represent each key point by its nearest “visual
 Represent an image by “a bag of visual words”
 Text retrieval technique can be applied directly.
School of Computer Science
Step 1: Dataset
 Example: Three images imgA, imgB, imgC.
 imgA : 2 key points; imgB: 3 key points; imgC:
2 key points.
imgB.jpg 3 imgC.jpg 2 imgA.jpg 2
imgB-key point 1 imgB-key point 2 imgB-key point 3 imgC-key point 1 imgC-key point 2 imgA-key point 1 imgA-key point 2
imglist.txt
esp.feature
School of Computer Science

Step 2: Key Point Quantization
 Represent each image by a bag of visual words:
Construct the visual vocabulary
 Clustering all the key points into 10,000 clusters  Each cluster center is a visual word
 Map each key point to a visual word
 Find the nearest cluster center for each key point (nearest neighbor search)
School of Computer Science
K-Mean Clustering
 A set of n unlabeled examples D={x ,x ,…,x } in d-
E.g. Minimize square distance of vectors to
dimensional feature space
 Number of clusters – K
 Objective 1
xm;m x
 Find the partition of D into K non-empty disjoint
j1xD j xD subsets j
D   Kj  1 D j D i  D j   i  j
 So that the points in each subset are coherent according to certain criterion
School of Computer Science

Step 2: Key Point Quantization
 Clustering 7 key points into 3 clusters  The cluster centers are: cnt1, cnt2, cnt3  Each center is a visual word: w1, w2, w3
 Find the nearest center to each key point
imgB.jpg 3 imgC.jpg 2 imgA.jpg 2
imgB-key point 1 imgB-key point 2 imgB-key point 3 imgC-key point 1 imgC-key point 2 imgA-key point 1 imgA-key point 2
imglist.txt
esp.feature
School of Computer Science

Step 2: Key Point Quantization
 imgA.jpg
 1st key point  w2
 2nd key point  w1
 imgB.jpg
 1st key point  w3
 2nd key point  w3
 3rd key point  w2
 imgC.jpg
 1st key point  w3
 2nd key point  w2
Bag-of-words Rep.
imgA.jpg: w2 w1 imgB.jpg: w3 w3 w2 imgC.jpg: w3 w2
School of Computer Science
Step 2: Key Point Quantization
 In this step, you need to save:
 the cluster centers to a file. You will use this later
on for quantizing key points of query images
 bag-of-words representation of each image in “trec” format.
Bag-of-words Rep.
imgA.jpg: w2 w1 imgB.jpg: w3 w3 w2 imgC.jpg: w3 w2
imgB imgA
w3 w3 w2 w2 w1


imgC
School of Computer Science

Step 3: Build index
 Refer to the indexing process of textual information retrieval
 IR libraries:
 Lucene: http://lucene.apache.org/
 Lemur: http://www.lemurproject.org/
 http://www.semanticmetadata.net/lire/  http://www.lire-project.net/
School of Computer Science
Step 4: Extract key points for a query
School of Computer Science

Step 5: Generate BoVW for a query
The mapped cluster ID for the 1st key point
The mapped cluster ID for the 2nd key point
The mapped cluster ID for the 1st key point
School of Computer Science
Step 5: Generate BoVW for a query
School of Computer Science

Step 6: Retrieval
 Refer to the indexing process of textual information retrieval
School of Computer Science
Video Google
http://www.robots.ox.ac.uk/~vgg/research/vgoogle/index.html
J. Sivic and A. Zisserman, Video Google: A Text Retrieval Approach to Object Matching in Videos, ICCV 2003.
School of Computer Science

Occluded !!!

Problem with bag-of-features
 The intrinsic matching scheme performed by BOF is weak
 for a “small” visual dictionary: too many false matches
 for a “large” visual dictionary: many true matches are missed
 No good trade-off between “small” and “large” !  either the Voronoi cells are too big
 or these cells can’t absorb the descriptor noise
 intrinsic approximate nearest neighbor search of BOF is not sufficient
www.ens-lyon.fr/LIP/Arenaire/ERVision/search_large.ppt
School of Computer Science

20K visual word: false matches
200K visual word: good matches missed
School of Computer Science

Need to Know
 Large scale retrieval
 Semantic gap
 Image/Video annotation/tagging/captioning  Bag of visual words model
School of Computer Science

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com