留学生代考 SIGIR 2004 Tutorial.

MULTIMEDIA RETRIEVAL
Semester 1, 2022
Information Summarization
 Text summarization

 Video Summarization
 Applications
 LifeLogging
Scene summarization  StoryImaging
School of Computer Science

Information Deluge
Approximately 3.5 trillion photos have been taken since Daguerre captured Boulevard du Temple 174 years ago
http://blog.1000memories.com/94‐number‐of‐photos‐ever‐taken‐digital‐and‐analog‐in‐shoebox
School of Computer Science
Information Deluge
6 billion (Aug 2011)
• 192 years to view all of them (1s per image) • 3000+ uploads/minute
• 2% Internet users visit (2009)
• Daily time on site: 4.7 minutes (2009)
690 million (Mar 2012) • 3,450 years to see all of them
• 48 hours uploaded/minute (2012)
• 20% Internet users visit (2009)
• Daily time on site: 23 minutes (2009)
• 2007 bandwidth = entire Internet in 2000 • 3B+ views per day (2012)
100 billion (Middle of 2011 )
• 3,200 years to view all of them (1s per image) • ~200M uploads/day; ~ 6B/month (2012)
• 800+M users (Dec 2011)
• Daily time on site: 30 minutes (2009)
School of Computer Science

Quantum TV DVR that records up to 12 channels at once http://www.engadget.com/2014/04/01/verizon-fios-media-server-quantum-tv/
School of Computer Science
Summarization
 Distill the essence
 Provide a compact yet informative
representation of a video
 Crucial for effective and efficient access of video content
School of Computer Science

 http://summly.com/index.html
 Founded by 17-year-old ’Aloisio  Acquired by Yahoo in 26/03/2013
 30 Million!!! http://www.smh.com.au/digital-life/digital-life-news/teens-multimilliondollar-yahoo-payday-before-18th-birthday-20130326-2gqvg.html
School of Computer Science

Text Summarization
School of Computer Science
Human summarization and abstracting
 What professional abstractors do
 “To take an original article, understand it and pack it neatly into a nutshell without loss of substance or clarity presents a challenge which many have felt worth taking up for the joys of achievement alone. These are the characteristics of an art form” – Ashworth.
School of Computer Science

Text Summarization
 Indicative, informative, and critical summaries
 Extracts (representative
paragraphs/sentences/phrases)
 Abstracts: “a concise summary of the central subject matter of a document”
 Dimensions
 Single-document vs. multi-document
 Query-specific vs. query-independent
Dragomir R. Radev, Text summarization, SIGIR 2004 Tutorial.
School of Computer Science
 Identify important words or sentences
 Formulate the problem with graph-based
 Keyword extraction & sentence extraction
and , TextRank : Bringing Order into Texts, EMNLP 2004.
School of Computer Science

PageRank Revisit
S(Vi) :ScoreoftheVertex
 Vi :Vertex
 In(Vi ): the set of vertices that point to it ( predecessors )
 Out (Vi ): the set of vertices that vertex points to ( successors )
 d : damping factor
 The probability of jumping from a given vertex to another
 Random surfer model  0.85 ( PageRank )
School of Computer Science
Graph Construction
Smallest text units (e.g., keyword, sentence)
 Different types of keywords (e.g., noun, verb)  Edge
 Keyword: co-occurrence in a sliding window
 Sentence: similarity between sentences
 Knowledge based: WordNet
 Data driven: Google Distance
 Empirical: overlap (over common tokens/words)
School of Computer Science

Sample on Keyword Extraction
School of Computer Science
Sample on Keyword Extraction
School of Computer Science

Quantitative Result
School of Computer Science
Sample on Sentence Extraction
School of Computer Science

Sample on Sentence Extraction
 TextRank goes beyond the sentence “connectivity” in a text
Sentence 15 would not identified
as “important” based on the number of connection
 But it is identified as “important” by TextRank
Human also identify the sentence as “important”
School of Computer Science
Quantitative Result
School of Computer Science

 Sentence level
 Cosine similarity between sentences
http://141.211.245.18/demos/lexrank/lexrankmead.html
http://clair.si.umich.edu/demos/lexrank/
G. Erkan, D. Radev, LexRank: Graph-based lexical centrality as salience in text summarization, Journal of Artificial Intelligence Research, 2004.
School of Computer Science
Video Summarization Problem
 Representativeness  Maximized
 Redundancy  Minimized
 Presentation
 Keyframe/Storyboard  Skim
School of Computer Science

The Problem
School of Computer Science
Related Work
 Clustering based
 K-means, graph cuts, … …
 Learning based
 Important vs unimportant
 Reconstruction based  Curve fitting
 Data fitting
 Different features
 Semantics such as who, what, where, when
School of Computer Science

Video Abstraction
, , and W. Effelsberg, Video abstracting, Communications of the ACM 40(12): 54–62, 1997.
School of Computer Science
Static Video Summarization
(Among many others)
[Chen et al., 09] story‐structure
tree chair
[Avila et al., 11] Vsumm
[Makedonas et al., 09] graph connectivity
[Cong et al., 12] sparse dictionary
[Furini et al., 10] STIMO
[Guan et al., 12] keypoint‐based
School of Computer Science

Limitation
 Utilizing global visual features
 Color and texture computed over the entire frame
 Subtle yet important details could be swallowed by global features
School of Computer Science
Local Descriptors
 Local keypoint features
 Distinctive representation capacity (e.g. invariant to location, scale and rotation, and robust to affine transformation).
 Played a significant role in many application domains of visual content analysis
 Object recognition
 Landmark recognition  Image classification … …
School of Computer Science

Local Descriptors
 Scale Invariant Feature Transform
 Speeded Up Robust Features
School of Computer Science
Problem Formulation
 What makes a video
 Video frame vs video shot vs video story
 A video shot depicts a scene
 Object can be characterized with a number of keypoints
 What contributes to redundancy
 Redundancy exists among adjacent frames
 Removing overlapped objects could reduce redundancy
 Keyframe selection is to identify a number of frames which  Best cover the keypoints
 Share minimal redundancy
School of Computer Science

Keyframe Selection
 The global pool is separated into two sets, Kcovered and Kuncovered. At the beginning, Kuncovered contains all keypoints in K and Kcovered is empty
 For frame fi, denote its keypoint set as FPi,
 Coverage
 the cardinality of the intersection between FPi and Kuncovered
 Redundancy
 how many keypoints it contains in Kcovered
School of Computer Science
Keyframe Selection
 The influence of frame fi is calculated as a balance of C(fi) and R(fi) controlled by alpha (set to 1 empirically in the experiments)
 At the end of each iteration, the frame with the highest influence value and positive coverage will be selected as a keyframe, and Kcovered and Kuncovered will be updated
 The iteration repeats until the whole keypoint pool is covered, or a predefined percentage of coverage STOP of the pool K is reached.
School of Computer Science

Toy Example
School of Computer Science

Keypoint Matching
School of Computer Science
Keyframe Selection
 Keypoint Pool Construction
 Inter-window Keypoint Chaining
 Constrain the pairing within a temporal window of size W without losing the discriminative power of keypoint matching
 Intra- Window Keypoint Chaining  make the matching more reliable
School of Computer Science

Keyframe Selection
 Keypoint Pool Construction
 Each keypoint either belongs to a chain of matched keypoints or becomes an singleton without any connection
 Each chain is represented by its HEAD keypoint
 Chains with the number of keypoints greater than T (set to 10) are kept
School of Computer Science
Samples Results
School of Computer Science

Sample Result 1
School of Computer Science
Sample Result 2
School of Computer Science

Sample Result 3
School of Computer Science
Impact of α
School of Computer Science

Keypoint Matching
 Computationally expensive
Thousands of keypoints per frame
Matching candidate keypoints within a certain radius R (set to 100)
 RANdom Sample Consensus algorithm (RANSAC) is iteratively invoked to enforce geometrical consistence among keypoint matches
School of Computer Science
Video Summarization Framework
 Utilizing both global and local visual features
School of Computer Science

Scene Identification
 A video consists of multiple scenes and the frames of each scene are visually similar, though the frames of the same scene may scatter in the video
 Represent each video frame with the CEDD feature which is a histogram characterizing both color and texture features
 Perform frame clustering with K-Means
School of Computer Science
Keyframe Selection
 Within each cluster (i.e. scene)
1. Represent each frame with local keypoints
2. Generate a keypoint pool
3. Select the frames that covers the pool best (maximum coverage and minimum redundancy)
4. Combine keyframes from each scene as a summary
School of Computer Science

Keypoint Filtering with Saliency
School of Computer Science
Fast Solution – Keypoint Forest
 Randomized kd-tree
1. Gather all keypoints from all frames
2. Split the data along different features that have the greatest variance to generate a few trees
3. Matching a keypoint against the trees to find the best match
More details
School of Computer Science

Fast Solution – Keypoint Forest
 Randomized kd-tree performance
 Appropriatenumberoftrees(e.g.5)
 KeypointMatchingaccuracycanbeabove90%
 KeypointMatchingcanbe100timesfaster
 Previously 0.5 second /frame –> now 0.01 second / frame
 Donothavenoticeableimpactonthekeyframe selection
School of Computer Science
Local Visual Word Model
 grouping neighbouring keypoints into local visual words to accommodate variance of the same keypoint appearing in different frames.
 Simple mutual neighbourhood relationship
School of Computer Science

Calculate Influence
For GlobalSim(), is j from the whole sequence or the selected keyframes.
School of Computer Science
Experiments
 Dataset 1
 50 videos from Open Video Project (OVP)
 http://www.open-video.org/ 1 to 4 minutes
 Dataset 2
 50 Youtube videos 1 to 10 minutes
School of Computer Science

Sample Result
 NASA 25th Anniversary Show Segment 03
 There are 8 frames of pilot shots in our result, covering 6 out of 7 pilots mentioned in the story.
 This indicates that our approach focuses more on local details compared to other global-feature based approaches
School of Computer Science
Impact of Clustering
School of Computer Science

Impact of Saliency Map
School of Computer Science
Sample Result 1
School of Computer Science

Sample Result 2
School of Computer Science
Sample Result 3
 Summarization with Different Lengths

Bag-of-Importance (BoI) Model
Part I: Part II: Part III:
Motivations Methodology Evaluations
School of Computer Science
Motivations
 Propose a paradigm for video summarization
 Identify the invariant and repeatable patterns  Capture the essence of the visual patterns
 Eliminate the redundancy
 Capture the discriminative details
 Characterize individual features for video summarization
School of Computer Science

Identify Repetition
Eliminate Redundancy
School of Computer Science
Feature Learning
 Learn the Dictionary by Sparse Pursuit
 Transform the local features into sparse space
 Weight the learned feature  Project the raw features to an
anchor point the transformed space
 Anchor points – assemble the repetition

Identify the Bag-of-Importance
 Derive the distribution of the weight coefficients
 The most repeatable learned features are with the highest P Value.
 We further borrow TF-IDF concept to reweight
 The “common words” are stopped
 The discriminative words may be weighted a higher value
Video summarization by BoI
 We calculate the representativeness score for each frame, by aggregating the important codes inside the frame
 We generate the representativeness curve, representative frames are detected by identifying the top K local maximum.

Evaluations
 Annotated Videos from Open Video Project
 www.openvideo.org  Youtube videos
 F-score:
 β controls the balance between
precision and recall.
 The F-score can be interpreted as a weighted average of precision and recall, where a score reaches its best value at 1 and worst at 0.
Evaluations at a short length level
Iso-Content Distortion
Iso-Content Distance
DSVS(λ=0.15)
DSVS(λ=0.5)
BoIVS(λ=0.15)
BoIVS(λ=0.5)
0.554 0.556
Dsvs‐: [Cong et al., 12] sparse dictionary
BoIVS: our proposed method
0.64 0.6 0.65

Evaluations at a long length level
OVP: service provider
DT: [Mundur, 2006] STIMO: [Avila et al., 2011] DSVS: [Cong, 2012] BoIVS: proposed by us
School of Computer Science
Impact of various factors
School of Computer Science

 Introduce a new perspective into video summarization
 Utilize local features for video summarization at finer level
 Introduce a new BoI framework for video summarization
 Promising future for exploiting the value of local features
School of Computer Science
Deep Features
M. Ma, et al., Exploring the Influence of Feature Representation for Dictionary Selection based Video Summarization, ICIP 2017.
School of Computer Science

Deep Features
School of Computer Science
Deep Features
School of Computer Science

Recurrent Auto-Encoder for Unsupervised Highlight Extraction
H. Yang, et al., Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-encoders, ICCV 2015.
School of Computer Science
Auto-encoder
http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/
School of Computer Science

Auto-encoder
https://towardsdatascience.com/autoencoders-are-essential-in-deep-neural-nets-f0365b2d1d7c
School of Computer Science
Recurrent Auto-Encoder for Unsupervised Highlight Extraction
School of Computer Science

Recurrent Auto-Encoder for Unsupervised Highlight Extraction
School of Computer Science
Recurrent Auto-Encoder for Unsupervised Highlight Extraction
School of Computer Science

LSTM for VS
K. Zhang, et al., Video Summarization with Long Short-term Memory, ECCV 2016.
School of Computer Science
LSTM for VS
School of Computer Science

LSTM for VS
School of Computer Science
Hierarchical Structure-Adaptive RNN for Video Summarization
B. Zhao, et al., Hierarchical Structure-Adaptive RNN for Video Summarization, CVPR 2018.
School of Computer Science

HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization.(cvpr18)
School of Computer Science
HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization.(cvpr18)

Presentation
http://rp-www.cs.usyd.edu.au/~ggua5470/keyframe-demo/
T. Mei, et al., Video collage: presenting a video sequence using a
single image, The Visual Computer 25(1): 39-51 (2009)
School of Computer Science
Presentation
 Video Synopsis of Brief Cam
School of Computer Science

PicPac Stop Motion
 picpac.tv
 Demo video  screenshot
http://picpac.tv/
School of Computer Science
Applicaitons
 Summarizing LifeLog
Microsoft SenseCam
Hyowon Lee, . Smeaton, . O’Connor and Gareth J.F. Jones, Adaptive Visual Summary of LifeLog Photos for Personal Information, International Workshop on Adaptive Information Retrieval, 2006.
School of Computer Science

Applications
 PhotoSynth
http://phototour.cs.washington.edu/
School of Computer Science

Applications
 PhotoSynth
 http://photosynth.net
 How PhotoSynth can connect the world’s images
 http://www.ted.com/talks/blaise_aguera_y_arcas_ demos_photosynth
 Photo Tourism
 http://phototour.cs.washington.edu/
School of Computer Science
Applications
 StoryImaging
G. Guan, Z. Wang, X.-S. Hua, and D. Feng, StoryImaging: a media-rich presentation system for textual stories, ACM MM 2011.
School of Computer Science

Beyond Search: Event Driven Summarization for Web Videos TOMCCAP 2011 NGO
Undirected Graph
• NDK -> key-shots- >graph
• Rank the key-shots
– Informative scores
– the chronological order
• Key-shot tagging
– Tag filtering
– Tag propagation
• Random walk • Summarization
– Trade-off between the sum of relevance and time interval
More on Summarisation
 Multi-document summarisation  Multi-video summarisation
 Multi-modal summarisation
 Query based summarisation
 eXtreme summarisation
 Domain-specific summarisation … …
School of Computer Science

Need to Know
 Text summarization
 Video summarization problem
 Categories of existing solutions
 A new perspective into video summarization with local features
 Applications
School of Computer Science

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts