MULTIMEDIA RETRIEVAL
Semester 1, 2022
Information Summarization
Text summarization
Copyright By PowCoder代写 加微信 powcoder
Video Summarization
Applications
LifeLogging
Scene summarization StoryImaging
School of Computer Science
Information Deluge
Approximately 3.5 trillion photos have been taken since Daguerre captured Boulevard du Temple 174 years ago
http://blog.1000memories.com/94‐number‐of‐photos‐ever‐taken‐digital‐and‐analog‐in‐shoebox
School of Computer Science
Information Deluge
6 billion (Aug 2011)
• 192 years to view all of them (1s per image) • 3000+ uploads/minute
• 2% Internet users visit (2009)
• Daily time on site: 4.7 minutes (2009)
690 million (Mar 2012) • 3,450 years to see all of them
• 48 hours uploaded/minute (2012)
• 20% Internet users visit (2009)
• Daily time on site: 23 minutes (2009)
• 2007 bandwidth = entire Internet in 2000 • 3B+ views per day (2012)
100 billion (Middle of 2011 )
• 3,200 years to view all of them (1s per image) • ~200M uploads/day; ~ 6B/month (2012)
• 800+M users (Dec 2011)
• Daily time on site: 30 minutes (2009)
School of Computer Science
Quantum TV DVR that records up to 12 channels at once http://www.engadget.com/2014/04/01/verizon-fios-media-server-quantum-tv/
School of Computer Science
Summarization
Distill the essence
Provide a compact yet informative
representation of a video
Crucial for effective and efficient access of video content
School of Computer Science
http://summly.com/index.html
Founded by 17-year-old ’Aloisio Acquired by Yahoo in 26/03/2013
30 Million!!! http://www.smh.com.au/digital-life/digital-life-news/teens-multimilliondollar-yahoo-payday-before-18th-birthday-20130326-2gqvg.html
School of Computer Science
Text Summarization
School of Computer Science
Human summarization and abstracting
What professional abstractors do
“To take an original article, understand it and pack it neatly into a nutshell without loss of substance or clarity presents a challenge which many have felt worth taking up for the joys of achievement alone. These are the characteristics of an art form” – Ashworth.
School of Computer Science
Text Summarization
Indicative, informative, and critical summaries
Extracts (representative
paragraphs/sentences/phrases)
Abstracts: “a concise summary of the central subject matter of a document”
Dimensions
Single-document vs. multi-document
Query-specific vs. query-independent
Dragomir R. Radev, Text summarization, SIGIR 2004 Tutorial.
School of Computer Science
Identify important words or sentences
Formulate the problem with graph-based
Keyword extraction & sentence extraction
and , TextRank : Bringing Order into Texts, EMNLP 2004.
School of Computer Science
PageRank Revisit
S(Vi) :ScoreoftheVertex
Vi :Vertex
In(Vi ): the set of vertices that point to it ( predecessors )
Out (Vi ): the set of vertices that vertex points to ( successors )
d : damping factor
The probability of jumping from a given vertex to another
Random surfer model 0.85 ( PageRank )
School of Computer Science
Graph Construction
Smallest text units (e.g., keyword, sentence)
Different types of keywords (e.g., noun, verb) Edge
Keyword: co-occurrence in a sliding window
Sentence: similarity between sentences
Knowledge based: WordNet
Data driven: Google Distance
Empirical: overlap (over common tokens/words)
School of Computer Science
Sample on Keyword Extraction
School of Computer Science
Sample on Keyword Extraction
School of Computer Science
Quantitative Result
School of Computer Science
Sample on Sentence Extraction
School of Computer Science
Sample on Sentence Extraction
TextRank goes beyond the sentence “connectivity” in a text
Sentence 15 would not identified
as “important” based on the number of connection
But it is identified as “important” by TextRank
Human also identify the sentence as “important”
School of Computer Science
Quantitative Result
School of Computer Science
Sentence level
Cosine similarity between sentences
http://141.211.245.18/demos/lexrank/lexrankmead.html
http://clair.si.umich.edu/demos/lexrank/
G. Erkan, D. Radev, LexRank: Graph-based lexical centrality as salience in text summarization, Journal of Artificial Intelligence Research, 2004.
School of Computer Science
Video Summarization Problem
Representativeness Maximized
Redundancy Minimized
Presentation
Keyframe/Storyboard Skim
School of Computer Science
The Problem
School of Computer Science
Related Work
Clustering based
K-means, graph cuts, … …
Learning based
Important vs unimportant
Reconstruction based Curve fitting
Data fitting
Different features
Semantics such as who, what, where, when
School of Computer Science
Video Abstraction
, , and W. Effelsberg, Video abstracting, Communications of the ACM 40(12): 54–62, 1997.
School of Computer Science
Static Video Summarization
(Among many others)
[Chen et al., 09] story‐structure
tree chair
[Avila et al., 11] Vsumm
[Makedonas et al., 09] graph connectivity
[Cong et al., 12] sparse dictionary
[Furini et al., 10] STIMO
[Guan et al., 12] keypoint‐based
School of Computer Science
Limitation
Utilizing global visual features
Color and texture computed over the entire frame
Subtle yet important details could be swallowed by global features
School of Computer Science
Local Descriptors
Local keypoint features
Distinctive representation capacity (e.g. invariant to location, scale and rotation, and robust to affine transformation).
Played a significant role in many application domains of visual content analysis
Object recognition
Landmark recognition Image classification … …
School of Computer Science
Local Descriptors
Scale Invariant Feature Transform
Speeded Up Robust Features
School of Computer Science
Problem Formulation
What makes a video
Video frame vs video shot vs video story
A video shot depicts a scene
Object can be characterized with a number of keypoints
What contributes to redundancy
Redundancy exists among adjacent frames
Removing overlapped objects could reduce redundancy
Keyframe selection is to identify a number of frames which Best cover the keypoints
Share minimal redundancy
School of Computer Science
Keyframe Selection
The global pool is separated into two sets, Kcovered and Kuncovered. At the beginning, Kuncovered contains all keypoints in K and Kcovered is empty
For frame fi, denote its keypoint set as FPi,
Coverage
the cardinality of the intersection between FPi and Kuncovered
Redundancy
how many keypoints it contains in Kcovered
School of Computer Science
Keyframe Selection
The influence of frame fi is calculated as a balance of C(fi) and R(fi) controlled by alpha (set to 1 empirically in the experiments)
At the end of each iteration, the frame with the highest influence value and positive coverage will be selected as a keyframe, and Kcovered and Kuncovered will be updated
The iteration repeats until the whole keypoint pool is covered, or a predefined percentage of coverage STOP of the pool K is reached.
School of Computer Science
Toy Example
School of Computer Science
Keypoint Matching
School of Computer Science
Keyframe Selection
Keypoint Pool Construction
Inter-window Keypoint Chaining
Constrain the pairing within a temporal window of size W without losing the discriminative power of keypoint matching
Intra- Window Keypoint Chaining make the matching more reliable
School of Computer Science
Keyframe Selection
Keypoint Pool Construction
Each keypoint either belongs to a chain of matched keypoints or becomes an singleton without any connection
Each chain is represented by its HEAD keypoint
Chains with the number of keypoints greater than T (set to 10) are kept
School of Computer Science
Samples Results
School of Computer Science
Sample Result 1
School of Computer Science
Sample Result 2
School of Computer Science
Sample Result 3
School of Computer Science
Impact of α
School of Computer Science
Keypoint Matching
Computationally expensive
Thousands of keypoints per frame
Matching candidate keypoints within a certain radius R (set to 100)
RANdom Sample Consensus algorithm (RANSAC) is iteratively invoked to enforce geometrical consistence among keypoint matches
School of Computer Science
Video Summarization Framework
Utilizing both global and local visual features
School of Computer Science
Scene Identification
A video consists of multiple scenes and the frames of each scene are visually similar, though the frames of the same scene may scatter in the video
Represent each video frame with the CEDD feature which is a histogram characterizing both color and texture features
Perform frame clustering with K-Means
School of Computer Science
Keyframe Selection
Within each cluster (i.e. scene)
1. Represent each frame with local keypoints
2. Generate a keypoint pool
3. Select the frames that covers the pool best (maximum coverage and minimum redundancy)
4. Combine keyframes from each scene as a summary
School of Computer Science
Keypoint Filtering with Saliency
School of Computer Science
Fast Solution – Keypoint Forest
Randomized kd-tree
1. Gather all keypoints from all frames
2. Split the data along different features that have the greatest variance to generate a few trees
3. Matching a keypoint against the trees to find the best match
More details
School of Computer Science
Fast Solution – Keypoint Forest
Randomized kd-tree performance
Appropriatenumberoftrees(e.g.5)
KeypointMatchingaccuracycanbeabove90%
KeypointMatchingcanbe100timesfaster
Previously 0.5 second /frame –> now 0.01 second / frame
Donothavenoticeableimpactonthekeyframe selection
School of Computer Science
Local Visual Word Model
grouping neighbouring keypoints into local visual words to accommodate variance of the same keypoint appearing in different frames.
Simple mutual neighbourhood relationship
School of Computer Science
Calculate Influence
For GlobalSim(), is j from the whole sequence or the selected keyframes.
School of Computer Science
Experiments
Dataset 1
50 videos from Open Video Project (OVP)
http://www.open-video.org/ 1 to 4 minutes
Dataset 2
50 Youtube videos 1 to 10 minutes
School of Computer Science
Sample Result
NASA 25th Anniversary Show Segment 03
There are 8 frames of pilot shots in our result, covering 6 out of 7 pilots mentioned in the story.
This indicates that our approach focuses more on local details compared to other global-feature based approaches
School of Computer Science
Impact of Clustering
School of Computer Science
Impact of Saliency Map
School of Computer Science
Sample Result 1
School of Computer Science
Sample Result 2
School of Computer Science
Sample Result 3
Summarization with Different Lengths
Bag-of-Importance (BoI) Model
Part I: Part II: Part III:
Motivations Methodology Evaluations
School of Computer Science
Motivations
Propose a paradigm for video summarization
Identify the invariant and repeatable patterns Capture the essence of the visual patterns
Eliminate the redundancy
Capture the discriminative details
Characterize individual features for video summarization
School of Computer Science
Identify Repetition
Eliminate Redundancy
School of Computer Science
Feature Learning
Learn the Dictionary by Sparse Pursuit
Transform the local features into sparse space
Weight the learned feature Project the raw features to an
anchor point the transformed space
Anchor points – assemble the repetition
Identify the Bag-of-Importance
Derive the distribution of the weight coefficients
The most repeatable learned features are with the highest P Value.
We further borrow TF-IDF concept to reweight
The “common words” are stopped
The discriminative words may be weighted a higher value
Video summarization by BoI
We calculate the representativeness score for each frame, by aggregating the important codes inside the frame
We generate the representativeness curve, representative frames are detected by identifying the top K local maximum.
Evaluations
Annotated Videos from Open Video Project
www.openvideo.org Youtube videos
F-score:
β controls the balance between
precision and recall.
The F-score can be interpreted as a weighted average of precision and recall, where a score reaches its best value at 1 and worst at 0.
Evaluations at a short length level
Iso-Content Distortion
Iso-Content Distance
DSVS(λ=0.15)
DSVS(λ=0.5)
BoIVS(λ=0.15)
BoIVS(λ=0.5)
0.554 0.556
Dsvs‐: [Cong et al., 12] sparse dictionary
BoIVS: our proposed method
0.64 0.6 0.65
Evaluations at a long length level
OVP: service provider
DT: [Mundur, 2006] STIMO: [Avila et al., 2011] DSVS: [Cong, 2012] BoIVS: proposed by us
School of Computer Science
Impact of various factors
School of Computer Science
Introduce a new perspective into video summarization
Utilize local features for video summarization at finer level
Introduce a new BoI framework for video summarization
Promising future for exploiting the value of local features
School of Computer Science
Deep Features
M. Ma, et al., Exploring the Influence of Feature Representation for Dictionary Selection based Video Summarization, ICIP 2017.
School of Computer Science
Deep Features
School of Computer Science
Deep Features
School of Computer Science
Recurrent Auto-Encoder for Unsupervised Highlight Extraction
H. Yang, et al., Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-encoders, ICCV 2015.
School of Computer Science
Auto-encoder
http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/
School of Computer Science
Auto-encoder
https://towardsdatascience.com/autoencoders-are-essential-in-deep-neural-nets-f0365b2d1d7c
School of Computer Science
Recurrent Auto-Encoder for Unsupervised Highlight Extraction
School of Computer Science
Recurrent Auto-Encoder for Unsupervised Highlight Extraction
School of Computer Science
Recurrent Auto-Encoder for Unsupervised Highlight Extraction
School of Computer Science
LSTM for VS
K. Zhang, et al., Video Summarization with Long Short-term Memory, ECCV 2016.
School of Computer Science
LSTM for VS
School of Computer Science
LSTM for VS
School of Computer Science
Hierarchical Structure-Adaptive RNN for Video Summarization
B. Zhao, et al., Hierarchical Structure-Adaptive RNN for Video Summarization, CVPR 2018.
School of Computer Science
HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization.(cvpr18)
School of Computer Science
HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization.(cvpr18)
Presentation
http://rp-www.cs.usyd.edu.au/~ggua5470/keyframe-demo/
T. Mei, et al., Video collage: presenting a video sequence using a
single image, The Visual Computer 25(1): 39-51 (2009)
School of Computer Science
Presentation
Video Synopsis of Brief Cam
School of Computer Science
PicPac Stop Motion
picpac.tv
Demo video screenshot
http://picpac.tv/
School of Computer Science
Applicaitons
Summarizing LifeLog
Microsoft SenseCam
Hyowon Lee, . Smeaton, . O’Connor and Gareth J.F. Jones, Adaptive Visual Summary of LifeLog Photos for Personal Information, International Workshop on Adaptive Information Retrieval, 2006.
School of Computer Science
Applications
PhotoSynth
http://phototour.cs.washington.edu/
School of Computer Science
Applications
PhotoSynth
http://photosynth.net
How PhotoSynth can connect the world’s images
http://www.ted.com/talks/blaise_aguera_y_arcas_ demos_photosynth
Photo Tourism
http://phototour.cs.washington.edu/
School of Computer Science
Applications
StoryImaging
G. Guan, Z. Wang, X.-S. Hua, and D. Feng, StoryImaging: a media-rich presentation system for textual stories, ACM MM 2011.
School of Computer Science
Beyond Search: Event Driven Summarization for Web Videos TOMCCAP 2011 NGO
Undirected Graph
• NDK -> key-shots- >graph
• Rank the key-shots
– Informative scores
– the chronological order
• Key-shot tagging
– Tag filtering
– Tag propagation
• Random walk • Summarization
– Trade-off between the sum of relevance and time interval
More on Summarisation
Multi-document summarisation Multi-video summarisation
Multi-modal summarisation
Query based summarisation
eXtreme summarisation
Domain-specific summarisation … …
School of Computer Science
Need to Know
Text summarization
Video summarization problem
Categories of existing solutions
A new perspective into video summarization with local features
Applications
School of Computer Science
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com