Unsupervised Learning III: Anomaly Detection
Machine Learning
Anomaly detection • What is anomaly detection?
• Methods:
– Density estimation
– Detection by reconstruction – One-class SVM
What is an anomaly?
Anomaly Detection is
• An unsupervised learning problem (data unlabeled)
• About the identification of new or unknown data or signal that a machine learning system is not aware of during training
4
Example 1
“Novel”
“Normal”
“Novel”
“Novel” 5
So what seems to be the problem?
It’s a 2-Class problem. “Normal” vs. “Novel”
6
So what seems to be the problem?
It’s a 2-Class problem. “Normal” vs. “Novel”
7
The Problem is
That “All positive examples are alike but each
negative example is negative in its own way”.
8
One-class recognition
• Suppose we want to build a classifier that recognizes anomalous activities in an airport
• How can we collect a training data?
– We easily assemble videos of normal airport activities like walking, checking in, etc., as positive examples.
• What about negative examples ?
– The negative examples are… all other
activities!!
• So the negative examples come from an unknown # of negative classes.
9
Importance of Anomaly Detection
Ozone Depletion History
• In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels
• Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?
• The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!
Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html
Real World Anomalies • Credit Card Fraud
– An abnormally high purchase made on a credit card
• Cyber Intrusions
– A web server involved in ftp traffic
Fraud Detection
• Fraud detection refers to detection of criminal activities occurring in commercial organizations
– Malicious users might be the actual customers of the organization or might be posing as a customer (also known as identity theft).
• Types of fraud
– Creditcardfraud
– Insuranceclaimfraud
– Mobile/cellphonefraud – Insidertrading
• Challenges
– Fast and accurate real-time detection – Misclassification cost is very high
Healthcare Informatics
• Detect anomalous patient records
– Indicate disease outbreaks, instrumentation errors, etc.
• KeyChallenges
– Only normal labels available
– Misclassification cost is very high
– Data can be complex: spatio- temporal
outbreaks from 2006 to today preventable by
vaccinations
Article
Industrial Damage Detection
• Industrial damage detection refers to detection of different faults and failures in complex industrial systems, structural damages, intrusions in electronic security systems, suspicious events in video surveillance, abnormal energy consumption, etc.
– Example: Aircraft Safety
• Anomalous Aircraft (Engine) / Fleet Usage
• Anomalies in engine combustion data
• Total aircraft health and usage management
• Key Challenges
– Data is extremely huge, noisy and unlabelled
– Most of applications exhibit temporal behavior
– Detecting anomalous events typically require immediate intervention
Image Processing
• Detecting outliers in a image monitored over time
• Detecting anomalous regions within an image
• Used in
– mammographyimageanalysis
– videosurveillance
– satellite image analysis
• Key Challenges
– Detecting collective anomalies – Data sets are very large
50
100
150
200
250
50 100
150 200
250 300 350
Anomaly
Video Surveillance
https://arxiv.org/pdf/1606.08455.pdf
Density Estimation Method
Anomaly Detection
Anomaly detection example
Aircraft engine features: = heat generated
= vibration intensity
Dataset: New engine:
…
(heat)
(vibration)
Density estimation
Dataset:
Is anomalous?
(vibration)
(heat)
Anomaly detection example
Fraud detection:
= features of user ’s activities
Model from data.
Identify unusual users by checking which have
Manufacturing
Monitoring computers in a data center. = features of machine
= memory use, = number of disk accesses/sec,
= CPU load, = CPU load/network traffic. …
Example density estimation method: Multivariate Gaussian (Normal) distribution
Parameters
Parameter fitting: Given training set
Anomaly detection with the multivariate Gaussian
1. Fit model by setting
2. Given a new example , compute Flag an anomaly if
Evaluation
Anomaly Detection
Evaluating an anomaly detection model
When developing a learning algorithm (choosing features, etc.), making decisions is much easier if we have a way of evaluating our learning algorithm.
Assume we have some labeled data, of anomalous and non- anomalous examples. ( if normal, if anomalous).
Training set: (assume normal examples/not anomalous)
Cross validation set: Test set:
Aircraft engines motivating example
10000 good (normal) engines
20 flawed engines (anomalous)
Training set: 6000 good engines
CV: 2000 good engines ( ), 10 anomalous ( ) Test: 2000 good engines ( ), 10 anomalous ( )
Alternative:
Training set: 6000 good engines
CV: 4000 good engines ( ), 10 anomalous ( ) Test: 4000 good engines ( ), 10 anomalous ( )
Algorithm evaluation
Fit model on training set
On a cross validation/test example , predict
Possible evaluation metrics:
– True positive, false positive, false negative, true negative – Precision/Recall
– F1-score
Can also use cross validation set to choose parameter
Anomaly detection
• Very small number of positive examples (y=1)
• Large number of negative (y=0) examples
vs.
Supervised learning
Large number of positive and negative examples.
Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.
• Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.
• •
Anomaly detection vs.
• Fraud detection
• Manufacturing (e.g. aircraft
engines)
• Monitoring machines in a data center
Supervised learning
• Email spam classification
• Weather prediction
(sunny/rainy/etc).
• Cancer classification
Anomaly Detection by Reconstruction
Recall: Computing PCA decomposition for images
X=…-=
num. pixels
svd(X,0) gives X = U S VT
Covariance matrix XXT = U S VT V S UT = U S2 UT
We can reconstruct some new image, x
num. face images
Anomaly Detection?
x== …**+ eigenvectors S v mean img
Online Detection of Unusual
Events in Videos via Dynamic
Sparse Coding
Bin Zhao, Li Fei-Fei, Eric Xing
Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR 2011), Colorado Springs, CO, USA, June 2011
Goal: Detect Unusual Events in Videos
• Example unusual event: entering subway via exit
• Videos are described as spatio-temporal features
Dictionary-based Anomaly Detection
• Learn a dictionary of bases corresponding to usual events:
– a usual event should be reconstructible from a small number of such bases, and
– the reconstruction weights should change smoothly over space/time across actions in such events.
– an unusual event is either not reconstructible from the dictionary of usual events with small error, or,
– Needs a large number of bases , in a temporal-spatially non- smooth fashion.
• Must: Learn a good dictionary of bases representing usual events
• Must: Update the dictionary online to adapt to changing content of the video
Algorithm: Look for High Reconstruction Error
Reconstruction error: For a usual event, this term should be small, due to the assumption that the learned dictionary represents knowledge in the previously seen video data.
Algorithm: Look for High Reconstruction Error
Algorithm
Results on YouTube Videos
One-Class SVM
Anomaly Detection
Support Vector Method for Novelty Detection
Bernhard Schölkof, Robert Williams, Alex Smola, John Shawe-Taylor, John Platt
40
Problem Formulation
• Supposewearegivenatrainingampledrawnfrom
an underlying distribution P
• We want to a estimate a “simple” subset S ⊂ X
such that for a test point x drawn from the distributionP, Pr(x∉S)=ν,ν∈(0,1]
• Weapproachtheproblembytryingtoestimatea function f which is positive on S and negative on the complement
41
One-Class SVM
• Given a feature space F (via kernel k )
• Definetheclasstobe
Cw ={x| fw(x)≥ρ}
• (w,ρ) are respectively a weight vector and an offset parameterizing a hyperplane in F
42
“Hey, Just a second”
If we use hyperplanes & offsets, doesn’t it mean we separate the “positive” sample? But, separate from what?
From the Origin
43
OCSVM
w
ρ / || w || Φ(x)
ξ / || w ||
44
OCSVM
To separate the data set from the origin we solve the following quadratic program:
min 1||w||+ 1 ∑ξi −ρ w∈F,ξi∈R,ρ∈R 2 νm i
〈w,Φ(x)〉 ≥ ρ −ξi
ξi ≥0
subject to
0<ν ≤1
45
OCSVM
Serves as a penalizer like “C” in the 2-class Tsovmse(preacratlleththaet data set fro)m the origin we
0<ν ≤1
solve the following quadric program:
min 1||w||+ 1 ∑ξi −ρ w∈F,ξi∈R,ρ∈R 2 νm i
〈w,Φ(x)〉 ≥ ρ −ξi
subject to
ξ≥0
Notice that noi “y”s are incorporated in the
constraint since there are no labels
46
OCSVM
• Thedecisionistherefore:
f (x) = sgn(〈w,Φ(x)〉 − ρ)
• Since the slack variables ξi are penalized in the objective function, we can expect that if w and ρ solve the problem then f will equal 1 for most example in the training set, while || w || still stays small
47
OCSVM
Using multipliers αi ,β i≥ 0 we get the Lagrangian,
L(w,ξ,ρ,α,β)=1||w||2 + 1 ∑ξi −ρ− 2 νmi
−∑αi (〈w,Φ(x)〉−ρ +ξi )−∑βiξi ii
48
OCSVM
Setting the derivatives of L w.r.t w, ξ, ρ to 0 yields:
1) w=∑αiΦ(xi) i
2) αi=1−βi≤,1 νm νm
∑αi =1 i
49
OCSVM
Eq. 1 transforms f (x) into a kernel expansion:
∑ f(x)=sgn αik(xi,x)−ρ
i Substituting eq. 1 & 2 into ρ yields the dual L
problem: 1 ∑
min 2 α α k(x ,x )
subjectto 0≤αi ≤ 1 , νm
α∈Rm ij
ijij
∑αi =1 i
50
OCSVM
Eq. 1 transforms f (x) into a kernel expansion:
∑ f(x)=sgn αik(xi,x)−ρ
i
pattern x satisfies:
ρ
STuhbestoitfufstientg eqc. a1n&b2einretocovyeLiereldds bthyeedxupalloiting t p h r a o t b f l o e mr a : n y 0 < α < 1 t h e c o r r e s p o n d i n g
1
i
νm
min 2 α α k(x ,x )
∑
i
ρ=〈w,Φ(x)〉= α k(x ,x)
α∈Rm ij i1
∑ subjectto 0≤αi ≤j,
∑
ijij
νm
i
α =1 i
jji
51
Example
sklearn.svm.OneClassSVM
53
OneClassSVM - Shortcomings
• Implicitly assumes that the “negative” data lies around the origin.
• Ignores completely “negative” data even if such data partially exist.
54