LUDWIG- MAXIMILIANS- UNIVERSITÄT MÜNCHEN
INSTITUTE FOR INFORMATICS
DATABASE SYSTEMS GROUP
The 2010 SIAM International Conference on Data Mining
Copyright By PowCoder代写 加微信 powcoder
Outlier Detection Techniques
Hans- , Peer Kröger,
Ludwig-Maximilians-Universität München Munich, Germany http://www.dbs.ifi.lmu.de
Tutorial Notes: SIAM SDM 2010, Columbus, Ohio
DATABASE SYSTEMS GROUP
General Issues
1. Pleasefeelfreetoaskquestionsatanytimeduringthe presentation
2. Aimofthetutorial:getthebigpicture
– NOT in terms of a long list of methods and algorithms
– BUT in terms of the basic algorithmic approaches
– Sample algorithms for these basic approaches will be sketched
• The selection of the presented algorithms is somewhat arbitrary
• Please don’t mind if your favorite algorithm is missing
• Anyway you should be able to classify any other algorithm not covered here by means of which of the basic approaches is implemented
3. Therevisedversionoftutorialnoteswillsoonbeavailable on our websites
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Introduction
Definition of Hawkins [Hawkins 1980]:
Statistics-based intuition
What is an outlier?
“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism”
– Normal data objects follow a “generating mechanism”, e.g. some given statistical process
– Abnormal objects deviate from this generating mechanism
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Introduction
• Example: Hadlum vs. Hadlum (1949) [Barnett 1978]
• ThebirthofachildtoMrs. Hadlum happened 349 days after Mr. Hadlum left for military service.
• Averagehumangestation period is 280 days (40 weeks).
• Statistically,349daysisan outlier.
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Introduction
• Example: Hadlum vs. Hadlum (1949) [Barnett 1978]
− blue: statistical basis (13634 observations of gestation periods)
− green: assumed underlying Gaussian process
− Very low probability for the birth of Mrs. Hadlums child for being generated by this process
− red: assumption of Mr. Hadlum (another Gaussian process responsible for the observed birth, where the gestation period responsible)
− Under this assumption the gestation period has an average duration and highest-possible probability
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Introduction
• Sample applications of outlier detection
– Fraud detection
• Purchasingbehaviorofacreditcardownerusuallychangeswhenthe card is stolen
• Abnormalbuyingpatternscancharacterizecreditcardabuse – Medicine
• Unusualsymptomsortestresultsmayindicatepotentialhealthproblems of a patient
• Whetheraparticulartestresultisabnormalmaydependonother characteristics of the patients (e.g. gender, age, …)
– Public health
• Theoccurrenceofaparticulardisease,e.g.tetanus,scatteredacross various hospitals of a city indicate problems with the corresponding vaccination program in that city
• Whetheranoccurrenceisabnormaldependsondifferentaspectslike frequency, spatial correlation, etc.
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Introduction
• Sample applications of outlier detection (cont.)
– Sports statistics
• Inmanysports,variousparametersarerecordedforplayersinorderto evaluate the players’ performances
• Outstanding(inapositiveaswellasanegativesense)playersmaybe identified as having abnormal parameter values
• Sometimes,playersshowabnormalvaluesonlyonasubsetoraspecial combination of the recorded parameters
– Detecting measurement errors
• Dataderivedfromsensors(e.g.inagivenscientificexperiment)may contain measurement errors
• Abnormalvaluescouldprovideanindicationofameasurementerror
• Removingsucherrorscanbeimportantinotherdatamininganddata analysis tasks
• “Oneperson‘snoisecouldbeanotherperson‘ssignal.” –…
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Introduction
• Discussion of the basic intuition based on Hawkins – Data is usually multivariate, i.e., multi-dimensional
=> basic model is univariate, i.e., 1-dimensional
– There is usually more than one generating mechanism/statistical process underlying the data
=> basic model assumes only one “normal” generating mechanism
– Anomalies may represent a different class (generating mechanism) of objects, so there may be a large class of similar objects that are the outliers
=> basic model assumes that outliers are rare observations
• Consequence: a lot of models and approaches have evolved in the past years in order to exceed these assumptions and it is not easy to keep track with this evolution.
• New models often involve typical, new, though usually hidden assumptions and restrictions.
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Introduction
• General application scenarios
– Supervised scenario
• Insomeapplications,trainingdatawithnormalandabnormaldata objects are provided
• Theremaybemultiplenormaland/orabnormalclasses
• Often,theclassificationproblemishighlyunbalanced
– Semi-supervised Scenario
• Insomeapplications,onlytrainingdataforthenormalclass(es)(oronly the abnormal class(es)) are provided
– Unsupervised Scenario
• Inmostapplicationstherearenotrainingdataavailable
• In this tutorial, we focus on the unsupervised scenario
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Introduction
• Are outliers just a side product of some clustering algorithms?
– Many clustering algorithms do not assign all points to clusters but account for noise objects
– Look for outliers by applying one of those algorithms and retrieve the noise set
– Problem:
• Clusteringalgorithmsareoptimizedtofindclustersratherthanoutliers
• Accuracyofoutlierdetectiondependsonhowgoodtheclustering algorithm captures the structure of clusters
• Asetofmanyabnormaldataobjectsthataresimilartoeachotherwould be recognized as a cluster rather than as noise/outliers
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Introduction
• We will focus on three different classification approaches
– Global versus local outlier detection
Considers the set of reference objects relative to which each point’s “outlierness” is judged
– Labeling versus scoring outliers Considers the output of an algorithm
– Modeling properties
Considers the concepts based on which “outlierness” is modeled
NOTE: we focus on models and methods for Euclidean data but many of those can be also used for other data types (because they only require a distance measure)
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Introduction
• Global versus local approaches
– Considers the resolution of the reference set w.r.t. which the “outlierness” of a particular data object is determined
– Global approaches
• The reference set contains all other data objects
• Basicassumption:thereisonlyonenormalmechanism
• Basic problem: other outliers are also in the reference set and may falsify the results
– Local approaches
• The reference contains a (small) subset of data objects
• Noassumptiononthenumberofnormalmechanisms
• Basicproblem:howtochooseaproperreferenceset
– NOTE: Some approaches are somewhat in between
• Theresolutionofthereferencesetisvariede.g.fromonlyasingleobject (local) to the entire database (global) automatically or by a user-defined input parameter
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Introduction
• Labeling versus scoring
– Considers the output of an outlier detection algorithm
– Labeling approaches • Binaryoutput
• Dataobjectsarelabeledeitherasnormaloroutlier – Scoring approaches
• Continuousoutput
• Foreachobjectanoutlierscoreiscomputed(e.g.theprobabilityfor being an outlier)
• Dataobjectscanbesortedaccordingtotheirscores – Notes
• Manyscoringapproachesfocusondeterminingthetop-noutliers (parameter n is usually given by the user)
• Scoringapproachescanusuallyalsoproducebinaryoutputifnecessary (e.g. by defining a suitable threshold on the scoring values)
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Introduction
• Approaches classified by the properties of the underlying modeling approach
– Model-based Approaches
• Rational
– Apply a model to represent normal data points – Outliers are points that do not fit to that model
• Sampleapproaches
– Probabilistic tests based on statistical models – Depth-based approaches
– Deviation-based approaches
– Some subspace outlier detection approaches
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Introduction
– Proximity-based Approaches
• Rational
– Examine the spatial proximity of each object in the data space
– If the proximity of an object considerably deviates from the proximity of other objects it is considered an outlier
• Sampleapproaches
– Distance-based approaches
– Density-based approaches
– Some subspace outlier detection approaches
– Angle-based approaches
• Rational
– Examine the spectrum of pairwise angles between a given point and all other points
– Outliers are points that have a spectrum featuring high fluctuation
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
1. Introduction √
2. StatisticalTests
3. Depth-basedApproaches
4. Deviation-basedApproaches
5. Distance-basedApproaches 6. Density-basedApproaches
7. High-dimensionalApproaches 8. Summary
statistical model
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
model based on spatial proximity
adaptation of different models to a special problem
DATABASE SYSTEMS GROUP
Statistical Tests
• General idea
– Given a certain kind of statistical distribution (e.g., Gaussian)
– Compute the parameters assuming all data points have been generated by such a statistical distribution (e.g., mean and standard deviation)
– Outliers are points that have a low probability to be generated by the overall distribution (e.g., deviate more than 3 times the standard deviation from the mean)
• Basic assumption
– Normal data objects follow a (known) distribution and occur in a high probability region of this model
– Outliers deviate strongly from this distribution
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Statistical Tests
• A huge number of different tests are available differing in
– Type of data distribution (e.g. Gaussian)
– Number of variables, i.e., dimensions of the data objects (univariate/multivariate)
– Number of distributions (mixture models)
– Parametric versus non-parametric (e.g. histogram-based)
• Example on the following slides – Gaussian distribution
– Multivariate
– Parametric
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Statistical Tests
• Probability density function of a multivariate normal distribution
N(x) = 1 e−(x−μ)T Σ−1(x−μ) 2
– Σ is the covariance matrix from the mean
– μ is the mean value of all points (usually data is normalized such that
– MDist(x,μ)=(x−μ)TΣ−1(x−μ) istheMahalanobisdistanceof point x to μ
– MDist follows a χ2-distribution with d degrees of freedom (d = data dimensionality)
– All points x, with MDist(x,μ) > χ2(0,975) [≈ 3.σ] Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Statistical Tests
• Visualization (2D) [Tan et al. 2006]
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Statistical Tests
• Problems
– Curse of dimensionality
• Thelargerthedegreeoffreedom,themoresimilartheMDistvaluesfor all points
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
x-axis: observed MDist values y-axis: frequency of observation
DATABASE SYSTEMS GROUP
Statistical Tests
• Problems (cont.)
– Robustness
• Meanandstandarddeviationareverysensitivetooutliers
• Thesevaluesarecomputedforthecompletedataset(includingpotential outliers)
• TheMDistisusedtodetermineoutliersalthoughtheMDistvaluesare influenced by these outliers
=> Minimum Covariance Determinant [Rousseeuw and Leroy 1987] minimizes the influence of outliers on the Mahalanobis distance
• Discussion
– Data distribution is fixed
– Low flexibility (no mixture model)
– Global method
– Outputs a label but can also output a score
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
1. Introduction √
2. Statistical Tests √
3. Depth-basedApproaches
4. Deviation-basedApproaches
5. Distance-basedApproaches 6. Density-basedApproaches
7. High-dimensionalApproaches 8. Summary
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Depth-based Approaches
• General idea
– Search for outliers at the border of
the data space but independent of
statistical distributions
– Organize data objects in
convex hull layers
– Outliers are objects on outer layers
• Basic assumption
– Outliers are located at the border of the data space – Normal objects are in the center of the data space
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
Picture taken from [Johnson et al. 1998]
DATABASE SYSTEMS GROUP
Depth-based Approaches
• Model [Tukey 1977]
– Points on the convex hull of the full data space have depth = 1
– Points on the convex hull of the data set after removing all points with depth = 1 have depth = 2
– Points having a depth ≤ k are reported as outliers
Picture taken from [Preparata and Shamos 1988]
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Depth-based Approaches
• Sample algorithms
– ISODEPTH [Ruts and Rousseeuw 1996] – FDC [Johnson et al. 1998]
• Discussion
– Similar idea like classical statistical approaches (k = 1 distributions) but independent from the chosen kind of distribution
– Convex hull computation is usually only efficient in 2D / 3D spaces
– Originally outputs a label but can be extended for scoring easily (take depth as scoring value)
– Uses a global reference set for outlier detection
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
1. Introduction √
2. Statistical Tests √
3. Depth-based Approaches √
4. Deviation-basedApproaches
5. Distance-basedApproaches 6. Density-basedApproaches
7. High-dimensionalApproaches 8. Summary
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Deviation-based Approaches
• General idea
– Given a set of data points (local group or global set)
– Outliers are points that do not fit to the general characteristics of that set, i.e., the variance of the set is minimized when removing the outliers
• Basic assumption
– Outliers are the outermost points of the data set
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Deviation-based Approaches
• Model [Arning et al. 1996]
– Given a smoothing factor SF(I) that computes for each I ⊆ DB how
much the variance of DB is decreased when I is removed from DB
– With equal decrease in variance, a smaller exception set is better
– The outliers are the elements of the exception set E ⊆ DB for which the following holds:
SF(E) ≥ SF(I) for all I ⊆ DB
– Similar idea like classical statistical approaches (k = 1 distributions)
• Discussion:
but independent from the chosen kind of distribution
– Naïve solution is in O(2n) for n data objects
– Heuristics like random sampling or best first search are applied
– Applicable to any data type (depends on the definition of SF)
– Originally designed as a global method
– Outputs a labeling
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
1. Introduction √
2. Statistical Tests √
3. Depth-based Approaches √
4. Deviation-based Approaches √
5. Distance-basedApproaches 6. Density-basedApproaches
7. High-dimensionalApproaches 8. Summary
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Distance-based Approaches
• General Idea
– Judge a point based on the distance(s) to its neighbors – Several variants proposed
• Basic Assumption
– Normal data objects have a dense neighborhood
– Outliers are far apart from their neighbors, i.e., have a less dense neighborhood
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Distance-based Approaches
• DB(ε,π)-Outliers
– Basic model [Knorr and Ng 1997]
• Givenaradiusεandapercentageπ
• Apointpisconsideredanoutlierifatmostπpercentofallotherpoints have a distance to p less than ε
OutlierSet(ε,π)={p|Card({q∈DB|dist(p,q)<ε}) ≤π} Card(DB)
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
range-query with radius ε
DATABASE SYSTEMS GROUP
Distance-based Approaches
– Algorithms
• Index-based[KnorrandNg1998]
– Compute distance range join using spatial index structure
– Exclude point from further consideration if its ε-neighborhood contains more than Card(DB) . π points
• Nested-loopbased[KnorrandNg1998]
– Divide buffer in two parts
– Use second part to scan/compare all points with the points from the first part
• Grid-based[KnorrandNg1998]
– Build grid such that any two points from the same grid cell have a distance of
at most ε to each other
– Points need only compared with points from neighboring cells
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Distance-based Approaches
– Deriving intensional knowledge [Knorr and Ng 1999]
• ReliesontheDB(ε,π)-outliermodel
• Findtheminimalsubset(s)ofattributesthatexplainsthe“outlierness”ofa point, i.e., in which the point is still an outlier
– Identified outliers
– Derived intensional knowledge (sketch)
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Distance-based Approaches
• Outlier scoring based on kNN distances
– General models
• TakethekNNdistanceofapointasitsoutlierscore[Ramaswamyetal2000]
• Aggregate the distances of a point to all its 1NN, 2NN, ..., kNN as an outlier score [Angiulli and Pizzuti 2002]
– Algorithms
• Generalapproaches – Nested-Loop
» Naïve approach:
For each object: compute kNNs with a sequential scan
» Enhancement: use index structures for kNN queries – Partition-based
» Partition data into micro clusters
» Aggregate information for each partition (e.g. minimum bounding rectangles)
» Allows to prune micro clusters that cannot qualify when searching for the kNNs of a particular point
Kriegel/Kröger/Zimek: Outlier Detection Techniques (SDM 2010)
DATABASE SYSTEMS GROUP
Distance-based Approaches
– Sample Algorithms (computing top-n outliers)
• Nested-Loop[Ramaswamyetal2000]
– Simple NL algorithm with index support for kNN queries
– Partition-based algorithm (based on a clustering algorithm that has linear time complexity)
– Algorithm for the simple kNN-distance model
• Linearization[AngiulliandPizzuti2002]
– Linearization of a multi-dimensional data set using space-fill curves – 1D representation is partitioned into micro clusters
– Algorithm for the average kNN-distance model
• ORCA[BayandSchwabacher2003]
– NL algorithm with randomization and simple pruning
– Pruning: if a point has a score greater than the top-n outlier so far (cut-off), remove this point from further consideration
=> non-outliers are pruned
=> works good on randomized data (can be done in linear ti
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com