CS代写 database data mining algorithm Clustering and Density-based Anomaly Detection

Clustering and Density-based Anomaly Detection
COMP90073 Security Analytics
, CIS Semester 2, 2021

Outline
• Anomalydetectionwithclustering
• Density-BasedSpatialClustering(DBSCAN) • LocalOutlierFactor(LOF)
COMP90073 Security Analytics © University of Melbourne 2021

Using Clustering for Anomaly Detection
• Advantages:
– Theycandetectanomalieswithoutrequiringanylabelleddata.
– Theyworkformanydatatypes.
– Clusterscanberegardedassummariesofthedata.
– Oncetheclustersareobtained,clustering-basedmethodsneedonly compare any object against the clusters to determine whether the object is an anomaly.
– Testprocessistypicallyfastandefficientbecausethenumberofclusters is usually small compared to the total number of objects small.
• Weakness:
– Theireffectivenessdependshighlyontheclusteringmethodused.Such
methods may not be optimized for outlier detection.
– Theyareoftencostlyforlargedatasets,whichcanserveasabottleneck.
COMP90073 Security Analytics © University of Melbourne 2021

Clustering-based Anomaly Detection
• k-meansclustering
COMP90073 Security Analytics © University of Melbourne 2021

Clustering-based Anomaly Detection
• k-meansclustering
k=3
+ Cluster center
COMP90073 Security Analytics © University of Melbourne 2021

Clustering-based Anomaly Detection
• k-meansclustering
k=3
+ Cluster center
COMP90073 Security Analytics © University of Melbourne 2021

Clustering-based Anomaly Detection
• Assignananomalyscoretoeachobject according to the distance between the object and the centre of closest cluster.
• 𝐴𝑛𝑜𝑚𝑎𝑙𝑦 𝑠𝑐𝑜𝑟𝑒(𝑝 ) = “#$%(‘!,)”)
! # ∑% “#$%(‘%,)&)
• Anomalies(a,b,c)arefarfromtheclustersto which they are closest (with respect to the cluster centres).
$
COMP90073 Security Analytics © University of Melbourne 2021

Limitations of k-means: Differing Size
Original Points K-means (3 Clusters)
COMP90073 Security Analytics © University of Melbourne 2021

Limitations of k-means: Differing Density
Original Points K-means (3 Clusters)
COMP90073 Security Analytics © University of Melbourne 2021

Limitations of k-means: Non Globular Shape
Original Points K-means (2 Clusters)
COMP90073 Security Analytics © University of Melbourne 2021

Density based Clustering
• Density-basedclustering:Modelclustersasdenseregionsinthedataspace, separated by sparse regions, which can discover clusters of non-spherical shape and avoid outliers.
COMP90073 Security Analytics © University of Melbourne 2021

Density-Based Spatial Clustering
of Applications with Noise (DBSCAN) [2]
• Objective:Identifydenseregions,whichcanbemeasuredbythenumberof objects close to a given point.
– Findscoreobjects,thatis,objectsthathavedenseneighbourhoods.It connects core objects and their neighbourhoods to form dense regions as clusters.
• ImportantQuestions:
– Howdowemeasuredensity? – Whatisadenseregion?
Eps = 1cm MinPts = 4
Parameters:
• Densityatpointp:NumberofpointswithinacircleofradiusEps
• DenseRegion:AclusterwithradiusEpsthatcontainsatleastMinPtspoints
COMP90073 Security Analytics © University of Melbourne 2021

DBSCAN – Concepts
DBSCAN defines different classes of points:
• Corepoint:ApointwithatleastMinPtspointswithinitsEps-neighbourhood
(including itself).
• Borderpoint:ApointwithfewerpointsthanMinPtsintheEps-neighbourhood,
but is in the neighbourhood of a core point.
• Anomaly(outlier)point:apointwhichisneithercorenorborder. • E.g.,MinPts=4
Border
P2 P3
P4
P1
Anomaly Core
COMP90073 Security Analytics © University of Melbourne 2021

DBSCAN – Core, Border, and Anomaly
Original data Point types: core, border and anomaly
COMP90073 Security Analytics © University of Melbourne 2021

DBSCAN – Concepts
• DirectlyDensity-reachable:Pointqisdirectly density-reachable from p (w.r.t. Eps and MinPts) if p is a core point, and q is within the Eps- neighbourhood of p.
• (Indirectly)Density-reachable:Pointqisdensity- reachable from p (w.r.t. Eps and MinPts) if there is a chain of points q1…qn, q1 = p, qn = q, such that qi+1 is directly density-reachable from q1.
• Density-connected:Pointqisdensity-connectedto a point p (w.r.t. Eps and MinPts) if there is a point
o such that both p and q are density reachable from o (w.r.t. Eps and MinPts).
COMP90073 Security Analytics © University of Melbourne 2021

DBSCAN Algorithm
• Randomlyselectanunvisitedobjectp
COMP90073 Security Analytics © University of Melbourne 2021

DBSCAN Algorithm
• Randomlyselectanunvisitedobjectp
COMP90073 Security Analytics © University of Melbourne 2021

DBSCAN Algorithm
• Randomlyselectanunvisitedobjectp
• Retrieveallpointsdensity-reachablefromp w.r.t. Eps and MinPts (e.g., 5)
COMP90073 Security Analytics © University of Melbourne 2021

DBSCAN Algorithm
• Randomlyselectanunvisitedobjectp
• Retrieveallpointsdensity-reachablefromp
w.r.t. Eps and MinPts (e.g., 5)
• Ifpisacorepoint,createacluster
Core point
COMP90073 Security Analytics © University of Melbourne 2021

DBSCAN Algorithm
• Randomlyselectanunvisitedobjectp
• Retrieveallpointsdensity-reachablefromp
w.r.t. Eps and MinPts (e.g., 5)
• Ifpisacorepoint,createacluster
COMP90073 Security Analytics © University of Melbourne 2021

DBSCAN Algorithm
• Randomlyselectanunvisitedobjectp
• Retrieveallpointsdensity-reachablefromp
w.r.t. Eps and MinPts (e.g., 5)
• Ifpisacorepoint,createacluster
• Ifpisaboarderpoint,nopointsaredensity- reachable from p and DBSCAN visits the next point of the database
Border point
COMP90073 Security Analytics © University of Melbourne 2021

DBSCAN Algorithm
• Randomlyselectanunvisitedobjectp
• Retrieveallpointsdensity-reachablefromp
w.r.t. Eps and Mint
• Ifpisacorepoint,createacluster
• Ifpisaboarderpoint,nopointsaredensity- reachable from p and DBSCAN visits the next point of the database
• Repeattheabovestepsuntilalldatapoints have been visited
Anomaly
COMP90073 Security Analytics © University of Melbourne 2021

DBSCAN Algorithm
• Randomlyselectanunvisitedobjectp
• Retrieveallpointsdensity-reachablefromp
w.r.t. Eps and Mint
• Ifpisacorepoint,createacluster
• Ifpisaboarderpoint,nopointsaredensity- reachable from p and DBSCAN visits the next point of the database
• Repeattheabovestepsuntilalldatapoints have been visited
COMP90073 Security Analytics © University of Melbourne 2021

DBSCAN Algorithm
• Randomlyselectanunvisitedobjectp
• Retrieveallpointsdensity-reachablefromp
w.r.t. Eps and Mint
• Ifpisacorepoint,createacluster
• Ifpisaboarderpoint,nopointsaredensity- reachable from p and DBSCAN visits the next point of the database
• Repeattheabovestepsuntilalldatapoints have been visited
COMP90073 Security Analytics © University of Melbourne 2021

DBSCAN Algorithm
• Randomlyselectanunvisitedobjectp
• Retrieveallpointsdensity-reachablefromp
w.r.t. Eps and Mint
• Ifpisacorepoint,createacluster
• Ifpisaboarderpoint,nopointsaredensity- reachable from p and DBSCAN visits the next point of the database
• Repeattheabovestepsuntilalldatapoints have been visited
Computational Complexity:
• 𝑂 𝑛, , where n is the number of samples. • Ifaspatialindexisused,𝑂(𝑛log𝑛).
COMP90073 Security Analytics © University of Melbourne 2021

Determining Eps and MinPts
• Ideaisthatforpointsinacluster, their kth nearest neighbours are at roughly the same distance
• Noise points have the kth nearest neighbour at farther distance
• So,plotsorteddistanceofevery point to its kth nearest neighbour
• Findthedistancedwherethereisa “knee” in the curve
– Eps=d,MinPts=k
• Demo: https://www.naftaliharris.com/blog/vi sualizing-dbscan-clustering/
Eps=7~10 MinPts=4
COMP90073 Security Analytics © University of Melbourne 2021

Advantages and Disadvantages of DBSCAN
• Advantages:
– ResistanttoNoise
– Canhandleclustersofdifferentshapesandsizes
• Disadvantages:
– Varyingdensities
Original data
– = 5 Eps = 3.5 – High-dimensionaldata
MinPts=4, Eps=9.75
Eps = 3
COMP90073 Security Analytics © University of Melbourne 2021

Using Clustering for Anomaly Detection
• Advantages:
– Theycandetectanomalywithoutrequiringanylabelleddata.
– Theyworkformanydatatypes.
– Clusterscanberegardedassummariesofthedata.
– Oncetheclustersareobtained,clustering-basedmethodsneedonly compare any object against the clusters to determine whether the object is an anomaly.
– Testprocessistypicallyfastandefficientbecausethenumberofclusters is usually small compared to the total number of objects small.
• Weakness:
– Theireffectivenessdependshighlyontheclusteringmethodused.Such
methods may not be optimized for anomaly detection.
– Theyareoftencostlyforlargedatasets,whichcanserveasabottleneck.
COMP90073 Security Analytics © University of Melbourne 2021

Local Proximity-based Outliers
In the following figure which of the following instances are anomalies?
• 𝑜!? • 𝑜”? • 𝑜#? • 𝑜$?
COMP90073 Security Analytics © University of Melbourne 2021

Local Outlier Factor (LOF)
• Objective:Quantifytherelativedensityaboutaparticulardatapoint.
• Intuition:Theanomaliesshouldbemoreisolatedcomparedto“normal”data
points.
• LOFusestherelativedensityofanobjectagainstitsneighbourstoindicatethe degree to which an object is an anomaly.
p
COMP90073 Security Analytics © University of Melbourne 2021

K-Distance
• 𝑘𝑑𝑖𝑠𝑡: distance between 𝑝 and its 𝑘%- NN
• Meta-heuristic:The𝑘𝑑𝑖𝑠𝑡givesusanotionof“volume” • Themoreisolatedapointis,thelargerits𝑘𝑑𝑖𝑠𝑡
p
COMP90073 Security Analytics © University of Melbourne 2021

Reachability Distance
• ReachabilityDistanceof𝑝withrespectto𝑜:
𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡. 𝑝, 𝑜 = max{𝑘𝑑𝑖𝑠𝑡 𝑜 , 𝑑𝑖𝑠𝑡 𝑝, 𝑜 }
Not symmetric
• Intuition:“Doyourcloseneighboursseeyouasoneoftheircloseneighbours”
q
o
p
COMP90073 Security Analytics © University of Melbourne 2021

Local Reachability Density
• LocalReachabilityDensityof𝑝:
𝑙𝑟𝑑. 𝑝 = 𝑘1 @ 𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡.(𝑝, 𝑜)
/∈1((,*)
k nearest neighbours of p
• Intuition:Howfarwehavetotravelfromourpointtoreachthenextpointor
cluster of points.
– Theloweritis,thelessdenseitis,thelongerwehavetotravel.
23
COMP90073 Security Analytics © University of Melbourne 2021

LOF
• LOFofanobject𝑝istheaverageoftheratiooflocalreachabilityof𝑝andthose of 𝑜′s k-nearest neighbours
• Theanomaliesarecomingfromlessdensearea,sotheratioishigherfor anomalies
𝐿𝑂𝐹. 𝑝 =1@ 𝑙𝑟𝑑. 𝑜 𝑘 /∈1((,*) 𝑙𝑟𝑑. 𝑝
p
COMP90073 Security Analytics © University of Melbourne 2021

Interpretation of LOF Score
• Thelowerthelocalreachabilitydensityof𝑝,andthehigherthelocal reachability density of the kNN of 𝑝, the higher LOF
– 𝐿𝑂𝐹. 𝑝 ~1: Comparable density to neighbours, – 𝐿𝑂𝐹. 𝑝 < 1: Higher density than neighbours – 𝐿𝑂𝐹. 𝑝 > 1: Lower density than neighbours
COMP90073 Security Analytics © University of Melbourne 2021

LOF – Example
• Considerthefollowing4datapoints: a(0, 0), b(0, 1), c(1, 1), d(3, 0)
• CalculatetheLOFforeachpointandshowthetop1outlier,setk=2anduse Manhattan Distance.
COMP90073 Security Analytics © University of Melbourne 2021

LOF – Example
Step 1: Calculate all the distances between each two data points • There are 4 data points:
a(0, 0), b(0, 1), c(1, 1), d(3, 0)
dist(a, b) = 1 dist(a, c) = 2 dist(a, d) = 3 dist(b, c) = 1 dist(b, d) = 3+1=4 dist(c, d) = 2+1=3
COMP90073 Security Analytics © University of Melbourne 2021

LOF – Example
Step 2: Calculate distk (o), distance between o and its k-th NN (k-th nearest neighbour)
k=2:
dist2 (a) = dist(a, c) = 2 (c is the 2nd nearest neighbour) dist2 (b) = dist(b, a) = 1 (a/c is the 2nd nearest neighbour) dist2 (c) = dist(c, a) = 2 (a is the 2nd nearest neighbour) dist2 (d) = dist(d, a) = 3 (a/c is the 2nd nearest neighbour)
COMP90073 Security Analytics © University of Melbourne 2021

LOF – Example
Step 3: Calculate all the Nk (p), k-distance neighborhood of p, Nk (p) = {p’| p’ in D, dist(p, p’) ≤ distk (p)}
N2 (a) = {b, c} N2 (b) = {a, c} N2 (c) = {b, a} N2 (d) = {a, c}
COMP90073 Security Analytics © University of Melbourne 2021

LOF – Example
Step4:Calculateallthe𝑙𝑟𝑑. 𝑝
For example:
𝑙𝑟𝑑. 𝑎 = 𝑁,(𝑎)
𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡 𝑎,𝑏 +𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡(𝑎,𝑐)
• 𝑁,(𝑎) = {𝑏,𝑐} =2
• 𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡 𝑎,𝑏 = max 𝑑𝑖𝑠𝑡, 𝑏 𝑑𝑖𝑠𝑡 𝑏,𝑎 = max 1,1 = 1 • 𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡 𝑎,𝑐 = max 𝑑𝑖𝑠𝑡, 𝑐 𝑑𝑖𝑠𝑡 𝑐,𝑎 = max 2,2 = 2
𝑙𝑟𝑑.𝑎= 2 =0.67 1+2
COMP90073 Security Analytics © University of Melbourne 2021

LOF – Example
Step4:Calculateallthe𝑙𝑟𝑑. 𝑝 Similarly,
•𝑙𝑟𝑑%𝑏= •𝑙𝑟𝑑%𝑐= •𝑙𝑟𝑑%𝑑=
&!(()
*+,-./012 (,, 4*+,-./012((,-)
&!(-)
*+,-./012 -,( 4*+,-./012(-,,)
&!(/)
*+,-./012 /,, 4*+,-./012(/,-)
=” =0.5 “4”
=” =0.67 !4″
=” =0.33 #4#
COMP90073 Security Analytics © University of Melbourne 2021

LOF – Example
Step5:calculateallthe𝐿𝑂𝐹. 𝑝
• 𝐿𝑂𝐹, 𝑎 = 𝑙𝑟𝑑, 𝑏 + 𝑙𝑟𝑑, 𝑐 × 𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡, 𝑎,𝑏 + 𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡, 𝑎,𝑐 =
0.5+0.67 × 1+2 =3.51
• 𝐿𝑂𝐹, 𝑏 = 𝑙𝑟𝑑, 𝑎 +𝑙𝑟𝑑, 𝑐 × 𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡, 𝑏,𝑎 +𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡, 𝑏,𝑐 = 0.67+0.67 × 2+2 =5.36
• 𝐿𝑂𝐹, 𝑐 = 𝑙𝑟𝑑, 𝑏 + 𝑙𝑟𝑑, 𝑎 × 𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡, 𝑐,𝑏 + 𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡, 𝑐,𝑎 = 0.5+0.67 × 1+2 =3.51
• 𝐿𝑂𝐹, 𝑑 = 𝑙𝑟𝑑, 𝑎 + 𝑙𝑟𝑑, 𝑐 × 𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡, 𝑑,𝑎 + 𝑟𝑒𝑎𝑐h𝑑𝑖𝑠𝑡, 𝑑,𝑐 = 0.67+0.67 × 3+3 =8.04
COMP90073 Security Analytics © University of Melbourne 2021

LOF – Example
Step6:Sortallthe𝐿𝑂𝐹. 𝑝 The sorted order is:
• 𝐿𝑂𝐹, 𝑑 =8.04 • 𝐿𝑂𝐹, 𝑑 =5.36 • 𝐿𝑂𝐹, 𝑑 =3.51 • 𝐿𝑂𝐹, 𝑑 =3.51
Obviously, top 1 anomaly is point d
COMP90073 Security Analytics © University of Melbourne 2021

LOF – Properties
• LOFcapturesalocalanomalywhoselocaldensityisrelativelylowcomparing to the local densities of its kNN
• Outputsascoring(assignsanLOFvaluetoeachpoint)
• Choiceofkspecifiesthereferenceset
• Originallyimplementsalocalapproach(resolutiondependsontheuser’s choice for k)
COMP90073 Security Analytics © University of Melbourne 2021

Cluster-based Local Outlier Factor (CBLOF)[1]
1) Findclustersinadataset(usingk-means)
2) Sortthemaccordingtodecreasingsize.
– Any cluster that contains at least a percentage (e.g., 90%) of the data set is considered a “large cluster.” The remaining clusters are referred to as “small clusters.”
3) Toeachdatapoint,assignacluster-basedlocal outlier factor (CBLOF), which computed as the product of the cluster’s size and the similarity between the point and the cluster.
– Forapointbelongingtoasmallcluster,its CBLOF is calculated as the product of the size of the small cluster and the similarity between the point and the closest large cluster.
COMP90073 Security Analytics © University of Melbourne 2021

Summary
• Whataretheadvantagesofclusteringforanomalydetection? • Howdistanceanddensitybasedclusteringperformdifferently? • Howtoidentifylocalanomalies?
• Howtoidentifygroupanomalies?
Next: Anomaly Detection in Evolving Data Streams
COMP90073 Security Analytics © University of Melbourne 2021

References
1. , , Jian Pei, “Data Mining: Concepts and Techniques”, 3rd ed, 2012. Chapters 10.4 and 12
2. , Hans- , Jörg Sander , , “A density- based algorithm for discovering clusters in large spatial databases with noise.”, KDD, 1996.
3. Density-Based Clustering http://www.cse.buffalo.edu/faculty/azhang/cse601/density-based.ppt
COMP90073 Security Analytics © University of Melbourne 2021