Anomaly Detection in Evolving Data Streams
COMP90073 Security Analytics
, CIS Semester 2, 2021
Outline
• Introductiontodatastreams • Windowingtechniques
• HS-Trees
• IncrementalLOF(iLOF)
– Memory-efficientiLOF(MiLOF)
COMP90073 Security Analytics © University of Melbourne 2021
Data Streams
Data stream is a sequence of data points, which is continues, unbounded, and nonstationary.
• StreamliningAnalysis
– Largevolumeofdata
– Short/real-timeresponse
– Limitedmemory
– Energy/communicationconstraints
COMP90073 Security Analytics © University of Melbourne 2021
Batch Learning vs. Incremental Learning
Batch Leaning: Data points are stored until they can be analysed at the end of a monitoring period. Batch learning methods
• Canbecomputationallyefficient
• Theiraccuracyisheavilydependentonagoodchoiceforthetrainingperiod
and the quality of the training data
• Cannotbeappliedinstreamingenvironments,wherethemeasurementsarrive as a continuous stream of data
• Cannotbeusedinresourceconstraintdevicestobufferallthemeasurements
• Cannotidentifyanomalouspointsastheyoccur
• Cannotadapttochangesintheenvironment(e.g.,drift)
Incremental Learning: Data points are (usually) analysed once and there is no need to buffer the data. Incremental methods
• Startwithasetofinitialparametersfortheselectedmodelandtheybecomes more accurate as more data arrives
COMP90073 Security Analytics © University of Melbourne 2021
Different Windowing Techniques for Data Streams
• Landmarkwindows:Afixedpointinthedata stream is defined as a landmark and processing is done over data points between the landmark and the present data point.
• Dampedwindows:Aweightisassignedtoeach data point in such a way that the old data points are given smaller weights. Therefore, the most recent data points are used with higher weights.
COMP90073 Security Analytics © University of Melbourne 2021
Different Windowing Techniques for Data Streams
• Slidingwindows:Aslidingwindowsizewis considered in this technique. It processes the last w data points in the stream, while older data points are discarded.
• Adaptivewindows:Thewindowsizewwould change as the data stream evolves. In this technique, the more the data points evolve, the smaller w becomes. In contrast, if data points remain constant, the value of w increases.
COMP90073 Security Analytics © University of Melbourne 2021
Streaming Half Space (HS)-Trees [1]
A fast one-class anomaly detector for evolving data streams. • Arandomtreemodel
• Buildstreestructurewithoutdata
• Detectsanomaliesinonepass
• Adaptstodistributionchangesbyregularmodelupdates • Updates model in constant time 𝑂 𝑡(h + 𝜓)
• Requires constant amount of memory 𝑂 𝑡2!
𝑡: number of trees, h: depth of tree, 𝜓: window size
COMP90073 Security Analytics © University of Melbourne 2021
Half Space-Tree
An HS-Tree is a full binary tree, which all leaves are at the same depth h.
• Randomlyselectanattributed
• Bisectsthespaceintotwohalf-spaces,usingthemid-pointofd(assumethat attributes’ ranges are normalised to [0, 1])
• Continueexpansionuntilthemaximumdepthhofallnodesisreached. • Employs mass as a measure to rank anomalies.
COMP90073 Security Analytics © University of Melbourne 2021
Separating the Regions: HS-Trees
COMP90073 Security Analytics © University of Melbourne 2021
Ranking by Mass
Anomaly
Normal
COMP90073 Security Analytics © University of Melbourne 2021
Stream Processing
• Dividedatastreamintofixed-sizewindows:W1,W2,…,Wn
• Eachwindowisafixednumberofsequenceddatainstances • Initial Learning: Train model M1 using instances in W1
• SubsequentLearningandAnomalyScoring
For each window Wk (where k >1)
Train model Mk using instances in Wk Test instances in Wk using model Mk-1
Next window
COMP90073 Security Analytics © University of Melbourne 2021
Count-based Windows
• Letwindowsize=3
• Initialstage
• W1:referencewindow
– TrainHS-Treesandupdatemassr
COMP90073 Security Analytics © University of Melbourne 2021
Streaming HS-Trees – Example
A
Mass profile of reference window
Mass profile of latest window
C
DE
FG
B
COMP90073 Security Analytics © University of Melbourne 2021
Streaming HS-Trees – Example
Train using x1
B
C
DE
FG
A
COMP90073 Security Analytics © University of Melbourne 2021
Streaming HS-Trees – Example
Train using x2
B
C
DE
FG
A
COMP90073 Security Analytics © University of Melbourne 2021
Streaming HS-Trees – Example
Train using x3
B
C
DE
FG
A
COMP90073 Security Analytics © University of Melbourne 2021
Streaming HS-Trees – Example
Training is complete
B
C
DE
FG
A
COMP90073 Security Analytics © University of Melbourne 2021
Count-based Windows
• Letwindowsize=3
• Initialstage
• W1:referencewindow
– TrainHS-Treesandupdatemassr • W2:latestwindow
– InstancesinW2fortrainingHS-Trees(massl) – InstancesinW2fortestingHS-Trees(massr)
COMP90073 Security Analytics © University of Melbourne 2021
Streaming HS-Trees – Example
Train/Test using x4
B
C
DE
FG
A
COMP90073 Security Analytics © University of Melbourne 2021
Streaming HS-Trees – Example
Train/Test using x5
B
C
DE
FG
A
COMP90073 Security Analytics © University of Melbourne 2021
Streaming HS-Trees – Example
Train/Test using x6
B
C
A
DE
X6
FG
COMP90073 Security Analytics © University of Melbourne 2021
Count-based Windows
When all instances in W2 are processed
• Modelupdateoccurs
• W2becomesthenewreferencewindow
– Transferallmasslvaluestomassrvalues
– Resetallmasslvaluestozero • W3becomesthelatestwindow
– InstancesinW3fortrainingHS-Trees(massl)
– InstancesinW3fortestingHS-Trees(massr) And so on…
COMP90073 Security Analytics © University of Melbourne 2021
Model Update
COMP90073 Security Analytics © University of Melbourne 2021
Model Update
COMP90073 Security Analytics © University of Melbourne 2021
Anomaly Score in HS-Tree
• The final score for 𝑥 is the sum of scores obtained from each HS-Tree in the ensemble
𝑎𝑛𝑜𝑚𝑎𝑙𝑦 𝑠𝑐𝑜𝑟𝑒 𝑥 = 5 𝑆𝑐𝑜𝑟𝑒(𝑥, 𝑡%) “∈$
𝑆𝑐𝑜𝑟𝑒 𝑥, 𝑡% = 𝑁𝑜𝑑𝑒&×2′()*!
r value Depth of at node node
COMP90073 Security Analytics © University of Melbourne 2021
LOF – Revision
Advantages of LOF for anomaly detection:
• Detectsanomaliesregardlessthedatadistributionofnormalbehaviour,sinceit does not make any assumptions about the distributions of data records.
• Detectsanomalieswithrespecttodensityoftheirneighbouringdatarecords; not to the global model.
• DirectlyapplyingLOFtodatastreamswouldbeextremelycomputationally inefficient and/or very often may lead to incorrect prediction results.
COMP90073 Security Analytics © University of Melbourne 2021
Extending LOF to Data Streams
(i) Periodic LOF. Apply LOF algorithm on the entire data set periodically (e.g., after every data block of 1000 data records) or after all the data records are inserted.
• Themajorproblemofthisapproachisinabilitytodetectanomaliesrelatedto the beginning of new behaviour that initially appear within the inserted block.
COMP90073 Security Analytics © University of Melbourne 2021
Extending LOF to Data Streams
(ii) Iterated LOF: Re-apply the static LOF algorithm every time a new data record is inserted into the dataset.
• ThisstaticLOFalgorithmdoesnotsufferfromthepreviousproblems,butis extremely computationally expensive.
• Increases LOF’s time complexity to 𝑂(𝑛+ log 𝑛), where 𝑛 is the current number of data records in the data set
COMP90073 Security Analytics © University of Melbourne 2021
Incremental LOF (iLOF) [2]
Objectives:
• Theresultoftheincrementalalgorithmmustbeequivalenttotheresultofthe “batch”.
• TimecomplexityofincrementalLOFalgorithmhastobecomparabletothe static LOF algorithm 𝑂(𝑛 log 𝑛).
Step 1 – Insertion:
• Insertionofnewrecord,computek-dist,reachdist,IrdandLOFvaluesofa
new point
• Maintenance,updatek-dist,reachdist,IrdandLOFvaluesforaffectedexisting
points.
Step 2 – Deletion: Delete certain data records (e.g., due to their obsoleteness).
• Maintenance,updatek-dist,reachdist,IrdandLOFvaluesforaffectedexisting points.
COMP90073 Security Analytics © University of Melbourne 2021
iLOF – Step 1 (Maintenance)
• Updatingk-dist:Insertionofthepointnmaydecreasethek-distanceofcertain neighbouring points, and it can happen only to those points that have the new point n in their k-neighbourhood (e.g., 2-neighbourhood).
• ReverseNearestNeighbour(RNN):Findalltheobjectsforwhichthenew point n is their (k-)nearest neighbour.
kNN
RNN
COMP90073 Security Analytics © University of Melbourne 2021
iLOF – Step 1 (Maintenance)
• Updatingreachdistk:Whenk-distance(p)changesforapointp, reachdistk(o,p) will be affected only for points o that are in k-neighbourhood of the point p.
COMP90073 Security Analytics © University of Melbourne 2021
iLOF – Step 1 (Maintenance)
• Updatinglrd:Irdvalueofapointpisaffectedif:
– Thek-neighbourhoodofthepointpchanges,
– Reachdist from point p to one of its k-neighbours changes.
COMP90073 Security Analytics © University of Melbourne 2021
iLOF – Step 1 (Maintenance)
• UpdatingLOFValues:LOFvaluesofanexistingpointpshouldbeupdatedif – Ird(p)isupdated,or
– Ird(p)ofoneofitsk-neighbourspchanges
COMP90073 Security Analytics © University of Melbourne 2021
iLOF – Step 2 (Deletion)
• Updating k-dist: The deletion of each record pc from the dataset influences the k-distances of its RNN.
– k-neighbourhood increases for each data record pj that is in reverse k- nearest neighbourhood of pc. The new k-distance for pj becomes equal to its distance to its new k-th nearest neighbour.
pz
py
px
COMP90073 Security Analytics © University of Melbourne 2021
iLOF – Step 2 (Deletion)
• Updating k-dist: The deletion of each record pc from the dataset influences the k-distances of its RNN.
– k-neighbourhood increases for each data record pj that is in reverse k- nearest neighbourhood of pc. The new k-distance for pj becomes equal to its distance to its new k-th nearest neighbour.
pz
COMP90073 Security Analytics © University of Melbourne 2021
iLOF – Step 2 (Deletion)
• Updatingreachdist:Thereachabilitydistancesfrompj’snearestneighbours need to be updated.
• Updatinglrd:Irdvalueneedstobeupdatedfor
– Allpointspj,whichk-distanceisupdated.
– Allpointspi, whichisink-NNofpj andpj isink-NNofpi.
• UpdatingLOFValues:LOFvalueisupdatedfor – Allpointspj,whichIrdvalueisupdated
– All points pi, which is in RNN of pj
COMP90073 Security Analytics © University of Melbourne 2021
Shortcomings of iLOF
Deleting past data points due to memory limitations causes two problems: • Differentiationbetweennewandoldevents
• Theaccuracywilldropbydeletingthehistory
COMP90073 Security Analytics © University of Melbourne 2021
Shortcomings of iLOF
Deleting past data points due to memory limitations causes two problems: • Differentiationbetweennewandoldevents
• Theaccuracywilldropbydeletingthehistory
COMP90073 Security Analytics © University of Melbourne 2021
Shortcomings of iLOF
Deleting past data points due to memory limitations causes two problems: • Differentiationbetweennewandoldevents
• Theaccuracywilldropbydeletingthehistory
COMP90073 Security Analytics © University of Melbourne 2021
Shortcomings of iLOF
Deleting past data points due to memory limitations causes two problems: • Differentiationbetweennewandoldevents
• Theaccuracywilldropbydeletingthehistory
COMP90073 Security Analytics © University of Melbourne 2021
Memory Efficient Incremental Local Outlier (MiLOF) Detection [3]
Objective: Assign an LOF value to a point 𝑝”, under the constraint that the available memory stores only a fraction 𝑚 ≪ 𝑛 of the 𝑛 points that have been observed up to time 𝑇.
• Needtochooseastrategytosummarizethepreviousdatapointssothatthe LOF values of new points can be calculated.
MiLOF Phases:
• Summarization
• Merging
• RevisedInsertion
COMP90073 Security Analytics © University of Melbourne 2021
Three Phases of MiLOF – Framework
Compute outlier score of new point using the latest data points and latest merged clustering model
COMP90073 Security Analytics © University of Melbourne 2021
Three Phases of MiLOF
Phase 1 – Summarization:
Build a summary over the past data points along with their corresponding values (k-dist, lrd and LOF), and deleting them from memory.
• Everybucketdatapointsaresummarizedandclustercentresaregenerated using k-means clustering
Notations:
• 𝐶:pointsarrivingattime𝑇
• Partition𝐶into𝑚clusters𝐶={𝐶,∪𝐶+∪⋯∪𝐶-},withclustercentres𝑉= {𝑣,,𝑣+,…,𝑣-}
COMP90073 Security Analytics © University of Melbourne 2021
MiLOF Measures
• k-dist of a cluster centre 𝑣% ∈ 𝑉
𝑘𝑑𝑖𝑠𝑡(𝑣%) = ∑.∈/” 𝑘𝑑𝑖𝑠𝑡 (𝑝)
|𝐶%|
• lrd of a cluster centre 𝑣% ∈ 𝑉
𝑙𝑟𝑑0(𝑣%) = ∑.∈/” 𝑙𝑟𝑑0(𝑝)
|𝐶%|
• LOF of a cluster centre 𝑣% ∈ 𝑉
𝐿𝑂𝐹0(𝑣%) = ∑.∈/” 𝐿𝑂𝐹0(𝑝) |𝐶%|
Number of points in 𝐶#
COMP90073 Security Analytics © University of Melbourne 2021
Three Phases of MiLOF
Phase 2 – Merging:
Merge the clusters with existing clusters to maintain a single set of cluster centres by the anomaly detection framework after each step.
• Usingaweightedclusteringalgorithm(weightedk-means)andclusterthe cluster centres
• Clustercentre’sweightisequaltothenumberofdatapointsinthatcluster Phase 3 – Revised Insertion:
• ComputeLOFvalueofthenewincomingdata point p, w.r.t. both the recent data points and cluster centres.
– If a cluster centre is the ith NN of p, we stop searching for the rest of the nearest neighbours.
• Updatethe𝑘𝑑𝑖𝑠𝑡,reachdist,lrdandLOFvaluesfor the existing data points
COMP90073 Security Analytics © University of Melbourne 2021
Summary
• Whataredifferentwindowingtechniquesfordatastreams?
• Howtoapplytreebasedanomalydetectionmethodstodatastreams?
• HowtoextendLOFforincrementallearningwhilemaintainingitsperformance?
Next: Anomaly Detection Using Support Vector Machine
COMP90073 Security Analytics © University of Melbourne 2021
References
1. Tan, Ting, Liu, “Fast Anomaly Detection for Streaming Data”, International Joint Conference on Artificial Intelligence (IJCAI), 2011
– https://github.com/yli96/HSTree
2. , , Latecki, “Incremental Local Outlier Detection for Data Streams”, IEEE Symposium on Computational Intelligence and Data Mining, 2007
3. , , . Bezdek, , , “Fast Memory Efficient Local Outlier Detection in Data Streams”, IEEE Transactions on Knowledge and Data Engineering (TKDE), 2016
COMP90073 Security Analytics © University of Melbourne 2021