程序代写 scheme data mining algorithm An Introduction to Anomaly Detection

An Introduction to Anomaly Detection
COMP90073 Security Analytics
, CIS Semester 2, 2021

Outline
• Usingmachinelearningincybersecurity • Basicsofmachinelearning
• Introductiontoanomalydetection
• IsolationForest(iForest)
COMP90073 Security Analytics © University of Melbourne 2021

Why Machine Learning and Security?
COMP90073 Security Analytics © University of Melbourne 2021

Conventional Cybersecurity System
COMP90073 Security Analytics © University of Melbourne 2021

Adaptive Defense System for Cybersecurity
Data capturing tools Data preprocessing Feature extraction Analyzing Engine
Decision Process
COMP90073 Security Analytics © University of Melbourne 2021

Cyber Security Solutions
• Proactive:
Maintain the overall security of a system, even if individual components of the system have been compromised by an attack, i.e., Privacy Preserving Data Mining (PPDM).
• Reactive:
Identify any unauthorized attempt to access, manipulate, modify, or destroy information or to use a computer system remotely to spam, hack, or modify other computers, i.e., Intrusion Detection System (IDS).
COMP90073 Security Analytics © University of Melbourne 2021

Intrusion Detection Systems (IDS)
• Signature(Misuse)Detection:Measuresthesimilaritybetweeninputevents and the signatures of known intrusions
• AnomalyDetection:Triggersalarmswhenthedetectedobjectbehaves significantly differently from the predefined normal patterns
COMP90073 Security Analytics © University of Melbourne 2021

Overview of ML – Revision
Input to a machine learning system can consist of instance/measurements about individual entities/objects, e.g., a network packet.
• Attribute(akaFeature,explanatoryvariable):componentoftheinstances source IP, destination IP, source port, destination port, etc.
• Label(akaResponse,dependentvariable):anoutcomethatiscategorical, numeric, etc. attack vs. legitimate traffic
• Models:discoveredrelationshipbetweenattributesand/orlabel
COMP90073 Security Analytics © University of Melbourne 2021

Supervised and Unsupervised Learning – Revision
• Supervisedlearning
– Teachthecomputerhowtodosomething(byexample),thenletit
– Useitsnew-foundknowledgetodoit
– Labelleddata:forgiveninputs,providetheexpectedoutput(“theanswer”) – Inferafunctionmappingfrominputstooutputs
• Unsupervisedlearning
– Letthecomputerlearnhowtodosomething
– Determinestructureandpatternsindata
– Unlabelleddata:Don’tgivethecomputer“theanswer”
COMP90073 Security Analytics © University of Melbourne 2021

Model Validation – Revision
• Holdout:Trainaclassifieroverafixedtrainingdataset,andevaluateitovera fixed held-out test dataset
• RandomSubsampling:Performholdoutovermultipleiterations,randomly selecting the training and test data (maintaining a fixed size for each dataset) on each iteration
• Leave-One-Out:Chooseeachdatapointastestcaseandtherestastraining data
• M-foldCross-Validation:PartitionthedataintoM(approximately)equalsize partitions, and choose each partition for testing and the remaining M-1 partitions for training
Chose a validation model that is efficient, and minimises bias and variance in evaluation.
COMP90073 Security Analytics © University of Melbourne 2021

Generalisation Problem – Revision
COMP90073 Security Analytics © University of Melbourne 2021

Performance Evaluation – Revision
• ConfusionMatrix
Actual
Cat
Dog
Cat
4 (TP)
2 (FN)
Dog
Predicted
3 (FP)
6 (TN)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
= 4 + 6 ≅ 67% 4+6+3+2
COMP90073 Security Analytics © University of Melbourne 2021

Limitation of Accuracy – Revision
• Anomalydetection
– Numberofnegativeexamples=9990 – Numberofpositiveexamples=10
• Ifmodelpredictseverythingtobeclass0,accuracyis9990/10000=99.9%
– Accuracyismisleadingbecausemodeldoesnotdetectanypositive examples
COMP90073 Security Analytics © University of Melbourne 2021

Performance Evaluation – Revision

Recall:
𝑇𝑃 𝑇𝑃 + 𝐹𝑁

Precision:
𝑇𝑃 𝑇𝑃 + 𝐹𝑃
Actual
TP FP FN TN

F-Score:
(1 + 𝛽!) 𝑃𝑒𝑟×𝑅𝑒𝑐 𝑅𝑒𝑐 + 𝛽!𝑃𝑒𝑟
COMP90073 Security Analytics © University of Melbourne 2021
Predicted

Performance Evaluation – Revision
ROC (Receiver Operating Characteristic) Curve
• ROCcurveplotsTPR(onthey-axis)against FPR (on the x-axis)
• Performanceofeachclassifierrepresentedasa point on the ROC curve
• (TPR,FPR):
– (0,0):declareeverythingtobenegative
class
– (1,1):declareeverythingtobepositive class
– (1,0):ideal
• Diagonalline:
– Randomguessing
– Belowdiagonalline:
• prediction is opposite of the true class
False Positive Rate
COMP90073 Security Analytics © University of Melbourne 2021
True Positive Rate

Anomaly (Outlier) Detection
• Ananomalyisdefinedasapatternindatathatdoesnotconformtothe expected behaviours, including outliers, abbreviations, contaminants, and surprise, etc., in applications.
COMP90073 Security Analytics © University of Melbourne 2021

Types of Anomalies
• Global (Point) Anomalies: A data object is a global outlier if it deviates significantly from the rest of the data set. To detect global anomalies, a critical issue is to find an appropriate measurement of deviation with respect to the application in question.
• Contextual (Conditional) Anomalies: A data object is a contextual anomaly if it deviates significantly with respect to a specific context of the object. In contextual anomaly detection, the context has to be specified as part of the problem definition.
• Collective Anomalies: A subset of data objects forms a collective anomaly if the objects as a whole deviate significantly from the entire data set. Importantly, the individual data objects may not be anomalies.
COMP90073 Security Analytics © University of Melbourne 2021

Anomaly, Noise and Novelty Detection
• Noisevs.Anomaly:
– Noiseisarandomerrororvarianceinaninstance
variable.
– Ingeneral,noiseisnotinterestingindataanalysis, including anomaly detection.
– Anomaliesareinterestingbecausetheyare suspected of not being generated by the same mechanisms as the rest of the data.
Anomaly
Noise
COMP90073 Security Analytics © University of Melbourne 2021

Anomaly, Noise and Novelty Detection
• Noisevs.Anomaly:
– Noiseisarandomerrororvarianceinaninstance
variable.
– Ingeneral,noiseisnotinterestingindataanalysis, including anomaly detection.
– Anomaliesareinterestingbecausetheyare suspected of not being generated by the same mechanisms as the rest of the data.
• Noveltyvs.Anomaly:
– Inevolvingdatasets,novelpatternsmayinitially
appear as anomalies.
– Oncenewpatternsareconfirmed,theyareusually incorporated into the model of normal behaviour so that follow-up instances are not treated as anomalies anymore.
Anomaly
Noise
Noise
Novelty
COMP90073 Security Analytics © University of Melbourne 2021

Anomaly Detection Schemes
General Steps
• Buildaprofileofthe“normal”behaviour
– Profilecanbepatternsorsummarystatisticsfortheoverallpopulation
• Usethe“normal”profiletodetectanomalies
– Anomaliesareobservationswhosecharacteristicsdiffersignificantlyfrom
the normal profile
Methods
1. Extreme Value Analysis 2. Proximity-Based
3. Model-based
COMP90073 Security Analytics © University of Melbourne 2021

1. Extreme Value Analysis
• Assumeaparametricmodeldescribingthedistributionofthedata(e.g.,normal distribution)
• Applyastatisticaltest(e.g.,𝑧=(𝑥−𝜇)/𝜎)thatdependson – Datadistribution
– Parameterofdistribution(e.g.,mean,variance) – Numberofexpectedanomalies(confidencelimit)
Limitations
• Mostofthetestsareforasingleattribute
• Inmanycases,datadistributionmaynotbeknown
• Forhighdimensionaldata,itmaybedifficulttoestimatethetruedistribution
– Canbeusedasfinalstepsforinterpretingoutputsofotheranomaly detection methods
COMP90073 Security Analytics © University of Melbourne 2021

2. Proximity-Based
• Dataisrepresentedasavectoroffeatures.
• Assumestheproximityofananomalytoitsneighbourhoodsignificantly deviates from the proximity of the object to most of the other objects in the dataset.
• Threemajorapproaches
2.1 Nearest-neighbour based 2.2 Density based
2.3 Clustering based
COMP90073 Security Analytics © University of Melbourne 2021

2.1 Nearest-Neighbour Based
• Computethedistancebetweeneverypairofdatapoints • There are various ways to define anomalies:
– Datapointsforwhichtherearefewerthankneighbouringpointswithina distance D
– The top n data points whose distance to the kth nearest neighbour is greatest
– The top n data points whose average distance to the kth nearest neighbours is greatest
COMP90073 Security Analytics © University of Melbourne 2021

2.2 Density-based
• Estimatesthedensityofobjects (using proximity measures between objects).
• Objectsthatareinregionsoflow density are relative distant from their neighbours, and can be considered anomalous.
• Amoresophisticatedapproach accommodates the fact that data sets can have regions of widely differing densities.
– Classifiesapointasanoutlier only if it has a local density significantly less than that of most of its neighbours.
Distance from o1 to nearest neighbour
o1.
. o 2
Distance from o2 to nearest neighbour
COMP90073 Security Analytics © University of Melbourne 2021

2.3 Clustering-Based
• Clusterthedataintogroupsof different density
• Choosepointsinsmallclusteras candidate anomalies
• Computethedistancebetween candidate points and non-candidate clusters.
– Ifcandidatepointsarefarfrom all other non-candidate points, they are anomalies
COMP90073 Security Analytics © University of Melbourne 2021

3.1 Classification-Based Methods
Idea: Train a classification model that can distinguish “normal” data from anomalies
• Consideratrainingsetthatcontainssampleslabelledas“normal”andothers labelled as “anomaly”
– But,thetrainingsetistypicallyheavilybiased:numberof“normal”samples likely far exceeds number of anomaly samples
• Handletheimbalanceddistribution
– Oversamplingpositivesand/orundersamplingnegatives – Cost-sensitivelearning
COMP90073 Security Analytics © University of Melbourne 2021

3.2 One-Class Model
• One-classmodel:Aclassifierisbuilttodescribeonlythenormalclass
– Learnthedecisionboundaryofthenormalclassusingclassification
methods such as one-class SVM
– Anysamplesthatdonotbelongtothenormalclass(notwithinthedecision boundary) are declared as anomalies
– Advantage:candetectnewanomaliesthatmaynotappearclosetoany anomalous objects in the training set
COMP90073 Security Analytics © University of Melbourne 2021

Output of Anomaly Detection
• ScoringTechniques:
– Assignananomalyscoretoeachinstanceinthetestdatadependingon
the degree to which that instance is considered an anomaly.
– Theoutputisarankedlistofanomalies.
– Ananalystmaychoosetoeitheranalysetopfewanomaliesoruseacut- off threshold (or domain specific threshold) to select the anomalies.
• LabellingTechniques:
– Assignalabel(normaloranomalous)toeachtestinstance.
– Limit the analysts to the binary label, (though this can be controlled indirectly through parameter choices within each technique).
COMP90073 Security Analytics © University of Melbourne 2021

Challenges of Machine Learning in Security
• Modellingdatawithskewedclassdistributions(classimbalance)
• Sheervolumeandheterogeneousnetworkdata
• Difficulttoassesstheperformanceofthesystem,giventhevastpossibilitiesof anomalies and lack of label
• CostoferrorinIDSishuge
• Largefalsealarmratedegradesconfidenceinthesystem
• Lackofinterpretability
• Anomaliesmaybeundetectableatonelevelofgranularityorabstractionbut easy to detect at another level
• Evolvingpatterns(conceptdrift)
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest (iForest) [3]
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Tree (iTree)
• Objective:Isolatesanomaliesratherthanprofilesnormalinstances
• Isolation:Separatinganinstancefromtherestoftheinstances
To achieve this, we takes advantage of two anomalies’ quantitative properties:
i. They are the minority consisting of fewer instances, and
ii. They have attribute-values that are very different from those of normal instances
• IsolationTree(iTree)Intuition:Becauseoftheirsusceptibilitytoisolation, anomalies are isolated closer to the root of the tree; whereas normal points are isolated at the deeper end of the tree.
COMP90073 Security Analytics © University of Melbourne 2021

iTree Intuition
• Anomalies are more susceptible to isolation under random partitioning
(a) xi requires only 12 partitions (b) x0 requires only 4 partitions Figure. Identifying normal vs. abnormal observations
COMP90073 Security Analytics © University of Melbourne 2021

iTree Intuition
• Anomalies are more susceptible to isolation and hence have short path lengths
No. of tree (log scale)
COMP90073 Security Analytics © University of Melbourne 2021
Average path length

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension
Y2
Y1
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension
– Randomlypickavalueinthat dimension
Y2
Y1
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension – Randomlypickavalueinthat
dimension
– Drawastraightlinethroughthe data at that value and split data
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension – Randomlypickavalueinthat
dimension
– Drawastraightlinethroughthe data at that value and split data
– Repeatuntiltreeiscomplete
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension – Randomlypickavalueinthat
dimension
– Drawastraightlinethroughthe data at that value and split data
– Repeatuntiltreeiscomplete
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension – Randomlypickavalueinthat
dimension
– Drawastraightlinethroughthe data at that value and split data
– Repeatuntiltreeiscomplete
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension – Randomlypickavalueinthat
dimension
– Drawastraightlinethroughthe data at that value and split data
– Repeatuntiltreeiscomplete • Generatemultipletrees→forest
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
iTree 1 iTree 2 iTree 3
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension – Randomlypickavalueinthat
dimension
– Drawastraightlinethroughthe data at that value and split data
– Repeatuntiltreeiscomplete
• Generatemultipletrees→forest
• Anomalieswillbeisolatedinonlya few steps
Anomaly
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension – Randomlypickavalueinthat
dimension
– Drawastraightlinethroughthe data at that value and split data
– Repeatuntiltreeiscomplete
• Generatemultipletrees→forest
• Anomalieswillbeisolatedinonlya few steps
• Nominalpointsinmore
Anomaly
Normal
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension – Randomlypickavalueinthat
dimension
– Drawastraightlinethroughthe data at that value and split data
– Repeatuntiltreeiscomplete
• Generatemultipletrees→forest
• Anomalieswillbeisolatedinonlya few steps
• Nominalpointsinmore
Anomaly
Normal
COMP90073 Security Analytics © University of Melbourne 2021

Anomaly Detection with iForest
• IsolationForestscore:
!”($(%)) 𝑠𝑥,𝑛=2 ‘(()
• Where,
– h(𝑥) is the path length of observation 𝑥 from the root node,
– E(h 𝑥 ) is the average of h(𝑥) from a collection of isolation trees – nisthenumberofdatapoints
– 𝑐 𝑛 =2𝐻 𝑛−1 −(! “#$ ),whereEuler’sconstant !
• 0<𝑠≤1 – 𝑠 → 1, then samples are definitely anomalies, – 𝑠≪0.5,thensamplesarequitesafetoberegardedasnormal, – 𝑠=0.5,thentheentiresampledoesnotreallyhaveanydistinctanomaly. COMP90073 Security Analytics © University of Melbourne 2021 iForest Score – Case Study Isolation Forest Score COMP90073 Security Analytics © University of Melbourne 2021 Advantages of iForest • Requires two parameters, the number of trees to build and the sub-sampling size • Convergesquicklywithaverysmallnumberoftrees,anditonlyrequiresa small sub-sampling size to achieve high detection performance with high efficiency • TheisolationcharacteristicofiTreesenablesthemtobuildpartialmodelsand exploit sub-sampling to an extent that is not feasible in existing methods. • Utilizesnodistanceordensitymeasurestodetectanomalies. • Hasalineartimecomplexitywithalowconstantandalowmemory requirement. • Scalesuptohandleextremelylargeandhigh-dimensionaldatasets COMP90073 Security Analytics © University of Melbourne 2021 Summary • Whatisanomalydetectionandwhataredifferenttypesofanomalies? • Howwecanevaluationtheperformanceofanomalydetectiontechniques? • Howanomalydetectionisdifferentfromothermachinelearningproblems? • HowdoestheiForestalgorithmoperates,andwhatareitsadvantagesofthis method? Next: Clustering and Density-based Anomaly Detection COMP90073 Security Analytics © University of Melbourne 2021 References 1. Data Mining and Machine Learning in Security, Chapters 1,3. 2. Machine Learning and Security, Chapter 1. 3. Liu, Ting, Zhi- , “Isolation Forest”, IEEE International Conference on Data Mining, 2008. COMP90073 Security Analytics © University of Melbourne 2021