CS代写 COMP90073 Security Analytics

An Introduction to Anomaly Detection
COMP90073 Security Analytics
, CIS Semester 2, 2021

• Usingmachinelearningincybersecurity • Basicsofmachinelearning
• Introductiontoanomalydetection
• IsolationForest(iForest)
COMP90073 Security Analytics © University of Melbourne 2021

Why Machine Learning and Security?
COMP90073 Security Analytics © University of Melbourne 2021

Conventional Cybersecurity System
COMP90073 Security Analytics © University of Melbourne 2021

Adaptive Defense System for Cybersecurity
Data capturing tools Data preprocessing Feature extraction Analyzing Engine
Decision Process
COMP90073 Security Analytics © University of Melbourne 2021

Cyber Security Solutions
• Proactive:
Maintain the overall security of a system, even if individual components of the system have been compromised by an attack, i.e., Privacy Preserving Data Mining (PPDM).
• Reactive:
Identify any unauthorized attempt to access, manipulate, modify, or destroy information or to use a computer system remotely to spam, hack, or modify other computers, i.e., Intrusion Detection System (IDS).
COMP90073 Security Analytics © University of Melbourne 2021

Intrusion Detection Systems (IDS)
• Signature(Misuse)Detection:Measuresthesimilaritybetweeninputevents and the signatures of known intrusions
• AnomalyDetection:Triggersalarmswhenthedetectedobjectbehaves significantly differently from the predefined normal patterns
COMP90073 Security Analytics © University of Melbourne 2021

Overview of ML – Revision
Input to a machine learning system can consist of instance/measurements about individual entities/objects, e.g., a network packet.
• Attribute(akaFeature,explanatoryvariable):componentoftheinstances source IP, destination IP, source port, destination port, etc.
• Label(akaResponse,dependentvariable):anoutcomethatiscategorical, numeric, etc. attack vs. legitimate traffic
• Models:discoveredrelationshipbetweenattributesand/orlabel
COMP90073 Security Analytics © University of Melbourne 2021

Supervised and Unsupervised Learning – Revision
• Supervisedlearning
– Teachthecomputerhowtodosomething(byexample),thenletit
– Useitsnew-foundknowledgetodoit
– Labelleddata:forgiveninputs,providetheexpectedoutput(“theanswer”) – Inferafunctionmappingfrominputstooutputs
• Unsupervisedlearning
– Letthecomputerlearnhowtodosomething
– Determinestructureandpatternsindata
– Unlabelleddata:Don’tgivethecomputer“theanswer”
COMP90073 Security Analytics © University of Melbourne 2021

Model Validation – Revision
• Holdout:Trainaclassifieroverafixedtrainingdataset,andevaluateitovera fixed held-out test dataset
• RandomSubsampling:Performholdoutovermultipleiterations,randomly selecting the training and test data (maintaining a fixed size for each dataset) on each iteration
• Leave-One-Out:Chooseeachdatapointastestcaseandtherestastraining data
• M-foldCross-Validation:PartitionthedataintoM(approximately)equalsize partitions, and choose each partition for testing and the remaining M-1 partitions for training
Chose a validation model that is efficient, and minimises bias and variance in evaluation.
COMP90073 Security Analytics © University of Melbourne 2021

Generalisation Problem – Revision
COMP90073 Security Analytics © University of Melbourne 2021

Performance Evaluation – Revision
• ConfusionMatrix
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
= 4 + 6 ≅ 67% 4+6+3+2
COMP90073 Security Analytics © University of Melbourne 2021

Limitation of Accuracy – Revision
• Anomalydetection
– Numberofnegativeexamples=9990 – Numberofpositiveexamples=10
• Ifmodelpredictseverythingtobeclass0,accuracyis9990/10000=99.9%
– Accuracyismisleadingbecausemodeldoesnotdetectanypositive examples
COMP90073 Security Analytics © University of Melbourne 2021

Performance Evaluation – Revision
𝑇𝑃 𝑇𝑃 + 𝐹𝑁
Precision:
𝑇𝑃 𝑇𝑃 + 𝐹𝑃
TP FP FN TN
(1 + 𝛽!) 𝑃𝑒𝑟×𝑅𝑒𝑐 𝑅𝑒𝑐 + 𝛽!𝑃𝑒𝑟
COMP90073 Security Analytics © University of Melbourne 2021

Performance Evaluation – Revision
ROC (Receiver Operating Characteristic) Curve
• ROCcurveplotsTPR(onthey-axis)against FPR (on the x-axis)
• Performanceofeachclassifierrepresentedasa point on the ROC curve
• (TPR,FPR):
– (0,0):declareeverythingtobenegative
– (1,1):declareeverythingtobepositive class
– (1,0):ideal
• Diagonalline:
– Randomguessing
– Belowdiagonalline:
• prediction is opposite of the true class
False Positive Rate
COMP90073 Security Analytics © University of Melbourne 2021
True Positive Rate

Anomaly (Outlier) Detection
• Ananomalyisdefinedasapatternindatathatdoesnotconformtothe expected behaviours, including outliers, abbreviations, contaminants, and surprise, etc., in applications.
COMP90073 Security Analytics © University of Melbourne 2021

Types of Anomalies
• Global (Point) Anomalies: A data object is a global outlier if it deviates significantly from the rest of the data set. To detect global anomalies, a critical issue is to find an appropriate measurement of deviation with respect to the application in question.
• Contextual (Conditional) Anomalies: A data object is a contextual anomaly if it deviates significantly with respect to a specific context of the object. In contextual anomaly detection, the context has to be specified as part of the problem definition.
• Collective Anomalies: A subset of data objects forms a collective anomaly if the objects as a whole deviate significantly from the entire data set. Importantly, the individual data objects may not be anomalies.
COMP90073 Security Analytics © University of Melbourne 2021

Anomaly, Noise and Novelty Detection
• Noisevs.Anomaly:
– Noiseisarandomerrororvarianceinaninstance
– Ingeneral,noiseisnotinterestingindataanalysis, including anomaly detection.
– Anomaliesareinterestingbecausetheyare suspected of not being generated by the same mechanisms as the rest of the data.
• Noveltyvs.Anomaly:
– Inevolvingdatasets,novelpatternsmayinitially
appear as anomalies.
– Oncenewpatternsareconfirmed,theyareusually incorporated into the model of normal behaviour so that follow-up instances are not treated as anomalies anymore.
COMP90073 Security Analytics © University of Melbourne 2021

Anomaly Detection Schemes
General Steps
• Buildaprofileofthe“normal”behaviour
– Profilecanbepatternsorsummarystatisticsfortheoverallpopulation
• Usethe“normal”profiletodetectanomalies
– Anomaliesareobservationswhosecharacteristicsdiffersignificantlyfrom
the normal profile
1. Extreme Value Analysis 2. Proximity-Based
3. Model-based
COMP90073 Security Analytics © University of Melbourne 2021

1. Extreme Value Analysis
• Assumeaparametricmodeldescribingthedistributionofthedata(e.g.,normal distribution)
• Applyastatisticaltest(e.g.,𝑧=(𝑥−𝜇)/𝜎)thatdependson – Datadistribution
– Parameterofdistribution(e.g.,mean,variance) – Numberofexpectedanomalies(confidencelimit)
Limitations
• Mostofthetestsareforasingleattribute
• Inmanycases,datadistributionmaynotbeknown
• Forhighdimensionaldata,itmaybedifficulttoestimatethetruedistribution
– Canbeusedasfinalstepsforinterpretingoutputsofotheranomaly detection methods
COMP90073 Security Analytics © University of Melbourne 2021

2. Proximity-Based
• Dataisrepresentedasavectoroffeatures.
• Assumestheproximityofananomalytoitsneighbourhoodsignificantly deviates from the proximity of the object to most of the other objects in the dataset.
• Threemajorapproaches
2.1 Nearest-neighbour based 2.2 Density based
2.3 Clustering based
COMP90073 Security Analytics © University of Melbourne 2021

2.1 Nearest-Neighbour Based
• Computethedistancebetweeneverypairofdatapoints • There are various ways to define anomalies:
– Datapointsforwhichtherearefewerthankneighbouringpointswithina distance D
– The top n data points whose distance to the kth nearest neighbour is greatest
– The top n data points whose average distance to the kth nearest neighbours is greatest
COMP90073 Security Analytics © University of Melbourne 2021

2.2 Density-based
• Estimatesthedensityofobjects (using proximity measures between objects).
• Objectsthatareinregionsoflow density are relative distant from their neighbours, and can be considered anomalous.
• Amoresophisticatedapproach accommodates the fact that data sets can have regions of widely differing densities.
– Classifiesapointasanoutlier only if it has a local density significantly less than that of most of its neighbours.
Distance from o1 to nearest neighbour
Distance from o2 to nearest neighbour
COMP90073 Security Analytics © University of Melbourne 2021

2.3 Clustering-Based
• Clusterthedataintogroupsof different density
• Choosepointsinsmallclusteras candidate anomalies
• Computethedistancebetween candidate points and non-candidate clusters.
– Ifcandidatepointsarefarfrom all other non-candidate points, they are anomalies
COMP90073 Security Analytics © University of Melbourne 2021

3.1 Classification-Based Methods
Idea: Train a classification model that can distinguish “normal” data from anomalies
• Consideratrainingsetthatcontainssampleslabelledas“normal”andothers labelled as “anomaly”
– But,thetrainingsetistypicallyheavilybiased:numberof“normal”samples likely far exceeds number of anomaly samples
• Handletheimbalanceddistribution
– Oversamplingpositivesand/orundersamplingnegatives – Cost-sensitivelearning
COMP90073 Security Analytics © University of Melbourne 2021

3.2 One-Class Model
• One-classmodel:Aclassifierisbuilttodescribeonlythenormalclass
– Learnthedecisionboundaryofthenormalclassusingclassification
methods such as one-class SVM
– Anysamplesthatdonotbelongtothenormalclass(notwithinthedecision boundary) are declared as anomalies
– Advantage:candetectnewanomaliesthatmaynotappearclosetoany anomalous objects in the training set
COMP90073 Security Analytics © University of Melbourne 2021

Output of Anomaly Detection
• ScoringTechniques:
– Assignananomalyscoretoeachinstanceinthetestdatadependingon
the degree to which that instance is considered an anomaly.
– Theoutputisarankedlistofanomalies.
– Ananalystmaychoosetoeitheranalysetopfewanomaliesoruseacut- off threshold (or domain specific threshold) to select the anomalies.
• LabellingTechniques:
– Assignalabel(normaloranomalous)toeachtestinstance.
– Limit the analysts to the binary label, (though this can be controlled indirectly through parameter choices within each technique).
COMP90073 Security Analytics © University of Melbourne 2021

Challenges of Machine Learning in Security
• Modellingdatawithskewedclassdistributions(classimbalance)
• Sheervolumeandheterogeneousnetworkdata
• Difficulttoassesstheperformanceofthesystem,giventhevastpossibilitiesof anomalies and lack of label
• CostoferrorinIDSishuge
• Largefalsealarmratedegradesconfidenceinthesystem
• Lackofinterpretability
• Anomaliesmaybeundetectableatonelevelofgranularityorabstractionbut easy to detect at another level
• Evolvingpatterns(conceptdrift)
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest (iForest) [3]
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Tree (iTree)
• Objective:Isolatesanomaliesratherthanprofilesnormalinstances
• Isolation:Separatinganinstancefromtherestoftheinstances
To achieve this, we takes advantage of two anomalies’ quantitative properties:
i. They are the minority consisting of fewer instances, and
ii. They have attribute-values that are very different from those of normal instances
• IsolationTree(iTree)Intuition:Becauseoftheirsusceptibilitytoisolation, anomalies are isolated closer to the root of the tree; whereas normal points are isolated at the deeper end of the tree.
COMP90073 Security Analytics © University of Melbourne 2021

iTree Intuition
• Anomalies are more susceptible to isolation under random partitioning
(a) xi requires only 12 partitions (b) x0 requires only 4 partitions Figure. Identifying normal vs. abnormal observations
COMP90073 Security Analytics © University of Melbourne 2021

iTree Intuition
• Anomalies are more susceptible to isolation and hence have short path lengths
No. of tree (log scale)
COMP90073 Security Analytics © University of Melbourne 2021
Average path length

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension
– Randomlypickavalueinthat dimension
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension – Randomlypickavalueinthat
– Drawastraightlinethroughthe data at that value and split data
– Repeatuntiltreeiscomplete
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension – Randomlypickavalueinthat
– Drawastraightlinethroughthe data at that value and split data
– Repeatuntiltreeiscomplete • Generatemultipletrees→forest
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension – Randomlypickavalueinthat
– Drawastraightlinethroughthe data at that value and split data
– Repeatuntiltreeiscomplete
• Generatemultipletrees→forest
• Anomalieswillbeisolatedinonlya few steps
COMP90073 Security Analytics © University of Melbourne 2021

Isolation Forest
• Foreachtree:
• Get a sample of the data
– Randomlyselectadimension – Randomlypickavalueinthat
– Drawastraightlinethroughthe data at that value and split data
– Repeatuntiltreeiscomplete
• Generatemultipletrees→forest
• Anomalieswillbeisolatedinonlya few steps
• Nominalpointsinmore
COMP90073 Security Analytics © University of Melbourne 2021

Anomaly Detection with iForest
• IsolationForestscore:
!”($(%)) 𝑠𝑥,𝑛=2 ‘(()
– h(𝑥) is the path length of observation 𝑥 from the root node,
– E(h 𝑥 ) is the average of h(𝑥) from a collection of isolation trees – nisthenumberofdatapoints
– 𝑐 𝑛 =2𝐻 𝑛−1 −(! “#$ ),whereEuler’sconstant !
– 𝑠 → 1, then samples are definitely anomalies,
– 𝑠≪0.5,thensamplesarequitesafetoberegardedasnormal,
– 𝑠=0.5,thentheentiresampledoesnotreallyhaveanydistinctanomaly.
COMP90073 Security Analytics © University of Melbourne 2021

Advantages of iForest
• Requires two parameters, the number of trees to build and the sub-sampling size
• Convergesquicklywithaverysmallnumberoftrees,anditonlyrequiresa small sub-sampling size to achieve high detection performance with high efficiency
• TheisolationcharacteristicofiTreesenablesthemtobuildpartialmodelsand exploit sub-sampling to an extent that is not feasible in existing methods.
• Utilizesnodistanceordensitymeasurestodetectanomalies.
• Hasalineartimecomplexitywithalowconstantandalowmemory
requirement.
• Scalesuptohandleextremelylargeandhigh-dimensionaldatasets
COMP90073 Security Analytics © University of Melbourne 2021

• Whatisanomalydetectionandwhataredifferenttypesofanomalies?
• Howwecanevaluationtheperformanceofanomalydetectiontechniques?
• Howanomalydetectionisdifferentfromothermachinelearningproblems?
• HowdoestheiForestalgorithmoperates,andwhatareitsadvantagesofthis method?
Next: Clustering and Density-based Anomaly Detection
COMP90073 Security Analytics © University of Melbourne 2021

References
1. Data Mining and Machine Learning in Security, Chapters 1,3.
2. Machine Learning and Security, Chapter 1.
3. Liu, Ting, Zhi- , “Isolation Forest”, IEEE International Conference on Data Mining, 2008.
COMP90073 Security Analytics © University of Melbourne 2021

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts