CS代考 COMP90073 Security Analytics

Contrast Data Mining: Methods and Applications
COMP90073 Security Analytics
, CIS Semester 2, 2021

• Introduction to Contrast Data Mining
• FP-Growth
• Applications of contrast mining in network traffic analysis and anomaly detection
COMP90073 Security Analytics © University of Melbourne 2021

Contrast Data Mining – What is it? [1]
Contrast – “To compare or appraise in respect to differences” ( Dictionary)
Contrast data mining – The mining of patterns and models contrasting two or more datasets/conditions.
“Sometimes it’s good to contrast what you like with something else. It makes you appreciate it even more”
, Get Fuzzy, 2001
COMP90073 Security Analytics © University of Melbourne 2021

What can be Contrasted?
• Objects at different time periods
– “Compare traffic patterns from yesterday with today’s”
• Objects for different spatial locations
– “Find the distinguishing features of location x for human DNA, versus
location x for mouse DNA” • Object positions in a ranking
– “Find the differences between high- and low-income earners” • Objects across different classes
– “Find the differences between people with brown hair, versus those with blonde hair”
• Objects within a class
– “Within the academic profession, there are no rich people”
– “Within computer science, most scientific articles come from USA or Europe”
COMP90073 Security Analytics © University of Melbourne 2021

Characteristics of Contrast Data Mining
• Applied to multivariate data
• Objects may be relational, sequential, graphs, models, classifiers,
combinations of these
• Representation of contrasts is important. Needs to be
– Interpretable, non redundant, potentially actionable
– Tractable to compute
• Quality of contrasts is also important. Need
– Statistical significance, which can be measured in multiple ways – Ability to rank contrasts is desirable, especially for classification
COMP90073 Security Analytics © University of Melbourne 2021

How is Contrast Data Mining Used ? [1]
• Reporting significant changes/differences
– “Young children with diabetes have a greater risk of hospital
admission, compared to the rest of the population” • Alerting, notification and monitoring
– “Tell me when the dissimilarity index falls below 0.3” • Building one/multi-class classifiers
– Many different techniques
– Also used for weighting and ranking instances • Constructing synthetic instances
– Good for rare classes
COMP90073 Security Analytics © University of Melbourne 2021

Example – Network Traffic Analysis
• Extracting knowledge from the massive volumes of network traffic is an important task in network and security management
• Network flows that are ranked by anomaly detection systems often contain thousands of records. Analysts often check only the first few pages
• Having a concise and meaningful report of network traffic is more desirable
• An appropriate report can help managers to reduce the time and cost of security analysis and make smart decisions
COMP90073 Security Analytics © University of Melbourne 2021

Example – Network Traffic Analysis
• Summarization: A good summarization is trade-off between two metrics: Compaction gain and information loss
T2 88.34.224.2
T3 12.190.19.23
T4 98.198.66.23
T5 192.168.22.4
T6 192.168.22.4
T7 67.118.25.23
T8 192.168.22.4
T9 98.198.66.23
51989 100.10.20.4
2234 100.10.20.4
27643 100.10.20.4
5002 100.10.20.3
5001 100.10.20.3
44532 100.10.20.3
2765 100.10.20.4
5003 100.10.20.5
80 tcp APRS
80 tcp APRS
80 tcp APRS
21 tcp A-RSF
21 tcp A-RSF
21 tcp A-RS
113 tcp APRS
21 tcp A-RSF
[2,20] [220,500]
[2,20] [220,500]
[2,20] [42,200]
[2,20] [42,200]
[40,68] [220,500]
[40,68] [42,200]
[2,20] [504,1200]
[2,20] [220,500]
Table1: Dataset of network flows
src IP sPort des IP dPort pro flags packets bytes
T1 12.190.84.122 32178 100.10.20.4 80 tcp APRS [2,20] [504,1200]
COMP90073 Security Analytics © University of Melbourne 2021

Example – Network Traffic Analysis
• Summarization: A good summarization is trade-off between two metrics: Compaction gain and information loss
Table1: Dataset of network flows
T2 T4 T6 T8
src IP sPort des IP dPort pro flags packets bytes
12.190.84.122 32178 100.10.20.4 80 tcp APRS [2,20] [504,1200]
88.34.224.2 51989 100.10.20.4 80 tcp APRS [2,20] [220,500]
12.190.19.23 2234 100.10.20.4 80 tcp APRS [2,20] [220,500]
98.198.66.23 27643 100.10.20.4 80 tcp APRS [2,20] [42,200]
192.168.22.4 5002 100.10.20.3 21 tcp A-RSF [2,20] [42,200]
192.168.22.4 5001 100.10.20.3 21 tcp A-RSF [40,68] [220,500]
67.118.25.23 44532 100.10.20.3 21 tcp A-RS [40,68] [42,200]
192.168.22.4 2765 100.10.20.4 113 tcp APRS [2,20] [504,1200]
T9 98.198.66.23 5003 100.10.20.5 21 tcp A-RSF [2,20] [220,500]
src IP sPort des IP dPort pro flags packets bytes
Table 2: Summarization by clustering
*** *** 100.10.20.3 21 tcp *** *** ***
*** *** 100.10.20.4 *** tcp APRS [2,20] ***
COMP90073 Security Analytics © University of Melbourne 2021

Example – Reporting Significant Differences Between Multiple Datasets
12.190.84.122
100.10.20.4
[504,1200]
88.34.224.2
100.10.20.4
12.190.19.23
100.10.20.4
98.198.66.23
100.10.20.4
192.168.22.4
100.10.20.3
192.168.22.4
100.10.20.3
67.118.25.23
100.10.20.3
98.198.66.23
100.10.20.5
12.190.84.122
100.10.20.4
[504,1200]
88.34.224.2
100.10.20.4
12.190.19.23
100.10.20.4
98.198.66.23
100.10.20.10
192.168.22.4
100.10.20.10
192.168.22.4
100.10.20.3
67.118.25.23
100.10.20.3
98.198.99.23
100.10.20.20
[1200,1500]
COMP90073 Security Analytics © University of Melbourne 2021
Differences

Example – Reporting Significant Differences Between Multiple Datasets
98.198.66.23
100.10.20.10
192.168.22.4
100.10.20.10
98.198.99.23
100.10.20.20
[1200,1500]
• We can use contrast pattern mining for finding important changes.
• Contrast pattern mining finds patterns whose support differs significantly from one dataset to another.
COMP90073 Security Analytics © University of Melbourne 2021

Definitions
• Itemset: A collection of one or more items
– k-itemset: An itemset that contains k items
• Count (X, D): The number of transactions in dataset D containing pattern X
• Support (X, D): The percentage of transactions in dataset D containing
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑋, 𝐷 = 𝐶𝑜𝑢𝑛𝑡 𝑋, 𝐷 |𝐷|
Frequent Itemset: An itemset whose support is greater than or equal to a minsup threshold
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑋, 𝐷 ≥ 𝑚𝑖𝑛𝑠𝑢𝑝
COMP90073 Security Analytics © University of Melbourne 2021

Apriori Algorithm – Revision
• Generate frequent itemsets of length 1
• Repeat until no new frequent itemsets are identified
– Prune candidate itemsets containing subsets of length k that are infrequent
– Count the support of each candidate by scanning the database
– Eliminate candidates that are infrequent, leaving only those that are
– Generate length (k+1) candidate itemsets from length k frequent itemsets
Apriori Principle:
• If an itemset is infrequent, then all of its superset must also be infrequent
COMP90073 Security Analytics © University of Melbourne 2021

Illustrating Apriori Principle – Revision
If AB is infrequent all its super sets are infrequent and hence patterns ABC, ABD, ABE, ABCD, ABCE, ABDE, ABCDE are all infrequent.
ABC ABD ABE
Found to be Infrequent
Pruned supersets
COMP90073 Security Analytics © University of Melbourne 2021

Bottleneck of Apriori
• It is costly to handle a huge number of candidate sets
• If there are 104 frequent 1-itemsts, the Apriori algorithm will need to
generate more than 107 2-itemsets and test their frequencies.
• Mining long patterns needs many passes of scanning and generates lots
of candidates.
• It may need to repeatedly scan the whole database and check a large set of candidates by pattern matching.
• Bottleneck: candidate-generation-and-test
• Can we avoid candidate generation?
COMP90073 Security Analytics © University of Melbourne 2021

Overview of FP-Growth [2]
• Find frequent single items, and partition the database based on each such item
• Recursively grow frequent patterns by doing the above for each portioned databased
• To facilitate efficient processing, compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure
– Highly compacted, but complete for frequent pattern mining – Avoid candidate generation
– Avoid costly repeated database scans
COMP90073 Security Analytics © University of Melbourne 2021

FP-Tree Definition
• FP-tree is a frequent pattern tree, and defined as below:
• One root labeled as “null”, a set of item prefix sub-trees as the children of
the root, and a frequent-item header table.
• Each node in the item prefix sub-trees has three fields:
– Item-name: Registers which item this node represents,
– Count: The number of transactions represented by the portion of the
path reaching this node,
– Node-link: Links to the next node in the FP-tree carrying the same item-name, or null if there is none.
COMP90073 Security Analytics © University of Melbourne 2021

FP-Tree Definition
• Each entry in the frequent-item header table has two fields, – Item-name,
– Item support count, and
– Head of node-link: Points to the first node in the FP-tree carrying the item-name.
COMP90073 Security Analytics © University of Melbourne 2021

FP-tree: Construction and Design
STEP 1: Scan the transaction database for the first time, find frequent items (single item patterns) and order them into a list in frequency descending order. In the format of (item-name, support).
TID List of item _IDs
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3, l6
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
COMP90073 Security Analytics © University of Melbourne 2021

FP-tree: Construction and Design
STEP 1: Scan the transaction database for the first time, find frequent items (single itempatterns)andorderthemintoalistinfrequency descending order. In the format of (item-name, support).
STEP 2: For each transaction, order its frequent items according to the order; Scan database the second time, construct FP-tree by putting each frequency ordered transaction onto it.
TID List of Item _IDs
Ordered Frequent Items
T100 I1, I2, I5 I2, I1, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3, l6 T700 I1, I3
T800 I1, I2, I3, I5
I2, I1, I4
I2, I1, I3, I5
T900 I1, I2, I3 I2, I1, I3
COMP90073 Security Analytics © University of Melbourne 2021

FP-tree: Construction and Design
• STEP2:ConstructFP-tree
Ordered Frequent Items
I2, I1, I5
I2, I1, I4
I2, I1, I3, I5
I2, I1, I3
COMP90073 Security Analytics © University of Melbourne 2021

FP-tree: Construction and Design
• STEP2:ConstructFP-tree
Ordered Frequent Items
I2, I1, I5
I2, I1, I4
I2, I1, I3, I5
I2, I1, I3
I1:1 I3:1 I3:1
COMP90073 Security Analytics © University of Melbourne 2021

FP-tree: Construction and Design
• STEP2:ConstructFP-tree
Ordered Frequent Items
I2, I1, I5
I2, I1, I4
I2, I1, I3, I5
I2, I1, I3
I1:1 I3:2 I3:1
COMP90073 Security Analytics © University of Melbourne 2021

FP-tree: Construction and Design
• STEP2:ConstructFP-tree
Ordered Frequent Items
I2, I1, I5
I2, I1, I4
I2, I1, I3, I5
I2, I1, I3
I1:2 I3:2 I3:2
COMP90073 Security Analytics © University of Melbourne 2021

FP-tree: Construction and Design
• STEP2:ConstructFP-tree
Ordered Frequent Items
I2, I1, I5
I2, I1, I4
I2, I1, I3, I5
I2, I1, I3
I2:6 I1:3 I4:1
I1:2 I3:2 I3:2
COMP90073 Security Analytics © University of Melbourne 2021

FP-tree: Construction and Design
• STEP2:ConstructFP-tree
Ordered Frequent Items
I2, I1, I5
I2, I1, I4
I2, I1, I3, I5
I2, I1, I3
I2:7 I1:4 I4:1
I1:2 I3:2 I3:2
COMP90073 Security Analytics © University of Melbourne 2021

Mining Frequent Patterns Using FP-tree
Starting the processing from the end of list L:
Construct conditional pattern base for each item in L
Construct conditional FP-tree from each conditional pattern base
Recursively mine conditional FP-trees and grow frequent patterns
obtained so far.
– If the conditional FP-tree contains a single path, simply enumerate all the patterns
COMP90073 Security Analytics © University of Melbourne 2021

Step 1: Construct Conditional Pattern Base
• Starting at the bottom of frequent-item header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item
• Accumulate all of transformed prefix paths of that item to form a conditional pattern base
COMP90073 Security Analytics © University of Melbourne 2021

Step 2: Construct Conditional FP-tree
• For each pattern base
– Accumulate the count for each item in the base
– Construct the conditional FP-tree for the frequent items of the pattern base
• Minsup=2
COMP90073 Security Analytics © University of Melbourne 2021

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts