程序代写代做代考 data mining decision tree algorithm CSC 480: Introduction to Data Mining

CSC 480: Introduction to Data Mining

Fall 2018

Assignment 2: Data Mining for Cybersecurity

In this assignment, rather than using machine learning algorithms on nicely

curated data sets such as those found in the UCI Repository, you will be dealing

with real-world data. In particular, you will be using network traffic data

generated by mobile apps on monitored smart phones collected by Dr. Zhen Liu

and her research group. Dr. Liu and her group collected both active flows where

users chose to share their data when using an app and passive flows where apps

were launched on the phone and traffic collected while the phone was not used.

The goal of your assignment is to solve two different problems:

1) Classify the active flow data according to the mobile app that generated

it (e.g., QQ, WeChat, facebook, etc.).

2) Classify active from passive flows (Randomly sample a fixed number of

examples (such as 20,000) from active data, and combine them with

passive data for the passive flow detection experiment).

Mobile traffic classification is the foundation for QoS (Quality of Service)

provision, bandwidth allocation and traffic shaping etc. For example, the high

interactive applications (audio chat or video chat) require high QoS to avoid losing

packets, so as to provide good user experience. Whereas some other applications

(such as Browsers) do not need interactive communication, we could assign low

QoS for the traffic generated by these applications

Mobile applications generate background traffic when the end-user is not actively

using the app. If this background traffic could be accurately identified, network

operators could de-prioritise this traffic and free up network bandwidth for

priority network traffic.

The data was collected in the form of Network Flows. The data is available at:

https://wangruoyu.github.io/mobilegt/. The data is described below:

https://wangruoyu.github.io/mobilegt/

Data description:

Active data files:

(1) biFeatureData: data are characterized by the bi-flow feature set

(2) uniFeatureData:data are characterized by the uni-flow feature set

They are characterized by different feature sets. You could try the algorithms on
the two data sets and find out which feature set is better.

Passive data files

(1) Bipassivedata: data are characterized by the bi-flow feature set

(2) unipassivedata:data are characterized by the uni-flow feature set

In this assignment, your goal is to run various classifiers and try to combine

feature-selection methods, class-imbalance approaches, outlier detection

methods, and other such data filtering techniques that you see fit with algorithms

such as Decision Trees, Neural Networks, Naïve Bayes, SVMs, k-NN, Bagging,

Boosting, Random Forests, etc. to try to find a way to obtain good results on this

data. Prior to running your experiments, take time to study your data set and see

what kind of filters would be useful to apply to the data. Don’t just try plenty of

them. Instead, attempt to understand the data and reason about what

approaches may be best given their characteristics. If you have any questions

about the data, please contact our expert-in-residence, Zhen Liu at

jeannylz@yahoo.com. She will be able to help you.

mailto:jeannylz@yahoo.com