CSC 480: Introduction to Data Mining
Fall 2018
Assignment 2: Data Mining for Cybersecurity
In this assignment, rather than using machine learning algorithms on nicely
curated data sets such as those found in the UCI Repository, you will be dealing
with real-world data. In particular, you will be using network traffic data
generated by mobile apps on monitored smart phones collected by Dr. Zhen Liu
and her research group. Dr. Liu and her group collected both active flows where
users chose to share their data when using an app and passive flows where apps
were launched on the phone and traffic collected while the phone was not used.
The goal of your assignment is to solve two different problems:
1) Classify the active flow data according to the mobile app that generated
it (e.g., QQ, WeChat, facebook, etc.).
2) Classify active from passive flows (Randomly sample a fixed number of
examples (such as 20,000) from active data, and combine them with
passive data for the passive flow detection experiment).
Mobile traffic classification is the foundation for QoS (Quality of Service)
provision, bandwidth allocation and traffic shaping etc. For example, the high
interactive applications (audio chat or video chat) require high QoS to avoid losing
packets, so as to provide good user experience. Whereas some other applications
(such as Browsers) do not need interactive communication, we could assign low
QoS for the traffic generated by these applications
Mobile applications generate background traffic when the end-user is not actively
using the app. If this background traffic could be accurately identified, network
operators could de-prioritise this traffic and free up network bandwidth for
priority network traffic.
The data was collected in the form of Network Flows. The data is available at:
https://wangruoyu.github.io/mobilegt/. The data is described below:
https://wangruoyu.github.io/mobilegt/
Data description:
Active data files:
(1) biFeatureData: data are characterized by the bi-flow feature set
(2) uniFeatureData:data are characterized by the uni-flow feature set
They are characterized by different feature sets. You could try the algorithms on
the two data sets and find out which feature set is better.
Passive data files
(1) Bipassivedata: data are characterized by the bi-flow feature set
(2) unipassivedata:data are characterized by the uni-flow feature set
In this assignment, your goal is to run various classifiers and try to combine
feature-selection methods, class-imbalance approaches, outlier detection
methods, and other such data filtering techniques that you see fit with algorithms
such as Decision Trees, Neural Networks, Naïve Bayes, SVMs, k-NN, Bagging,
Boosting, Random Forests, etc. to try to find a way to obtain good results on this
data. Prior to running your experiments, take time to study your data set and see
what kind of filters would be useful to apply to the data. Don’t just try plenty of
them. Instead, attempt to understand the data and reason about what
approaches may be best given their characteristics. If you have any questions
about the data, please contact our expert-in-residence, Zhen Liu at
jeannylz@yahoo.com. She will be able to help you.
mailto:jeannylz@yahoo.com