Tips for Assessment Item 1
CMP3036M/CMP9063M Data Science 2016 – 2017
Bowei Chen
School of Computer Science
University of Lincoln
Datasets
It should be noted that the datasets are purposefully provided without meaning to ensure that the
analysis students do is based entirely on data mining algorithms and not in intuitive understanding. As
some students missed the lecture and workshop sessions that I introduced the datasets, below is the
further description and tips about the datasets:
– ds training.csv
This data is used for descriptive statistical analysis, training and selecting the best predictive model.
It contains 371 variables (i.e., column names), including ID, TARGET and other variables. Each
row represents an observation and each observation is multidimensional across different variables
(not including the ID variable). The ID variable is unique for all observations (for indexing). The
TARGET variable is the label or the response variable, which gives the response 0 or 1. The rest
369 variables are the possible input for the predictive model.
– ds test.csv
This data is used to make predictions once the best predictive model has been selected from the
ds training.csv. It contains 370 variables, including ID and other variables (without TARGET).
The values of the ID variable are different to those in the ds training.csv because they are different
observations. The 369 variables can be used as input for the selected predictive model to predict
the values of the TARGET variable. Each row should have one prediction and the predicted value
should be either 0 or 1.
– ds sample submission.csv
This is an example data format shows how to submit the predictions of your model. The values
of the ID variable are as same as in the ds test.csv and your predictions should be in the TAR-
GET column. You must make sure that your data solution file follows the same content format
as the ds sample submission.csv and it should be named as ds submission YourStudentID.csv, e.g.,
ds submission 12345000.csv. Remember – If you do not submit your data solution file fol-
lowing the required format, you will be given a mark of ’zero’ for report section III:
Predictive Models. (For more details, please check the briefing document and CRG)
2 CMP3036M/CMP9063M Data Science 2016 – 2017
Model Selection and Evaluation
Below are some suggestions on model and/or variable selection and evaluation. Note that your solutions
are not limited to these suggestions and also your solutions do not need to follow all these suggestions.
These suggestions are for those who have no idea where to start! Your solutions can
contain other advanced topics or models that haven’t been introduced in lectures.
– Check the model selection methods studied in Semester A Weeks 10:
• You don’t have to use all the model selection methods! Cross-validation (CV) methods are
preferred for this assignments but you could also look at other types of methods.
• There is no specific sequence if you want to use different model selection methods. The key
point is that you want to select the model which has the highest AUC value!
– Think about how to select the input variables for the predictive model:
• Consider how to measure the importance of input variables. For example, the correlation of
the input variables and the TARGET variable or the information gain of input variables based
on entropy. You can google and find a lot of documents about it and its implementation, e.g.,
rank.correlation(), information.gain() functions in FSelector package.
• Rank the input variables and drop out those which are not important.
– Below are some useful R packages that can be used. Some of these packages and relevant built-in
functions have been introduced in workshop slides or you can also google them for more details:
• leaps
• FSelector
• caret
• rpart
• pROC