STAT318/462 — Data Mining
Dr G ́abor Erd ́elyi
University of Canterbury, Christchurch,
Course developed by Dr B. Robertson. Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,1 / 12
Organizational
Lecturer Term 3: Dr G ́abor Erd ́elyi.
Office Hours: Erskine 704, Tuesdays 10am-11am or by appointment.
Course Co-ordinator/Lecturer Term 4: Dr Varvara Vetrova. Lectures (Echo360 recorded):
Wednesdays 11am-12pm E7 Lecture Theatre; Thursdays 1-2pm E5 Lecture Theatre.
Lecture slides: on LEARN before lectures; Lecture notes: on LEARN after lectures.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,2 / 12
Organizational
Tutors:
;
Baburaj; ;
.
Weekly labs/help sessions (starts in week 2):
Mondays 3-4pm 212 (Zoom livestream); Tuesdays 1-2pm 212;
Wednesdays 12-1pm 035 Lab 2;
Wednesdays 3-4pm 035 Lab 2 (Zoom livestream).
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,3 / 12
Organizational
STAT318 Assessment:
3 assignments: 56% (18%, 18% and 20%); Final exam (2hrs): 44%.
STAT462 Assessment:
3 assignments: 56% (16%, 16% and 24%); Final exam (2hrs): 44%.
You must get at least 40% of the marks in the exam to pass the course.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,4 / 12
• A student who narrowly fails to achieve 40% in the exam, but who performs very well in the other assessment, may be eligible for a pass in the course.
• You may do the assignments by yourself or with one other person from the same cohort (300-level students cannot work with 400-level students on the assignments). If you hand in a joint assignment, you will each be given the same mark.
Course Textbook
G. James, D. Witten, T. Hastie and R. Tibshirani, An Introduction to Statistical Learning, with applications in R.
Available online free (pdf):
http://www-bcf.usc.edu/∼gareth/ISL/
T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning,
Data Mining, Inference, and Prediction. Available online free (pdf):
http://web.stanford.edu/∼hastie/ElemStatLearn/
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,5 / 12
The Elements of Statistical Learning is an excellent book on the subject, but requires a level of mathematical sophistication that is beyond the scope of this course. Students with sufficient backgrounds in mathematics may find this book useful to explain technical details that are not covered in this course.
An Introduction to Statistical Learning (ISL) is a simplified version of the Elements of Statistical Learning. The authors have removed much of the mathematics and keep things simple to make the subject accessible to a wider audience. They have done an excellent job and this course is pitched at the same level as the ISL text. The course follows ISL closely, but we do not have enough time to cover all of the topics in ISL.
Course Objectives
1 Introduce statistical learning and data mining.
2 Introduce techniques for classification, regression, clustering and association
analysis.
3 Understand the basic concepts and underlying assumptions of each technique and determine when they might be useful.
4 Implement various statistical learning techniques using R and real-world data.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,6 / 12
1. Data Science, Machine Learning, Statistical Learning and Data Mining have much overlap. I will refer to the subject as statistical learning and data mining, but trendier names like ‘data science’ and ‘machine learning’ would also be appropriate for many topics in this course.
3. We will not be covering the technical/mathematical details of statistical learning methods in this course. Covering these details requires a level of mathematical sophistication that was not asked for in the course prereqs. At times we will delve into the details, but only when it is essential to better understand a method.
4. We will be implementing R functions rather than programming in R. This course does not assume prior knowledge of R and introductory R labs will be given.
Data Mining and Statistical Learning Problems
Identify the main risk factors for prostate cancer.
Predict whether someone will have a heart attack based on their diet, demographic and other clinical measurements.
Establish a relationship between salary and socioeconomic factors in survey data. Classify emails as spam or ham.
Recommend new products to consumers based on previous purchases.
Identify handwritten zip codes.
Determine public sentiment from social media feeds.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,7 / 12
There are many interesting statistical learning problems (some far more interesting than those listed above). The point here is that there are a variety of statistical learning problems including making predictions, looking for relationships, making inferences, classifying things, looking for interesting associations, …. There is no best method to tackle problems like these, so we will consider a variety methods for each type of problem.
Knowledge Discovery (Data Science?)
…….. … ….
Knowledge
Data Mining; Machine
Learning; Statistical
Learning; …
.. . .. .
Patterns
Data
Preprocessing
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining
,8 / 12
Data storage and preprocessing (aka data wrangling) is a subject in its own right (we have a data wrangling course DATA201/422). This includes
• Cleaning (e.g. veracity — the accuracy and trustworthiness of the data) • Integration (e.g. combining data from multiple sources/data types)
• Reduction (e.g. removing redundancy)
• Transformation (e.g. algorithm readable form)
Clearly, data wrangling is an incredibly important step in any learning process because when training a model, rubbish in equals rubbish out. This class considers fairly benign data sets that do not require much (if any) wrangling. This is possibly an unrealistic situation in practice, but means we can focus on understanding the algorithms/methods that are used in statistical learning. This course focuses on the last three steps of knowledge discovery.
Supervised Learning
Response (outcome, target, dependent) variable Y and a vector of p predictor (input, feature, independent) variables X.
We have a training data set of n observations (examples, instances) of the form
{(xi,yi) : i = 1,2,…,n},
where xi is a vector of p predictor variables and yi is the response value for the ith observation.
If Y is quantitative, we have a regression problem. If Y takes values in a finite (unordered) set, we have a classification problem.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,9 / 12
Consider trying to predict the Sales of a product based on the advertising budgets of three different media: TV, Radio, Newspaper.
• Response: Y = Sales
• Predictors: X1 = TV, X2 = Radio and X3 = Newspaper
• Predictor vector: X = (X1, X2, X3) = (TV, Radio, Newspaper).
This is a regression problem because Y is quantitative. If we discretize Sales into Low and High, for example,
Low if Sales < 10k
Y = High otherwise,
we would have a classification problem, where the goal is to predict whether sales are high or low (binary classification problem). Both regression and classification problems are considered in this course.
In supervised learning we:
1 Predict Outcomes: Use an observed x to predict an unobserved y;
2 Make Inferences: Understand how each predictor variable affects the response;
3 Quantify Uncertainty: Assess the quality of any predictions and/or inferences made.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,10 / 12
1. It is usually difficult (or impossible) and/or expensive to measure the response variable directly and relatively easy to measure the predictors. Hence, we build a model to training data to make predictions. For example, if I spend $50k on TV advertising, what is my expected Sales?
2. Should I reduce the amount I spend on TV advertising?
3. How confident am I about the predictions made by my model?
Unsupervised Learning
No response variable, just a set of predictor variables.
The objective of unsupervised learning is harder to define (and somewhat
subjective):
1 find natural groupings (clusters) in data;
2 find interesting associations in data;
3 find a subset of predictors (or a linear combination of predictors) that collectively explain most of the variation in the data.
This is a challenging situation because there is often no way of telling how well you are doing!
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,11 / 12
The only unsupervised learning techniques considered in this course are cluster analysis and association analysis. We will not be considering, for example, dimen- sionality reduction and feature subset selection. Principal components analysis (PCA) is a popular dimensionality reduction method (see section 10.2 in the course textbook if you’re interested), but we will not cover it here (it is covered in other courses we offer, for example, STAT315).
Unsupervised Learning: How many clusters?
. . ... . .
... ...... .. .
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,12 / 12
The number of natural clusters in data is subjective. This data set could have 2,4 or 6 clusters, it really depends on how we define clusters. Different algorithms will tend to find different clusterings in data because their cluster definitions are different. More about this later in the course.