程序代写代做代考 graph data science data mining database flex data structure algorithm Applied Data Analysis — Introduction

Applied Data Analysis — Introduction
Dr. Lan Du and Dr Ming Liu
Faculty of Information Technology, Monash University, Australia Week 1
Lan&Ming (Monash) FIT5149 1 / 60

Outline
1 An Overview of Statistical (Machine) Learning
2 About the Unit
3 What Is Statistical Learning?
4 Assessing Model Accuracy
Lan&Ming (Monash) FIT5149 2 / 60

An Overview of Statistical (Machine) Learning
Data Science Skills
Figure: from https://imgur.com/hoyFT4t
Lan&Ming (Monash) FIT5149 3 / 60

Data mining
An Overview of Statistical (Machine) Learning
Data Mining
¥ Data mining is the process of automatically extracting information from large data sets
¥ These data sets are usually so large that manually examining them is impractical
¥ The data sets can be structured (e.g., a database) or unstructured (e.g., free-form text in documents)
I Text data mining uses natural language processing to extract information from large text collections
I Quantitative data mining extracts information from numerical data
I It’s also possible to integrate quantitative and qualitative information
sources
4/58
Lan&Ming (Monash) FIT5149 4 / 60

An Overview of Statistical (Machine) Learning
Business applications of data mining
Data Mining
¥ Data mining permits businesses to exploit the information present in the large data sets they collect in the course of their business
¥ Typical business applications:
I in medical patient management, data mining identifies patients likely to
beneÞt from a new drug or therapy
I in customer relationship management, data mining identifies customers
likely to be receptive to a new advertising campaign
I in Þnancial management, data mining can help predict the
credit-worthiness of new customers
I in load capacity management, data mining predicts the fraction of
customers with airline reservations that will actually turn up for the ßight I in market basket and anity analysis, data mining identifies pairs of
products likely or unlikely to be bought together, which can help design advertising campaigns
5/58
Lan&Ming (Monash) FIT5149 4 / 60

An Overview of Statistical (Machine) Learning
Data Mining
¥ Diverse range of data mining tasks:
I software packages exist for standard tasks, e.g., anity analysis
I but specialised data mining applications require highly-skilled experts to
design and construct them
¥ Data mining is often computationally intensive and involve advanced algorithms and data structures
¥ Data mining may involve huge data sets too large to store on a single computer
I often requires large clusters or cloud computing services
6/58
Lan&Ming (Monash) FIT5149 4 / 60

Machine learning
An Overview of Statistical (Machine) Learning
Machine Learning
¥ Machine learning is a branch of ArtiÞcial Intelligence that studies methods for automatically learning from data
¥ It focuses on generalisation and prediction
I typical goal is to predict the properties of yet unseen cases
) split training set/test set methodology, which lets us estimate accuracy on
novel test data
¥ Data mining can use machine learning, but it doesn’t have to:
I E.g., “who is the phone system’s biggest user?” doesn’t necessarily involve machine learning
I E.g., “which customers are likely to increase their phone usage next year?” does involve machine learning
7/58
Lan&Ming (Monash) FIT5149 5 / 60

An Overview of Statistical (Machine) Learning
Statistical Modelling
¥ Probability theory is the branch of mathematics concerned with random phenomena and systems whose structure and or state is only partially known
) probability theory is a mathematical foundation of machine learning
¥ Statistics is the science of the collection, organisation and interpretation of data
I A statistic is a function of data sets (usually numerically-valued) intended to summarise the data (e.g., the average or mean of a set of numbers)
¥ A statistical model is a mathematical statement of the relationship between variables that have a random component
I many machine learning algorithms are based on statistical models
I statistical models also play a central role in natural language processing
8/58
Lan&Ming (Monash) FIT5149 6 / 60
Statistical modelling

Statistics vs machine learning
An Overview of Statistical (Machine) Learning
Statistics vs Machine Learning
¥ Statistics and machine learning often use the same statistical models ) very strong cross-fertilisation between fields
¥ Machine learning often involves data sets that are orders of magnitude larger than those in standard statistics problems
I Machine learning is concerned with algorithmic and data structure issues that statistics doesn’t deal with
¥ Statistics tends to focus on hypothesis testing, while machine learning focuses on prediction
I Hypothesis testing: Does co↵ee cause cancer
I Prediction: Which patients are likely to die of cancer
9/58
Lan&Ming (Monash) FIT5149 7 / 60

An Overview of Statistical (Machine) Learning
The world of Machine Learning
Lan&Ming (Monash) FIT5149 8 / 60 Figure: from https://vas3k.com/blog/machine_learning/

An Overview of Statistical (Machine) Learning
Our Focus
Lan&Ming (Monash) FIT5149 9 / 60

An Overview of Statistical (Machine) Learning
Supervised Learning
Supervised training data1 contains the labels y that we want to predict
1
Figure is from https://learncuriously.wordpress.com/2018/12/22/machine-learning-vs-human-learning-part-1/
Lan&Ming (Monash) FIT5149 10 / 60

An Overview of Statistical (Machine) Learning
Regression: Wage Data [Chapters 3 and 7]
We wish to understand the association between an employee’s wage and
􏰀 his age, education and calendar year
Left: wage, as a function of age, increases by age but not after 60! The blue curve shows an estimate of average wage for a given age
􏰀 Regression: using this curve to predict someone’s wage, a continuous value, Centre: wage as a function of year
Right: boxplots showing wage as a function of education
50 100
200 300
50 100
200 300
50 100
200 300
Wage
Wage
Wage
Lan&Ming (Monash)
FIT5149
11 / 60
20 40 60 80
2003 2006 2009
Year
12345
Age
Education Level

An Overview of Statistical (Machine) Learning
Classification: Stock Market [Chapter 4]
Stock market data on daily movements over a 5-year period
Aim: predict if the market is going to be Up or Down (categorical label)
Left: 648 days the market increased on the next day, and 602 days for the market decreased.
Centre: considering 2 days percentage change Right: considering 3 days percentage change
Yesterday
Two Days Previous
Three Days Previous
Down Up
Down Up
Today’s Direction
Down Up
Today’s Direction
Today’s Direction
Lan&Ming (Monash)
FIT5149
12 / 60
Percentage change in S&P
−4 −2 0 2 4 6
Percentage change in S&P
−4 −2 0 2 4 6
Percentage change in S&P
−4 −2 0 2 4 6

An Overview of Statistical (Machine) Learning
Classification: Stock Market
A quadratic discriminant analysis model is fitted to the subset of the market data (2001-2004) Training Set
Predicted the probability of a stock market decrease using the 2005 data Test Set
Correctly predict the direction of movement in the market 60% of the time
Down Up
Lan&Ming (Monash) FIT5149 13 / 60 Today’s Direction
Predicted Probability 0.46 0.48 0.50 0.52

An Overview of Statistical (Machine) Learning
Unsupervised Learning
Unsupervised training data2 does not contain the labels y that we want to predict.
2
Figure is from https://learncuriously.wordpress.com/2018/12/22/machine-learning-vs-human-learning-part-1/
Lan&Ming (Monash) FIT5149 14 / 60

An Overview of Statistical (Machine) Learning
Clustering Gene Expression Data
NCI60 dataset, 6,830 gene expression measurements for 64 cancer cell lines Goal: To determine whether there are groups among the cell lines
Left: gene expression data in 2-dimensional space
Right: same data with different colour for 14 types of cancer
−40 −20 0 20 40 60
Z1
−40 −20 0 20 40 60
Z1
Lan&Ming (Monash)
FIT5149
15 / 60
Z2
−60 −40 −20 0 20
Z2
−60 −40 −20 0 20

An Overview of Statistical (Machine) Learning
Semi-supervised Learning
Semisupervised training data partially identifies the labels y, or identifies the labels on some but not all the training examples.
Lan&Ming (Monash) FIT5149 16 / 60

An Overview of Statistical (Machine) Learning
Semi-supervised Learning
Semisupervised training data partially identifies the labels y, or identifies the labels on some but not all the training examples.
Lan&Ming (Monash) FIT5149 16 / 60

About the Unit
Outline
1 An Overview of Statistical (Machine) Learning
2 About the Unit
3 What Is Statistical Learning?
4 Assessing Model Accuracy
Lan&Ming (Monash) FIT5149 17 / 60

About the Unit
Unit Outcomes
Synopsis: This unit aims to provide students with the necessary analytical and data modeling skills for the roles of a data scientist or business analyst. Students will be introduced to established and contemporary Machine Learning techniques for data analysis and presentation using widely available analysis software. They will look at a number of characteristic problems/data sets and analyse them with appropriate machine learning and statistical algorithms. Those algorithms include regression, classification, clustering and so on. The unit focuses on understanding the analytical problems, machine learning models, and the basic modeling theory. Students will need to interpret the results and the suitability of the algorithms.
Note that
􏰀 Overlap with FIT5201 and FIT5197 on the standard regression/classification method.
Lan&Ming (Monash) FIT5149 18 / 60

About the Unit
Unit Outcomes
Students are expected to
Analyse data sets with a range of statistical, graphical and machine-learning tools;
Evaluate the limitations, appropriateness and benefits of data analytics methods for given tasks;
Design solutions to real world problems with data analytics techniques; Assess the results of an analysis;
Communicate the results of an analysis for both specific and broad audiences.
For example, apply a statistical learning model to a toy dataset.
Lan&Ming (Monash) FIT5149 19 / 60

Books
Lan&Ming
(Monash)
FIT5149
20 / 60
About the Unit
R and Data Mining Examples and Case Studies
Yanchang Zhao
RDataMining.com
􏰘􏰙􏰚􏰛􏰜􏰝􏰞􏰘􏰙 􏰠 􏰡􏰢􏰚􏰛􏰢􏰣 􏰠 􏰤􏰜􏰥􏰞􏰜􏰦􏰡􏰜􏰝􏰧 􏰠 􏰦􏰢􏰣􏰞􏰢􏰣 􏰠 􏰣􏰜􏰨 􏰩􏰢􏰝􏰪 􏰢􏰫􏰬􏰢􏰝􏰞 􏰠 􏰭􏰘􏰝􏰥􏰚 􏰠 􏰚􏰘􏰣 􏰞􏰥􏰜􏰧􏰢 􏰠 􏰚􏰘􏰣 􏰬􏰝􏰘􏰣􏰮􏰥􏰚􏰮􏰢 􏰠 􏰚􏰥􏰣􏰧􏰘􏰭􏰢􏰝􏰜 􏰚􏰩􏰞􏰣􏰜􏰩 􏰠 􏰛􏰢􏰪􏰩􏰢
􏰘􏰯􏰰􏰱􏰲􏰳􏰴􏰯 􏰭􏰵􏰲􏰶􏰶 􏰴􏰶 􏰰􏰷 􏰴􏰳􏰸􏰵􏰴􏰷􏰹 􏰺􏰻 􏰜􏰼􏰶􏰲􏰽􏰴􏰲􏰵

About the Unit
Unit Contents, Assessments, and Programming Environment
Summary of unit contents and assessments
https://lms.monash.edu/local/preview/index.php?courseid=
78152&unitcode=FIT5149
It is a very bad idea to leave the assessments until the last minute.
Programming
􏰀R
− Hands-on Programming with R: https://rstudio- education.github.io/hopr/
− R for data science (Please register an account with your Monash University email address to obtain access)
− rpubs.com
􏰀 Jupyter notebook/R Markdown 􏰀 Google it!
Lan&Ming (Monash) FIT5149 21 / 60

About the Unit
Lecture/Tutorial Schedule
Lan&Ming (Monash) FIT5149 22 / 60

About the Unit
Teaching Staff and Consultation Time
Teaching Staff
􏰀 Dr. Lan Du: lan.du@monash.edu
􏰀 Dr. Ming Liu: grayming.liu@monash.edu
􏰀 Dr. Minh Le : minh.le@monash.edu
􏰀 Dr. Mohammad Haqqani: mohammad.haqqani@monash.edu 􏰀 Dr. Tam Vo: tam.vo@monash.edu
􏰀 Dr. Penny Zhang:penny.zhang@monash.edu
􏰀 Mr. Dan Nguyen: dan.nguyen2@monash.edu
􏰀 Mr. David Tan: David.Tan@monash.edu
􏰀 Mr. Xiaocheng Jin: jin.jin2@monash.edu
Consultation Time
􏰀 Consultation will start from Week 2
􏰀 Consultation times can be found in Moodle.
Lan&Ming (Monash) FIT5149 23 / 60

What Is Statistical Learning?
Outline
1 An Overview of Statistical (Machine) Learning
2 About the Unit
3 What Is Statistical Learning?
4 Assessing Model Accuracy
Lan&Ming (Monash) FIT5149 24 / 60

What Is Statistical Learning?
A Simple Example of Statistical Learning
Suppose you are a consultant
Task\Question How to improve sales of a particular product
The Advertising data, sales in thousands of dollars, as a function of TV, Radio and Newspaper budgets for 200 different markets
They cannot directly increase sales of the product
They can control the advertising expenditure in each of the three media
If there is an association between advertising and sales, then client can adjust advertising budgets, to indirectly increasing sales
Lan&Ming (Monash) FIT5149 25 / 60

What Is Statistical Learning?
Motivation
Our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets
Simple least square fit of sales to the variables
Each blue line represents a simple model that can be used to predict sales using TV, Radio, and Newspaper
0 50 100
200
300
0 10 20 30 40 50
Radio
0 20 40 60 80 100
TV
Newspaper
Lan&Ming (Monash)
FIT5149
26 / 60
Sales
5 10 15 20 25
Sales
5 10 15 20 25
Sales
5 10 15 20 25

What Is Statistical Learning?
Another example: f with one variable
Income versus years of education for 30 individuals
One might be able to predict income using years of education
However, the function f is unknown
One must estimate f based on the observed points
Income is a simulated data set, f is known and is shown by the blue curve The vertical lines are the error ε [why are some points above or under?]
10 12 14 16 18 20 22
10 12 14 16 18 20 22
Years of Education
Years of Education
Lan&Ming (Monash)
FIT5149
27 / 60
Income
20 30 40 50 60 70 80
Income
20 30 40 50 60 70 80

What Is Statistical Learning?
Another example: f with more than one variable
Income as a function of years of education and seniority
f is a two-dimensional surface (true underlying relationship) Must be estimated based on the observed data
Income = f (Years, Seniority)
Income
Years of Education
Seniority
Lan&Ming (Monash) FIT5149 28 / 60

What Is Statistical Learning?
Statistical Learning
Suppose that we observe a quantitative response Y and p different predictors, X1, X2, . . . , Xp
Assume some relationship between Y , and X = (X1, X2, . . . , Xp) Y =f(X)+ε
􏰀 Here X = (X1,X2,…,Xp) is an observation (input) 􏰀 With features X1,X2,…,Xp
􏰀 Y is the output or response variable
f is a fixed but unknown function of X1,X2,…,Xp
ε is a random error term, independent of X and E [ε] = 0
f represents the systematic information that X provides about Y
Lan&Ming (Monash) FIT5149 29 / 60

What Is Statistical Learning?
Example: Advertising Budget
Lan&Ming (Monash) FIT5149 30 / 60

What Is Statistical Learning?
Statistical Learning
Y =f(X1,X2,…,Xp)+ε Yˆ = fˆ(X1,X2,…,Xp) Y ≈ Yˆ
In essence, statistical learning refers to a set of approaches for estimating f Statistical learning: A set of tools for understanding data
In this lecture
􏰀 We outline some of the key concepts that arise in estimating f 􏰀 And tools for evaluating the estimates obtained
Lan&Ming (Monash) FIT5149 31 / 60

What Is Statistical Learning?
Statistical Learning: Prediction
Y =f(X1,X2,…,Xp)+εorY =f(X)+ε
Prediction: to find an estimator Yˆ = fˆ(X)
􏰀 fˆ is treated as a black box, we are not concerned what is it
􏰀 We are interested in accurate predictions for Y
􏰀 The accuracy of Yˆ as a prediction of Y depends on
− Reducible error: |f − fˆ|
− We can potentially improve the accuracy by using better learning technique to
estimate
− Irreducible error: Y is a function of ε and the error cannot be predicted using X
− No matter how well we estimate f , we cannot reduce the error ε
E[(Y −Yˆ)2]=E[(f(X)+ε−fˆ(X))2]=[f(X)−fˆ(X)]2+ Var(ε) 􏱀 􏰿􏰾 􏱁 􏱀􏰿􏰾􏱁
Reducible Irreducible
􏰀 E[(Y −Yˆ)2] the average value of squared difference between the prediction
and true value
􏰀 Var(ε) the variance associated with the error term ε
􏰀 E[(Y−Yˆ)2]≥Var(ε)
􏰀 The focus of this unit is to minimize the reducible error
Lan&Ming (Monash) FIT5149 32 / 60

What Is Statistical Learning?
Statistical Learning: Inference
Y =f(X1,X2,…,Xp)+εorY =f(X)+ε
Inference: to understand the way Y is affected by X1, X2, . . . , Xp
We wish to estimate f , but our goal is not necessarily to make predictions We want to understand the relationship between X and Y
How Y changes as a function of X1,X2,…,Xp
fˆ cannot be a black box; we need to know its exact form
We want to answer the following questions:
􏰀 Which predictors are associated with the response?
􏰀 What is the relationship between the response and each predictor? 􏰀 Can the relationship between Y and each predictor be adequately
summarized using a linear equation, or is the relationship more complicated?
Lan&Ming (Monash) FIT5149 33 / 60

What Is Statistical Learning?
Inference vs Prediction: Predictive Modelling of House Prices
log(price) = β0+β1×rooms+β2×bathrooms+β3×sqft_living+β4×sqft_lot Lan&Ming (Monash) FIT5149 34 / 60

What Is Statistical Learning?
How to Estimate f ?
Training data: n different data points (or observations) {x1,…,xn}
For i = 1,…,n, xi = (xi1,…,xip)
xij is the jth predictor of observation i for i = 1,…,n and j = 1,…,p
yi response variable for the ith observation
training set is {(x1, y1), (x2, y2), . . . , (xn, yn)}
Goal: to apply a statistical learning method to the training data to estimate the unknown function f
We want to find fˆ such that Y ≈ fˆ(X) for any (X,Y) Most statistical learning methods are categorized as
􏰀 Parametric Methods
􏰀 Non-parametric Methods
Lan&Ming (Monash) FIT5149 35 / 60

What Is Statistical Learning?
Parametric Methods
Two-step model-based approach: reduces the problem of estimating f down to one of estimating a set of parameters:
1 We make an assumption about the functional form, or shape, of f f(X)=β0 +β1X1 +β2X2 +…+βpXp
􏰀 This is a linear model
􏰀 Instead of estimating an entirely arbitrary p-dimensional function f (X), 􏰀 One only needs to estimate the p + 1 coefficients β0,…,βp
2 After a model has been selected, we need a procedure that uses the training data to fit or train the model
􏰀 weneedtoestimatetheparametersβ0,…,βp
Y ≈ βˆ 0 + βˆ 1 X 1 + βˆ 2 X 2 + . . . + βˆ p X p
􏰀 The most common approach to fitting the model is referred to as (ordinary) least squares
Lan&Ming (Monash) FIT5149 36 / 60

Income
Income
Years of Education
Years of Education
Seniority
Seniority
income ≈ β0 + β1 × education + β2 × seniority Entire fitting problem reduces to estimating
β0,β1 andβ2
Capturing the positive relationship between years of education and income
Income as a function of years of education and seniority in the Income data set
What Is Statistical Learning?
Parametric Methods – Advantages vs Disadvantages
Lan&Ming (Monash) FIT5149
37 / 60

What Is Statistical Learning?
Non-parametric Methods
Non-parametric methods do not make explicit assumptions about the functional form of f , so they have the potential to accurately fit a wider
range of possible shapes for f Disadvantage:
􏰀 a large number of observations is required in order to obtain an accurate estimate for f
􏰀 overfitting
Income
Income
Years of Education
Years of Education
Seniority
Seniority
Figure: Right level of smoothness Figure: Lower level of smoothness Lan&Ming (Monash) FIT5149
38 / 60

What Is Statistical Learning?
Trade-Off Between Accuracy and Model Interpretability
Tasks put more emphasise on inference: restrictive models are much more interpretable and preferred.
Tasks focus on the prediction accuracy: flexible models can be a choice.
Lan&Ming (Monash)
FIT5149
39 / 60
Subset Selection Lasso
Least Squares
Generalized Additive Models Trees
Bagging, Boosting Support Vector Machines
Low
High
Flexibility
Low
High
Interpretability

Assessing Model Accuracy
Outline
1 An Overview of Statistical (Machine) Learning
2 About the Unit
3 What Is Statistical Learning?
4 Assessing Model Accuracy
Lan&Ming (Monash) FIT5149 40 / 60

Assessing Model Accuracy
Selecting a statistical learning procedure
No free lunch in statistics: No one method dominates all others over all possible data sets
It is an important task to decide for any given set of data which method produces the best results
Selecting the best approach can be one of the most challenging parts of performing statistical learning in practice
Lan&Ming (Monash) FIT5149 41 / 60

Assessing Model Accuracy
Measuring the Quality of Fit
We need some way to measure how well a method’s predictions actually match the observed data
In the regression setting, the most commonly-used measure is the mean squared error (MSE)
1 􏰊n
(yi − fˆ(xi ))2
MSE = n
􏰀 We are interested in the accuracy of the predictions that we obtain when we
Training MSE
i=1
apply our method to previously unseen test data
Test Set MSE:
􏰀 previously unseen observation not used to train the statistical learning method 􏰀 Average squared prediction error for these test observations
Lan&Ming (Monash) FIT5149 42 / 60

Assessing Model Accuracy
0 20 40 60 80 100 2 5 10 20
X Flexibility
Data simulated from f , shown in black Three estimates of f are shown:
􏰀 The linear regression line (orange curve), and
􏰀 Two smoothing spline fits (blue and green curves)
Training MSE (grey curve), test MSE (red curve), and minimum possible test MSE over all methods (dashed line)
Lan&Ming (Monash) FIT5149 43 / 60
Y
2 4 6 8 10 12
Mean Squared Error
0.0 0.5 1.0 1.5 2.0 2.5

Assessing Model Accuracy
0 20 40 60 80 100 2 5 10 20
X Flexibility
True f that is much closer to linear, shown in black Linear regression provides a very good fit to the data
Training MSE (grey curve), test MSE (red curve), and minimum possible test MSE over all methods (dashed line)
Lan&Ming (Monash) FIT5149 44 / 60
Y
2 4 6 8 10 12
Mean Squared Error
0.0 0.5 1.0 1.5 2.0 2.5

Assessing Model Accuracy
0 20 40 60 80 100 2 5 10 20
X Flexibility
f is far from linear
llinear regression provides a very poor fit to the data
Training MSE (grey curve), test MSE (red curve), and minimum possible test MSE over all methods (dashed line)
Lan&Ming (Monash) FIT5149 45 / 60
−10 0 10 20
Y
Mean Squared Error
0 5 10 15 20

Assessing Model Accuracy
The Bias-Variance Trade-Off
U-shape test MSE is due to two competing properties of learning methods Test MSE at a point x0 has three components: Three sources of errors:
􏰀 The variance of fˆ
􏰀 The squared bias of fˆ
􏰀 The variance of error term ε
E 􏰒(y0 − fˆ(x0))2􏰓 = Var(fˆ(x0)) + [Bias(fˆ(x0))]2 + Var(ε)
􏱀 􏰿􏰾 􏱁
expected test MSE
􏱀 􏰿􏰾 􏱁􏱀􏰿􏰾􏱁
Reducible error Irreducible error
If we have many training sets, for each of them one can find its β’s such that
f ̄ = E[fˆ(x0)]
Var(fˆ(x0)) = E[(fˆ(x0) − f ̄(x0))2]
Bias(fˆ(x0)) = f (x0) − f ̄(x0)
􏰀 Bias is the difference between the average prediction of our model and the
correct value which we are trying to predict.
Low variance and low bias
Expected test MSE can never lie below Var(ε)
Lan&Ming (Monash) FIT5149 46 / 60

Assessing Model Accuracy
Variance refers to the amount by which fˆ would change if we estimated it using a different training data set
Var(fˆ(x ))=E [(fˆ(x )−f ̄(x ))2] 0ββ00
If a method has high variance then small changes in the training data can result in large changes in fˆ
In general, more flexible statistical methods have higher variance
Bias refers to the error that is introduced by approximating a real-life problem,
Which may be extremely complicated, by a much simpler model
Bias(fˆ(x0)) = f (x0) − f ̄(x0)
It is unlikely that any real-life problem truly has such a simple linear
relationship
linear regression results in high bias
As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease.
Lan&Ming (Monash) FIT5149 47 / 60

Assessing Model Accuracy
Squared bias (blue curve)
Variance (orange curve)
Var(ε) dashed line
And test MSE (red curve)
The vertical dotted line indicates the flexibility level corresponding to the smallest test MSE
MSE Bias Var
2 5 10 20
Flexibility
2
5 10 20
Flexibility
2 5 10 20
Flexibility
Lan&Ming (Monash)
FIT5149
48 / 60
0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5
0 5 10 15 20

Assessing Model Accuracy
The challenge lies in finding a method for which both the variance and the squared bias are low.
This trade-off is one of the most important recurring themes in this unit
Lan&Ming (Monash) FIT5149 49 / 60

Assessing Model Accuracy
In classification, yi are qualitative
Training observations {(x1,y1),…,(xn,yn)}
Quantifying the accuracy of our estimate fˆ by means of the training error rate
The proportion of mistakes that are made if we apply our estimate fˆ to the training observations, missclassifications
1 􏰊n
n
yˆi predicted class label for the ith observation using fˆ
I (yi ̸= yˆi ) is an indicator variable that equals 1 if yi ̸= yˆi and zero if yi = yˆi If I (yi ̸= yˆi ) = 0 then the i th observation was classified correctly Otherwise it was misclassified
The formula gives the fraction of incorrect classifications
i=1
I ( y i ̸ = yˆ i )
Lan&Ming (Monash) FIT5149 50 / 60

Assessing Model Accuracy
Training error rate
I ( y i ̸ = yˆ i )
Test error rate associated with a set of test observations
Ave(I(y0 ̸=yˆ0))
1 􏰊m
I ( y j ̸ = yˆ j )
A good classifier is one for which the test error is smallest
1 􏰊n
n
i=1
= m
j=1
Lan&Ming (Monash) FIT5149 51 / 60

Assessing Model Accuracy
The Bayes Classifier
The test error rate is minimized, on average, by a very simple classifier that assigns each observation to the most likely class, given its predictor values
We should simply assign a test observation with predictor vector x0 to the class j for which the probability is largest
Pr(Y = j|X = x0)
This very simple classifier is called the Bayes classifier
If we have 2 classes: class one and class two
It predicts the class is one if Pr(Y = 1|X = x0) > 0.5
The Bayes classifier produces the lowest possible test error rate, called the
Bayes error rate
Bayes error rate is given by
1−E maxPr(Y=j|X)
j
􏱂􏱃
Lan&Ming (Monash) FIT5149 52 / 60

Assessing Model Accuracy
A simulated data set consisting of 100 observations in each of two groups, indicated in blue and in orange.
The purple dashed line represents the Bayes decision boundary
The orange background grid indicates the region in which a test observation will be assigned to the orange class,
The blue background grid indicates the region in which a test observation will be assigned to the blue class
oo oooo
o
o oo o o
oo
o oooooo
o o o o o o
oooo o o ooo oo
oooo ooo o oooooo o
oo oo oo o o ooooooo
oo oooooo oo oooo o
X2
Lan&Ming
(Monash)
FIT5149
53 / 60
ooo o ooo oo o o oo
oooooooo o o o o ooooo
oo oooo o o oo oooo
ooo o o ooooooo
oo ooo ooo
oo ooooo oo oo
ooo o o o oo o
ooo
o o o o
o oooo
o
oo
oo o o
X1

Assessing Model Accuracy
K-Nearest Neighbors
Bayes classifier is good in theory
For real data, we do not know the conditional distribution of Y given X
Many approaches attempt to estimate the conditional distribution of Y given X,
And then classify a given observation to the class with highest estimated probability
One such method is the K-nearest neighbors (KNN) classifier Given a positive integer K and a test observation x0
1 The KNN classifier first identifies the K points in the training data that are closest to x0, represented by N0
2 Then estimates the conditional probability for class j as the fraction of points in N0 whose response values equal j
K
3 Finally, KNN applies Bayes rule and classifies the test observation x0 to the
class with the largest probability
Pr(Y=j|X=x0)=1 􏰊I(yi=j)
i ∈N0
Lan&Ming (Monash) FIT5149 54 / 60

Assessing Model Accuracy
The KNN approach, using K = 3
The KNN decision boundary for this example is shown in black
The blue grid indicates the region in which a test observation will be assigned to the blue class,
And the orange grid indicates the region in which it will be assigned to the orange class.
oo
oo oo
oooo oo
oo
oooo oo
oo oo
Lan&Ming
(Monash)
FIT5149
55 / 60

Assessing Model Accuracy
The choice of K has a drastic effect on the KNN classifier obtained Purple curve for Bayes Classifier
Black curve K = 10
The KNN and Bayes decision boundaries are very similar
oo oooo
o
o oo o o
KNN: K=10
oo
o oooooo oo o
o o o ooo o oooooooo
oooo ooo o oooooo o
oo oo oo o o ooooooo
oo ooooo oo oo o o o o
X2
Lan&Ming
(Monash)
X1 FIT5149
56 / 60
ooo o ooo oo o o oo
oooooooo o o o o ooooo
ooo oo o
o
o
o
ooo oo ooo o o o oo o
oooooo oo
ooo o oooo o
oo oooo o oo
oo oo oooo
o o
o oooo
o o
oo
oo o o
o

Lan&Ming
(Monash)
FIT5149
57 / 60
o ooo
o o o o o o
oo ooo oo o ooo oo oo
oo o o
ooo
o
oo o ooo o o o oooo
Assessing Model Accuracy
A comparison of the KNN decision boundaries
Solid black curves obtained using K = 1 and K = 100 Purple curve for Bayes Classifier
With K = 1, the decision boundary is overly flexible
K = 100 it is not sufficiently flexible
KNN: K=1 KNN: K=100
oo oo ooo ooo
oo
oo oo oo
oooo oooo
oo
o ooo
o o o o o o
oo o
oo o
oooooo oooooooooooooo
o ooo ooo o o oooo o o
o ooo ooo o oooo o
o o
o oo o
ooooooo ooooooo
o oooooo
oooo oo o o oo o o
oo ooo oo o ooo oo oo
ooo oo ooo ooo ooo oo ooo oooooo
ooo ooo
ooo ooo
oo oo o ooo o o
o o o oooo ooo ooo
o o oo o o o o oo o o ooo ooo
oooooo o oo o oo
oo o oooo o oo
o oooo o oo
oo
o ooo
o ooo o o o o ooo o
oo oo
o o o ooo o oo
o o o o o o oo o oo o
o o oooo
oo o oo o oo
oo oo oo
o oo o
o

Assessing Model Accuracy
The KNN training error rate (blue, 200 observations)
Test error rate (orange, 5,000 observations)
As the level of flexibility (assessed using 1/K) increases, or equivalently as the number of neighbours K decreases
The black dashed line indicates the Bayes error rate
The jumpiness of the curves is due to the small size of the training data set
Training Errors Test Errors
0.01 0.02 0.05 0.10 0.20 0.50 1.00 1/K
Lan&Ming (Monash)
FIT5149 58 / 60
Error Rate
0.00 0.05 0.10 0.15 0.20

Assessing Model Accuracy
In both the regression and classification settings, choosing the correct level of flexibility is critical to the success of any statistical learning method.
The bias-variance tradeoff, and the resulting U-shape in the test error, can make this a difficult task.
Later we return to this topic and discuss various methods for estimating test error rates
And thereby choosing the optimal level of flexibility for a given statistical learning method.
Lan&Ming (Monash) FIT5149 59 / 60

Assessing Model Accuracy
Summary
What we learned
􏰀 supervised vs unsupervised learning
􏰀 Inference vs prediction
􏰀 Prediction accuracy and interpretability 􏰀 bias-variance trade-off
To do
􏰀 Read Chapters 1 and 2 in “An introduction to statistical learning”
􏰀 Run the Lab in Chapter 2
􏰀 Attempt the exercises questions in Chapter 2, and understand them.
􏰀 Set up the programming environment if you choose to use your own
computer.
Acknowledgement
􏰀 Figures used in this slides are from the book “Introduction to Statistical Learning with Applications in R”
􏰀 Some slides were adapted from Prof Mark John’s slides on “Introduction to machine learning”
Lan&Ming (Monash) FIT5149 60 / 60