CS代写 CSC 311: Introduction to Machine Learning

CSC 311: Introduction to Machine Learning
Lecture 2 – Decision Trees & Bias-Variance Decomposition
. of Toronto, Fall 2021
Intro ML (UofT) CSC311-Lec2 1 / 57

Copyright By PowCoder代写 加微信 powcoder

Announcement: HW1 released Decision Trees
I Simple but powerful learning algorithm
I Used widely in Kaggle competitions
I Lets us motivate concepts from information theory (entropy, mutual
information, etc.)
Bias-variance decomposition
I Concept to motivate combining different classifiers. Ideas we will need in today’s lecture
I Trees [from algorithms]
I Expectations, marginalization, chain rule [from probability]
Intro ML (UofT) CSC311-Lec2

Lemons or Oranges
Scenario: You run a sorting facility for citrus fruits
Binary classification: lemons or oranges
Features measured by sensor on conveyor belt: height and width
Intro ML (UofT) CSC311-Lec2 3 / 57

Decision Trees
Make predictions by splitting on features according to a tree structure.
Yes No Yes No
Intro ML (UofT) CSC311-Lec2 4 / 57

Decision Trees
Make predictions by splitting on features according to a tree structure.
Intro ML (UofT) CSC311-Lec2 5 / 57

Decision Trees—Continuous Features
Split continuous features by checking whether that feature is greater than or less than some threshold.
Decision boundary is made up of axis-aligned planes.
Intro ML (UofT) CSC311-Lec2 6 / 57

Decision Trees
Yes No Yes No
Internal nodes test a feature
Branching is determined by the feature value Leaf nodes are outputs (predictions)
Question: What are the hyperparameters of this model?
Intro ML (UofT) CSC311-Lec2

Decision Trees—Classification and Regression
Each path from root to a leaf defines a region Rm of input space
Let {(x(m1),t(m1)),…,(x(mk),t(mk))} be the training examples that fall into Rm
m = 4 on the right and k is the same across each region
Regression tree:
I continuous output
I leaf value ym typically set to the mean value in {t(m1),…,t(mk)}
Classification tree (we will focus on this):
I discrete output
I leaf value ym typically set to the most common value in {t(m1),…,t(mk)}
Intro ML (UofT) CSC311-Lec2 8 / 57

Decision Trees—Discrete Features
eat at this restaurant?
Intro ML (UofT) CSC311-Lec2 9 / 57

Decision Trees—Discrete Features
Split discrete features into a partition of possible values.
Intro ML (UofT) CSC311-Lec2 10 / 57

Learning Decision Trees
Decision trees are universal function approximators.
I For any training set we can construct a decision tree that has exactly the one leaf for every training point, but it probably won’t generalize.
I Example – If all D features were binary, and we had N = 2D unique training examples, a Full Binary Tree would have one leaf per example.
Finding the smallest decision tree that correctly classifies a training set is NP complete.
I If you are interested, check: Hyafil & Rivest’76. So, how do we construct a useful decision tree?
Intro ML (UofT) CSC311-Lec2 11 / 57

Learning Decision Trees
Resort to a greedy heuristic:
I Start with the whole training set and an empty decision tree.
I Pick a feature and candidate split that would most reduce a loss I Split on that feature and recurse on subpartitions.
What is a loss?
I When learning a model, we use a scalar number to assess whether we’re on track
I Scalar value: low is good, high is bad Which loss should we use?
I Let’s see if misclassification rate is a good loss.
Intro ML (UofT) CSC311-Lec2 12 / 57

Choosing a Good Split
Consider the following data. Let’s split on width.
Intro ML (UofT) CSC311-Lec2 13 / 57

Choosing a Good Split
Recall: classify by majority.
A and B have the same misclassification rate, so which is the best split? Vote!
Intro ML (UofT) CSC311-Lec2 14 / 57

The lecture in review
The story thus far
I To learn a decision tree, we’re going to implement a greedy heuristic I The heuristic requires specification of a rule to guide decision
I What rule should we use?
Next: Use ideas from how probability theory
Intro ML (UofT) CSC311-Lec2 15 / 57

Probability in review
Three concepts you should page into memory for the next fifteen minutes: Expectation: Ex[f(x)] = Px∈X p(x)f(x)
Chain rule of probabilities: p(y|x)p(x) = p(x, y)
Marginalization of joint probabilities: p(x) = Py p(x, y)
Intro ML (UofT) CSC311-Lec2

Choosing a Good Split
A feels like a better split, because the left-hand region is very certain about whether the fruit is an orange.
Can we quantify this?
Intro ML (UofT) CSC311-Lec2 17 / 57

Choosing a Good Split
How can we quantify uncertainty in prediction for a given leaf node?
I If all examples in leaf have same class: good, low uncertainty I If each class has same amount of examples in leaf: bad, high
uncertainty
Idea: Use counts at leaves to define probability distributions; use a probabilistic notion of uncertainty to decide splits.
A brief detour through information theory…
Intro ML (UofT) CSC311-Lec2

Entropy – Quantifying uncertainty
You may have encountered the term entropy quantifying the state of chaos in chemical and physical systems,
In statistics, it is a property of a random variable,
The entropy of a discrete random variable is a number that quantifies
the uncertainty inherent in its possible outcomes.
The mathematical definition of entropy that we give in a few slides may
seem arbitrary, but it can be motivated axiomatically.
I If you’re interested, check: Information Theory by or Elements of Information Theory by Cover and Thomas.
To explain entropy, consider flipping two different coins…
Intro ML (UofT) CSC311-Lec2 19 / 57

We Flip Two Different Coins
Each coin is a binary random variable with outcomes Heads (1) or Tails (1)
Sequence 1:
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 … ?
Sequence 2:
0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 1 … ?
(UofT) CSC311-Lec2

Quantifying Uncertainty
The entropy of a loaded coin with probability p of heads is given by −plog2(p)−(1−p)log2(1−p)
−8log 8−1log 1≈1 4 4 5 5
9 2 9 9 2 9 2 −9 log2 9 − 9 log2 9 ≈ 0.99
Notice: the coin whose outcomes are more certain has a lower entropy.
In the extreme case p = 0 or p = 1, we were certain of the outcome before observing. So, we gained no certainty by observing it, i.e., entropy is 0.
Intro ML (UofT) CSC311-Lec2 21 / 57

Quantifying Uncertainty
Can also think of entropy as the expected information content of a random draw from a probability distribution.
entropy 1.0
0.8 0.6 0.4 0.2
0.6 0.8 1.0
probability p of heads
showed: you cannot store the outcome of a random draw using fewer expected bits than the entropy without losing information.
So units of entropy are bits; a fair coin flip has 1 bit of entropy. Intro ML (UofT) CSC311-Lec2

More generally, the entropy of a discrete random variable Y is given by H(Y ) = − X p(y) log2 p(y)
“High Entropy”:
I Variable has a uniform like distribution over many outcomes I Flat histogram
I Values sampled from it are less predictable
“Low Entropy”
I Distribution is concentrated on only a few outcomes I Histogram is concentrated in a few areas
I Values sampled from it are more predictable
[Slide credit: ]
Intro ML (UofT) CSC311-Lec2 23 / 57

Suppose we observe partial information X about a random variable Y I For example, X = sign(Y ).
We want to work towards a definition of the expected amount of information that will be conveyed about Y by observing X.
I Or equivalently, the expected reduction in our uncertainty about Y after observing X.
Intro ML (UofT) CSC311-Lec2 24 / 57

Entropy of a Joint Distribution
Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}
Not’Raining’ 25/100′
Not’Cloudy’
H(X,Y) = −XXp(x,y)log2p(x,y) x∈X y∈Y
= −24log 24− 1 log 1 −25log 25−50log 50
100 2 100 100 2 100 100 2 100 100 ≈ 1.56bits
Intro ML (UofT) CSC311-Lec2

Conditional Entropy
Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}
Not’Raining’ 25/100′
Not’Cloudy’
What is the entropy of cloudiness Y , given that it is raining? H(Y|X=x) = −Xp(y|x)log2p(y|x)
We used: p(y|x) = p(x,y) , p(x)
and p(x) = P p(x, y) y
= −24log24−1log 1
25 2 25 25 ≈ 0.24bits
(sum in a row)
Intro ML (UofT)
CSC311-Lec2

Conditional Entropy
Cloudy’ Raining’ 24/100′ Not’Raining’ 25/100′
The expected conditional entropy:
Not’Cloudy’ 1/100′ 50/100′
= Ex[H[Y |x]]
= Xp(x)H(Y|X=x)
= − X X p(x, y) log2 p(y|x)
CSC311-Lec2 27 / 57

Conditional Entropy
Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}
Not’Raining’ 25/100′
Not’Cloudy’
What is the entropy of cloudiness, given the knowledge of whether or not it is raining?
X p(x)H(Y |X = x) x∈X
14H(cloudy|is raining) + 34H(cloudy|not raining) 0.75 bits
Intro ML (UofT)
CSC311-Lec2

Conditional Entropy
Some useful properties:
I H is always non-negative
I Chainrule: H(X,Y)=H(X|Y)+H(Y)=H(Y|X)+H(X)
I If X and Y independent, then X does not affect our uncertainty aboutY: H(Y|X)=H(Y)
I But knowing Y makes our knowledge of Y certain: H(Y |Y ) = 0 I By knowing X, we can only decrease uncertainty about Y :
H(Y |X) ≤ H(Y )
Intro ML (UofT) CSC311-Lec2 29 / 57

Some useful properties:
I H is always non-negative
I Chainrule: H(X,Y)=H(X|Y)+H(Y)=H(Y|X)+H(X)
I If X and Y independent, then X does not affect our uncertainty aboutY: H(Y|X)=H(Y)
I But knowing Y makes our knowledge of Y certain: H(Y |Y ) = 0 I By knowing X, we can only decrease uncertainty about Y :
H(Y |X) ≤ H(Y )
Intro ML (UofT) CSC311-Lec2 30 / 57

Conditional Entropy
Some useful properties:
I H is always non-negative
I Chainrule: H(X,Y)=H(X|Y)+H(Y)=H(Y|X)+H(X)
I If X and Y independent, then X does not affect our uncertainty aboutY: H(Y|X)=H(Y)
I But knowing Y makes our knowledge of Y certain: H(Y |Y ) = 0 I By knowing X, we can only decrease uncertainty about Y :
H(Y |X) ≤ H(Y )
Intro ML (UofT) CSC311-Lec2 31 / 57

Information Gain
Cloudy’ Raining’ 24/100′ Not’Raining’ 25/100′
Not’Cloudy’ 1/100′ 50/100′
How much more certain am I about whether it’s cloudy if I’m told whether it is raining? My uncertainty in Y minus my expected uncertainty that would remain in Y after seeing X.
This is called the information gain IG(Y |X) in Y due to X, or the mutual information of Y and X
IG(Y |X) = H(Y ) − H(Y |X)
If X is completely uninformative about Y : IG(Y |X) = 0 If X is completely informative about Y : IG(Y |X) = H(Y ) Intro ML (UofT) CSC311-Lec2

The lecture in review
Previously: Use ideas from how probability theory to come up with a rule Goal: What rule should we use?
The story thus far
I Entropy H(Y ) [bits]: characterizes the uncertainty in a draw of a random variable
I Conditional Entropy H(Y |X) [bits] : characterizes the uncertainty in a draw of Y after observing X
Intro ML (UofT) CSC311-Lec2 33 / 57

Revisiting Our Original Example
Information gain measures the informativeness of a variable, which is exactly what we desire in a decision tree split!
The information gain of a split: how much information (over the training set) about the class label Y is gained by knowing which side of a split you’re on.
Intro ML (UofT) CSC311-Lec2 34 / 57

Revisiting Our Original Example
What is the information gain of split B? Not terribly informative…
Rootentropyofclassoutcome: H(Y)=−27 log2(27)−57 log2(75)≈0.86 Leaf conditional entropy of class outcome: H(Y |left) ≈ 0.81,
H(Y |right) ≈ 0.92
IG(split) ≈ 0.86 − (47 · 0.81 + 37 · 0.92) ≈ 0.006
Intro ML (UofT) CSC311-Lec2 35 / 57

Revisiting Our Original Example
What is the information gain of split A? Very informative!
Rootentropyofclassoutcome: H(Y)=−27 log2(27)−57 log2(75)≈0.86 Leaf conditional entropy of class outcome: H(Y |left) = 0,
H(Y |right) ≈ 0.97
IG(split)≈0.86−(27 ·0+57 ·0.97)≈0.17!!
Intro ML (UofT) CSC311-Lec2 36 / 57

Constructing Decision Trees
At each level, one must choose: 1. Which feature to split.
2. Possibly where to split it.
Choose them based on how much information we would gain from the
decision! (choose feature that gives the highest gain)
Intro ML (UofT) CSC311-Lec2 37 / 57

Decision Tree Construction Algorithm
Simple, greedy, recursive approach, builds up tree node-by-node
1. pick a feature to split at a non-terminal node
2. split examples into groups based on feature value 3. for each group:
I if no examples – return majority from parent I else if all examples in same class – return class I else loop to step 1
Terminates when all leaves contain only examples in the same class or are empty.
Questions for discussion:
I How do you choose the feature to split on?
I How do you choose the threshold for each feature?
Intro ML (UofT) CSC311-Lec2 38 / 57

Back to Our Example
[from: Russell & Norvig]
Intro ML (UofT) CSC311-Lec2 39 / 57

Feature Selection
IG(Y ) = H(Y ) − H(Y |X) 2244
IG(type)=1− 12H(Y|Fr.)+12H(Y|It.)+12H(Y|Thai)+12H(Y|Bur.) =0 2 4 624
IG(Patrons)=1− 12H(0,1)+12H(1,0)+12H(6,6) ≈0.541
Intro ML (UofT) CSC311-Lec2 40 / 57

Which Tree is Better? Vote!
Intro ML (UofT) CSC311-Lec2 41 / 57

What Makes a Good Tree?
Not too small: need to handle important but possibly subtle distinctions in data
Not too big:
I Computational efficiency (avoid redundant, spurious attributes) I Avoid over-fitting training examples
I Human interpretability
“Occam’s Razor”: find the simplest hypothesis that fits the observations
I Useful principle, but hard to formalize (how to define simplicity?) I See Domingos, 1999, “The role of Occam’s razor in knowledge
discovery”
We desire small trees with informative nodes near the root
Intro ML (UofT) CSC311-Lec2 42 / 57

Decision Tree Miscellany
I You have exponentially less data at lower levels
I Too big of a tree can overfit the data
I Greedy algorithms don’t necessarily yield the global optimum
Handling continuous attributes
I Split based on a threshold, chosen to maximize information gain
Decision trees can also be used for regression on real-valued outputs. Choose splits to minimize squared error, rather than maximize information gain.
Intro ML (UofT) CSC311-Lec2 43 / 57

Comparison to some other classifiers
Advantages of decision trees over KNNs
Simple to deal with discrete features, missing values, and poorly scaled data
Fast at test time
More interpretable
Advantages of KNNs over decision trees
Few hyperparameters
Can incorporate interesting distance measures (e.g. shape contexts)
Intro ML (UofT) CSC311-Lec2

We’ve seen many classification algorithms.
We can combine multiple classifiers into an ensemble, which is a set of predictors whose individual decisions are combined in some way to classify new examples
I E.g., (possibly weighted) majority vote
For this to be nontrivial, the classifiers must differ somehow, e.g.
I Different algorithm
I Different choice of hyperparameters
I Trained on different data
I Trained with different weighting of the training examples
Next lecture, we will study some specific ensembling techniques.
Intro ML (UofT) CSC311-Lec2

Today, we deepen our understanding of generalization through a bias-variance decomposition.
I This will help us understand ensembling methods.
Intro ML (UofT) CSC311-Lec2 46 / 57

Bias-Variance Decomposition
Recall that overly simple models underfit the data, and overly complex models overfit.
We can quantify this effect in terms of the bias/variance decomposition. I Bias and variance of what?
Intro ML (UofT) CSC311-Lec2 47 / 57

Bias-Variance Decomposition: Basic Setup
Suppose the training set D consists of pairs (xi,ti) sampled independent and identically distributed (i.i.d.) from a single data generating distribution psample.
Pick a fixed query point x (denoted with a green x). Consider an experiment where we sample lots of training sets
independently from psample.
Intro ML (UofT) CSC311-Lec2

Bias-Variance Decomposition: Basic Setup
Let’s run our learning algorithm on each training set, and compute its prediction y at the query point x.
We can view y as a random variable, where the randomness comes from the choice of training set.
The classification accuracy is determined by the distribution of y.
Intro ML (UofT) CSC311-Lec2 49 / 57

Bias-Variance Decomposition: Basic Setup
Here is the analogous setup for regression:
Since y is a random variable, we can talk about its expectation, variance, etc. Intro ML (UofT) CSC311-Lec2 50 / 57

Bias-Variance Decomposition: Basic Setup
Recap of basic setup:
I Fix a query point x. I Repeat:
I Sample a random training dataset D i.i.d. from the data generating distribution psample.
I Run the learning algorithm on D to get a prediction y at x.
I Sample the (true) target from the conditional distribution p(t|x). I Compute the loss L(y, t).
Notice: y is independent of t. (Why?)
This gives a distribution over the loss at x, with expectation E[L(y, t) | x].
For each query point x, the expected loss is different. We are interested in minimizing the expectation of this with respect to x ∼ psample.
Intro ML (UofT) CSC311-Lec2 51 / 57

Bayes Optimality
For now, focus on squared error loss, L(y, t) = 12 (y − t)2.
A first step: suppose we knew the conditional distribution p(t | x). What
value y should we predict?
I Here, we are treating t as a random variable and choosing y.
Claim: y∗ = E[t | x] is the best possible prediction. Proof:
E[(y−t)2|x]=E[y2 −2yt+t2|x]
= y2 − 2yE[t | x] + E[t2 | x]
=y2 −2yE[t|x]+E[t|x]2 +Var[t|x] = y2 − 2yy∗ + y∗2 + Var[t | x] =(y−y∗)2 +Var[t|x]
(UofT) CSC311-Lec2 52 / 57

Bayes Optimality
E[(y−t)2|x]=(y−y∗)2 +Var[t|x]
The first term is nonnegative, and can be made 0 by setting y = y∗.
The second term corresponds to the inherent unpredictability, or noise, of the targets, and is called the Bayes error.
I This is the best we can ever hope to do with any learning algorithm. An algorithm that achieves it is Bayes optimal.
I Notice that this term doesn’t depend on y.
This process of choosing a single value y∗ based on p(t | x) is an example
of decision theory.
Intro ML (UofT) CSC311-Lec2 53 / 57

Bayes Optimality
Now return to treating y as a random variable (where the randomness comes from the choice of dataset).
We can decompose out the expected loss (suppressing the conditioning on x for clarity):
E[(y − t)2] = = = = =
E[(y − y⋆)2] + Var(t)
E[y⋆2 − 2y⋆y + y2] + Var(t)
y⋆2 − 2y⋆E[y] + E[y2] + Var(t)
y⋆2 − 2y⋆E[y] + E[y]2 + Var(y) + Var(t)
(y⋆ − E[y])2 + Var(y) + Var(t) | {z } |{z} |{z}
bias variance Bayes error
CSC311-Lec2

Bayes Optimality
E[(y − t)2] = (y⋆ − E[y])2 + Var(y) + Var(t) | {z } |{z} |{z}
bias variance Bayes error
We just split the expected loss into three terms:
I bias: how wrong the expected prediction is (corresponds to underfitting)
I variance: the amount of variability in the predictions (corresponds to overfitting)
I Bayes error: the inherent unpredictability of the targets
Even though this analysis only applies to squared error, we often loosely use “bias” and “variance” as synonyms for “underfitting” and “overfitting”.
Intro ML (UofT) CSC311-Lec2 55 / 57

Bias and Variance
Throwing darts = predictions for each draw of a dataset
Be careful, what doesn’t this capture?
I We average over points x from the data distribution.
Intro ML (UofT) CSC311-Lec2 56 / 57

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com