CS计算机代考程序代写 decision tree algorithm Introduction to Machine Learning:

Introduction to Machine Learning:
Learning from Examples
AIMA 19.1 – 19.4

CMPSC 442
Week 11, Meeting 33, Three Segments

Outline

● Supervised Learning
● Decision Trees
● Model Selection and Optimization

2Outline, Wk 11, Mtg 33

Introduction to Machine Learning:
Learning from Examples
AIMA 19.1 – 19.4

CMPSC 442
Week 11, Meeting 33, Segment 1 of 3: Supervised Learning

Machine Learning in General (mostly slide 4 from Mtg 22)

● Types of ML
○ Replicate a pattern given by a supervision signal
○ Discover new patterns (unsupervised learning)
○ Learn through trial and error by interacting with environment and receiving

a reinforcement signal (reward; learn a Markov Decision Process model)
● Supervised machine learning types:

○ Classification, e.g., Naive Bayes
○ Sequence prediction, e.g., HMM
○ Regression (Mtg 34)

4Supervised Learning

Supervised Statistical Machine Learning

● Given a set of independent and identically distributed (i.i.d.) training
examples (x(1){1:m} , y

(1)
)…(x

(n)
{n:m} , y

(n)
)

○ Assume each pair was generated by an unknown function y=f(x)
○ Discover a function y=h(x) where h approximates f
○ The labeled data (x(1){1:m} , y

(1)
)…(x

(n)
{1:m} , y

(n)
) represents the ground truth

● Test the generalization ability of h on labeled examples that are not in
the training set

5Supervised Learning

Bias Variance Tradeoff

● AIMA defines bias in terms of the selected hypothesis space (e.g.,
linear functions versus sinusoidal)

● AIMA defines variance as arising from the choice of training data
(variance across possible training sets)

6Supervised Learning

Another View of Bias Variance Tradeoff

● Expected prediction error (EPE) for a new observation with value x is given by:

● is the irreducible error (noise) apart from bias and estimation variance
● Bias is the result of misspecifying the statistical model f
● Estimation variance is the result of using a sample to estimate f
● Modeling goals:

○ Explanatory modeling attempts to minimize bias meaning to find the same theoretical
explanation for some phenomenon, e.g., across categories of datasets

○ Predictive modeling aims for empirical precision (minimize bias and estimation variance)

Galit Shmueli, 2010, To Explain or Predict, Statistical Science

Supervised Learning

Ockham’s Razor, and Underfitting versus Overfitting

● Ockham’s razor: choose the simplest model that “works”
● Underfitting: the model fails to find a pattern that exists in the data

○ Not common; training usually continues until a good fit is achieved
○ Solution: choose a more complex model, and find more data to enable

learning the more complex model

● Overfitting: the model finds a pattern that exists in the sample, but the
pattern does not generalize across samples
○ Fairly common
○ Solution: simplify the model, and sample the data more rigorously

8Supervised Learning

Introduction to Machine Learning:
Learning from Examples
AIMA 19.1 – 19.4

CMPSC 442
Week 11, Meeting 33, Segment 2 of 3: Decision Trees

Decision Trees

● Supervised learning:
○ Usually of a classifier
○ Can learn a regression
○ CART (classification and regression trees) can learn either one

● Basic idea, using a binary decision goal (e.g., ± C) and n binary attributes A:
○ From your training data, find the one attribute (e.g., A1) that best splits the data into

two largely equal disjoint sets corresponding to + C and −C
○ If each set is “pure” (has only + C or only −C), the algorithm terminates
○ Else, for each of the 2 sets, find the attribute from A\{A1} that best divides each set

into two maximally pure classes
○ Iterate

10Decision Trees

Decision Tree Example: the Data

● Attributes: Alt(ernate) nearby, open Fri & Sat, takes Res(ervations), . . .
● Attributes with that nearly evenly split the data into Yes/No: Pat?

11Decision Trees

Decision Trees Learn Decision Rules

● A path from root to leaf is a decision rule
○ EG: If Patrons == none → No

● This tree has 13 leaves (paths, or decision
rules)
○ The root attribute (Patrons) has 3 values

■ None (n=2)
■ Some (n=4)
■ Full (n=6)

● A tree with fewer paths would be more
compact (simpler), thus preferred

12Decision Trees

Decision Trees Learn Decision Rules

● Green = yes, red = no
● Patrons is better than Type as a first way to split the data

○ Type: All the sets for each value of type have 50% positive, 50% negative
○ Patrons: Two of the sets have 100% in one class, and the other has a 20%/40% split

13Decision Trees

Role of Informativeness of Attributes

● This tree has 8 leaves, or decision rules)
○ This tree is more compact
○ Assuming they generalize equally well,

this tree is preferable

● Patrons is more informative than Type
because two of its three children are
“pure”, accounting for 50% of the data

Decision Trees

● Once Patrons is chosen, then the next attribute should be the
one that is the most informative (most pure classes)

● How can we rank attributes from most to least informative?

Entropy

● Entropy measures the uncertainty of a random variable; the more uncertainty
in the variable, the less information it has

● Recall the use of mutual information for feature selection for NB classifiers
(Mtg 22, slide 24)
○ MI presented there as a measure of how independent X and Y are
○ It can also be seen as a measure of how much information X and Y convey about

each other; it is the normalized negative joint entropy

15Decision Trees

Information Gain

● Measures the relative reduction in entropy for all attributes
● Start with a measure of the entropy of the dataset with respect to the

desired classes C = {c1, c2, . . . cn}
○ One class: H(C) = 0
○ Two equal classes: H(C) = 1
○ Three classes with p(c1 ) = ⅙, p(c2 ) = ⅓, p(c3 )=½: H(C) = 1.46
○ Three equal classes: H(C) = 1.58
○ Ten equal classes: H(C) = 3.32

● Information gain of an attribute:

16Decision Trees

Information Gain of Patron Attribute

● Each values V of Patron (None, Some, Full) applies to |V|/N of the data,
where here, Hv(i,j) represents the entropy of v ∊ V

Decision Tree Algorithm

18Decision Trees

Nodes consist of tests on attributes
Edges consisting of attribute values
Leaves consist of output values

Introduction to Machine Learning:
Learning from Examples
AIMA 19.1 – 19.4

CMPSC 442
Week 11, Meeting 33, Segment 3 of 3: Model Selection and
Optimization

Finding a Hypothesis Function for a Dataset

● Choose a model (meaning type of model)
○ In this context, choosing a model means choosing a hypothesis space,

e.g., linear function, polynomial function
○ In other contexts, model can mean model + hyperparameters (degree-2

polynomial), or a specific model (e.g., y=5×2 +3x +2)
● Optimize (or train the model)

○ Find the best hypothesis (instantiated model)
○ Training relies on a training set and a smaller validation (or dev) set for

developing the model

20Model Selection and Optimization

Optimization: Choosing Model Hyperparameters

● Training set error tends to decrease as complexity of model increases
● Because in many cases, error can be reduced nearly to zero, the validation set

serves as a check on overfitting

Model =
Decision tree

Model =
Conv. Neural
Network

Model Selection and Optimization

Loss Functions: An Objective to Minimize

● Error rate can be due to different error types (e.g., one class)
● A loss function can be used as a training objective to minimize error for all

classes
● Most generally, loss should take x into account

● Usually, x is ignored:

22Model Selection and Optimization

● If the set ε of all possible pairs (x,y) is known, then generalization loss
can be used (this could be used for simulated data)

● Otherwise, empirical loss for a dataset E be used where N = |E| (this is
far more typical)

Generalization Loss versus Empirical Loss

23Model Selection and Optimization

Regularization

● Regularization aims to minimize the complexity of the model
● Choice of regularization function depends on the hypothesis space

○ For example, a regularization with a linear regression (weights on each
attribute) often uses one of the two following regularizers
■ Shrink all small weights to zero (eliminates attributes; Lasso or L1)
■ Prevent very large weights (smooths the weights, uses all features; Ridge or

L2)

24Model Selection and Optimization

Cost Combines Model Selection and Optimization

● The cost of a hypothesis can be considered as the sum of the loss and
the regularization:
○ Lower loss is equivalent to reducing all the error types
○ Lower regularization term is equivalent to a simpler model

● Cost formalizes Ockham’s razor, in an empirical way
○ Ockham: “A plurality of entities should not be posited without necessity”
○ If necessity is interpreted as empirically predictive, then cost directly

formalizes Ockham’s razor
○ If necessity is interpreted as providing a more useful explanation with

respect to a theory of the world, then cost is only part of the story

25Model Selection and Optimization

Summary, One

● Applying supervised machine learning to a classification or regression problem
requires i.i.d. labeled examples split into training, validation, and test data
○ In contrast, reinforcement learning requires a reward signal from the

environment for learning and for testing. It cannot be learned from labeled
data, but can be learned through a simulated environment

○ Training includes model selection (finding model parameters that
minimize error) and optimization (avoiding overfitting or underfitting)

● The learned classifier or regression model is a hypothesis or function whose
input consists of examples represented as attributes (x(i){1:m}), and whose
output is a hypothesized y(i)

26Summary, Wk 11, Mtg 33

Summary, Two

● Statistical machine learning (predictive modeling) aims to minimize bias (choice
of model class) and error

● Optimization for a loss function rather than error minimizes error over all output
classes

● Depending on the choice of model class, regularization can be used to avoid
overfitting by eliminating some attributes (L1 loss) or by evening out the
influence of all attributes (L2 loss)

● By Ockham’s razor, a final choice of learned model h should minimize Cost(h),
which is the sum of Loss (prediction performance) and the Regularization
(model simplicity)

27Summary, Wk 11, Mtg 33

Related Posts