Introduction to Machine Learning:
Learning from Examples
AIMA 19.1 – 19.4
CMPSC 442
Week 11, Meeting 33, Three Segments
Outline
● Supervised Learning
● Decision Trees
● Model Selection and Optimization
2Outline, Wk 11, Mtg 33
Introduction to Machine Learning:
Learning from Examples
AIMA 19.1 – 19.4
CMPSC 442
Week 11, Meeting 33, Segment 1 of 3: Supervised Learning
Machine Learning in General (mostly slide 4 from Mtg 22)
● Types of ML
○ Replicate a pattern given by a supervision signal
○ Discover new patterns (unsupervised learning)
○ Learn through trial and error by interacting with environment and receiving
a reinforcement signal (reward; learn a Markov Decision Process model)
● Supervised machine learning types:
○ Classification, e.g., Naive Bayes
○ Sequence prediction, e.g., HMM
○ Regression (Mtg 34)
4Supervised Learning
Supervised Statistical Machine Learning
● Given a set of independent and identically distributed (i.i.d.) training
examples (x(1){1:m} , y
(1)
)…(x
(n)
{n:m} , y
(n)
)
○ Assume each pair was generated by an unknown function y=f(x)
○ Discover a function y=h(x) where h approximates f
○ The labeled data (x(1){1:m} , y
(1)
)…(x
(n)
{1:m} , y
(n)
) represents the ground truth
● Test the generalization ability of h on labeled examples that are not in
the training set
5Supervised Learning
Bias Variance Tradeoff
● AIMA defines bias in terms of the selected hypothesis space (e.g.,
linear functions versus sinusoidal)
● AIMA defines variance as arising from the choice of training data
(variance across possible training sets)
6Supervised Learning
Another View of Bias Variance Tradeoff
● Expected prediction error (EPE) for a new observation with value x is given by:
● is the irreducible error (noise) apart from bias and estimation variance
● Bias is the result of misspecifying the statistical model f
● Estimation variance is the result of using a sample to estimate f
● Modeling goals:
○ Explanatory modeling attempts to minimize bias meaning to find the same theoretical
explanation for some phenomenon, e.g., across categories of datasets
○ Predictive modeling aims for empirical precision (minimize bias and estimation variance)
7
Galit Shmueli, 2010, To Explain or Predict, Statistical Science
Supervised Learning
Ockham’s Razor, and Underfitting versus Overfitting
● Ockham’s razor: choose the simplest model that “works”
● Underfitting: the model fails to find a pattern that exists in the data
○ Not common; training usually continues until a good fit is achieved
○ Solution: choose a more complex model, and find more data to enable
learning the more complex model
● Overfitting: the model finds a pattern that exists in the sample, but the
pattern does not generalize across samples
○ Fairly common
○ Solution: simplify the model, and sample the data more rigorously
8Supervised Learning
Introduction to Machine Learning:
Learning from Examples
AIMA 19.1 – 19.4
CMPSC 442
Week 11, Meeting 33, Segment 2 of 3: Decision Trees
Decision Trees
● Supervised learning:
○ Usually of a classifier
○ Can learn a regression
○ CART (classification and regression trees) can learn either one
● Basic idea, using a binary decision goal (e.g., ± C) and n binary attributes A:
○ From your training data, find the one attribute (e.g., A1) that best splits the data into
two largely equal disjoint sets corresponding to + C and −C
○ If each set is “pure” (has only + C or only −C), the algorithm terminates
○ Else, for each of the 2 sets, find the attribute from A\{A1} that best divides each set
into two maximally pure classes
○ Iterate
10Decision Trees
Decision Tree Example: the Data
● Attributes: Alt(ernate) nearby, open Fri & Sat, takes Res(ervations), . . .
● Attributes with that nearly evenly split the data into Yes/No: Pat?
11Decision Trees
Decision Trees Learn Decision Rules
● A path from root to leaf is a decision rule
○ EG: If Patrons == none → No
● This tree has 13 leaves (paths, or decision
rules)
○ The root attribute (Patrons) has 3 values
■ None (n=2)
■ Some (n=4)
■ Full (n=6)
● A tree with fewer paths would be more
compact (simpler), thus preferred
12Decision Trees
Decision Trees Learn Decision Rules
● Green = yes, red = no
● Patrons is better than Type as a first way to split the data
○ Type: All the sets for each value of type have 50% positive, 50% negative
○ Patrons: Two of the sets have 100% in one class, and the other has a 20%/40% split
13Decision Trees
Role of Informativeness of Attributes
14
● This tree has 8 leaves, or decision rules)
○ This tree is more compact
○ Assuming they generalize equally well,
this tree is preferable
● Patrons is more informative than Type
because two of its three children are
“pure”, accounting for 50% of the data
Decision Trees
● Once Patrons is chosen, then the next attribute should be the
one that is the most informative (most pure classes)
● How can we rank attributes from most to least informative?
Entropy
● Entropy measures the uncertainty of a random variable; the more uncertainty
in the variable, the less information it has
● Recall the use of mutual information for feature selection for NB classifiers
(Mtg 22, slide 24)
○ MI presented there as a measure of how independent X and Y are
○ It can also be seen as a measure of how much information X and Y convey about
each other; it is the normalized negative joint entropy
15Decision Trees
Information Gain
● Measures the relative reduction in entropy for all attributes
● Start with a measure of the entropy of the dataset with respect to the
desired classes C = {c1, c2, . . . cn}
○ One class: H(C) = 0
○ Two equal classes: H(C) = 1
○ Three classes with p(c1 ) = ⅙, p(c2 ) = ⅓, p(c3 )=½: H(C) = 1.46
○ Three equal classes: H(C) = 1.58
○ Ten equal classes: H(C) = 3.32
● Information gain of an attribute:
16Decision Trees
Information Gain of Patron Attribute
● Each values V of Patron (None, Some, Full) applies to |V|/N of the data,
where here, Hv(i,j) represents the entropy of v ∊ V
17
Decision Tree Algorithm
18Decision Trees
Nodes consist of tests on attributes
Edges consisting of attribute values
Leaves consist of output values
Introduction to Machine Learning:
Learning from Examples
AIMA 19.1 – 19.4
CMPSC 442
Week 11, Meeting 33, Segment 3 of 3: Model Selection and
Optimization
Finding a Hypothesis Function for a Dataset
● Choose a model (meaning type of model)
○ In this context, choosing a model means choosing a hypothesis space,
e.g., linear function, polynomial function
○ In other contexts, model can mean model + hyperparameters (degree-2
polynomial), or a specific model (e.g., y=5×2 +3x +2)
● Optimize (or train the model)
○ Find the best hypothesis (instantiated model)
○ Training relies on a training set and a smaller validation (or dev) set for
developing the model
20Model Selection and Optimization
Optimization: Choosing Model Hyperparameters
● Training set error tends to decrease as complexity of model increases
● Because in many cases, error can be reduced nearly to zero, the validation set
serves as a check on overfitting
21
Model =
Decision tree
Model =
Conv. Neural
Network
Model Selection and Optimization
Loss Functions: An Objective to Minimize
● Error rate can be due to different error types (e.g., one class)
● A loss function can be used as a training objective to minimize error for all
classes
● Most generally, loss should take x into account
● Usually, x is ignored:
22Model Selection and Optimization
● If the set ε of all possible pairs (x,y) is known, then generalization loss
can be used (this could be used for simulated data)
● Otherwise, empirical loss for a dataset E be used where N = |E| (this is
far more typical)
Generalization Loss versus Empirical Loss
23Model Selection and Optimization
Regularization
● Regularization aims to minimize the complexity of the model
● Choice of regularization function depends on the hypothesis space
○ For example, a regularization with a linear regression (weights on each
attribute) often uses one of the two following regularizers
■ Shrink all small weights to zero (eliminates attributes; Lasso or L1)
■ Prevent very large weights (smooths the weights, uses all features; Ridge or
L2)
24Model Selection and Optimization
Cost Combines Model Selection and Optimization
● The cost of a hypothesis can be considered as the sum of the loss and
the regularization:
○ Lower loss is equivalent to reducing all the error types
○ Lower regularization term is equivalent to a simpler model
● Cost formalizes Ockham’s razor, in an empirical way
○ Ockham: “A plurality of entities should not be posited without necessity”
○ If necessity is interpreted as empirically predictive, then cost directly
formalizes Ockham’s razor
○ If necessity is interpreted as providing a more useful explanation with
respect to a theory of the world, then cost is only part of the story
25Model Selection and Optimization
Summary, One
● Applying supervised machine learning to a classification or regression problem
requires i.i.d. labeled examples split into training, validation, and test data
○ In contrast, reinforcement learning requires a reward signal from the
environment for learning and for testing. It cannot be learned from labeled
data, but can be learned through a simulated environment
○ Training includes model selection (finding model parameters that
minimize error) and optimization (avoiding overfitting or underfitting)
● The learned classifier or regression model is a hypothesis or function whose
input consists of examples represented as attributes (x(i){1:m}), and whose
output is a hypothesized y(i)
26Summary, Wk 11, Mtg 33
Summary, Two
● Statistical machine learning (predictive modeling) aims to minimize bias (choice
of model class) and error
● Optimization for a loss function rather than error minimizes error over all output
classes
● Depending on the choice of model class, regularization can be used to avoid
overfitting by eliminating some attributes (L1 loss) or by evening out the
influence of all attributes (L2 loss)
● By Ockham’s razor, a final choice of learned model h should minimize Cost(h),
which is the sum of Loss (prediction performance) and the Regularization
(model simplicity)
27Summary, Wk 11, Mtg 33