计算机代考 Machine Learning: Decision Trees

Machine Learning: Decision Trees
CSci 5512: Artificial Intelligence II
Instructor:
March 15, 2022

Copyright By PowCoder代写 加微信 powcoder

Instructor:
Machine Learning: Decision Trees

Announcements
HW3 posted today (due Tue, Mar 29)
Machine Learning: Decision Trees
Instructor:

What is Machine Learning?
“Machine learning is programming computers to optimize a performance criterion using example data or past experience” –

10 Companies Using Machine Learning in Cool Ways


Instructor:
Machine Learning: Decision Trees

Learning is essential for unknown environments When designer lacks omniscience
Learning is useful as a system construction method
Expose the agent to reality rather than trying to write it down
Modifies agent’s decision mechanisms to improve performance
Machine Learning: Decision Trees
Instructor:

Learning Element
Design of learning element is dictated by
Which component is to be improved
What prior knowledge the agent already has
What representation is used for the data and the component What feedback is available to learn from
Machine Learning: Decision Trees
Instructor:

Machine Learning Pipeline
https://medium.com/microsoftazure/how-to-accelerate-devops-with-machine-learning-lifecycle-management- 2ca4c86387a0
Instructor:
Machine Learning: Decision Trees

: collection of data (numbers, images, words, etc.)
Training data: data used to train a machine learning algorithm (i.e., learn a model)
Validation data: data used to measure performance of a model (for different hyperparameters)
Test data: unseen data used to measure performance of final model
https://en.wikipedia.org/wiki/Training, validation, and test sets
Machine Learning: Decision Trees
Instructor:

Hyperparameters: algorithm parameters user chooses before training (they control learning)
Loss/cost function: function to measure performance of algorithm
Generalization: how well a model predicts on unseen data
Features: attribute (characteristic) of the data
Labels/target values: class, output (dependent variable) value
Representation: how data is encoded (images, numbers, graphs, etc.)
Error rate: probability predicted class is incorrect
Supervised: learning from a labeled dataset
Unsupervised: learning structure of data (no labels/target values)
Reinforcement: learning via interacting with environment (trial and error)
Machine Learning: Decision Trees
Instructor:

Applications
Type of data: vectors, time-series, sequences, etc.
Domain: text, image, speech, videos, social networks, finance, biology, climate, healthcare, etc.
Type of problem: regression, classification, anomaly detection, ranking, etc.
Models and Methods
Models: assumptions, loss functions on representations Learning algorithms: training models based on data Representation: native features vs. learning representations
Generalization in batch learning Regret in online learning
Machine Learning: Decision Trees
Instructor:

Supervised Learning
Learning mapping from input to output Utilizes labelled dataset
Input-output (features, labels or target values) pairs Superviser tells algorithm correct answer
Classification, regression, anomaly detection, etc. Example applications:
Movie predictions Stock price prediction Face recognition
Machine Learning: Decision Trees
Instructor:

Supervised Learning
Learn a function from examples
f is the unknown target function
An example is a pair (x, f (x))
x is often a vector and f (x) is a label (cat/no cat) or a target value (price of a car)
Problem: find a hypothesis (function) h ≈ f
Given a dataset of examples (xi , f (xi )), i = 1, . . . , N
Simplified model of real learning:
Ignores prior knowledge
Assumes a deterministic, observable “environment” Assumes examples are given
Assumes that the agent wants to learn f – why?
Machine Learning: Decision Trees
Instructor:

Supervised Learning (cont.)
Construct h to approximate/agree with f on dataset h is consistent if it agrees with f on all examples
Occam’s razor: Trade-off between consistency and simplicity
Machine Learning: Decision Trees
Instructor:

Classification
Assume: A fixed (unknown) distribution on Rd × {−1, +1} Given: A dataset X = {(x1,y1),…,(xN,yN)} of N samples
from the distribution
Problem: Find a function h : Rd 􏰒→ {−1, +1} that has “low”
error rate, i.e., L(h) = P(h(x) ̸= y) is low
Let C be the set of functions over which f is searched for “Bias” determines the set C
A learning algorithm is the search algorithm in C
ForMulticlassproblems,(x,y)∈Rd ×{1,…,c}
For Regression problems, (x , y ) ∈ Rd × R
For Multi-dimensional Regression problems, (x , y ) ∈ Rd × Rk
Machine Learning: Decision Trees
Instructor:

Classification
Example: Credit scoring
Differentiating between low-risk and high-risk customers from their income and savings
Discriminant:
IF income > θ1 AND savings >
θ2THEN low-risk ELSE high-risk
Machine Learning: Decision Trees
Instructor:

Classification Applications
Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style
Character recognition: Different handwriting styles
Speech recognition: Temporal dependency
Medical diagnosis: From symptoms to illnesses
Biometrics: Recognition/authentication using physical and/or behavioral characteristics: Face, iris, signature, etc.
Outlier/novelty detection
Machine Learning: Decision Trees
Instructor:

Regression
Example: Price of a used car
Features x: car attributes (e.g., mileage) Target value y: price
y = h(x|θ)
model: h(), parameters: θ
x: mileage
Machine Learning: Decision Trees
Instructor:

Supervised Learning: Uses
Prediction of future cases: Use the rule to predict the output for future inputs
Knowledge extraction: The rule is easy to understand Compression: The rule is simpler than the data it explains
Outlier detection: Exceptions that are not covered by the rule, e.g., fraud
Machine Learning: Decision Trees
Instructor:

Unsupervised Learning
Learning “what normally happens” No output
Clustering: Grouping similar instances Example applications:
Customer segmentation
Image compression: Color quantization Bioinformatics: Learning motifs
Machine Learning: Decision Trees
Instructor:

Reinforcement Learning
Learning via interacting with environment (i.e., trial and error)
Consists of observing environment state, taking action, receiving reward
No supervisor to tell you correct answer
Only relative rewards give clue of action quality Example applications:
Game playing (e.g., Mario) Robot search and rescue Path planning
Drug discovery
Machine Learning: Decision Trees
Instructor:

Attribute-based representations
Examples described by attribute values (Boolean, discrete, continuous, etc.)
Classification of examples is positive (T) or negative (F)
Instructor:
Machine Learning: Decision Trees

Decision Trees
One possible representation for hypotheses
Instructor:
Machine Learning: Decision Trees

Expressiveness
Decision trees can express any function of the input attributes For Boolean functions, truth table → path to leaf
The basic trade-off
There is a consistent decision tree for any training set
Unless f is nondeterministic in x
One path to leaf for each example
But it probably won’t generalize to new examples
Prefer to find more compact decision tree
Machine Learning: Decision Trees
Instructor:

Hypothesis Spaces
How many distinct decision trees with n Boolean attributes Number of boolean functions
Number of distinct truth tables with 2n rows = 22n
6 Boolean attributes give 18, 446, 744, 073, 709, 551, 616 trees
How many purely conjunctive hypotheses (Hungry ∩ ¬Train) Each attribute can be in (positive), in (negative) or out ⇒ 3n conjunctive hypotheses
More expressive hypothesis space
Increases chance that target function can be expressed Increases number of hypotheses consistent with training set ⇒ may get worse prediction
Machine Learning: Decision Trees
Instructor:

Decision Tree Learning
Aim: Find a small tree consistent with the training examples
Method: (recursively) choose “most significant” attribute as root of (sub)tree
Idea: Good attribute splits he examples (ideally) into “pure” subsets
Patrons? is a better choice – gives information about the classification
Machine Learning: Decision Trees
Instructor:

Information Gain
Entropy of r.v. X with distribution ⟨P1,…,Pk⟩ is
H(X) = H(⟨P1,…,Pk⟩) = 􏰂−Pi log2 Pi
i=1 Attribute splits examples E into m subsets Ej
Each Ej should need less information to classification Let Ej have pj positive and nj negative examples The conditional entropy
H(X|E)=􏰂m pj +njH(⟨pj/(pj +nj),nj/(pj +nj)⟩)
Information Gain: IG(X|E) = H(X) − H(X|E) Expected reduction in entropy
How informative is an attribute for classification
Machine Learning: Decision Trees
Instructor:

Information Gain Example
Patrons? Type? None Some Full French Italian
Thai Burger
p = 6 positive and n = 6 negative data points H (X ) = −6/12 log2 6/12 − 6/12 log2 6/12 = 1
Now decide what attribute to use to split the data points Patrons:
H(X|None) = −0/2log2 0/2 − 2/2log2 2/2 = 0 H(X|Some) = −4/4log2 4/4−0/4log2 0/4 = 0
H (X |Full ) = −2/6 log2 2/6 − 4/6 log2 4/6 = 0.9183 H (X ) − H (X |Y ) = 1 − 6/12(0.9183) = 0.5409
H (X |French)=H (X |Italian)=−1/2 log2 1/2 − 1/2 log2 1/2 = 1 H(X|Thai) = H(X|Burger) = −2/4log2 2/4−2/4log2 2/4 = 1 H(X)−H(X|Y)=1−2/12(1)−2/12(1)−4/12(1)−4/12(1)= 0
Choose attribute that maximizes information gain: Patrons Machine Learning: Decision Trees
Instructor:

A Simple Decision Tree
Decision tree learned from the 12 examples:
Substantially simpler than “true” tree
A more complex hypothesis isn’t justified by small amount of data
Machine Learning: Decision Trees
Instructor:

The Original Tree
The original (complex) tree with same performance on the training set
Instructor:
Machine Learning: Decision Trees

Generalization and Overfitting
May generate a large tree, e.g., if no pattern is found Overfitting: Can happen in all types of learners
Decision tree pruning: Remove irrelevant attributes Information gain is small after splitting
Significant test with null hypothesis of no underlying pattern
Machine Learning: Decision Trees
Instructor:

Performance Measurement
How do we know that h ≈ f ?
1 Use theorems of computational/statistical learning theory
2 Try h on a new test set of examples
Learning curve = % correct on test set as a function of training set size
Machine Learning: Decision Trees
Instructor:

Performance Measurement (cont.)
Learning curve depends on
Realizable (can express target function) vs. non-realizable Redundant expressiveness (e.g., loads of irrelevant attributes)
Non-realizability can be due to Missing attributes
Restricted hypothesis class (e.g., thresholded linear function) Non-deterministic dependency
Machine Learning: Decision Trees
Instructor:

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com