CS代考计算机代写 Bayesian network Bayesian case study algorithm Hidden Markov Mode decision tree database flex information theory Machine Learning 10-601

Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
January 12, 2015
Today:
• What is machine learning? • Decisiontreelearning
• Courselogistics
Readings:
• “The Discipline of ML” • Mitchell,Chapter3
• Bishop,Chapter14.4
Machine Learning:
Study of algorithms that
• improve their performance P • at some task T
• with experience E
well-defined learning task:
1

Learning to Predict Emergency C-Sections
[Sims et al., 2000]
9714 patient records, each with 215 features
Learning to classify text documents
spam vs
not spam
2

Learning to detect objects in images
(Prof. H. Schneiderman)
Example training images for each orientation
Learn to classify the word a person is thinking about, based on fMRI brain activity
3

Learning prosthetic control from neural implant
[R. Kass L. Castellanos A. Schwartz]
Machine Learning – Practice
Mining Databases
Text analysis
Speech Recognition
Control learning
Object recognition
• Support Vector Machines • Bayesian networks
• Hidden Markov models
• Deep neural networks
• Reinforcement learning • ….
4

Machine Learning – Theory
Other theories for
• Reinforcement skill learning
• Semi-supervised learning • Active student querying • …
PAC Learning Theory (supervised concept learning)
# examples (m) representational
error rate (ε)
complexity (H)
failure probability (δ)
… also relating:
• # of mistakes during learning
• learner’s query strategy • convergence rate
• asymptotic performance • bias, variance
Machine Learning in Computer Science
• Machine learning already the preferred approach to – Speech recognition, Natural language processing
– Computer vision
– Medical outcomes analysis
– Robot control – …
• This ML niche is growing (why?)
ML apps.
All software apps.
5

Machine Learning in Computer Science
• Machine learning already the preferred approach to – Speech recognition, Natural language processing
– Computer vision
– Medical outcomes analysis
– Robot control – …
ML apps.
All software apps.
• This ML niche is growing
– Improved machine learning algorithms
– Increased volume of online data
– Increased demand for self-customizing software
Tom’s prediction: ML will be fastest-growing part of CS this century
Economics and Organizational Behavior
Evolution
Computer science
Machine learning
Statistics
Animal learning (Cognitive science, Psychology, Neuroscience)
Adaptive Control Theory
6

What You’ll Learn in This Course
• The primary Machine Learning algorithms
– Logistic regression, Bayesian methods, HMM’s, SVM’s, reinforcement learning, decision tree learning, boosting, unsupervised clustering, …
• How to use them on real data – text, image, structured data
– your own project
• Underlying statistical and computational theory
• Enough to read and understand ML research papers
Course logistics
7

Machine Learning 10-601
website: www.cs.cmu.edu/~ninamf/courses/601sp15
Faculty
• MariaBalcan • TomMitchell
TA’s
• TravisDick
• KirstenEarly
• AhmedHefny
• MicolMarchetti-Bowick • WillieNeiswanger
• AbuSaparov
Course assistant
• SharonCavlovich
See webpage for
• Officehours
• Syllabusdetails
• Recitationsessions • Gradingpolicy
• Honestypolicy
• Latehomeworkpolicy • Piazzapointers
• …
Highlights of Course Logistics
On the wait list?
• Hang in there for first few weeks Homework 1
• Available now, due friday
Grading:
• 30% homeworks (~5-6)
• 20% course project
• 25% first midterm (March 2)
• 25% final midterm (April 29)
Academic integrity:
• CheatingàFail class, be expelled from CMU
Late homework:
• full credit when due
• half credit next 48 hrs
• zero credit after that
• we’ll delete your lowest HW score
• must turn in at least n-1 of the n homeworks, even if late
Being present at exams:
• You must be there – plan now.
• Two in-class exams, no other final
8

Maria-Florina Balcan: Nina
• FoundationsforModernMachineLearning

E.g., interactive, distributed, life-long learning
Theoretical Computer Science, especially connections between learning theory & other fields

Approx. Algorithms
Control Theory
Game Theory
Machine Learning Theory
Mechanism Design
Discrete Optimization
Matroid Theory
Travis Dick
• Whencanwelearnmanyconcepts from mostly unlabeled data by exploiting relationships between between concepts.
• Currently:Geometricrelationships
9

Kirstin Early
• Analyzingandpredicting energy consumption
• Reducecosts/usageandhelp people make informed decisions
Predicting energy costs
from features of home and occupant behavior
Energy disaggregation:
decomposing total electric signal into individual appliances
Ahmed Hefny
• Howcanwelearntotrackandpredictthestateofa dynamical system only from noisy observations ?
• Can we exploit supervised learning methods to devise a flexible, local minima-free approach ?
observations (oscillating pendulum) Extracted 2D state trajectory
10

Micol Marchetti-Bowick
How can we use machine learning for biological and medical research?
• Usinggenotypedatatobuildpersonalized models that can predict clinical outcomes
• Integratingdatafrommultiplesourcesto perform cancer subtype analysis
• Structuredsparseregressionmodelsfor genome-wide association studies
sample weight
genetic relatedness
xxxxxxxxx yyyyyyyyy
Gene expression data w/ dendrogram (or have one picture per task)
Willie Neiswanger
• Ifwewanttoapplymachinelearning algorithms to BIG datasets…
• Howcanwedevelopparallel,low-communicationmachine learning algorithms?
• Suchasembarrassinglyparallelalgorithms,wheremachines work independently, without communication.
11

Abu Saparov
• How can knowledge about the world help computers understand natural language?
• What kinds of machine learning tools are needed to understand sentences?​
“Carolyn ate the cake with a fork.” “Carolyn ate the cake with vanilla.”
person_eats_food
consumer
Carolyn
food
cake
instrument
fork
person_eats_food
consumer
Carolyn
food
cake
topping
vanilla
Tom Mitchell
How can we build never-ending learners? Case study: never-ending language learner
(NELL) runs 24×7 to learn to read the web
see http://rtw.ml.cmu.edu
# of beliefs vs.
time (5 years)
mean avg. precision top 1000
reading accuracy vs.
time (5 years)
12

Function Approximation and Decision tree learning
Function approximation
Problem Setting:
• SetofpossibleinstancesX
• Unknowntargetfunctionf:XàY
• SetoffunctionhypothesesH={h|h:XàY}
superscript: ith training example • Trainingexamples{}ofunknowntargetfunctionf
Output:
• Hypothesish∈Hthatbestapproximatestargetfunctionf
Input:
13

Simple Training Data Set
Day Outlook Temperature Humidity Wind PlayTennis?
A Decision tree for
f: à PlayTennis?
Each internal node: test one discrete-valued attribute Xi Each branch from a node: selects one value for Xi Each leaf node: predict Y (or P(Y|X ∈ leaf))
14

Decision Tree Learning
Problem Setting:
• SetofpossibleinstancesX
– each instance x in X is a feature vector
– e.g., • Unknowntargetfunctionf:XàY
– Y=1 if we play tennis on this day, else 0 • SetoffunctionhypothesesH={h|h:XàY}
– each hypothesis h is a decision tree – trees sorts x to leaf, which assigns y
Decision Tree Learning
Problem Setting:
• SetofpossibleinstancesX
– each instance x in X is a feature vector x = < x1, x2 ... xn>
• Unknowntargetfunctionf:XàY – Y is discrete-valued
• SetoffunctionhypothesesH={h|h:XàY} – each hypothesis h is a decision tree
Input:
• Trainingexamples{}ofunknowntargetfunctionf Output:
• Hypothesish∈Hthatbestapproximatestargetfunctionf
15

Decision Trees
Suppose X =
where Xi are boolean-valued variables
HowwouldyourepresentY=X2 X5 ? Y=X2 ∨X5
How would you represent X2 X5 ∨ X3X4(¬X1)
16

node = Root
[ID3, C4.5, Quinlan]
Sample Entropy
17

Entropy
Entropy H(X) of a random variable X
H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code)
Why? Information theory:
• Most efficient possible code assigns -log2 P(X=i) bits
to encode the message X=i
• So,expectednumberofbitstocodeonerandomXis:
# of possible values for X
Entropy
Entropy H(X) of a random variable X
Specific conditional entropy H(X|Y=v) of X given Y=v :
Conditional entropy H(X|Y) of X given Y :
Mutual information (aka Information Gain) of X and Y :
18

Information Gain is the mutual information between input attribute A and target variable Y
Information Gain is the expected reduction in entropy of target variable Y for data sample S, due to sorting on variable A
Simple Training Data Set
Day Outlook Temperature Humidity Wind PlayTennis?
19

20

Final Decision Tree for
f: à PlayTennis?
Each internal node: test one discrete-valued attribute Xi Each branch from a node: selects one value for Xi Each leaf node: predict Y
Which Tree Should We Output?
• ID3 performs heuristic
search through space of decision trees
• It stops at smallest acceptable tree. Why?
Occam’s razor: prefer the simplest hypothesis that fits the data
21

Why Prefer Short Hypotheses? (Occam’s Razor) Arguments in favor:
Arguments opposed:
Why Prefer Short Hypotheses? (Occam’s Razor)
Argument in favor:
• Fewershorthypothesesthanlongones
à a short hypothesis that fits the data is less likely to be a statistical coincidence
à highly probable that a sufficiently complex hypothesis will fit the data
Argument opposed:
• Alsofewerhypotheseswithprimenumberofnodes
and attributes beginning with “Z”
• What’ssospecialabout“short”hypotheses?
22

Overfitting
Consider a hypothesis h and its • Error rate over training data: • True error rate over all data:
We say h overfits the training data if Amount of overfitting =
23

24

Split data into training and validation set
Create tree that classifies training set correctly
25

26

You should know:
• Wellposedfunctionapproximationproblems: – Instance space, X
– Sample of labeled training data { }
– Hypothesis space, H = { f: XàY }
• Learningisasearch/optimizationproblemoverH – Various objective functions
• minimize training error (0-1 loss)
• among hypotheses that minimize training error, select smallest (?)
• Decisiontreelearning
– Greedy top-down learning of decision trees (ID3, C4.5, …) – Overfitting and tree/rule post-pruning
– Extensions…
Questions to think about (1)
• ID3 and C4.5 are heuristic algorithms that search through the space of decision trees. Why not just do an exhaustive search?
27

Questions to think about (2)
• Consider target function f: ày, where x1 and x2 are real-valued, y is boolean. What is the set of decision surfaces describable with decision trees that use each attribute at most once?
Questions to think about (3)
• Why use Information Gain to select attributes in decision trees? What other criteria seem reasonable, and what are the tradeoffs in making this choice?
28

Questions to think about (4)
• What is the relationship between learning decision trees, and learning IF-THEN rules
29