程序代写代做代考 scheme data mining algorithm Bayesian deep learning AI Introduction to Machine Learning and Data Mining

Introduction to Machine Learning and Data Mining

Introduction to Machine Learning and Data Mining

COMP9417 Machine Learning and Data Mining

Last revision: 28 Feb 2018

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 1 / 94

Acknowledgements

Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie,
R. Tibshirani & J. Friedman. Springer (2009)
http://statweb.stanford.edu/~tibs/ElemStatLearn/

Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy
MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook

Material derived from slides for the book
“Machine Learning” by P. Flach
Cambridge University Press (2012)
http://cs.bris.ac.uk/~flach/mlbook

Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber
Cambridge University Press (2012)
http://www.cs.ucl.ac.uk/staff/d.barber/brml

Material derived from slides for the book
“Machine Learning” by T. Mitchell
McGraw-Hill (1997)
http://www-2.cs.cmu.edu/~tom/mlbook.html

Material derived from slides for the course
“Machine Learning” by A. Srinivasan
BITS Pilani, Goa, India (2016)

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 2 / 94

http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.cs.ubc.ca/~murphyk/MLbook
http://cs.bris.ac.uk/~flach/mlbook
http://www.cs.ucl.ac.uk/staff/d.barber/brml
http://www-2.cs.cmu.edu/~tom/mlbook.html

Aims

Aims

This lecture will introduce you to machine learning, giving an overview of
some of the key ideas and approaches we will cover in the course.
Following it you should be able to describe some of the main concepts and
outline some of the main techniques that are used in machine learning:

• categories of learning (supervised learning, unsupervised learning, etc.)
• widely-used techniques of machine learning algorithms
• batch vs. online settings
• parametric vs. non-parametric approaches
• generalisation in machine learning
• training, validation and testing phases in applications
• limits on learning

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 3 / 94

Aims

What we will cover

• core algorithms and model types in machine learning
• foundational concepts regarding learning from data
• relevant theory to inform and generalise understanding
• practical applications

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 4 / 94

Aims

What we will NOT cover

• lots of probability and statistics
• lots of neural nets and deep learning
• “big data”
• commercial and business aspects of “analytics”
• ethical aspects of AI and ML

although all of these are interesting and important topics

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 5 / 94

Why Study Machine Learning?

Some history

One can imagine that after the machine had been in operation
for some time, the instructions would have been altered out of
recognition, but nevertheless still be such that one would have to
admit that the machine was still doing very worthwhile
calculations. Possibly it might still be getting results of the type
desired when the machine was first set up, but in a much more
efficient manner. In such a case one would have to admit that
the progress of the machine had not been foreseen when its
original instructions were put in. It would be like a pupil who had
learnt much from his master, but had added much more by his
own work.

From A. M. Turing’s lecture to the London Mathematical Society. (1947)

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 6 / 94

Why Study Machine Learning?

Some definitions

The field of machine learning is concerned with the question of
how to construct computer programs that automatically improve
from experience.

“Machine Learning”. T. Mitchell (1997)

Machine learning, then, is about making computers modify or
adapt their actions (whether these actions are making
predictions, or controlling a robot) so that these actions get
more accurate, where accuracy is measured by how well the
chosen actions reflect the correct ones.

“Machine Learning”. S. Marsland (2015)

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 7 / 94

Why Study Machine Learning?

Some definitions

Machine learning is the systematic study of algorithms and
systems that improve their knowledge or performance with
experience.

Machine Learning”. P. Flach (2012)

The term machine learning refers to the automated detection of
meaningful patterns in data.

“Understanding Machine Learning”. S. Shalev-Shwartz and S. Ben-David (2014)

Data mining is the extraction of implicit, previously unknown,
and potentially useful information from data.

“Data Mining”. I. Witten et al. (2016)

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 8 / 94

Why Study Machine Learning?

Machine Learning is . . .

Trying to get programs to work in a reasonable way to predict stuff.

R. Kohn (2015)

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 9 / 94

Why Study Machine Learning?

How is Machine Learning different from . . .

Machine learning comes originally from Artificial Intelligence (AI), where
the motivation is to build intelligent agents, capable of acting
autonomously. Learning is a characteristic of intelligence, so to be
successful an agent must ultimately be able to learn, apply, understand
and communicate what it has learned.

These are not requirements in:

• statistics — the results are typically mathematical models for humans
• data mining — the results are typically models of “insight” for

humans

These criteria are often also necessary, but not always sufficient, for
machine learning.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 10 / 94

Why Study Machine Learning?

Machine Learning for Human-level Artifical Intelligence

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 11 / 94

Why Study Machine Learning?

ML for human-level AI, right ?

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 12 / 94

Why Study Machine Learning?

ML for human-level AI, right ? Not so fast . . .

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 13 / 94

Categories of Machine Learning

Supervised and unsupervised learning

The most widely used categories of machine learning algorithms are:

• Supervised learning – output class (or label) is given

• Unsupervised learning – no output class is given

There are also hybrids, such as semi-supervised learning, and alternative
strategies to acquire data, such as reinforcement learning and active
learning.

Note: output class can be real-valued or discrete, scalar, vector, or other
structure . . .

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 14 / 94

Categories of Machine Learning

Supervised and unsupervised learning

Supervised learning tends to dominate in applications.

Why ?

Generally, because it is much easier to define the problem and develop an
error measure (loss function) to evaluate different algorithms, parameter
settings, data transformations, etc. for supervised learning than for
unsupervised learning.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 15 / 94

Categories of Machine Learning

Supervised and unsupervised learning

Unfortunately . . .

In the real world it is often difficult to obtain good labelled data in
sufficient quantities

So in such cases unsupervised learning is really what you want . . .

but currently, finding good unsupervised learning algorithms for complex
machine learning tasks remains a research challenge.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 16 / 94

Categories of Machine Learning Models: the output of machine learning

Machine learning models

Machine learning models can be distinguished according to their main
intuition, for example:

• Geometric models use intuitions from geometry such as separating
(hyper-)planes, linear transformations and distance metrics.

• Probabilistic models view learning as a process of reducing
uncertainty, modelled by means of probability distributions.

• Logical models are defined in terms of easily interpretable logical
expressions.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 17 / 94

Categories of Machine Learning Models: the output of machine learning

Machine learning models

Alternatively, can be characterised by algorithmic properties:

• Regression models predict a numeric output
• Classification models predict a discrete class value
• Neural networks learn based on a biological analogy
• Local models predict in the local region of a query instance
• Tree-based models partition the data to make predictions
• Ensembles learn multiple models and combine their predictions

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 18 / 94

Categories of Machine Learning Supervised learning

Linear regression

Given 2 real-valued variables X1, X2, labelled with a real-valued variable
Y , find “line of best fit” that captures the dependency of Y on X1, X2.

Learning here is by minimizing MSE, i.e., the average of the squared
vertical distances of values of Y from the learned function Ŷ = f̂(X).

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 19 / 94

Categories of Machine Learning Supervised learning

Linear regression

A question: is it possible to do better than the line of best fit?

Maybe. Linear regression assume that the (xi, yi) examples in the data are
“generated” by the true (but unknown) function Y = f(X).

So any training set is a sample from the true distribution E(Y ) = f(X).

But what if f is non-linear ?

We may be able to reduce the mean squared error (MSE) value∑
i(yi − ŷ)

2 by trying a different function.

Can “decompose” MSE to aid in selecting a better function.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 20 / 94

Categories of Machine Learning Supervised learning

Nearest Neighbour

Nearest Neighbour is a regression or classification algorithm that predicts
whatever is the output value of the nearest data point to some query.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 21 / 94

Categories of Machine Learning Supervised learning

Classification

Customer103: Customer103: Customer103:(time=t0) (time=t1) (time=tn)…

Own House: Yes

Other delinquent accts: 2

Loan balance: $2,400

Income: $52k

Max billing cycles late: 3

Years of credit: 9

Profitable customer?: ?

Own House: Yes

Years of credit: 9

Profitable customer?: ?

Own House: Yes

Years of credit: 9

Loan balance: $3,250

Income: ?

Other delinquent accts: 2

Max billing cycles late: 4

Loan balance: $4,500

Income: ?

Other delinquent accts: 3

Max billing cycles late: 6

Profitable customer?: No

If OtherDelinquentAccounts > 2, and
NumberDelinquentBillingCycles > 1

Then ProfitableCustomer = No

If OtherDelinquentAccounts = 0, and
Income > 30k OR Y earsOfCredit > 3

Then ProfitableCustomer = Y es

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 22 / 94

Categories of Machine Learning Supervised learning

Assassinating spam e-mail

SpamAssassin is a widely used open-source spam filter. It calculates a
score for an incoming e-mail, based on a number of built-in rules or ‘tests’
in SpamAssassin’s terminology, and adds a ‘junk’ flag and a summary
report to the e-mail’s headers if the score is 5 or more.

-0.1 RCVD_IN_MXRATE_WL RBL: MXRate recommends allowing

[123.45.6.789 listed in sub.mxrate.net]

0.6 HTML_IMAGE_RATIO_02 BODY: HTML has a low ratio of text to image area

1.2 TVD_FW_GRAPHIC_NAME_MID BODY: TVD_FW_GRAPHIC_NAME_MID

0.0 HTML_MESSAGE BODY: HTML included in message

0.6 HTML_FONx_FACE_BAD BODY: HTML font face is not a word

1.4 SARE_GIF_ATTACH FULL: Email has a inline gif

0.1 BOUNCE_MESSAGE MTA bounce message

0.1 ANY_BOUNCE_MESSAGE Message is some kind of bounce message

1.4 AWL AWL: From: address is in the auto white-list

From left to right you see the score attached to a particular test, the test
identifier, and a short description including a reference to the relevant part
of the e-mail. As you see, scores for individual tests can be negative
(indicating evidence suggesting the e-mail is ham rather than spam) as well
as positive. The overall score of 5.3 suggests the e-mail might be spam.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 23 / 94

Categories of Machine Learning Supervised learning

Linear classification

Suppose we have only two tests and four training e-mails, one of which is
spam. Both tests succeed for the spam e-mail; for one ham e-mail neither
test succeeds, for another the first test succeeds and the second doesn’t,
and for the third ham e-mail the first test fails and the second succeeds.

It is easy to see that assigning both tests a weight of 4 correctly ‘classifies’
these four e-mails into spam and ham. In the mathematical notation
introduced above we could describe this classifier as 4×1 + 4×2 > 5 or
(4, 4) · (x1, x2) > 5.

In fact, any weight between 2.5 and 5 will ensure that the threshold of 5 is
only exceeded when both tests succeed. We could even consider assigning
different weights to the tests – as long as each weight is less than 5 and
their sum exceeds 5 – although it is hard to see how this could be justified
by the training data.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 24 / 94

Categories of Machine Learning Supervised learning

Spam filtering as a classification task

The columns marked x1 and x2 indicate the results of two tests on four
different e-mails. The fourth column indicates which of the e-mails are
spam. The right-most column demonstrates that by thresholding the
function 4×1 + 4×2 at 5, we can separate spam from ham.

E-mail x1 x2 Spam? 4×1 + 4×2
1 1 1 1 8
2 0 0 0 0
3 1 0 0 4
4 0 1 0 4

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 25 / 94

Categories of Machine Learning Supervised learning

Linear classification in two dimensions

+
+

+ +

+

+

++



x1

x0

x2

w

• straight line separates positives from negatives
• defined by w · xi = t
• w is perpendicular to decision boundary
• w points in direction of positives
• t is the decision threshold

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 26 / 94

Categories of Machine Learning Supervised learning

Linear classification in two dimensions

Note: xi points to a point on the decision boundary. In particular, x0
points in the same direction as w, from which it follows that
w · x0 = ||w|| ||x0|| = t (where ||x|| denotes the length of the vector x).

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 27 / 94

Categories of Machine Learning Supervised learning

Homogeneous coordinates

It is sometimes convenient to simplify notation further by introducing an
extra constant ‘variable’ x0 = 1, the weight of which is fixed to w0 = −t.

The extended data point is then x◦ = (1, x1, . . . , xn) and the extended
weight vector is w◦ = (−t, w1, . . . , wn), leading to the decision rule
w◦ · x◦ > 0 and the decision boundary w◦ · x◦ = 0.

Thanks to these so-called homogeneous coordinates the decision boundary
passes through the origin of the extended coordinate system, at the
expense of needing an additional dimension.

Note: this doesn’t really affect the data, as all data points and the ‘real’
decision boundary live in the plane x0 = 1.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 28 / 94

Categories of Machine Learning Supervised learning

Machine learning for spam filtering

SpamAssassin
tests Linear classifier

E-mails Data Spam?

weights

Learn weights
Training data

At the top we see how SpamAssassin approaches the spam e-mail
classification task: the text of each e-mail is converted into a data point
by means of SpamAssassin’s built-in tests, and a linear classifier is applied
to obtain a ‘spam or ham’ decision. At the bottom (in blue) we see the bit
that is done by machine learning.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 29 / 94

Categories of Machine Learning Supervised learning

A Bayesian classifier I

Bayesian spam filters maintain a vocabulary of words and phrases –
potential spam or ham indicators – for which statistics are collected from a
training set.

• For instance, suppose that the word ‘Viagra’ occurred in four spam
e-mails and in one ham e-mail. If we then encounter a new e-mail
that contains the word ‘Viagra’, we might reason that the odds that
this e-mail is spam are 4:1, or the probability of it being spam is 0.80
and the probability of it being ham is 0.20.

• The situation is slightly more subtle because we have to take into
account the prevalence of spam. Suppose that I receive on average
one spam e-mail for every six ham e-mails. This means that I would
estimate the odds of an unseen e-mail being spam as 1:6, i.e.,
non-negligible but not very high either.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 30 / 94

Categories of Machine Learning Supervised learning

A Bayesian classifier II

• If I then learn that the e-mail contains the word ‘Viagra’, which
occurs four times as often in spam as in ham, I need to combine these
two odds. As we shall see later, Bayes’ rule tells us that we should
simply multiply them: 1:6 times 4:1 is 4:6, corresponding to a spam
probability of 0.4.

In this way you are combining two independent pieces of evidence, one
concerning the prevalence of spam, and the other concerning the
occurrence of the word ‘Viagra’, pulling in opposite directions.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 31 / 94

Categories of Machine Learning Supervised learning

A Bayesian classifier III

The nice thing about this ‘Bayesian’ classification scheme is that it can be
repeated if you have further evidence. For instance, suppose that the odds
in favour of spam associated with the phrase ‘blue pill’ is estimated at 3:1,
and suppose our e-mail contains both ‘Viagra’ and ‘blue pill’, then the
combined odds are 4:1 times 3:1 is 12:1, which is ample to outweigh the
1:6 odds associated with the low prevalence of spam (total odds are 2:1, or
a spam probability of 0.67, up from 0.40 without the ‘blue pill’).

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 32 / 94

Categories of Machine Learning Supervised learning

A rule-based classifier

• if the e-mail contains the word ‘Viagra’ then estimate the odds of
spam as 4:1;

• otherwise, if it contains the phrase ‘blue pill’ then estimate the odds
of spam as 3:1;

• otherwise, estimate the odds of spam as 1:6.

The first rule covers all e-mails containing the word ‘Viagra’, regardless of
whether they contain the phrase ‘blue pill’, so no overcounting occurs.
The second rule only covers e-mails containing the phrase ‘blue pill’ but
not the word ‘Viagra’, by virtue of the ‘otherwise’ clause. The third rule
covers all remaining e-mails: those which neither contain neither ‘Viagra’
nor ‘blue pill’.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 33 / 94

The ingredients of machine learning

How machine learning helps to solve a task

Learning problem

Features
Domain

objects

Data Output
Model

Learning
algorithm

Training data

Task

An overview of how machine learning is used to address a given task. A
task (red box) requires an appropriate mapping – a model – from data
described by features to outputs. Obtaining such a mapping from training
data is what constitutes a learning problem (blue box).

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 34 / 94

The ingredients of machine learning

Some terminology I

Tasks are addressed by models, whereas learning problems are solved by
learning algorithms that produce models.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 35 / 94

The ingredients of machine learning

Some terminology II

Machine learning is concerned with using the right features to build the
right models that achieve the right tasks.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 36 / 94

The ingredients of machine learning

Some terminology III

Models lend the machine learning field diversity, but tasks and features
give it unity.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 37 / 94

The ingredients of machine learning

Some terminology IV

Does the algorithm require all training data to be present before the start
of learning ? If yes, then it is categorised as batch learning algorithm.

If however, it can continue to learn a new data arrives, it is an online
learning algorithm.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 38 / 94

The ingredients of machine learning

Some terminology V

If the model has a fixed number of parameters it is categorised as
parametric.

Otherwise, if the number of parameters grows with the amount of training
data it is categorised as non-parametric.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 39 / 94

The ingredients of machine learning Geometric models

Basic linear classifier I

+
+

+ +

+
+

++





p

n

w=p–n

(p+n)/2

The basic linear classifier constructs a decision boundary by half-way
intersecting the line between the positive and negative centres of mass.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 40 / 94

The ingredients of machine learning Geometric models

Basic linear classifier II

The basic linear classifier is described by the equation w · x = t, with
w = p− n; the decision threshold can be found by noting that (p+ n)/2
is on the decision boundary, and hence
t = (p− n) · (p+ n)/2 = (||p||2 − ||n||2)/2, where ||x|| denotes the
length of vector x.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 41 / 94

The ingredients of machine learning Geometric models

Neural Networks I

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 42 / 94

The ingredients of machine learning Geometric models

Neural Networks II

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 43 / 94

The ingredients of machine learning Geometric models

Neural Networks III

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 44 / 94

The ingredients of machine learning Geometric models

Neural Networks IV

Sharp
Left

Sharp
Right

4 Hidden
Units

30 Output
Units

30×32 Sensor
Input Retina

Straight
Ahead

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 45 / 94

The ingredients of machine learning Geometric models

Support vector machine

+
+

+ +

+

+

++


w

The decision boundary learned by a support vector machine from the
linearly separable data from before. The decision boundary maximises the
margin, which is indicated by the dotted lines. The circled data points are
the support vectors.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 46 / 94

The ingredients of machine learning Probabilistic models

A simple probabilistic model

‘Viagra’ and ‘lottery’ are two Boolean features; Y is the class variable,
with values ‘spam’ and ‘ham’. In each row the most likely class is
indicated in bold.
Viagra lottery P (Y = spam|Viagra, lottery) P (Y = ham|Viagra, lottery)

0 0 0.31 0.69
0 1 0.65 0.35
1 0 0.80 0.20
1 1 0.40 0.60

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 47 / 94

The ingredients of machine learning Probabilistic models

Decision rule

Assuming that X and Y are the only variables we know and care about,
the posterior distribution P (Y |X) helps us to answer many questions of
interest.

• For instance, to classify a new e-mail we determine whether the words
‘Viagra’ and ‘lottery’ occur in it, look up the corresponding
probability P (Y = spam|Viagra, lottery), and predict spam if this
probability exceeds 0.5 and ham otherwise.

• Such a recipe to predict a value of Y on the basis of the values of X
and the posterior distribution P (Y |X) is called a decision rule.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 48 / 94

The ingredients of machine learning Probabilistic models

Missing values I

Suppose we skimmed an e-mail and noticed that it contains the word
‘lottery’ but we haven’t looked closely enough to determine whether it
uses the word ‘Viagra’. This means that we don’t know whether to use the
second or the fourth row in to make a prediction. This is a problem, as we
would predict spam if the e-mail contained the word ‘Viagra’ (second row)
and ham if it didn’t (fourth row).

The solution is to average these two rows, using the probability of ‘Viagra’
occurring in any e-mail (spam or not):

P (Y |lottery) =P (Y |Viagra = 0, lottery)P (Viagra = 0)
+ P (Y |Viagra = 1, lottery)P (Viagra = 1)

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 49 / 94

The ingredients of machine learning Probabilistic models

Missing values II

For instance, suppose for the sake of argument that one in ten e-mails
contain the word ‘Viagra’, then P (Viagra = 1) = 0.10 and
P (Viagra = 0) = 0.90. Using the above formula, we obtain
P (Y = spam|lottery = 1) = 0.65 · 0.90 + 0.40 · 0.10 = 0.625 and
P (Y = ham|lottery = 1) = 0.35 · 0.90 + 0.60 · 0.10 = 0.375. Because the
occurrence of ‘Viagra’ in any e-mail is relatively rare, the resulting
distribution deviates only a little from the second row in the data.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 50 / 94

The ingredients of machine learning Probabilistic models

Likelihood ratio

As a matter of fact, statisticians work very often with different conditional
probabilities, given by the likelihood function P (X|Y ).
• I like to think of these as thought experiments: if somebody were to

send me a spam e-mail, how likely would it be that it contains exactly
the words of the e-mail I’m looking at? And how likely if it were a
ham e-mail instead?

• What really matters is not the magnitude of these likelihoods, but
their ratio: how much more likely is it to observe this combination of
words in a spam e-mail than it is in a non-spam e-mail.

• For instance, suppose that for a particular e-mail described by X we
have P (X|Y = spam) = 3.5 · 10−5 and P (X|Y = ham) = 7.4 · 10−6,
then observing X in a spam e-mail is nearly five times more likely
than it is in a ham e-mail.

• This suggests the following decision rule: predict spam if the
likelihood ratio is larger than 1 and ham otherwise.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 51 / 94

The ingredients of machine learning Probabilistic models

When to use likelihoods

Use likelihoods if you want to ignore the prior distribution or assume it
uniform, and posterior probabilities otherwise.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 52 / 94

The ingredients of machine learning Probabilistic models

Posterior odds

P (Y = spam|Viagra = 0, lottery = 0)
P (Y = ham|Viagra = 0, lottery = 0)

=
0.31

0.69
= 0.45

P (Y = spam|Viagra = 1, lottery = 1)
P (Y = ham|Viagra = 1, lottery = 1)

=
0.40

0.60
= 0.67

P (Y = spam|Viagra = 0, lottery = 1)
P (Y = ham|Viagra = 0, lottery = 1)

=
0.65

0.35
= 1.9

P (Y = spam|Viagra = 1, lottery = 0)
P (Y = ham|Viagra = 1, lottery = 0)

=
0.80

0.20
= 4.0

Using a MAP decision rule we predict ham in the top two cases and spam
in the bottom two. Given that the full posterior distribution is all there is
to know about the domain in a statistical sense, these predictions are the
best we can do: they are Bayes-optimal.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 53 / 94

The ingredients of machine learning Probabilistic models

Example marginal likelihoods

Y P (Viagra = 1|Y ) P (Viagra = 0|Y )
spam 0.40 0.60
ham 0.12 0.88

Y P (lottery = 1|Y ) P (lottery = 0|Y )
spam 0.21 0.79
ham 0.13 0.87

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 54 / 94

The ingredients of machine learning Probabilistic models

Using marginal likelihoods

Using the marginal likelihoods from before, we can approximate the
likelihood ratios (the previously calculated odds from the full posterior
distribution are shown in brackets):

P (Viagra = 0|Y = spam)
P (Viagra = 0|Y = ham)

P (lottery = 0|Y = spam)
P (lottery = 0|Y = ham)

=
0.60

0.88

0.79

0.87
= 0.62 (0.45)

P (Viagra = 0|Y = spam)
P (Viagra = 0|Y = ham)

P (lottery = 1|Y = spam)
P (lottery = 1|Y = ham)

=
0.60

0.88

0.21

0.13
= 1.1 (1.9)

P (Viagra = 1|Y = spam)
P (Viagra = 1|Y = ham)

P (lottery = 0|Y = spam)
P (lottery = 0|Y = ham)

=
0.40

0.12

0.79

0.87
= 3.0 (4.0)

P (Viagra = 1|Y = spam)
P (Viagra = 1|Y = ham)

P (lottery = 1|Y = spam)
P (lottery = 1|Y = ham)

=
0.40

0.12

0.21

0.13
= 5.4 (0.67)

We see that, using a maximum likelihood decision rule, our very simple
model arrives at the Bayes-optimal prediction in the first three cases, but
not in the fourth (‘Viagra’ and ‘lottery’ both present), where the marginal
likelihoods are actually very misleading.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 55 / 94

The ingredients of machine learning Logical models

A classification tree I

ʻViagraʼ

ʻlotteryʼ

=0

spam: 20
ham: 5

=1

spam: 20
ham: 40

=0

spam: 10
ham: 5

=1

!Viagra”

!lo
tt

e
ry

0 1
0

1

spam: 20

ham: 5

spam: 20

ham: 40

spam: 10

ham: 5

A classification tree combining two Boolean features.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 56 / 94

The ingredients of machine learning Logical models

A classification tree II

Each internal node or split is labelled with a feature, and each edge
emanating from a split is labelled with a feature value.

Each leaf therefore corresponds to a unique combination of feature values.

Also indicated in each leaf is the class distribution derived from the
training set.

A classification tree partitions the instance space into rectangular regions,
one for each leaf. We can clearly see that the majority of ham lives in the
lower left-hand corner.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 57 / 94

The ingredients of machine learning Logical models

Labelling a classification tree

• The leaves of the classification tree could be labelled, from left to
right, as ham – spam – spam, employing a simple decision rule called
majority class.

• Alternatively, we could label them with the proportion of spam e-mail
occurring in each leaf: from left to right, 1/3, 2/3, and 4/5.

• Or, if our task was a regression task, we could label the leaves with
predicted real values or even linear functions of some other,
real-valued features.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 58 / 94

The ingredients of machine learning Logical models

A complete classification tree

ʻViagraʼ

ʻlotteryʼ

=0

ʻlotteryʼ

=1

spam: 20
ham: 40

=0

spam: 10
ham: 5

=1

spam: 20
ham: 4

=0

spam: 0
ham: 1

=1

spam: 0

ham: 1

!Viagra”
!lo

tt
e

ry

0 1

0
1

spam: 20

ham: 4

spam: 20

ham: 40

spam: 10

ham: 5

A complete classification tree built from two Boolean features. The
corresponding instance space partition is the finest partition that can be
achieved with those two features.

COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 59 / 94

The ingredients of machine learning Features: the workhorses of machine learning

Two uses of features

Suppose we want to approximate y = cosπx on the interval −1 ≤ x ≤ 1.
A linear approximation is not much use here, since the best fit would be
y = 0. However, if we split the x-axis in two intervals −1 ≤ x < 0 and 0 ≤ x ≤ 1, we could find reasonable linear approximations on each interval. We can achieve this by using x both as a splitting feature and as a regression variable. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 60 / 94 The ingredients of machine learning Features: the workhorses of machine learning A small regression tree x ŷ = 2x+1 <0 ŷ = −2x+1 ≥0 -1 0 1 -1 1 A regression tree combining a one-split feature tree with linear regression models in the leaves. Note: x used as both a splitting feature and regression variable. At right, function y = cosπx on the interval −1 ≤ x ≤ 1, and piecewise linear approximation by regression tree. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 61 / 94 The ingredients of machine learning Feature construction and transformation Class-sensitive discretisation I 30 40 50 60 70 80 90 100 110 120 130 0 2 4 6 8 10 12 14 35 55 75 90 110 130 0 5 10 15 20 25 COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 62 / 94 The ingredients of machine learning Feature construction and transformation Class-sensitive discretisation II Artificial data depicting a histogram of body weight measurements of people with (blue) and without (red) diabetes, with eleven fixed intervals of 10 kilograms width each. By joining the first and second, third and fourth, fifth and sixth, and the eighth, ninth and tenth intervals, we obtain a discretisation such that the proportion of diabetes cases increases from left to right. This discretisation makes the feature more useful in predicting diabetes. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 63 / 94 The ingredients of machine learning Feature construction and transformation The kernel trick Let x1 = (x1, y1) and x2 = (x2, y2) be two data points, and consider the mapping (x, y) 7→ (x2, y2, √ 2xy) to a three-dimensional feature space. The points in feature space corresponding to x1 and x2 are x′1 = (x 2 1, y 2 1, √ 2x1y1) and x ′ 2 = (x 2 2, y 2 2, √ 2x2y2). The dot product of these two feature vectors is x′1 · x ′ 2 = x 2 1x 2 2 + y 2 1y 2 2 + 2x1y1x2y2 = (x1x2 + y1y2) 2 = (x1 · x2)2 That is, by squaring the dot product in the original space we obtain the dot product in the new space without actually constructing the feature vectors! A function that calculates the dot product in feature space directly from the vectors in the original space is called a kernel – here the kernel is κ(x1,x2) = (x1 · x2)2. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 64 / 94 The ingredients of machine learning Feature construction and transformation Non-linearly separable data −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 A linear classifier would perform poorly on this data. By transforming the original (x, y) data into (x′, y′) = (x2, y2), the data becomes more ‘linear’, and a linear decision boundary x′ + y′ = 3 separates the data fairly well. In the original space this corresponds to a circle with radius √ 3 around the origin. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 65 / 94 The ingredients of machine learning Representation is important Where do features come from ? COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 66 / 94 The ingredients of machine learning Representation is important Feature engineering COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 67 / 94 The ingredients of machine learning Representation is important Can features be learned ? Yes, to some extent. For example, in the intermediate layers of a convolutional neural network1. 1Image downloaded 28/2/18 from http://devblogs.nvidia.com/deep-learning-nutshell-core-concepts/ COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 68 / 94 http://devblogs.nvidia.com/deep-learning-nutshell-core-concepts/ The ingredients of machine learning Data is important Where does data come from ? AlphaZero played around 44 million chess games against an expert (itself). COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 69 / 94 The ingredients of machine learning Tasks: the problems that can be solved with machine learning Tasks for machine learning The most common machine learning tasks are predictive, in the sense that they concern predicting a target variable from features. • Binary and multi-class classification: categorical target • Regression: numerical target • Clustering: hidden target • Dimensionality reduction: intrinsic structure Exploratory or descriptive tasks are concerned with exploiting underlying structure in the data. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 70 / 94 The ingredients of machine learning Tasks: the problems that can be solved with machine learning Measuring similarity I If our e-mails are described by word-occurrence features as in the text classification example, the similarity of e-mails would be measured in terms of the words they have in common. For instance, we could take the number of common words in two e-mails and divide it by the number of words occurring in either e-mail (this measure is called the Jaccard coefficient). Suppose that one e-mail contains 42 (different) words and another contains 112 words, and the two e-mails have 23 words in common, then their similarity would be 23 42+112−23 = 23 130 = 0.18. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 71 / 94 The ingredients of machine learning Tasks: the problems that can be solved with machine learning Measuring similarity II We can then cluster our e-mails into groups, such that the average similarity of an e-mail to the other e-mails in its group is much larger than the average similarity to e-mails from other groups. While it wouldn’t be realistic to expect that this would result in two nicely separated clusters corresponding to spam and ham – there’s no magic here – the clusters may reveal some interesting and useful structure in the data. It may be possible to identify a particular kind of spam in this way, if that subgroup uses a vocabulary, or language, not found in other messages. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 72 / 94 The ingredients of machine learning Tasks: the problems that can be solved with machine learning Clustering We want to cluster similarly expressed genes in cancer samples. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 73 / 94 The ingredients of machine learning Tasks: the problems that can be solved with machine learning How many clusters ? I COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 74 / 94 The ingredients of machine learning Tasks: the problems that can be solved with machine learning How many clusters ? II COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 75 / 94 The ingredients of machine learning Looking for structure Looking for structure I Consider the following matrix:  1 0 1 0 0 2 2 2 0 0 0 1 1 2 3 2 1 0 1 1 0 2 2 3   Imagine these represent ratings by six different people (in rows), on a scale of 0 to 3, of four different films – say The Shawshank Redemption, The Usual Suspects, The Godfather, and The Big Lebowski, (in columns, from left to right). The Godfather seems to be the most popular of the four with an average rating of 1.5, and The Shawshank Redemption is the least appreciated with an average rating of 0.5. Can you see any structure in this matrix? COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 76 / 94 The ingredients of machine learning Looking for structure Looking for structure II   1 0 1 0 0 2 2 2 0 0 0 1 1 2 3 2 1 0 1 1 0 2 2 3   =   1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 1 1   ×   1 0 00 2 0 0 0 1   ×   1 0 1 00 1 1 1 0 0 0 1   • The right-most matrix associates films (in columns) with genres (in rows): The Shawshank Redemption and The Usual Suspects belong to two different genres, say drama and crime, The Godfather belongs to both, and The Big Lebowski is a crime film and also introduces a new genre (say comedy). • The tall, 6-by-3 matrix then expresses people’s preferences in terms of genres. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 77 / 94 The ingredients of machine learning Looking for structure Looking for structure III • Finally, the middle matrix states that the crime genre is twice as important as the other two genres in terms of determining people’s preferences. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 78 / 94 Algorithm-independent aspects of machine learning The philosophical problem Deduction: derive specific consequences from general theories Induction: derive general theories from specific observations Deduction is well-founded (mathematical logic). Induction is (philosophically) problematic – induction is useful since it often seems to work – an inductive argument ! COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 79 / 94 Algorithm-independent aspects of machine learning Generalisation - the key objective of machine learning What we are really interested in is generalising from the sample of data in our training set. This can be stated as: The inductive learning hypothesis Any hypothesis found to approximate the target (true) function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples. A corollary of this is that it is necessary to make some assumptions about the type of target function in a task for an algorithm to go beyond the data, i.e., generalise or learn. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 80 / 94 Algorithm-independent aspects of machine learning Evaluating performance on a task Cross-validation I There are certain parameters that need to be estimated during learning. We use the data, but NOT the training set, OR the test set. Instead, we use a separate validation or development set. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 81 / 94 Algorithm-independent aspects of machine learning Evaluating performance on a task Cross-validation II COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 82 / 94 Algorithm-independent aspects of machine learning Evaluating performance on a task Cross-validation III COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 83 / 94 Algorithm-independent aspects of machine learning Evaluating performance on a task Cross-validation IV COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 84 / 94 Algorithm-independent aspects of machine learning Evaluating performance on a task Contingency table I Two-class prediction case: Predicted Class Actual Class Yes No Yes True Positive (TP) False Negative (FN) No False Positive (FP) True Negative (TN) COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 85 / 94 Algorithm-independent aspects of machine learning Evaluating performance on a task Contingency table II Classification Accuracy on a sample of labelled pairs (x, c(x)) given a learned classification model that predicts, for each instance x, a class value ĉ(x): acc = 1 |Test| ∑ x∈Test I[ĉ(x) = c(x)] where Test is a test set and I[] is the indicator function which is 1 iff its argument evaluates to true, and 0 otherwise. Classification Error is 1-acc. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 86 / 94 Algorithm-independent aspects of machine learning Evaluating performance on a task Overfitting Imagine you are preparing for your Machine Learning 101 exam. Helpfully, your professor has made previous exam papers and their worked answers available online. You begin by trying to answer the questions from previous papers and comparing your answers with the model answers provided. Unfortunately, you get carried away and spend all your time on memorising the model answers to all past questions. Now, if the upcoming exam completely consists of past questions, you are certain to do very well. But if the new exam asks different questions about the same material, you would be ill-prepared and get a much lower mark than with a more traditional preparation. In this case, one could say that you were overfitting the past exam papers and that the knowledge gained didn’t generalise to future exam questions. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 87 / 94 Algorithm-independent aspects of machine learning Evaluating performance on a task The Bias-Variance Decomposition Error (particularly MSE) can be shown to have two components: Bias: error due to mismatch between target function and functions learnable by the algorithm that was applied to the task Variance: error due to variation between training sets sample from the distribution on data generated by the target function COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 88 / 94 Algorithm-independent aspects of machine learning The limits of machine learning Inductive Bias All models are wrong, but some models are useful. Box & Draper (1987) COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 89 / 94 Algorithm-independent aspects of machine learning The limits of machine learning Inductive Bias Confusingly, “inductive bias” is NOT the same “bias” as in the “bias-variance” decomposition. “Inductive bias” is the combination of assumptions and restrictions placed on the models and algorithms used to solve a learning problem. Essentially it means that the algorithm and model combination you are using to solve the learning problem is appropriate for the task. Success in machine learning requires understanding the inductive bias of algorithms and models, and choosing them appropriately for the task2. 2Even true for “deep learning”, but watch Andrew Ng’s talk on this at http://www.youtube.com/watch?v=F1ka6a13S9I&t=48s. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 90 / 94 http://www.youtube.com/watch?v=F1ka6a13S9I&t=48s Algorithm-independent aspects of machine learning The limits of machine learning No Free Lunch Uniformly averaged over all target functions, the expected off-training-set error for all learning algorithms is the same. Wolpert (1996) COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 91 / 94 Algorithm-independent aspects of machine learning The limits of machine learning No Free Lunch Owing to the “No Free Lunch’ Theorem, there is no universally best machine learning algorithm. Some learning algorithms perform better than others on certain tasks since their assumptions about the type of target function are more appropriate for the learning task in that application. On some other tasks, those assumptions, and hence the algorithms, may perform much worse than others. These assumptions form the inductive bias of the learning algorithm. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 92 / 94 Algorithm-independent aspects of machine learning The limits of machine learning Ethics of machine learning Machine learning algorithms are now widely available and increasingly used on “big data” in commercial, medical, legal and scientific applications These applications bring many benefits, but as they become more widespread they wll start to impact most people as part of daily life In some of these applications ethical and moral implications of the use of machine learning must be considered For example, can the use of machine learning be biased or unfair because the training data does not adequately represent all of the data to which the model may be applied ? Here is a recent paper discussing some of these issues and giving some recommendations3. 3Zook et al. (2017) Ten simple rules for responsible big data research. PLoS Comput Biol 13(3): e1005399. http://doi.org/10.1371/journal.pcbi.1005399. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 93 / 94 http://doi.org/10.1371/journal.pcbi.1005399 Summary Summary The purpose of this introductory lecture was, from a high-level, to survey the landscape that we will explore in this course. We have motivated the subject matter, outlined what parts of machine learning will be addressed in this course, and introduced some key ideas and terminology. Throughout this lecture we have deliberately avoided details like programming languages, machine learning tools and software libraries. These change very rapidly, although the core technical ideas and challenges remain. Following this lecture you should have a clear idea of the course scope. In the remaining lectures we will expand on the topics we have just introduced and go into more detail. You will also get to work on practical applications of what we have covered. COMP9417 ML & DM Intro to ML & DM Semester 1, 2018 94 / 94 Aims Why Study Machine Learning? Categories of Machine Learning Models: the output of machine learning Supervised learning The ingredients of machine learning Geometric models Probabilistic models Logical models Features: the workhorses of machine learning Feature construction and transformation Representation is important Data is important Unsupervised learning Tasks: the problems that can be solved with machine learning Looking for structure Batch vs. Online Learning Algorithms Parametric vs. Non-Parametric Learning Algorithms Generalisation Algorithm-independent aspects of machine learning Evaluating performance on a task The limits of machine learning Summary