MFIN6201
Week 10 – An Overview of Machine Learning Models
Leo Liu
April 15, 2020
Outline
• What is it?
• From linear regression to linear classifiers
• Decision Trees – The building block
• Random Forest and Boosting Trees
• Neural Networks
• An easy application of NN to NLP: Word2vec
• K-Mean clustering
• An application of NLP and K-Mean Clustering: Classifying product and services in the US economy
• Statistical Modeling: A story of two cultures
2
Machine Learning – What is it?
• An sub-field of artificial intelligence with heavy reliance on statistics
• ML grants the computers the capability to learn without explicitly programmed, to perform certain task based on learned knowledge.
• Always some form of optimisation problems to solve
3
Types of ML
1. Supervised Learning – need to train the algorithm before it becomes useful
2. Unsupervised Learning – No need to train
4
MLs are everywhere now
• YouTube knows what vidoes to show up on your homepage based on your past watching history
• Google sends you ads based on your preferences
• TikTok (or DouYin) knows what to show up on your feed that
generates your interests
• Email spams detections
• Self-driving Cars
• ….
5
Supervised Learning
Supervised Learning models normally solve a prediction problem, consider an OLS regression (OLS can also be a ML model…!), suppose we want to predict Y using X
Y = β0 + β1X + ε
To predict Y, we need to know the value of β0 and β1, therefore, we will need to run the regression in the “training” sample (in-sample), whereby we know the value of both Y and X, to get the value of β0 and β1, Then we can predict Y given X
Yˆ = βˆ + βˆ X + ε 01
The sample we used to do prediction is normally called test sample (out-of-sample).
6
OLS as a ML Model
The objective function of OLS is to minimise MSE (mean squared errors), that is to say, for OLS regression:
obj(θ)=(Yˆ −Y)2
How does OLS achieve this goal? It chooses parameter β0 and β1
to minimise in-sample variance ( ε2)
Notice that we do not care about endogeneity, causality etc. For prediction purpose, we only care about the performance of the model, not the underlying mechanism.
7
Well, is OLS a good prediction model?
We know OLS regression, or some linear regression assumes linear relationship between variables, but we know this is not a good description of the world…! Consider a prediction problem: we want to predict whether a person likes computer games or not by using his/her age, gender, occupation etc. as input
• Age has non-linear effect. For example, we may find only people below age of 20 like computer games
• The age effect often interact with other variables such as gender. For example, only young boys like computer games
• Such non-linear and complex interacted relationships among variables making linear regression the second class citizen in the ML world
8
OLS as a ML model
To improve the performance of OLS as predictors, we consider a few ways to fix the problems, namely, under-fitting and over-fitting.
we can clearly see the under-fitting problem in picture 1, an obvious way to capture the non-linearity of the underlying model is to add polynomials, or interaction terms in the regression, which could lead to a good model as in picture 2. However, it can also lead to over-fitting problem as in picture 3.
9
OLS as a ML model: overfitting the outliers
Consider a dataset with outliers, the outlier data point push the line towards because OLS tries to fit it.
10
Lasso and Ridge regression
The solution to over-fitting problem is to add a regularization term to the objective function.
Lasso:
Ridge:
obj(θ)=(Yˆ −Y)2 +λ|β|
obj(θ)=(Yˆ −Y)2 +λβ2
Both of these terms prevent the regression trying to fitting the
data by having a huge β (over-guessing)
* λ is the regularization parameter that determined the size of penalty. I omitted the details of how to choose λ here.
11
OLS as a ML model
Another way to think about merely minimising MSE can be problematic for prediction is bias vs. variance tradeoff
Here, if the training sample is not good representation of the test sample (e.g using different sample selection criteria), OLS regression may give a low variance and high bias by having too much over-guessing.
12
Decision Trees
The prediction models we have discussed so far rely on function forms, and it is the major drawback.
• predictions rely on the appropriateness of the function form, i.e. linear function
• Most of the time, the function form does not describe the reality…! e.g. Y is not a linear function of X…!
Before I introduce more advanced models, let me introduce the building blocks behind all of the “non-parametric” models.
13
Decision Trees
Suppose we have data on people’s status (student or not), their age, income, their credit ratings, and whether they have a computer or not (Yes / No). We can build a initial tree from our guesses, our goal here is to correctly predict whether a student has a computer.
For a student has low income with fair credit ratings, what’s your prediction on whether he has a computer using this tree?
14
Decision Trees
We can then generate many trees and select the best tree
How to select the best tree?
Well, we just need to optimise the objective function
15
Decision Trees
As the same in linear models, we need define a loss function and regularization term.
For a simple example, we can define the objective function in this way:
obj(θ)=(Yˆ −Y)2 +λN
Where N is the maximum depth in the tree or maximum number of splits.
16
Tree Ensembles
To further improve the model, instead of using just one tree, we can use many trees, each tree is small and only have a small portion of prediction power.
The idea is simple, instead of betting on one tree structure, you betting many trees to extract the diversification benefits (the errors from individual tree tend to offset each others).
I only introduce the two most popular methods to grow the trees: Random Forest (Breiman 2001) and Gradient Boosting (Friedman, 2001).
17
Random Forest
As the name suggest, we can grow tree by generating them based on some random elements, specifically, random vectors.
1. We can train the tree parameters on bootstrapped samples. Basically, we repeat the method that we find the best tree structure for independent sub-samples. For N bootstrapped samples, we grow N trees.
2. We can generate trees by random input selections e.g. tree 1 hasx1 andx2,tree2hasx3 andx4
Because the process is random, we can ensure a low correlations among trees (to have the diversification benefits).
18
Boosting Tree
The more popular tree ensemble method nowadays is boosting. I only introduce the most popular boosting method among them, namely, gradient boosting.
Instead of randomly generate trees, we can generate trees stepwise, each step we try to optimise the loss function
19
Gradient boosting
1. The boosting method starts from base learner, which normally the mean of Y , your predictor will then take the form F0 (x ) = argmin L(yi ), which is normally y ̄i
2. We calculate the residuals, εi = yi − y ̄i
3. We try to find a tree with parameter a (inputs, split points etc..) that can best offset the residuals. Mathmatically, a1 = argmina (εi − tree(a1))2 if our loss function is squared errors
4. We add the tree to our base learner to form a new learner F1(x) = F0(x) + tree(a1)
5. we return to step 2 to calculate new residuals, εi = yi − F1(x)
6. We can find a tree with parameter a2 that best offset the new residuals
7. we add the new tree to the learner we have so far F2(x) = F1(x) + tree(a2)
8. Repeat M steps, we can have a learner (predictor) FM(x) with M trees.
20
Regression Tree (CART)
You can immediately see the benefit of the tree structure
• By splitting group based on continuous variables, we can uncover non-linear relationship
• By computing prediction scores using sum of scores from different variables, we can uncover interacted relationships among variables
Majority of the data mining competitions are won by some variants of the regression tree The most popular model of regression tree is XGBoost, which uses Gradient Boosting
21
Neural Networks
• The idea is very old. It was first proposed in 1944 by Warren McCullough and Walter Pitts, two MIT professors.
• The idea was inspired from how brain works.
22
Neural Networks
• Each neuron contains one algorithm, the number of layers can be very large, e.g. Alpha Go has 84 layers
23
Neural Networks – examples
Let’s consider the simplest case: one input, one neuron contains linear predictor with one layer
1. input is x
2. neuron contains the algorithm y = 2x
3. then the predicted value y = 2 if input x = 1
We can now expand this to multiple neurons network:
1. neuron 1 contains the algorithm y = 2x 2. neuron 2 contains the algorithm y = 3x 3. neuron 3 contains the algorithm y = 4x 4. neuron 4 contains the algorithm y = 5x
Let’s say each neuron has a input weight of 1/4, the predicted value given x = 1 would be 1/4×2+1/4×3+1/4×4+1/4×5 which is 3.5.
24
Neural Networks – Training through backpropagation
Unlike other ML models training to tuning parameters in the algorithm model, the neural networks are trained to tune input weights in each neuron. It is normally trained through backpropagation(Rumelhart, Hinton, and Williams, 1986)
Let me illustrate it using a hypothetical example Let’s say we have 4 neurons in 1 layer as previous slide example, we have the training data as follows:
yx 21 4 1.5 6 3.5
25
Neural Networks – Training through backpropagation
1. first we train on first row, the squared error in first neuron is 0 and it’s the best model, so its weight increase by 0.15, other decrease by 0.05 each. we have weights: 0.40, 0.2, 0.2, 0.2
2. first we train on second row, the squared error in first neuron is 1 and it’s the best model, so its weight increase by 0.15, other decrease by 0.05 each. we have weights: 0.55, 0.15, 0.15, 0.15
3. first we train on third row, the squared error in second neuron is 0 and it’s the best model, so its weight increase by 0.15, other decrease by 0.05 each. we have weights: 0.5, 0.3, 0.1, 0.1
The expected value given x = 1 is then
0.5 ∗ 2 + 0.3 ∗ 3 + 0.1 ∗ 4 + 0.1 ∗ 5 = 2.8, this is close to what is being given by the training data, which is about 2.5
26
Word2vec
A simple and popular implementation of neural networks in Natural Language Processing is Word2vec, which is an algorithm to learn word representations. Consider the following two sentences:
1. President meets the media in Washington, D.C 2. Donald Trump greets the press in the capital city
The two sentences have almost the same meaning. However, this is very difficult for computer to pick up as they use different words. Ideally, we would like the computer to know Trump is the president; “meet” and “greet” have similar meaning and Washington is the capital city.
You can build a dictionary for this, but it is extremely difficult to be exhaustive. People tried many centuries without luck
27
Word2vec
The idea of Word2vec is simple, we look at a large amount of text, if two words (say A and B) are always surrounded by same words, A and B would have very similar meanings. For example,
1. Have a good day! 2. Have a great day!
since good and great are often surrounded by similar words, the computer can learn that they contain similar meanings.
Of course, we need a large training dataset to achieve this (I am talking about billions of words)
28
Word2vec – How?
we can train a shallow neural network to project the word i vector into a vector space using the words surround them.
and we project word vector using a projection layer, which contains N neurons, each neuron is a lookup table to convert word vectors to a coordinate in one direction in the N dimensional space.
The output is N dimensional vector specifying the coordinate of the word in the vector space.
29
Word2vec – How?
30
Unsupervised Learning
Normally, Unsupervised learning models solve some classification problems, or dimension reduction.
• Classify news items into topics
• Customer segmentations
• Insurance fraud detection
• Delivery service optimisation (where should I set up a post office / delivery center…?)
31
K-Mean clustering Algorithm
Let me introduce you one popular unsupervised ML algorithm – K-mean
A powerful and simple clustering algorithm, very popular Consider below problem, one wants to group below data points
32
K-mean
To group these data points into group using K-mean, we need to initialise centroids (imaginary centers for our group) and setting K (number of groups). Here, we initialise K=3 centroids
33
K-mean
Apparently, these centroids do not fit the data too well. Good centroids should minimise the summed squared distance, as in
J,K
min (||xj − ck ||)2
j =1,k =1
xk are the blue points we want to cluster, cj are centroids (green stars), we have J blue points, and K centroids.
34
K-mean
We iterately change the position of centroids to find the optimal, each blue point is assigned to a particular centroid based on the shortest distance
35
Statistical Modeling – Two cultures
We exclusively rely on data models, statistical analysis usually starts with “Assume the data generating process is …”
While algorithmic model makes no assumption about data distribution, but try to model the data itself.
The data models normally have no power in explain the behavior of the data
36
Statistical Modeling – Two cultures
Data models are easier to interpret, but are they relevant? Algorithmic models are accurate and powerful, but how to make economic sense?
As a statistician, you should be open to a wide range of tools.
37
Homework
Please write a paragraph to explain why linear models are bad predictors, what are the remedies can you apply?
38
References
Friedman, J. H. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics 1189–232.
Rumelhart, D. E., G. E. Hinton, and R. J. Williams. 1986. Learning representations by back-propagating errors. nature 323:533–6.
39