程序代写代做代考 python algorithm Slide 1

Slide 1

BUSINESS SCHOOL

 Discipline of Business Analytics

QBUS6850 Team

2

 Topics covered
 Logistics regression

 Intuition of regularization

 Regularized Linear Regressions: Ridge, LASSO, Elastic Net

 Feature Extraction

References
 Bishop (2006), Chapters 3.1.4; 4.3.2

Hastie et al. (2001), Chapter 3.4, Chapter 7.7-7.10

 James et al., (2014), Chapters 4.3; 6.2

3

Learning Objectives

 Understand the intuition of regularization
 Review how Ridge regression, LASSO regression and Elastic net

work
 Understand the differences between various regularized

regressions
 Understand difference between regression and classification
 Be able to calculate the predicting probability with logistic

regression
 Understand different types of features and feature extraction
 Be able to extract features for text data
 Understand Cross Validation and Be able to conduct Cross

Validation

4

Classification

 The linear regression was used as an example to show the machine

learning workflow

 The major ingredients are data, a model, and a criterion (objective)

What are they in linear regression?

 In regression, the target 𝑡𝑡 was continuous numeric value; however in
many applications, we wish to predict class instead of an amount

 When the target is categorical, we call the regression as a classification

 For classification, how shall we choose a model? how shall we design a

criterion to measure the “error” between the observation and the model

prediction?

 We will look at the logistic regression as an example

5

Supervised learning with categorical response: classification
2 classes. Linear regression for classification?

Classification

1

0

Default or not?

0.5

Annual Income

Higher income;
Less likely to default

Lower income;
More likely to default

If 𝑓𝑓 𝐱𝐱,𝜷𝜷 ≥ 0.5, predict 𝑡𝑡 = 1;
If 𝑓𝑓 𝐱𝐱,𝜷𝜷 < 0.5, predict 𝑡𝑡 = 0. Income threshold is known. $50,000 Presenter Presentation Notes Question: Can we change the target values “0 and 1” to “1 and “ or “1 and -1”? 6 Supervised learning with categorical response 2 classes. Linear regression for classification? Classification 1 0 Default or not? 0.5 One outlier significantly changes the income threshold ($50k to $65k). Do we really need to change the classification rule/boundary? $65,000 Annual Income 7 Logistic Regression 8 Re-coding Target  In fact, there is no numeric target value in classification. We manually encode it as 1 (class A) or 0 (class B).  Can we re-code “Class A” as (1, 0) and “Class B” as (0, 1)? That is, for each case, we have two target values or the target is a vector 𝐭𝐭 = (𝑡𝑡1, 𝑡𝑡2) [Please recall our notation in Lecture 2 where 𝑚𝑚 = 2]  Hence we shall have two models 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 → 𝑡𝑡1 and 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 → 𝑡𝑡𝟐𝟐  How shall we measure the error between them? [We need a learning criterion or objective]  It seems, for our coding (1, 0) and (0,1) for classes A and B, we have 𝑡𝑡1 + 𝑡𝑡2 = 1, and 0 ≤ 𝑡𝑡1 ≤ 1, 0 ≤ 𝑡𝑡2 ≤ 1.  Can we say 𝐭𝐭 = (𝑡𝑡1, 𝑡𝑡2) is a Bernoulli distribution with a parameter 𝑡𝑡1?  Hence, each training target (class A or class B) becomes an “extreme” Bernoulli distribution either (1, 0) or (0, 1) 9 Developing New Objective  Hence, each training target (class A or class B) becomes an “extreme” Bernoulli distribution either (1, 0) or (0, 1)  Two models 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 → 𝑡𝑡1 and 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 → 𝑡𝑡𝟐𝟐 shall aim to predict the Bernoulli parameter, or 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 should be the probability for case 𝐱𝐱 to be class A and 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 should be the probability for case 𝐱𝐱 to be class B.  Three conditions: 0 ≤ 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 ≤ 1 and 0 ≤ 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 ≤ 1, and 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 + 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 = 𝟏𝟏  That is (𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 , 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 ) is a Bernoulli distribution for each case 𝐱𝐱 too.  Suppose we already have models satisfying the above conditions, now the question becomes how we tell the Bernoulli (𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 , 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 ) is close to Bernoulli (1, 0) [if 𝐱𝐱 is class A] or Bernoulli (0, 1) [if 𝐱𝐱 is class B] 10 Cross Entropy Objective  Our simple example has demonstrated that simply measuring the squared errors is not a good way  Although a Bernoulli distribution is represented by a 2D vector, they are special: two components are between 0 and 1, and the sum of them is 1.  To measure the “distance” between distributions, we use either the so-called Kullback-Leibler divergence, or the so-called cross entropy. For two Bernoulli distributions (𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 , 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 ) and 𝐭𝐭 = 𝑡𝑡1, 𝑡𝑡2 , the cross entropy is defined as  For all the data we have 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 − 𝑡𝑡1 2 + 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 − 𝑡𝑡2 2 −𝑡𝑡1 log(𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 ) − 𝑡𝑡2 log(𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 )) 𝐿𝐿 𝜷𝜷 = − 1 𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 𝑡𝑡𝑛𝑛1 log 𝑓𝑓𝐴𝐴 𝐱𝐱𝑛𝑛,𝜷𝜷 +𝑡𝑡𝑛𝑛2 log 𝑓𝑓𝐵𝐵 𝐱𝐱𝑛𝑛,𝜷𝜷 11 Clarification  Don’t confuse with a number of things here  Two Classes A and B:  can be labelled as 1 and 0 respectively, so we use target value t = 1 or 0  can be encoded as (1, 0) and (0, 1) respectively, regarded as hot-one code or Bernoulli distribution. We can focus on the first component 1 and 0, respectively [This is not label, but the probability]  In both cases, we can simply use one target variable (not a vector) 𝑡𝑡 which takes value of 1 (for class A) and 0 (for class B).  Similarly we only need focus on 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 because 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 = 1 − 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 . Simply we use 𝑓𝑓 𝐱𝐱,𝜷𝜷 for 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 , i.e., we need only one model  Finally the loss is defined as 𝐿𝐿 𝜷𝜷 = − 1 𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 𝑡𝑡𝑛𝑛 log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 +(1 − 𝑡𝑡𝑛𝑛) log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 12 How can we make a model satisfying 0 ≤ 𝑓𝑓 𝐱𝐱,𝜷𝜷 ≤ 1 ? Logistic Function Also called Sigmoid Function x_list= np.linspace(-6,6,100) y_list= 1/(1+np.exp(-x_list)) plt.plot(x_list, y_list) Logistic Function 𝜎𝜎 𝑧𝑧 = 1 1 + 𝑒𝑒−𝑧𝑧 = 𝑒𝑒𝑧𝑧 𝑒𝑒𝑧𝑧 + 1 𝑧𝑧 ∈ (−∞, +∞) 𝜎𝜎(𝑧𝑧) ∈ (0,1) 13 Logistic Regression Regression + Logistic Function If 𝐱𝐱𝑇𝑇𝜷𝜷 ≥ 0;𝑓𝑓 𝐱𝐱,𝜷𝜷 ≥ 0.5 predict as class A; If 𝐱𝐱𝑇𝑇𝜷𝜷 < 0;𝑓𝑓 𝐱𝐱,𝜷𝜷 < 0.5 predict as class B; Question: Why this function is better? Think about “outlier” few slides before? 𝜎𝜎 𝑧𝑧 = 1 1 + 𝑒𝑒−𝑧𝑧 = 𝑒𝑒𝑧𝑧 𝑒𝑒𝑧𝑧 + 1 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝜎𝜎 𝐱𝐱𝑇𝑇𝜷𝜷 = 1 1 + 𝑒𝑒−𝐱𝐱𝑇𝑇𝜷𝜷 14 Output Interpretation 𝑓𝑓 𝐱𝐱,𝜷𝜷 tells us the estimated probability of given input 𝐱𝐱 being class A, parameterized by 𝜷𝜷. Supposed one customer 𝐱𝐱𝒊𝒊 has annual income of $120,000 The risk management team of the bank can tell that this customer have 10% probability to default (Class A). This information is crucial for the bank decision making! 𝑃𝑃 𝑡𝑡 = 𝐴𝐴 𝐱𝐱,𝜷𝜷 ≔ 𝑓𝑓(𝐱𝐱,𝜷𝜷) 𝑃𝑃 𝑡𝑡 = 𝐵𝐵 𝐱𝐱,𝜷𝜷 ∶= 1 − 𝑃𝑃 𝑡𝑡 = 𝐴𝐴 𝐱𝐱,𝜷𝜷 = 1 − 𝑓𝑓(𝐱𝐱,𝜷𝜷) 𝑓𝑓 𝐱𝐱𝒊𝒊,𝜷𝜷 = 0.1 = 10% 15 Decision Boundary If 𝑥𝑥1 + 𝑥𝑥2 − 5 ≥ 0; predict as class A (𝑡𝑡 = 1); If 𝑥𝑥1 + 𝑥𝑥2 − 5 < 0; predict as class B (𝑡𝑡 = 0); How many features do we have? Are the data labelled or not? 5 5 𝑥𝑥1 𝑥𝑥2 t =1 t=0 1st and 2nd feature vectors 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝜎𝜎 𝐱𝐱𝑇𝑇𝜷𝜷 = 𝜎𝜎 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 = 𝜎𝜎 −5 + 𝑥𝑥1 + 𝑥𝑥2 = 0.5 16 Non-linear Decision Boundary If 𝑥𝑥1 2 + 𝑥𝑥2 2 − 5 ≥ 0; predict 𝑡𝑡 = 1; If 𝑥𝑥1 2+𝑥𝑥2 2 − 5 < 0; predict 𝑡𝑡 = 0; 5 5 𝑥𝑥2 How does the decision boundary look like? 𝑥𝑥1 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝜎𝜎 𝐱𝐱𝑇𝑇𝜷𝜷 = 𝜎𝜎 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 2 + 𝛽𝛽2𝑥𝑥2 2 = 𝜎𝜎 −5 + 𝑥𝑥1 2 + 𝑥𝑥2 2 = 0.5 17 Loss Function (Formal) 𝑁𝑁: number of training examples 𝑑𝑑: number of features 𝐱𝐱: “input” variable; features 𝑡𝑡: “output” variable; “target” variable, ∈ {0,1} Note: this is not between 0 and 1 𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁 Once again, collect all the inputs into a matrix X, whose size is 𝑁𝑁 × 𝑑𝑑 + 1 and define the parameter vector 𝜷𝜷 = 𝛽𝛽0 𝛽𝛽1 𝛽𝛽2 ⋮ 𝛽𝛽𝑑𝑑 Loss 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡𝑛𝑛 = � − log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡 = 1 − log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡 = 0 𝐿𝐿 𝜷𝜷 = 1 𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 Loss(𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡𝑛𝑛) 18 Logistic Regression Loss Function 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝜎𝜎 𝐱𝐱𝑇𝑇𝜷𝜷 = 1 1 + 𝑒𝑒−𝐱𝐱𝑇𝑇𝜷𝜷 21 Loss Function: Compact Representation This loss function can be derived from statistics using the a methodology called Maximum Likelihood Estimation (MLE) 𝐿𝐿 𝜷𝜷 = 1 𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 Loss(𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡𝑛𝑛) Loss 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡𝑛𝑛 = � − log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡 = 1 − log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡 = 0 𝐿𝐿 𝜷𝜷 = − 1 𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 𝑡𝑡𝑛𝑛 log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 + (1 − 𝑡𝑡𝑛𝑛) log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 22  Logistic regression is a special case of Generalized Linear Models (GLM)  logit or sigmoid function is a link function. Many respectable numerical packages, e.g., sklearn.linear_model, contain GLM implementation which includes logistic regression. Logistic Regression Summary 23 Regularization Intuition (QBUS6810) 24 What Have We Learnt? Supervised learning with continuous response- regression single or multiple features 700 800 900 1000 1100 1200 1300 1400 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 R ev en ue Age Overfitting Just right𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥12 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1 2 +𝛽𝛽3 𝑥𝑥1 3 +𝛽𝛽4 𝑥𝑥1 4 25 Can we have a way to penalize parameters 𝛽𝛽3 and 𝛽𝛽4 to be close to 0, so that the model is approximately: If 𝜆𝜆 were very large, e.g., 10,000, then parameters 𝛽𝛽3 and 𝛽𝛽4 would be heavily penalized, e.g., close to 0 Regularization 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1 2 +𝛽𝛽3 𝑥𝑥1 3 +𝛽𝛽4 𝑥𝑥1 4 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1 2 𝐿𝐿 𝜷𝜷 = 1 2𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 + 𝜆𝜆𝛽𝛽3 2 + 𝜆𝜆𝛽𝛽4 2 26 Regularized Linear Regressions (QBUS6810) 27 The Ridge Regression Estimator is the minimiser of the cost function with quadratic regularization term Ridge Regression 𝜆𝜆 ≥ 0 is a regularization parameter which regulates the tradeoff (regulates model complexity). The penalised term does not include the intercept term 𝛽𝛽0 𝜆𝜆 = 0, we have the ordinary linear regression cost function. 𝜆𝜆 = 0 corresponds to the greatest complexity (bias is a minimum, but variance is high) 𝐿𝐿 𝜷𝜷 = 1 2𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 + 𝜆𝜆 2𝑁𝑁 � 𝑗𝑗=1 𝑑𝑑 𝛽𝛽𝑗𝑗 2 28 The penalty term penalises the departure from zero of the regression parameter i.e. shrinks them toward zero. Ridge regression cannot zero out a specific coefficient. The model either ends up including all the coefficients in the model, or none of them Small 𝝀𝝀 : no or low regularization. Can fit a high order polynomial model or complex model. Large 𝝀𝝀 : high regularization. 𝜷𝜷 will be small. If 𝜆𝜆 is very, very large, model becomes a horizontal line to the data. 700 800 900 1000 1100 1200 1300 1400 0.0 5.0 10.0 15.0 20.0 R ev en ue Age 700 800 900 1000 1100 1200 1300 1400 0.0 5.0 10.0 15.0 20.0 R ev en ue Age 29 Learning Curve λ Loss Best Model ( )βtrainL ( )βvL Tend to Overfitting. High variance. Tend to underfitting. High bias. This loss function is used estimate the model 𝐿𝐿 𝜷𝜷 = 1 2𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 + 𝜆𝜆 2𝑁𝑁 � 𝑗𝑗=1 𝑑𝑑 𝛽𝛽𝑗𝑗 2 𝐿𝐿train 𝜷𝜷 = 1 2𝑁𝑁𝑡𝑡𝑡𝑡 � 𝑛𝑛=1 𝑁𝑁𝑡𝑡𝑡𝑡 (𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 𝐿𝐿v 𝜷𝜷 = 1 2𝑁𝑁𝑣𝑣 � 𝑛𝑛=1 𝑁𝑁𝑣𝑣 (𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 Presenter Presentation Notes When lambda is large, beta become smaller, so the training is get larger. However, the validation will go down first, then get larger. The best model is the model with the smallest validation error. 30 λ Validation Loss 0.0001 Estimated 0.001 Estimated 0.01 Estimated 0.02 Estimated 0.04 Estimated Smallest … 100 Estimated How to Choose λ? ( )β β Lmin Use this model for test set Test a large number of different 𝜆𝜆 value, e.g., 10000 values between 0.0001 and 100, denoted by 𝜆𝜆𝑗𝑗 (𝑗𝑗 = 1,2,3, … , 10000) 31 Ridge Regression Gradient Descent • Have some random starting points for all 𝛽𝛽𝑖𝑖; • Keep updating all 𝛽𝛽𝑖𝑖 (simultaneously) to decrease the loss function 𝐿𝐿 𝜷𝜷 value; • Repeat until achieving minimum (convergence). Update simultaneously... Partial derivative calculation omitted 𝛽𝛽0 ≔ 𝛽𝛽0 − 𝛼𝛼 𝜕𝜕𝐿𝐿 𝜷𝜷 𝜕𝜕𝛽𝛽0 = 𝛽𝛽0 − 𝛼𝛼 1 𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛) 𝛽𝛽1 ≔ 𝛽𝛽1 − 𝛼𝛼 𝜕𝜕𝐿𝐿 𝜷𝜷 𝜕𝜕𝛽𝛽1 = 𝛽𝛽1 − 𝛼𝛼 1 𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛)𝑥𝑥𝑛𝑛1 + 𝜆𝜆 𝑁𝑁 𝛽𝛽1 𝛽𝛽𝑑𝑑 ≔ 𝛽𝛽𝑑𝑑 − 𝛼𝛼 𝜕𝜕𝐿𝐿 𝜷𝜷 𝜕𝜕𝛽𝛽𝑑𝑑 = 𝛽𝛽𝑑𝑑 − 𝛼𝛼 1 𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛)𝑥𝑥𝑛𝑛𝑑𝑑 + 𝜆𝜆 𝑁𝑁 𝛽𝛽𝑑𝑑 𝜷𝜷 ≔ 𝜷𝜷 − 𝛼𝛼 𝜕𝜕𝐿𝐿 𝜷𝜷 𝜕𝜕𝜷𝜷 = 𝜷𝜷 − α 1 𝑁𝑁 𝐗𝐗𝑇𝑇 𝑓𝑓 𝐗𝐗,𝜷𝜷 − 𝐭𝐭 + 𝜆𝜆 𝑁𝑁 𝜷𝜷 32 The LASSO Least absolute shrinkage & selection operator  LASSO does both parameter shrinkage and variable selection automatically.  Some coefficients are forced to zero as 𝜆𝜆 increases (effectively a subset selection) 𝐿𝐿 𝜷𝜷 = 1 2𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 + 𝜆𝜆 2𝑁𝑁 � 𝑗𝑗=1 𝑑𝑑 𝛽𝛽𝑗𝑗 33 Elastic Net Elastic net is a regularized regression method that linearly combines the penalties of the lasso and ridge methods. Note that due to the shared L1/L2 regularisation of Elastic-Net it does not aggressively prune features like Lasso. In practice it often performs well when used for regression prediction. See Lecture03_Example01.py 𝐿𝐿 𝜷𝜷 = 1 2𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 + 𝜆𝜆1 2𝑁𝑁 � 𝑗𝑗=1 𝑑𝑑 𝛽𝛽𝑗𝑗 + 𝜆𝜆2 2𝑁𝑁 � 𝑗𝑗=1 𝑑𝑑 𝛽𝛽𝑗𝑗 2 34 CV with Regularization • Suppose we have 150 data points and wish to find a better ridge regression. That is to find an appropriate 𝜆𝜆 for the ridge regression model. • Under K-CV, we are going to test a large number of different 𝜆𝜆 values, e.g., 10000 values between 0.0001 and 100, denoted by 𝜆𝜆𝑗𝑗 (𝑗𝑗 = 1,2,3, … , 10000) • Divide (randomly) 150 data points into 5 groups, each with 30 data points CV with Regularization • We will then run the 5-fold cross validation with following steps for each 𝜆𝜆𝑗𝑗 • For each 𝜆𝜆𝑗𝑗, output mean validation error (Lv 𝛃𝛃𝟏𝟏 + Lv 𝛃𝛃𝟐𝟐 +…+Lv 𝛃𝛃𝟓𝟓 )/5 on validation sets and select the model with 𝜆𝜆 that generates the least error, say 𝜆𝜆151 • Then build the model with 𝜆𝜆151by This process can be incorporated with LASSO and Elastic net as well. 𝐿𝐿 𝜷𝜷 = 1 2𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 + 𝜆𝜆151 2𝑁𝑁 � 𝑗𝑗=1 𝑑𝑑 𝛽𝛽𝑗𝑗 2 35 Appropriate K in CV? The special case K = N is known as the leave-one-out (LOO) cross- validation With K = N, the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the K=N “training sets” are so similar to one another. The computational burden is also considerable, requiring m applications of the learning method for LOO On the other hand, with K = 5 say, cross-validation has lower variance, while bias could be a problem, depending on how the performance of the learning method varies with the size of the training set. Overall, five-fold or ten-fold cross-validation are recommended as a good compromise 36 Feature Extraction and Representation Processing Features 37  All we have assumed so far is that data come to us in good shape and most of them are in numeric format and possibly in a categorical form  In Python machine learning, we normally organise data into a matrix (or multidimensional arrays)  However data coming from application domains could be in any forms or categories  For business applications, we may have data in the form of text (or natural language), or in media such as audio and videos  It is easy to deal with numeric data which can be sent to a machine learning straightaway How the real data look like? 38 Categorical Features 39  In raw data, they are represented by strings. For example “Red”, “Green”, “Blue”, “Yellow”, and “White”.  They are not suitable to machine learning. We need engineer them into numeric numbers.  Label Representation: Label representation: For example “Red” to 4, “Green” to 3, “Blue” to 2, “Yellow” to 1, and “White” to 0.  One-hot Encoding: Use scikit-learn to make this transform or pandas’ get_dummies: Lecture03_Example02.py Transforming Categorical Features in scikit-learn 40  Encoding categorical features  Converting a categorical feature to one-hot coding by OneHotEncoder  Loading features from dicts The number of features increased 3 to 9 2 to 4 from scikit-learn documentation Ordinal Features 41  In raw data, they are represented by strings. For example “Strongly Agree”, “Agree”, “Neutral”, “Disagree”, and “Strongly Disagree”.  The order information is important for modelling. Most time, we can encode such feature in terms of integers, such as “Strongly Agree” to 5, “Agree” to 4, “Neutral” to 3, “Disagree” to 2, and “Strongly Disagree” to 1.  LabelEncoder in scikit-learn can be used to transform such non- numerical labels to numerical labels from scikit-learn documentation Bucketized Feature (in Tensorflow) 42  Some times it is more meaningful to convert numbers into numerical ranges  Thus we shall engineer some numeric features into categorical feature  Then convert to one-hot coding or labels Feature Hashing 43  If a categorical feature has huge of different values, then the one-hot coding is a long (sparse) vector  Instead of using a long 0-1 vector, we use a hash function which calculates a hash code. We are forcing the different input values to a smaller set of categories  Collision: Both “kitchenware” and “sports” may be mapped to the same values hasher = FeatureHasher(input_type='string') X = hasher.transform(raw_X) 44  In Business Intelligence, we analyse texts such business plan, business report, even news.  Machine learning algorithms cannot work with raw text directly and the text must be converted in to numbers.  The BoW representation of text describes the occurrence of words within a document  For example, suppose we have a vocabulary of 1000 words. We have a document in which ``ours'‘ appears 3 times, ``competition'' appears 1 time and ``managers'' 10 times.  We will represent this document as a vector of dimension 1000 (such as each component in the vector corresponds to a word in the vocabulary) such that 3 will be in the position of the vector corresponding to ``ours'', 1 at the position corresponding to `` competition'' and 15 at the position corresponding ``managers''.  All other positions have values of 0. So the vector looks like which is sparse. Why? Bag-of-Words 𝑥𝑥 = (0, … , 0, 3, 0, … , 0, 1, 0, … , 15, 0, … , 0)𝑇𝑇∈ ℝ1000 Bag-of-Words 45 Sciki-learn can extract numerical features from text content such as counting the occurrence of tokens in each document A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus See Lecture03_Example03.py Text Feature Extraction 46  tf-idf Term Weighting  tf-idf(t, d) is defined as  tf(t,d) (term frequency): is the frequency of a term t in a document d, i.e., the occurring count number of t in d  idf(t) inverse document frequency: a measure of how much information the word t provides, that is, whether the term is common or rare across all documents  How is this done in scikit-learn? Lecture03_Example04.py tf−idf(𝑡𝑡,𝑑𝑑) = tf(𝑡𝑡,𝑑𝑑) × idf(𝑡𝑡) idf 𝑡𝑡 = log the total of number of documents the number of documents in corpus containing term t There are other definitions for these two “frequencies” 47  Both One-hot Encoding and BoW may produce feature representations which are of large dimension and sparse.  High dimension will result in the so-called curse of dimensionality problem in many machine learning algorithms.  As those representations are actually sparse, so a natural question is whether we can find a compact format for these sparse representations.  The so-called learning embedding or dimensionality reduction can achieve this goal Embedding Representation 48  Principal component analysis seeks a space of lower dimensionality, known as the principal subspace and denoted by the magenta line, such that the orthogonal projection of the data points (red dots) onto this subspace maximizes the variance of the projected points (green dots).  How can we find this line? Can we do this by linear regression? Principle Component Analysis (PCA) 49  Objective: given a set of d measurements on N individuals, we aim at determining 𝑟𝑟 ≤ 𝑑𝑑 orthogonal (uncorrelated) variables, called principal components, defined as linear combinations of the original ones  The PCs are uncorrelated and have decreasing variance  Synthesis: information dimensionality reduction  Interpretation: express the original data in terms of a reduced number of underlying variables (factors)  Score the individual proles, with a summary score  Obtain multivariate displays (scatterplot) of the units in two or three dimensions  The first component is designed to capture as much of the variability in the data as possible, and the succeeding components in turn extract as much of residual variability as possible Principle Component Analysis (PCA) 50  Given a set of data 𝒟𝒟 = 𝐱𝐱1, 𝐱𝐱2, … , 𝐱𝐱𝑁𝑁 . Suppose they have been centralised, i.e., removing the mean from them. Collect them in a data matrix 𝐗𝐗  Calculate the variance matrix 𝐒𝐒 = 1 𝑁𝑁 𝐗𝐗𝑇𝑇𝐗𝐗  Conduct the eigen-decomposition of 𝐒𝐒 such that 𝐒𝐒 = 𝐀𝐀Λ𝐀𝐀𝑇𝑇 where 𝐀𝐀𝐓𝐓𝐀𝐀= 𝐈𝐈𝐝𝐝 and Λ = diag 𝜆𝜆1, 𝜆𝜆2, … , 𝜆𝜆𝑑𝑑 .  The first 𝑟𝑟 (𝑟𝑟 ≤ 𝑑𝑑) principle components of 𝐗𝐗 are given by 𝐙𝐙𝑡𝑡 = 𝐗𝐗𝐀𝐀𝑡𝑡 where 𝐀𝐀𝑡𝑡 is the matrix of the first 𝑟𝑟 columns of 𝐀𝐀.  Each row of 𝐙𝐙𝑡𝑡 (in 𝑟𝑟 new factors/features) is a new representation of the given data, i.e., the corresponding row in 𝐗𝐗 (in 𝑑𝑑 attributes/features) PCA: The Algorithm Size 𝑁𝑁 × 𝑑𝑑 Size 𝑑𝑑 × 𝑑𝑑 Size 𝑁𝑁 × 𝑟𝑟 51  Consider the share of the total variance absorbed by the first 𝑟𝑟 components 𝐙𝐙𝑡𝑡 𝑄𝑄𝑡𝑡 = ∑ℎ=1 𝑡𝑡 𝜆𝜆ℎ ∑ℎ=1 𝑑𝑑 𝜆𝜆ℎ Select 𝑟𝑟 so that 𝑄𝑄𝑡𝑡 ≥ 0.95 for example  Kaiser criterion: computer the average eigenvalues �̅�𝜆 = 1 𝑑𝑑 � ℎ=1 𝑑𝑑 𝜆𝜆ℎ  Select the first 𝑟𝑟 components for which 𝜆𝜆ℎ > �̅�𝜆 . Note: if the
variables are standardised �̅�𝜆 = 1.

PCA: Selecting r

Python Example
(Lecture03_Example05.py)

52

Slide Number 1
Slide Number 2
Slide Number 3
Slide Number 4
Slide Number 5
Slide Number 6
Slide Number 7
Slide Number 8
Slide Number 9
Slide Number 10
Slide Number 11
Slide Number 12
Slide Number 13
Slide Number 14
Slide Number 15
Slide Number 16
Slide Number 17
Slide Number 18
Slide Number 21
Slide Number 22
Slide Number 23
Slide Number 24
Slide Number 25
Slide Number 26
Slide Number 27
Slide Number 28
Slide Number 29
Slide Number 30
Slide Number 31
Slide Number 32
Slide Number 33
Slide Number 34
Slide Number 35
Slide Number 36
�Processing Features
�How the real data look like?
�Categorical Features
�Transforming Categorical Features �in scikit-learn
�Ordinal Features
�Bucketized Feature (in Tensorflow)
�Feature Hashing
Slide Number 44
Bag-of-Words
Text Feature Extraction
Slide Number 47
Slide Number 48
Slide Number 49
Slide Number 50
Slide Number 51
Slide Number 52