Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Fundamentals of Machine Learning for
Predictive Data Analytics
Chapter 7: Error-based Learning Sections 7.4, 7.5
Copyright By PowCoder代写 加微信 powcoder
and Namee and Aoife D’Arcy
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Interpreting Multivariable Linear Regression Models
Setting the Learning Rate Using Weight Decay
Handling Categorical Descriptive Features
Handling Categorical Target Features: Logistic Regression
Modeling Non-linear Relationships Multinomial Logistic Regression Support Vector Machines
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Interpreting Multivariable Linear Regression Models
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
The weights used by linear regression models indicate the effect of each descriptive feature on the predictions returned by the model.
Both the sign and the magnitude of the weight provide information on how the descriptive feature effects the predictions of the model.
Table: Weights and standard errors for each feature in the office rentals model.
Descriptive Feature SIZE
FLOOR BROADBAND RATE
0.6270 -0.1781 0.071396
Standard Error 0.0545 2.7042 0.2969
t-statistic 11.504 -0.066 0.240
p-value <0.0001 0.949 0.816
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
It is tempting to infer the relative importance of the different descriptive features in the model from the magnitude of the weights
However, direct comparison of the weights tells us little about their relative importance.
A better way to determine the importance of each descriptive feature in the model is to perform a statistical significance test.
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
The statistical significance test we use to analyze the importance of a descriptive feature d [j] in a linear regression model is the t-test.
The null hypothesis for this test is that the feature does not have a significant impact on the model. The test statistic we calculate is called the t-statistic.
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
The standard error for the overall model is calculated as
(t −M (d))2
se = i=1
A standard error calculation is then done for a descriptive
n−2 feature as follows:
se(d[j])= n (2)
di [j]−d[j]2
The t-statistic for this test is calculated as follows:
t= w[j] (3) se(d[j])
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Using a standard t-statistic look-up table, we can then determine the p-value associated with this test (this is a two tailed t-test with degrees of freedom set to the number of instances in the training set minus 2).
If the p-value is less than the required significance level, typically 0.05, we reject the null hypothesis and say that the descriptive feature has a significant impact on the model; otherwise we say that it does not.
Table: Weights and standard errors for each feature in the office rentals model.
Descriptive Feature SIZE
FLOOR BROADBAND RATE
0.6270 -0.1781 0.071396
Standard Error 0.0545 2.7042 0.2969
t-statistic 11.504 -0.066 0.240
p-value <0.0001 0.949 0.816
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Setting the Learning Rate Using Weight Decay
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Learning rate decay allows the learning rate to start at a large value and then decay over time according to a predefined schedule.
A good approach is to use the following decay schedule:
α=αc (4) τ 0c+τ
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
2 4 6 8 10 Training Iteration
Figure: (a) The journey across the error surface for the office rentals prediction problem when learning rate decay is used (α0 = 0.18,
c = 10 ); (b) a plot of the changing sum of squared error values during this journey.
Sum of Squared Errors
0 20 40 60 80
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
5 10 15 Training Iteration
Figure: (a) The journey across the error surface for the office rentals prediction problem when learning rate decay is used (α0 = 0.25,
c = 100); (b) a plot of the changing sum of squared error values during this journey.
Sum of Squared Errors
0 50 100 150 200 250
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Handling Categorical Descriptive Features
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
The basic structure of the multivariable linear regression model allows for only continuous descriptive features, so we need a way to handle categorical descriptive features.
The most common approach to handling categorical features uses a transformation that converts a single categorical descriptive feature into a number of continuous descriptive feature values that can encode the levels of the categorical feature.
For example, the ENERGY RATING descriptive feature would be converted into three new continuous descriptive features, as it has 3 distinct levels: ’A’, ’B’, or ’C’.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Table: The office rentals dataset adjusted to handle the categorical ENERGY RATING descriptive feature in linear regression models.
BROADBAND ENERGY ENERGY ENERGY RENTAL ID SIZE FLOOR RATE RATING A RATING B RATING C PRICE
1 500 4 8 0 0 1 320 2 550 7 50 1 0 0 380 3 620 9 7 1 0 0 400 4 630 5 24 0 1 0 390 5 665 8 100 0 0 1 385 6 700 4 8 0 1 0 410 7 770 10 7 0 1 0 480 8 880 12 50 1 0 0 600 9 920 14 8 0 0 1 570
1010009 24 0 1 0 620
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Returning to our example, the regression equation for this RENTAL PRICE model would change to
RENTALPRICE=w[0] + w[1]×SIZE+w[2]×FLOOR + w[3]×BROADBANDRATE
+ w[4]×ENERGYRATINGA
+ w[5]×ENERGYRATINGB
+ w[6]×ENERGYRATINGC
where the newly added categorical features allow the original ENERGY RATING feature to be included.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Handling Categorical Target Features: Logistic Regression
Table: A dataset listing features for a number of generators.
ID RPM VIBRATION STATUS
1 568 585 good
2 586 565 good
3 609 536 good
4 616 492 good
5 632 465 good
6 652 528 good
7 655 496 good
8 660 471 good
9 688 408 good
10 696 399 good
11 708 387 good
12 701 434 good
13 715 506 good
14 732 485 good
15 731 395 good
16 749 398 good
17 759 512 good
18 773 431 good
19 782 456 good
20 797 476 good
21 794 421 good
22 824 452 good
23 835 441 good
24 862 372 good
25 879 340 good
26 892 370 good
27 913 373 good
28 933 330 good
ID RPM VIBRATION 29 562 309
30 578 346
31 593 357
32 626 341 33 635 252 34 658 235 35 663 299 36 677 223 37 685 303 38 698 197 39 699 311 40 712 257 41 722 193 42 735 259 43 738 314 44 753 113 45 767 286 46 771 264 47 780 137 48 784 131 49 798 132 50 820 152 51 834 157 52 858 163 53 888 91 54 891 156 55 911 79 56 939 99
STATUS faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty
Figure: A scatter plot of the RPM and VIBRATION descriptive features from the generators dataset shown in Table 4 [18] where ’good’ generators are shown as crosses and ’faulty’ generators are shown as triangles.
100 200 300 400 500 600
Figure: A scatter plot of the RPM and VIBRATION descriptive features from the generators dataset shown in Table 4 [18]. A decision boundary separating ’good’ generators (crosses) from ’faulty’ generators (triangles) is also shown.
100 200 300 400 500 600
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
As the decision boundary is a linear separator it can be defined using the equation of the line as:
VIBRATION = 830 − 0.667 × RPM (5) or
830 − 0.667 × RPM − VIBRATION = 0 (6)
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Applying Equation (6)[21] to the instance RPM = 810, VIBRATION = 495, which is be above the decision boundary, gives the following result:
830 − 0.667 × 810 − 495 = −205.27
By contrast, if we apply Equation (6)[21] to the instance RPM = 650 and VIBRATION = 240, which is be below the decision boundary, we get
830 − 0.667 × 650 − 240 = 156.45
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
All the data points above the decision boundary will result in a negative value when plugged into the decision boundary equation, while all data points below the decision boundary will result in a positive value.
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Reverting to our previous notation we have: 1 ifw·d≥0
Mw(d) = 0 otherwise
The surface defined by this rule is known as a decision surface.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Figure: (a) A surface showing the value of Equation (6)[21] for all values of RPM and VIBRATION. The decision boundary given in Equation (6)[21] is highlighted. (b) The same surface linearly thresholded at zero to operate as a predictor.
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
The hard decision boundary given in Equation (7)[24] is discontinuous so is not differentiable and so we can’t calculate the gradient of the error surface.
Furthermore, the model always makes completely confident predictions of 0 or 1, whereas a little more subtlety is desirable.
We address these issues by using a more sophisticated threshold function that is continuous, and therefore differentiable, and that allows for the subtlety desired: the logistic function
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
logistic function
Logistic(x) = 1 (8) 1+e−x
where x is a numeric value and e is Euler’s number and is approximately equal to 2.7183.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
logistic(x)
0.00 0.25 0.50 0.75 1.00
−10 −5 0 5 10 x
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
To build a logistic regression model, we simply pass the output of the basic linear regression model through the logistic function
Mw(d) = Logistic(w · d)
A note on training logistic regression models:
1 Before we train a logistic regression model we map the binary target feature levels to 0 or 1.
2 The error of the model on each instance is then the difference between the target feature (0 or 1) and the value of the prediction [0, 1].
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Mw( ⟨RPM, VIBRATION⟩)
1 + e−(−0.4077+4.1697×RPM+6.0460×VIBRATION)
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
The decision surface for the example logistic regression model.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
P(t = ’faulty’|d) = Mw(d) P(t = ’good’|d) = 1 − Mw(d)
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 RPM RPM RPM
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0 100 200 300 400 500 600 RPM RPM Training Iteration
Figure: A selection of the logistic regression models developed during the gradient descent process for the machinery dataset from Table 4 [18]. The bottom-right panel shows the sum of squared error values generated during the gradient descent process.
Vibration −1.0 −0.5 0.0
0.0 0.5 1.0
Vibration −1.0 −0.5 0.0
0.0 0.5 1.0
Sum of Squared Errors
5 10 15 20
0.0 0.5 1.0
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
To repurpose the gradient descent algorithm for training logistic regression models the only change that needs to be made is i in the weight update rule.
See pg. 360 in book for details of how to derive the new weight update rule.
The new weight update rule is:
w[j] ← w[j] + α × ((t − Mw(di )) × Mw(di ) × (1 − Mw(di )) × di [j])
ID RPM VIBRATION
10 653 554
11 679 516
12 688 524
13 684 450
14 699 512
15 703 505
16 717 377
17 740 377
18 749 501
19 756 492
20 752 381
21 762 508
22 781 474
23 781 480
24 804 460
25 828 346
26 830 366
27 864 344
28 882 403
29 891 338
30 921 362
31 941 301
32 965 336
33 976 297
34 994 287
STATUS faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty
ID RPM VIBRATION STATUS 35 501 463 good 36 526 443 good 37 536 412 good 38 564 394 good 39 584 398 good 40 602 398 good 41 610 428 good 42 638 389 good 43 652 394 good 44 659 336 good 45 662 364 good 46 672 308 good 47 691 248 good 48 694 401 good 49 718 313 good 50 720 410 good 51 723 389 good 52 744 227 good 53 741 397 good 54 770 200 good 55 764 370 good 56 790 248 good 57 786 344 good 58 792 290 good 59 818 268 good 60 845 232 good 61 867 195 good 62 878 168 good 63 895 218 good 64 916 221 good 65 950 156 good 66 956 174 good 67 973 134 good 68 1002 121 good
500 600 700
Figure: A scatter plot of the extended generators dataset given in Table 35 [35], which results in instances with the different target levels overlapping with each other. ’good’ generators are shown as crosses, and ’faulty’ generators are shown as triangles.
800 900 1000
200 300 400 500 600
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
For logistic regression models we recommend that descriptive feature values always be normalized.
In this example, before the training process begins, both descriptive features are normalized to the range [−1, 1].
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
For this example let’s assume that: α = 0.02
Initial Weights
w[0]: -2.9465 w[1]: -1.0147 w[2]: -2.1610
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
TARGET ID LEVEL
Iteration 1
Squared errorDelta(D, w[i])
Pred. 0.5570 0.5168 0.4469 0.4629
0.0037 0.0042 0.0028 0.0022
Error 0.4430 0.4832 0.5531 0.5371
-0.0037 -0.0042 -0.0028 -0.0022
Error w[0] w[1] 0.1963 0.1093 -0.1093 0.2335 0.1207 -0.1116 0.3059 0.1367 -0.1134 0.2885 0.1335 -0.1033
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 24.4738 2.7031 -0.7015
w[2] 0.1093 0.1159 0.1197 0.1244
0.0000 0.0000 0.0000 0.0000 1.6493
Sum Sum of squared errors (Sum/2)
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
w[j]←w[j]+α×((ti −Mw(di))×Mw(di)×(1−Mw(di))×di[j])
(after Iteration 1)
w[0]: -2.8924 w[1]: -1.0287 w[2]: -2.1940
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
TARGET ID LEVEL
Iteration 2
Squared errorDelta(D, w[i])
Pred. 0.5817 0.5414 0.4704 0.4867
0.0037 0.0043 0.0028 0.0022
Error 0.4183 0.4586 0.5296 0.5133
-0.0037 -0.0043 -0.0028 -0.0022
Error w[0] w[1] 0.1749 0.1018 -0.1018 0.2103 0.1139 -0.1053 0.2805 0.1319 -0.1094 0.2635 0.1282 -0.0992
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 24.0524 2.7236 -0.6646
w[2] 0.1018 0.1094 0.1155 0.1194
0.0000 0.0000 0.0000 0.0000 1.6484
Sum Sum of squared errors (Sum/2)
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
w[j]←w[j]+α×((ti −Mw(di))×Mw(di)×(1−Mw(di))×di[j])
(after Iteration 2)
w[0]: -2.8380 w[1]: -1.0416 w[2]: -2.2271
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 RPM RPM RPM
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0 200 400 600 800 RPM RPM Training Iteration
Figure: A selection of the logistic regression models developed during the gradient descent process for the extended generators dataset in Table 35 [35]. The bottom-right panel shows the sum of squared error values generated during the gradient descent process.
−1.0 −0.5 0.0 0.5
0.0 0.5 1.0
−1.0 −0.5 0.0 0.5
0.0 0.5 1.0
Sum of Squared Errors
5 10 15 20 25
−1.0 −0.5 0.0 0.5 1.0
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
The final model found is:
Mw( ⟨RPM, VIBRATION⟩)
1 + e−(−0.4077+4.1697×RPM+6.0460×VIBRATION)
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Modeling Non-linear Relationships
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Table: A dataset describing grass growth on Irish farms during July 2012.
GROWTH ID RAIN 14.016 12 3.754 10.834 13 2.809 13.026 14 1.809 11.019 15 4.114
4.162 16 2.834 14.167 17 3.872 10.190 18 2.174 13.525 19 4.353 13.899 20 3.684 13.949 21 2.140
8.643 22 2.783
GROWTH ID RAIN 11.420 23 3.960 13.847 24 3.592 13.757 25 3.451
9.101 26 1.197 13.923 27 0.723 10.795 28 1.958 14.307 29 2.366
8.059 30 1.530 12.041 31 0.847 14.641 32 3.843 14.138 33 0.976
GROWTH 10.307 12.069 12.335 10.806 7.822 14.010 14.088 12.701 9.012 10.885 9.876
012345 Rain
Figure: A scatter plot of the RAIN and GROWTH feature from the grass growth dataset.
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
The best linear model we can learn for this data is: GROWTH = 13.510 + −0.667 × RAIN
012345 Rain
Figure: A simple linear regression model trained to capture the relationship between the grass growth and rainfall.
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
In order to handle non-linear relationships we transform the data rather than the model using a set of basis functions:
Mw(d) = w[k]×φk(d) (10)
where φ0 to φb are a series of b basis functions that each
transform the input vector d in a different way.
The advantage of this is that, except for introducing the mechanism of basis functions, we do not need to make any other changes to the approach we have presented so far.
Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
The relationship between rainfall and grass growth in the grass growth dataset can be accurately represented as a second order polynomial through the following model:
GROWTH = w[0] × φ0(RAIN) + w[1] × φ1(RAIN) + w[2] ×
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com