代写代考 VIBRATION 29 562 309

Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Fundamentals of Machine Learning for
Predictive Data Analytics
Chapter 7: Error-based Learning Sections 7.4, 7.5

and Namee and Aoife D’Arcy

Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Interpreting Multivariable Linear Regression Models
Setting the Learning Rate Using Weight Decay
Handling Categorical Descriptive Features
Handling Categorical Target Features: Logistic Regression
Modeling Non-linear Relationships Multinomial Logistic Regression Support Vector Machines

Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
Interpreting Multivariable Linear Regression Models

Interpreting
Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM
The weights used by linear regression models indicate the effect of each descriptive feature on the predictions returned by the model.
Both the sign and the magnitude of the weight provide information on how the descriptive feature effects the predictions of the model.
Table: Weights and standard errors for each feature in the office rentals model.
Descriptive Feature SIZE
FLOOR BROADBAND RATE
0.6270 -0.1781 0.071396
Standard Error 0.0545 2.7042 0.2969
t-statistic 11.504 -0.066 0.240
p-value <0.0001 0.949 0.816 Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM It is tempting to infer the relative importance of the different descriptive features in the model from the magnitude of the weights However, direct comparison of the weights tells us little about their relative importance. A better way to determine the importance of each descriptive feature in the model is to perform a statistical significance test. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM The statistical significance test we use to analyze the importance of a descriptive feature d [j] in a linear regression model is the t-test. The null hypothesis for this test is that the feature does not have a significant impact on the model. The test statistic we calculate is called the t-statistic. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM The standard error for the overall model is calculated as 􏰎􏰈(t −M (d))2 se = 􏰍 i=1 A standard error calculation is then done for a descriptive n−2 feature as follows: se(d[j])=􏰏􏰎 n (2) 􏰎􏰈􏰃di [j]−d[j]􏰄2 The t-statistic for this test is calculated as follows: t= w[j] (3) se(d[j]) Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM Using a standard t-statistic look-up table, we can then determine the p-value associated with this test (this is a two tailed t-test with degrees of freedom set to the number of instances in the training set minus 2). If the p-value is less than the required significance level, typically 0.05, we reject the null hypothesis and say that the descriptive feature has a significant impact on the model; otherwise we say that it does not. Table: Weights and standard errors for each feature in the office rentals model. Descriptive Feature SIZE FLOOR BROADBAND RATE 0.6270 -0.1781 0.071396 Standard Error 0.0545 2.7042 0.2969 t-statistic 11.504 -0.066 0.240 p-value <0.0001 0.949 0.816 Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM Setting the Learning Rate Using Weight Decay Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM Learning rate decay allows the learning rate to start at a large value and then decay over time according to a predefined schedule. A good approach is to use the following decay schedule: α=αc (4) τ 0c+τ Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM 2 4 6 8 10 Training Iteration Figure: (a) The journey across the error surface for the office rentals prediction problem when learning rate decay is used (α0 = 0.18, c = 10 ); (b) a plot of the changing sum of squared error values during this journey. Sum of Squared Errors 0 20 40 60 80 Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM 5 10 15 Training Iteration Figure: (a) The journey across the error surface for the office rentals prediction problem when learning rate decay is used (α0 = 0.25, c = 100); (b) a plot of the changing sum of squared error values during this journey. Sum of Squared Errors 0 50 100 150 200 250 Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM Handling Categorical Descriptive Features Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM The basic structure of the multivariable linear regression model allows for only continuous descriptive features, so we need a way to handle categorical descriptive features. The most common approach to handling categorical features uses a transformation that converts a single categorical descriptive feature into a number of continuous descriptive feature values that can encode the levels of the categorical feature. For example, the ENERGY RATING descriptive feature would be converted into three new continuous descriptive features, as it has 3 distinct levels: ’A’, ’B’, or ’C’. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM Table: The office rentals dataset adjusted to handle the categorical ENERGY RATING descriptive feature in linear regression models. BROADBAND ENERGY ENERGY ENERGY RENTAL ID SIZE FLOOR RATE RATING A RATING B RATING C PRICE 1 500 4 8 0 0 1 320 2 550 7 50 1 0 0 380 3 620 9 7 1 0 0 400 4 630 5 24 0 1 0 390 5 665 8 100 0 0 1 385 6 700 4 8 0 1 0 410 7 770 10 7 0 1 0 480 8 880 12 50 1 0 0 600 9 920 14 8 0 0 1 570 1010009 24 0 1 0 620 Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM Returning to our example, the regression equation for this RENTAL PRICE model would change to RENTALPRICE=w[0] + w[1]×SIZE+w[2]×FLOOR + w[3]×BROADBANDRATE + w[4]×ENERGYRATINGA + w[5]×ENERGYRATINGB + w[6]×ENERGYRATINGC where the newly added categorical features allow the original ENERGY RATING feature to be included. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM Handling Categorical Target Features: Logistic Regression Table: A dataset listing features for a number of generators. ID RPM VIBRATION STATUS 1 568 585 good 2 586 565 good 3 609 536 good 4 616 492 good 5 632 465 good 6 652 528 good 7 655 496 good 8 660 471 good 9 688 408 good 10 696 399 good 11 708 387 good 12 701 434 good 13 715 506 good 14 732 485 good 15 731 395 good 16 749 398 good 17 759 512 good 18 773 431 good 19 782 456 good 20 797 476 good 21 794 421 good 22 824 452 good 23 835 441 good 24 862 372 good 25 879 340 good 26 892 370 good 27 913 373 good 28 933 330 good ID RPM VIBRATION 29 562 309 30 578 346 31 593 357 32 626 341 33 635 252 34 658 235 35 663 299 36 677 223 37 685 303 38 698 197 39 699 311 40 712 257 41 722 193 42 735 259 43 738 314 44 753 113 45 767 286 46 771 264 47 780 137 48 784 131 49 798 132 50 820 152 51 834 157 52 858 163 53 888 91 54 891 156 55 911 79 56 939 99 STATUS faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty Figure: A scatter plot of the RPM and VIBRATION descriptive features from the generators dataset shown in Table 4 [18] where ’good’ generators are shown as crosses and ’faulty’ generators are shown as triangles. 100 200 300 400 500 600 Figure: A scatter plot of the RPM and VIBRATION descriptive features from the generators dataset shown in Table 4 [18]. A decision boundary separating ’good’ generators (crosses) from ’faulty’ generators (triangles) is also shown. 100 200 300 400 500 600 Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM As the decision boundary is a linear separator it can be defined using the equation of the line as: VIBRATION = 830 − 0.667 × RPM (5) or 830 − 0.667 × RPM − VIBRATION = 0 (6) Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM Applying Equation (6)[21] to the instance RPM = 810, VIBRATION = 495, which is be above the decision boundary, gives the following result: 830 − 0.667 × 810 − 495 = −205.27 By contrast, if we apply Equation (6)[21] to the instance RPM = 650 and VIBRATION = 240, which is be below the decision boundary, we get 830 − 0.667 × 650 − 240 = 156.45 Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM All the data points above the decision boundary will result in a negative value when plugged into the decision boundary equation, while all data points below the decision boundary will result in a positive value. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM Reverting to our previous notation we have: 􏰉1 ifw·d≥0 Mw(d) = 0 otherwise The surface defined by this rule is known as a decision surface. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM Figure: (a) A surface showing the value of Equation (6)[21] for all values of RPM and VIBRATION. The decision boundary given in Equation (6)[21] is highlighted. (b) The same surface linearly thresholded at zero to operate as a predictor. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM The hard decision boundary given in Equation (7)[24] is discontinuous so is not differentiable and so we can’t calculate the gradient of the error surface. Furthermore, the model always makes completely confident predictions of 0 or 1, whereas a little more subtlety is desirable. We address these issues by using a more sophisticated threshold function that is continuous, and therefore differentiable, and that allows for the subtlety desired: the logistic function Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM logistic function Logistic(x) = 1 (8) 1+e−x where x is a numeric value and e is Euler’s number and is approximately equal to 2.7183. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM logistic(x) 0.00 0.25 0.50 0.75 1.00 −10 −5 0 5 10 x Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM To build a logistic regression model, we simply pass the output of the basic linear regression model through the logistic function Mw(d) = Logistic(w · d) A note on training logistic regression models: 1 Before we train a logistic regression model we map the binary target feature levels to 0 or 1. 2 The error of the model on each instance is then the difference between the target feature (0 or 1) and the value of the prediction [0, 1]. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM Mw( ⟨RPM, VIBRATION⟩) 1 + e−(−0.4077+4.1697×RPM+6.0460×VIBRATION) Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM The decision surface for the example logistic regression model. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM P(t = ’faulty’|d) = Mw(d) P(t = ’good’|d) = 1 − Mw(d) Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 RPM RPM RPM −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0 100 200 300 400 500 600 RPM RPM Training Iteration Figure: A selection of the logistic regression models developed during the gradient descent process for the machinery dataset from Table 4 [18]. The bottom-right panel shows the sum of squared error values generated during the gradient descent process. Vibration −1.0 −0.5 0.0 0.0 0.5 1.0 Vibration −1.0 −0.5 0.0 0.0 0.5 1.0 Sum of Squared Errors 5 10 15 20 0.0 0.5 1.0 Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM To repurpose the gradient descent algorithm for training logistic regression models the only change that needs to be made is i in the weight update rule. See pg. 360 in book for details of how to derive the new weight update rule. The new weight update rule is: w[j] ← w[j] + α × 􏰈 ((t − Mw(di )) × Mw(di ) × (1 − Mw(di )) × di [j]) ID RPM VIBRATION 10 653 554 11 679 516 12 688 524 13 684 450 14 699 512 15 703 505 16 717 377 17 740 377 18 749 501 19 756 492 20 752 381 21 762 508 22 781 474 23 781 480 24 804 460 25 828 346 26 830 366 27 864 344 28 882 403 29 891 338 30 921 362 31 941 301 32 965 336 33 976 297 34 994 287 STATUS faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty faulty ID RPM VIBRATION STATUS 35 501 463 good 36 526 443 good 37 536 412 good 38 564 394 good 39 584 398 good 40 602 398 good 41 610 428 good 42 638 389 good 43 652 394 good 44 659 336 good 45 662 364 good 46 672 308 good 47 691 248 good 48 694 401 good 49 718 313 good 50 720 410 good 51 723 389 good 52 744 227 good 53 741 397 good 54 770 200 good 55 764 370 good 56 790 248 good 57 786 344 good 58 792 290 good 59 818 268 good 60 845 232 good 61 867 195 good 62 878 168 good 63 895 218 good 64 916 221 good 65 950 156 good 66 956 174 good 67 973 134 good 68 1002 121 good 500 600 700 Figure: A scatter plot of the extended generators dataset given in Table 35 [35], which results in instances with the different target levels overlapping with each other. ’good’ generators are shown as crosses, and ’faulty’ generators are shown as triangles. 800 900 1000 200 300 400 500 600 Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM For logistic regression models we recommend that descriptive feature values always be normalized. In this example, before the training process begins, both descriptive features are normalized to the range [−1, 1]. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM For this example let’s assume that: α = 0.02 Initial Weights w[0]: -2.9465 w[1]: -1.0147 w[2]: -2.1610 Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM TARGET ID LEVEL Iteration 1 Squared errorDelta(D, w[i]) Pred. 0.5570 0.5168 0.4469 0.4629 0.0037 0.0042 0.0028 0.0022 Error 0.4430 0.4832 0.5531 0.5371 -0.0037 -0.0042 -0.0028 -0.0022 Error w[0] w[1] 0.1963 0.1093 -0.1093 0.2335 0.1207 -0.1116 0.3059 0.1367 -0.1134 0.2885 0.1335 -0.1033 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 24.4738 2.7031 -0.7015 w[2] 0.1093 0.1159 0.1197 0.1244 0.0000 0.0000 0.0000 0.0000 1.6493 Sum Sum of squared errors (Sum/2) Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM w[j]←w[j]+α×􏰈((ti −Mw(di))×Mw(di)×(1−Mw(di))×di[j]) (after Iteration 1) w[0]: -2.8924 w[1]: -1.0287 w[2]: -2.1940 Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM TARGET ID LEVEL Iteration 2 Squared errorDelta(D, w[i]) Pred. 0.5817 0.5414 0.4704 0.4867 0.0037 0.0043 0.0028 0.0022 Error 0.4183 0.4586 0.5296 0.5133 -0.0037 -0.0043 -0.0028 -0.0022 Error w[0] w[1] 0.1749 0.1018 -0.1018 0.2103 0.1139 -0.1053 0.2805 0.1319 -0.1094 0.2635 0.1282 -0.0992 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 24.0524 2.7236 -0.6646 w[2] 0.1018 0.1094 0.1155 0.1194 0.0000 0.0000 0.0000 0.0000 1.6484 Sum Sum of squared errors (Sum/2) Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM w[j]←w[j]+α×􏰈((ti −Mw(di))×Mw(di)×(1−Mw(di))×di[j]) (after Iteration 2) w[0]: -2.8380 w[1]: -1.0416 w[2]: -2.2271 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 RPM RPM RPM −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0 200 400 600 800 RPM RPM Training Iteration Figure: A selection of the logistic regression models developed during the gradient descent process for the extended generators dataset in Table 35 [35]. The bottom-right panel shows the sum of squared error values generated during the gradient descent process. −1.0 −0.5 0.0 0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 0.0 0.5 1.0 Sum of Squared Errors 5 10 15 20 25 −1.0 −0.5 0.0 0.5 1.0 Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM The final model found is: Mw( ⟨RPM, VIBRATION⟩) 1 + e−(−0.4077+4.1697×RPM+6.0460×VIBRATION) Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM Modeling Non-linear Relationships Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM Table: A dataset describing grass growth on Irish farms during July 2012. GROWTH ID RAIN 14.016 12 3.754 10.834 13 2.809 13.026 14 1.809 11.019 15 4.114 4.162 16 2.834 14.167 17 3.872 10.190 18 2.174 13.525 19 4.353 13.899 20 3.684 13.949 21 2.140 8.643 22 2.783 GROWTH ID RAIN 11.420 23 3.960 13.847 24 3.592 13.757 25 3.451 9.101 26 1.197 13.923 27 0.723 10.795 28 1.958 14.307 29 2.366 8.059 30 1.530 12.041 31 0.847 14.641 32 3.843 14.138 33 0.976 GROWTH 10.307 12.069 12.335 10.806 7.822 14.010 14.088 12.701 9.012 10.885 9.876 012345 Rain Figure: A scatter plot of the RAIN and GROWTH feature from the grass growth dataset. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM The best linear model we can learn for this data is: GROWTH = 13.510 + −0.667 × RAIN 012345 Rain Figure: A simple linear regression model trained to capture the relationship between the grass growth and rainfall. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM In order to handle non-linear relationships we transform the data rather than the model using a set of basis functions: Mw(d) = 􏰈w[k]×φk(d) (10) where φ0 to φb are a series of b basis functions that each transform the input vector d in a different way. The advantage of this is that, except for introducing the mechanism of basis functions, we do not need to make any other changes to the approach we have presented so far. Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM The relationship between rainfall and grass growth in the grass growth dataset can be accurately represented as a second order polynomial through the following model: GROWTH = w[0] × φ0(RAIN) + w[1] × φ1(RAIN) + w[2] × 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts