Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Fundamentals of Machine Learning for
Predictive Data Analytics
Chapter 7: Error-based Learning Sections 7.1, 7.2, 7.3
Copyright By PowCoder代写 加微信 powcoder
and Namee and Aoife D’Arcy
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Fundamentals
Simple Linear Regression Measuring Error
Error Surfaces
Standard Approach: Multivariate Linear Regression with Gradient Descent
Multivariate Linear Regression
Gradient Descent
Choosing Learning Rates & Initial Weights A Worked Example
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
A paramaterised prediction model is initialised with a set of random parameters and an error function is used to judge how well this initial model performs when making predictions for instances in a training dataset.
Based on the value of the error function the parameters are iteratively adjusted to create a more and more accurate model.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Fundamentals
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Simple Linear Regression
Table: The office rentals dataset: a dataset that includes office rental prices and a number of descriptive features for 10 Dublin city-centre offices.
1 2 3 4 5 6 7 8 9
SIZE FLOOR
500 4 550 7 620 9 630 5 665 8 700 4 770 10 880 12 920 14
BROADBAND ENERGY RENTAL RATE RATING PRICE 8 C 320 50 A 380 7 A 400 24 B 390 100 C 385 8 B 410 7 B 480 50 A 600 8 C 570 24 B 620
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Simple Linear Regression
Table: The office rentals dataset: a dataset that includes office rental prices and a number of descriptive features for 10 Dublin city-centre offices.
RENTAL ID SIZE PRICE
10 1,000 620
500 600 700
Figure: A scatter plot of the SIZE and RENTAL PRICE features from the office rentals dataset.
800 900 1000
Rental Price
350 400 450 500 550 600
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Simple Linear Regression
From the scatter plot it appears that there is a linear relationship between the SIZE and RENTAL PRICE.
The equation of a line can be written as:
y = mx + b (1)
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Simple Linear Regression
The scatter plot below shows the same scatter plot as shown in Figure 1 [8] with a simple linear model added to capture the relationship between office sizes and office rental prices.
This model is:
RENTAL PRICE = 6.47 + 0.62 × SIZE
500 600 700
800 900 1000
Rental Price
350 400 450 500 550 600
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Simple Linear Regression
RENTAL PRICE = 6.47 + 0.62 × SIZE
Using this model determine the expected rental price of the 730 square foot office:
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Simple Linear Regression
RENTAL PRICE = 6.47 + 0.62 × SIZE
Using this model determine the expected rental price of the
730 square foot office:
RENTAL PRICE = 6.47 + 0.62 × 730
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Simple Linear Regression
Mw(d) = w[0] + w[1] × d[1] (2)
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Measuring Error
Rental Price
350 400 450 500 550 600
500 600 700
Figure: A scatter plot of the SIZE and RENTAL PRICE features from the office rentals dataset. A collection of possible simple linear regression models capturing the relationship between these two features are also shown. For all models w[0] is set to 6.47. From top to bottom the models use 0.4, 0.5, 0.62, 0.7 and 0.8 respectively for w[1].
800 900 1000
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Measuring Error
500 600 700
Figure: A scatter plot of the SIZE and RENTAL PRICE features from the office rentals dataset showing a candidate prediction model (with w[0] = 6.47 and w[1] = 0.62) and the resulting errors.
800 900 1000
Rental Price
350 400 450 500 550 600
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Measuring Error
L2(Mw, D) =
2 (ti − Mw(di [1]))2 (3)
(ti − (w[0] + w[1] × di[1]))2 (4)
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Measuring Error
Table: Calculating the sum of squared errors for the candidate model (with w[0] = 6.47 and w[1] = 0.62) making predictions for the the office rentals dataset.
Error Error 3.21 32.18 8.74 -7.47 -34.19 -30.91 -4.36 47.37 -7.46 -7.11 Sum Sum of squared errors (Sum/2)
Squared Error 10.32 1,035.62 76.32 55.80 1,169.13 955.73 19.01 2,243.90 55.59 50.51 5,671.64 2,835.82
RENTAL Model
Prediction
316.79 347.82 391.26 397.47 419.19 440.91 484.36 552.63 577.46 627.11
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Error Surfaces
For every possible combination of weights, w[0] and w[1], there is a corresponding sum of squared errors value that can be joined together to make a surface.
−5 0 5 10 15
Figure: (a) A 3D surface plot and (b) a contour plot of the error surface generated by plotting the sum of squared errors value for the office rentals training set for each possible combination of values for w[0] (from the range [−10, 20]) and w[1] (from the range [−2, 3]).
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Error Surfaces
The x-y plane is known as a weight space and the surface is known as an error surface.
The model that best fits the training data is the model corresponding to the lowest point on the error surface.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Error Surfaces
Using Equation (4)[16] we can formally define this point on the error surface as the point at which:
(ti − (w[0] + w[1] × di[1]))2 = 0 (5) and
(ti − (w[0] + w[1] × di[1]))2 = 0 (6) There are a number of different ways to find this point.
We will describe a guided search approach known as the gradient descent algorithm.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary
Standard Approach: Multivariate Linear Regression with Gradient Descent
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Multivariate Linear Regression
Table: A dataset that includes office rental prices and a number of descriptive features for 10 Dublin city-center offices.
ID SIZE FLOOR 1 500 4 2 550 7 3 620 9 4 630 5 5 665 8 6 700 4 7 770 10 8 880 12 9 920 14
10 1,000 9
BROADBAND ENERGY RENTAL RATE RATING PRICE 8 C 320 50 A 380 7 A 400 24 B 390 100 C 385 8 B 410 7 B 480 50 A 600 8 C 570 24 B 620
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Multivariate Linear Regression
We can define a multivariate linear regression model as:
Mw(d) = w[0]+w[1]×d[1]+···+w[m]×d[m](7) m
= w[0]+w[j]×d[j] (8) j=1
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Multivariate Linear Regression
We can make Equation (8)[23] look a little neater by inventing a dummy descriptive feature, d[0], that is always equal to 1:
Mw(d) = w[0]×d[0]+w[1]×d[1]+…+w[m]×d[m(9]) m
= w[j]×d[j] (10) j=0
= w·d (11)
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Multivariate Linear Regression
The sum of squared errors loss function, L2, definition that we gave in Equation (4)[16] changes only very slightly to reflect the new regression equation:
(ti − Mw(di ))2 (12)
L2(Mw, D) =
(ti −(w·di))2 (13)
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Multivariate Linear Regression
This multivariate model allows us to include all but one of the descriptive features in Table 3 [17] in a regression model to predict office rental prices.
The resulting multivariate regression model equation is:
RENTAL PRICE = w[0] + w[1] × SIZE + w[2] × FLOOR + w[3] × BROADBAND RATE
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Multivariate Linear Regression
We will see in the next section how the best-fit set of weights for this equation are found, but for now we will set:
w[0] = −0.1513, w[1] = 0.6270, w[2] = −0.1781, w[3] = 0.0714.
This means that the model is rewritten as:
RENTAL PRICE = −0.1513 +
− 0.1781 × FLOOR
0.6270 × SIZE
+ 0.0714 × BROADBAND RATE
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Multivariate Linear Regression
Using this model:
RENTAL PRICE = −0.1513 +
− 0.1781 × FLOOR
0.6270 × SIZE
+ 0.0714 × BROADBAND RATE
we can, for example, predict the expected rental price of a 690 square foot office on the 11th floor of a building with a broadband rate of 50 Mb per second as:
RENTAL PRICE = ?
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Multivariate Linear Regression
Using this model:
RENTAL PRICE = −0.1513 +
− 0.1781 × FLOOR
0.6270 × SIZE
+ 0.0714 × BROADBAND RATE
we can, for example, predict the expected rental price of a 690 square foot office on the 11th floor of a building with a broadband rate of 50 Mb per second as:
RENTAL PRICE = −0.1513 + 0.6270 × 690 −0.1781 × 11 + 0.0714 × 50
= 434.0896
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Gradient Descent
Figure: (a) A 3D surface plot and (b) a contour plot of the same error surface. The lines indicate the path that the gradient decent algorithm would take across this error surface from different starting positions to the global minimum – marked as the white dot in the centre.
The journey across the error surface that is taken by the gradient descent algorithm when training the simple version of the office rentals example – involving just SIZE and RENTAL PRICE.
Figure: (a) A 3D surface plot and (b) a contour plot of the error surface for the office rentals dataset showing the path that the gradient descent algorithm takes towards the best fit model.
500 600 700 800 900 1000 Size
500 600 700 800 900 1000 Size
500 600 700 800 900 1000 Size
500 600 700 800 900 1000 Size
500 600 700 800 900 1000 Size
0 20 40 60 80 100 Training Iteration
Rental Price
350 400 450 500 550 600
Rental Price 350 400 450 500
Rental Price
350 400 450 500 550 600
Rental Price 350 400 450 500
Sum of Squared Errors
0e+00 2e+05 4e+05 6e+05 8e+05
Rental Price 350 400 450 500
Figure: A selection of the simple linear regression models developed during the gradient descent process for the office rentals dataset. The final panel shows the sum of squared error values generated during the gradient descent process.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Gradient Descent
Require: set of training instances D
Require: a learning rate α that controls how quickly the
algorithm converges
Require: a function, errorDelta, that determines the direction
in which to adjust a given weight, w[j], so as to move down
the slope of an error surface determined by the dataset, D Require: a convergence criterion that indicates that the
1: 2: 3: 4: 5: 6:
algorithm has completed
w ← random starting point in the weight space repeat
for each w[j] in w do
w[j] ← w[j] + α × errorDelta(D, w[j])
until convergence occurs
The gradient descent algorithm for training multivariate linear regression models.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Gradient Descent
The most important part to the gradient descent algorithm is Line Rule 4 on which the weights are updated.
w[j] ← w[j] + α × errorDelta(D, w[j])
Each weight is considered independently and for each one a small adjustment is made by adding a small delta value to the current weight, w[j].
This adjustment should ensure that the change in the weight leads to a move downwards on the error surface.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Gradient Descent
Imagine for a moment that our training dataset, D contains just one training example: (d, t )
The gradient of the error surface is given as the partial derivative of L2 with respect to each weight, w[j]:
∂ ∂1 2 ∂w[j]L2(Mw,D) = ∂w[j] 2(t−Mw(d)) (14)
= (t−Mw(d))× ∂ (t−Mw(d))(15) ∂w[j]
= (t −Mw(d))× ∂ (t −(w·d))(16) ∂w[j]
= (t − Mw(d)) × −d[j] (17)
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Gradient Descent
Adjusting the calculation to take into account multiple training instances:
∂w[j]L2(Mw,D) =
We use this equation to define the errorDelta in our gradient descent algorithm.
w[j]←w[j]+α((ti −Mw(di))×di[j])
errorDelta(D,w[j ])
((ti − Mw (di)) × di[j])
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Choosing Learning Rates & Initial Weights
The learning rate, α, determines the size of the adjustment made to each weight at each step in the process.
Unfortunately, choosing learning rates is not a well defined science.
Most practitioners use rules of thumb and trial and error.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Choosing Learning Rates & Initial Weights
(a) (b) (c)
Figure: Plots of the journeys made across the error surface for the simple office rentals prediction problem for different learning rates: (a) a very small learning rate (0.002), (b) a medium learning rate (0.08) and (c) a very large learning rate (0.18).
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary Choosing Learning Rates & Initial Weights
A typical range for learning rates is [0.00001, 10]
Based on empirical evidence, choosing random initial weights uniformly from the range [−0.2, 0.2] tends to work well.
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary A Worked Example
We are now in a position to build a linear regression model that uses all of the continuous descriptive features in the office rentals dataset.
The general structure of the model is:
RENTAL PRICE = w[0] + w[1] × SIZE + w[2] × FLOOR + w[3] × BROADBAND RATE
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary A Worked Example
Table: The office rentals dataset: a dataset that includes office rental prices and a number of descriptive features for 10 Dublin city-centre offices.
1 2 3 4 5 6 7 8 9
SIZE FLOOR
500 4 550 7 620 9 630 5 665 8 700 4 770 10 880 12 920 14
BROADBAND ENERGY RENTAL RATE RATING PRICE 8 C 320 50 A 380 7 A 400 24 B 390 100 C 385 8 B 410 7 B 480 50 A 600 8 C 570 24 B 620
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary A Worked Example
For this example let’s assume that: α = 0.00000002
Initial Weights
w[0]: -0.146 w[1]: 0.185 w[2]: -0.044 w[3]: 0.119
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary A Worked Example
Pred. Error
93.26 226.74 107.41 272.59 115.15 284.85 119.21 270.79 134.64 250.36 130.31 279.69 142.89 337.11 168.32 431.68 170.63 399.37 187.58 432.42
w[0] w[1] 226.74 113370.05 272.59 149926.92 284.85 176606.39 270.79 170598.22 250.36 166492.17 279.69 195782.78 337.11 259570.96 431.68 379879.24 399.37 367423.83 432.42 432423.35
w[2] 906.96 1908.16 2563.64 1353.95 2002.91 1118.76 3371.05 5180.17 5591.23 3891.81
w[3] 1813.92 13629.72 1993.94 6498.98 25036.42 2237.52 2359.74 21584.05 3194.99 10378.16 88727.43
Iteration 1
RENTAL ID PRICE
Squared Error 51411.08 74307.70 81138.96 73327.67 62682.22 78226.32 113639.88 186348.45 159499.37 186989.95
errorDelta(D, w[i])
Sum Sum of squared errors (Sum/2)
1067571.59 3185.61 2412073.90 27888.65 533785.80
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary A Worked Example
w[j]←w[j]+α((ti −Mw(di))×di[j])
errorDelta(D,w[j ]) Initial Weights
w[0]: -0.146 w[1]: 0.185 w[2]: -0.044 w[3]: 0.119 Example
w[1] ← 0.185 + 0.00000002 × 2, 412, 074 = 0.23324148
(Iteration 1)
w[0]: -0.146 w[1]: 0.233 w[2]: -0.043 w[3]: 0.121
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary A Worked Example
Pred. Error 117.40 202.60 134.03 245.97 145.08 254.92 149.65 240.35 166.90 218.10 164.10 245.90 180.06 299.94 210.87 389.13 215.03 354.97 187.58 432.42
w[0] w[1] 202.60 101301.44 245.97 135282.89 254.92 158051.51 240.35 151422.55 218.10 145037.57 245.90 172132.91 299.94 230954.68 389.13 342437.01 354.97 326571.94 432.42 432423.35
w[2] 810.41 1721.78 2294.30 1201.77 1744.81 983.62 2999.41 4669.60 4969.57 3891.81
w[3] 1620.82 12298.44 1784.45 5768.48 21810.16 1967.23 2099.59 19456.65 2839.76 10378.16 80023.74
Iteration 2
RENTAL ID PRICE
Squared Error 41047.92 60500.69 64985.12 57769.68 47568.31 60468.86 89964.69 151424.47 126003.34 186989.95 886723.04 443361.52
errorDelta(D, w[i])
Sum Sum of squared errors (Sum/2)
2884.32 2195615.84 25287.08
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary A Worked Example
w[j]←w[j]+α((ti −Mw(di))×di[j])
errorDelta(D,w[j ]) Initial Weights (Iteration 2)
w[0]: -0.146 w[1]: 0.233 w[2]: -0.043 w[3]: 0.121 Exercise
w[1] ←?, α = 0.00000002
(Iteration 2)
w[0]: ? w[1]: ? w[2]: ? w[3]: ?
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary A Worked Example
w[j]←w[j]+α((ti −Mw(di))×di[j])
errorDelta(D,w[j ]) Initial Weights (Iteration 2)
w[0]: -0.146 w[1]: 0.233 w[2]: -0.043 w[3]: 0.121 Exercise
w[1] ← −0.233 + 0.00000002 × 2195615.84 = 0.27691232
(Iteration 2)
w[0]: -0.145 w[1]: 0.277 w[2]: -0.043 w[3]: 0.123
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Summary A Worked Example
The algorithm then keeps iteratively applying the weight update rule until it converges on a stable set of weights beyond which little improvement in model accuracy is possible.
After 100 iterations the final values for the weights are: w[0] = −0.1513,
w[1] = 0.6270, w[2] = −0.1781 w[3] = 0.0714
which results in a sum of squared errors value of 2, 913.5
Big Idea Fundamentals Standard Approach: Multivariate Linear Regression with Gradient Descent Su
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com