代写代考 Machine Learning and Data Mining in Business

Machine Learning and Data Mining in Business
Week 7 Conceptual Exercises
Question 1
In this exercise, we’ll apply the theory of multivariate optimisation to the linear regres- sion method. We can derive the same results using matrix calculus.

Copyright By PowCoder代写 加微信 powcoder

In the least squares method for regression, we minimise the cost function
n p 2 
J(β)=􏰐 yi −β0 −􏰐βjxij . i=1 j=1
Our objective in this exercises is to show that the solution is β􏰑 = (X⊤X)−1X⊤y,
1xx…x 1112 1p
1×21 x22 …x2p X=,
 . . . .. .  …..
1 xn1 xn2 … xnp
under the assumption that the columns of X are linearly independent.
(a) Find the partial derivatives of J(β).
∂J n  p  =−2􏰐 yi −β0 −􏰐βjxij
∂β   0 i=1 j=1
∂J n  p  ∂β =−2􏰐xi1yi−β0−􏰐βjxij
∂J n  p  ∂β =−2􏰐xipyi−β0−􏰐βjxij

(b) Derive the first-order condition.
Solution: We set the partial derivatives to zero to obtain the first-order nec- essary condition as the system of linear equations,
􏰐 yi − β􏰑0 − 􏰐 β􏰑j xij i=1 j=1
np 􏰐 xi1 yi − β􏰑0 − 􏰐 β􏰑j xij 
np 􏰐 xip yi − β􏰑0 − 􏰐 β􏰑j xij  i=1 j=1
whereβ􏰑0,β􏰑1,…,β􏰑p istheleastsquaressolution.
(c) Let y􏰑 be an n−dimensional vector of fitted values with entries
Show that y􏰑 = Xβ􏰑.
y􏰑i = β􏰑0 + 􏰐 β􏰑j xij .
Solution: From the definition of matrix-vector product, Xβ􏰑 is an n-dimensional vector with element i equal to the inner product of ith row of X with β􏰑.
as the i−th row of X, we have that
as we wanted to show.
xi,: = (1,xi1,…,xip),
y􏰑i = x⊤i,:β􏰑 = β􏰑0 + 􏰐 β􏰑jxij, j=1
(d) Each equation in part (b) involves an inner product. Using the previous item as a hint, write down what the inner product is.
Solution: Define x:,j = (x1j , x2j , . . . , xnj ) as the vector values for predictor j
over the entire sample. This is the (j + 1)th column of X.

From the defition of inner product,
􏰐 xik yi − β􏰑0 − 􏰐 β􏰑jxij = x⊤:,k(y − Xβ􏰑) i=1 j=1
(e) Write the first-order condition in matrix notation.
Solution: Based on the previous part, the system becomes:
x⊤:,0(y − Xβ􏰑) = 0 x⊤:,1(y − Xβ􏰑) = 0
x⊤:,p(y − Xβ􏰑) = 0.
Using the definition of matrix-vector multiplication, X ⊤ ( y − X β􏰑 ) = 0 ,
since X = [x:,0 x:,1 . . . x:,p].
(f) Solve the system to obtain the formula for the OLS estimator.
Solution: From the previous step, the OLS estimator solves the linear system of equations
X⊤Xβ􏰑 = X⊤y.
Note that X⊤X is a (p+1)×(p+1) square matrix. If this matrix is invertible (which holds if there’s no perfect multicollinearity), we can multiply both sides by (X⊤X)−1 to obtain the formula
β􏰑 = (X⊤X)−1X⊤y.
(g) Derive the Hessian of J(β).
Solution: Earlier, we derived
∂J n  p  ∂β =−2􏰐xijyi−β0−􏰐βjxij.

Based on this equation, the second order partial derivatives are
∂β∂β =2􏰐xijxik=2x⊤:,jx:,k.
The Hessian is the matrix of second order partial derivatives. From the defini-
tion of the Hessian and the definition of matrix multiplication, ∇2J(β) = 2(X⊤X).
(h) We will now show that the cost function is convex. We do this here for completeness, you are not required to know how to prove this step.
A twice-differentiable function is convex if and only if the Hessian matrix is positive semidefinite. If the Hessian is positive definite, then the function is strictly convex.
We say that a p × p matrix A is positive definite when v⊤Av > 0 for every p- dimensional vector v ̸= 0. If v⊤Av ≥ 0 for every v ̸= 0, then we say that the matrix is positive semidefinite.
Show that ∇2J(β) is positive definite.
Solution: It’s sufficient to show that X⊤X is positive definite. Define u = Xv. We have that
v ⊤ ( X ⊤ X ) v = u ⊤ u = 􏰐 u 2j ≥ 0 . j
Therefore, X⊤X is positive semidefinite. Furthermore, because the columns of X are linearly independent, u is a zero vector if and only v is a zero vector. This follows directly from the definition of linear independence. We conclude that X⊤X is positive definite.
(i) Write down the gradient of the cost function in matrix notation.
(j) Suppose that we want to use gradient descent to optimise the cost function. How would we update the parameters at each iteration?
∇J = −2X⊤(y − Xβ).

At each iteration, we update the parameters according to β(k+1) = β(k) − α∇J(β(k)),
for a learning rate η.
Using the gradient from the previous part, the updating scheme is
β(k+1) =β(k) +α2XT(y−Xβ).
(k) Show that Newton’s method would converge to the solution in one step for this problem, regardless of the starting value. Why is this the case?
Newton’s method is based on the iterative scheme
β(k+1) = β(k) − 􏰃∇2J(β)􏰄−1 ∇J(β(k)),
where k denotes the iteration.
Using the gradient and Hessian from the previous parts,
β(k+1) = β(k) − (XT X)−1 􏰃−2XT (y − Xβ(k))􏰄 2
= β(k) + (XT X)−1(XT y) − (XT X)−1XT Xβ(k) = (XT X)−1(XT y).
The reason for this is that Newton’s method is based on a quadratic approxi- mation of the objective function, but the objective function is already quadratic in this problem.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com