CSC 445/545 – Summer 2021
Curve Fitting and Regression
University of Victoria – CSC 445/545 – Summer 2021
1
Bill Bird
Department of Computer Science University of Victoria
July 3, 2021
Aperitif: Means and Medians (1)
Let x1, x2, . . . , xn be a collection of real numbers. Suppose we want to choose a single value v (which may or may not equal some xi ) that can be used to encapsulate or represent the entire set of data.
University of Victoria – CSC 445/545 – Summer 2021 2
Aperitif: Means and Medians (2)
Interpretation 1: The squared error between a particular value y and a value xi is
(y − xi )2. Suppose we want to choose v to be the value that minimizes the sum of the squared errors across all xi . That is, we want v to be the minimum of the function
n
f (y ) = (y − xi )2
i=1
Exercise: Prove that the minimum of f will always be attained at the mean x ̄ = n1 ni =1 xi .
University of Victoria – CSC 445/545 – Summer 2021 3
Aperitif: Means and Medians (3)
This is an unconstrained optimization problem, and since the function
n
f (y ) = (y − xi )2
i=1
is differentiable everywhere, we can find the minimum by setting the derivative
to zero, giving
University of Victoria – CSC 445/545 – Summer 2021 i=1
4
nn
f ′ (y ) = 2(y − xi ) = 2n · y − 2 xi
i=1
i=1
n
0 = 2n · y − 2 xi
i=1 n
2n · y = 2 xi i=1
1 n
y=n xi=x ̄
Aperitif: Means and Medians (4)
Interpretation 2: The absolute error between a particular value y and a value xi is |y − xi |. Suppose we want to choose v to be the value that minimizes the sum of the absolute errors across all xi . That is, we want v to be the minimum of the function
n
f (y ) = |y − xi | i=1
Exercise: Prove that the minimum of f will always be attained at the median x ̃.
University of Victoria – CSC 445/545 – Summer 2021 5
Aperitif: Means and Medians (5)
The function
n
f (y ) = |y − xi | i=1
is continuous, but the derivative does not exist at y = xi .
Recall that the derivative of the absolute value operator |x| is the sign function sign(x).
Since the sign function has discontinuities around x = 0, the derivative of f (y ),
is not continuous at y = xi . University of Victoria – CSC 445/545 – Summer 2021
6
n
f ′ (y ) = sign(y − xi )
i=1
Aperitif: Means and Medians (6)
Suppose, without loss of generality, that the values x1, x2, . . . , xn are in ascending order. This implies that
and it also implies that
x ̃ =
x(n+1)/2
1 xn/2 + x(n+1)/2
if n is odd if n is even
−n if y < x ′ n 1
2
f (y)= sign(y−xi)= 2i−n ifxi
Aperitif: Means and Medians (7)
The formulation
−n if y < x ′1
f (y)= 2i−n ifxi
implies that f ′(y) is monotonically increasing, and that
f ′(y) ≤ 0 if y < x ̃ f ′(y) ≥ 0 if y > x ̃
University of Victoria – CSC 445/545 – Summer 2021
8
Aperitif: Means and Medians (8)
Consider the set
S = {13.9, 16.7, 14.7, 15.4, 13.7, 13.4, 9.8, 13.7},
which contains 8 temperature observations (in degrees Celsius) from various weather stations in Victoria at 10:06am on April 20th, 2021.
Question: Suppose you are asked to give a single temperature value representing the temperature in Victoria at 10:06am. How would you compute it?
University of Victoria – CSC 445/545 – Summer 2021 9
Aperitif: Means and Medians (9)
The mean of S is 13.9 and the median is 13.8. The elements of S are generally clustered around the mean and median, so either statistic might be appropriate for the task.
However, it is relatively common for observational data to contain outliers (e.g. due to sensor malfunctions or unusual localized conditions).
University of Victoria – CSC 445/545 – Summer 2021 10
Aperitif: Means and Medians (10)
Consider the set T formed by adding the element 30 to S.
T = {13.9, 16.7, 14.7, 15.4, 13.7, 13.4, 9.8, 13.7, 30}.
The element 30 is clearly an outlier, although 30 is a plausible temperature value in general (so it would not necessarily be inherently suspicious in a temperature dataset).
Contrast the different means and medians of each set.
Set Mean Median
S 13.9 T 15.7
13.8 13.9
University of Victoria – CSC 445/545 – Summer 2021
11
Aperitif: Means and Medians (11)
Set Mean Median
S 13.9 13.8
T 15.7 13.9
Notice that the outlier has a much stronger influence over the mean than the median. This is consistent with the interpretation of the median as minimizing absolute error and the mean as minimizing squared error, since the squared error of the outlier is much larger compared to the squared error of the other values.
In cases where anomalous data points are understood to be outliers (e.g. a severe measurement error rather than a genuine extreme event), the median can provide a better way to capture the characteristics of the data set.
University of Victoria – CSC 445/545 – Summer 2021 12
Interpolating Polynomials (1)
3.00
2.00
1.00
0.00
0.00 1.00 2.00
3.00 4.00 5.00 6.00
x
Consider a collection S = {(x1, y1), (x2, y2), . . . , (xn, yn)} of points in the plane.
University of Victoria – CSC 445/545 – Summer 2021 13
y
Interpolating Polynomials (2)
3.00
2.00
1.00
0.00
0.00 1.00 2.00
3.00 4.00 5.00 6.00
x
Task:Findapolynomialp(x)ofdegreen−1suchthatp(xi)=yi fori∈{1,2,…,n}.
University of Victoria – CSC 445/545 – Summer 2021 14
y
Interpolating Polynomials (3)
3.00
2.00
1.00
0.00
0.00 1.00 2.00
3.00 4.00 5.00 6.00
x
There are a variety of reasons this might be useful, but the result is generally known as an interpolating polynomial.
University of Victoria – CSC 445/545 – Summer 2021 15
y
Interpolating Polynomials (4)
Theorem: If xi ̸= xj for i ̸= j (that is, if the x coordinates of each point are distinct), then there is a unique interpolating polynomial of degree n − 1 which interpolates the collection
S = {(x1,y1),(x2,y2),…,(xn,yn)}.
The necessary condition that the x coordinates are be distinct is somewhat intuitive, since if xi = xj for i ̸= j, it isn’t possible to have p(xi) = p(xj) (assuming points i and j differ in their y-coordinates).
University of Victoria – CSC 445/545 – Summer 2021 16
Interpolating Polynomials (5)
3.00
2.00
1.00
0.00
0.00 1.00 2.00
3.00 4.00 5.00 6.00
x
There are other interpolation mechanisms (e.g. linear or spline interpolation) that might be better depending on the context. A numerical analysis course like CSC 349a covers interpolation in more detail.
University of Victoria – CSC 445/545 – Summer 2021 17
y
Interpolating Polynomials (6)
For the set S = {(x1, y1), (x2, y2), . . . , (xn, yn)}, the interpolating polynomial will have the form
p(x) = c0 + c1x + c2x2 + . . . + cn−1xn−1,
and, assuming the necessary condition on Slide 16 is met, the desired properties of
c0,c1,…,cn are
p(x1)=c0 +c1x1 +c2x12 +…+cn−1xn−1 =y1
University of Victoria – CSC 445/545 – Summer 2021
18
1 p(x2)=c0 +c1x2 +c2x2 +…+cn−1xn−1 =y2
.
p(x)=c +cx +cx2 +…+c xn−1 =y
n 0 1n 2n n−1n n
2
Interpolating Polynomials (7)
In other words, we want
and since the values of each solving the equation above.
University of Victoria – CSC 445/545 – Summer 2021
19
1 1
. .
1
x1 x12 … xn−1c0 y1 1
x2 x2 … xn−1c1 y2
2
. = . . .
cn yn
xi and yi are known, the coefficients ci can be obtained by
x x2 … xn−1 nnn
Interpolating Polynomials (8)
3.00
2.00
1.00
0.00
0.00 1.00 2.00
3.00 4.00 5.00 6.00
x
In general, it is possible to construct interpolating functions using a variety of basis functions (not just the polynomial basis 1, x, x2, . . . , xn).
University of Victoria – CSC 445/545 – Summer 2021 20
y
Curve Fitting (1)
6.00
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00 12.00
Now suppose that S contains a large number of points and that we believe there is an underlying trend (e.g. that the relationship between y and x is more-or-less quadratic).
University of Victoria – CSC 445/545 – Summer 2021 21
y
Curve Fitting (2)
6.00
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00
12.00
Task: Fit a low degree polynomial (degree 2 pictured above) to the data.
University of Victoria – CSC 445/545 – Summer 2021
22
y
Curve Fitting (3)
6.00
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00 12.00
It might be clear that this task requires some kind of optimization, since we cannot expect any quadratic function to exactly fit such a large number of points, and therefore want to choose a function which minimizes the error.
University of Victoria – CSC 445/545 – Summer 2021 23
y
Curve Fitting (4)
6.00
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00 12.00
Although the formulation for interpolating polynomials is not directly applicable, we can use it to design a linear program for curve fitting.
University of Victoria – CSC 445/545 – Summer 2021 24
y
Curve Fitting (5)
6.00
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00 12.00
There is also no reason that we should limit ourselves to quadratic polynomials (or polyno- mials in general).
University of Victoria – CSC 445/545 – Summer 2021 25
y
Approximation Error (1)
6.00
4.00
2.00
0.00
0.00 2.00
Consider the set of points
S = {(1, 1),
4.00 6.00
8.00 10.00
12.00
University of Victoria – CSC 445/545 – Summer 2021
26
(3, 3), (6, 2),
(7, 1),
(8, 5), (11, 5)}
x
y
Approximation Error (2)
6.00
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00 12.00
Suppose we want to fit a quadratic (degree 2) polynomial to these 6 points. It is likely that some or all of the points do not actually lie on the chosen function.
University of Victoria – CSC 445/545 – Summer 2021 27
y
Approximation Error (3)
6.00
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00
12.00
The error between each point (xi , yi ) and the approximated value p(xi ) is εi =p(xi)−yi
University of Victoria – CSC 445/545 – Summer 2021
28
y
Approximation Error (4)
6.00
1
4
3
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00 12.00
Since we cannot expect every error value to be zero, the polynomial coefficients should be chosen to minimize some expression of the total error.
University of Victoria – CSC 445/545 – Summer 2021 29
y
Approximation Error (5)
Since we cannot expect every error value to be zero, the polynomial coefficients should be chosen to minimize some expression of the total error.
University of Victoria – CSC 445/545 – Summer 2021 30
Approximation Error (6)
Question: What does it mean to minimize the total error?
University of Victoria – CSC 445/545 – Summer 2021 31
Measurement of Error (1)
The discussion of error minimization is usually framed in terms of vector norms. Let
E = (ε1,ε2,…,εn)
be the vector whose components are the individual error terms εi . Note that
E =(p(x1),p(x2),…,p(xn))−(y1,y2,…,yn).
A perfect approximation would result in E = 0. Otherwise, the length (or norm) of E provides a measurement of the overall error of an approximation. The choice of norm then controls how the measurement changes as error terms change.
University of Victoria – CSC 445/545 – Summer 2021 32
Measurement of Error (2)
We are going to focus on a family of norm functions called Lp norms, which includes the most commonly used norms (the L1, L2 and L∞ norms).
Although it is possible to define the total error using an arbitrary function, there are reasons that a formal norm (having certain algebraic criteria, like a triangle inequality) is generally the best choice.
Higher-level linear algebra courses discuss these implications further.
University of Victoria – CSC 445/545 – Summer 2021 33
Measurement of Error (3)
For p > 0, the Lp norm of a vector v = (v1,v2,…,vn) is given by ||v||p =(|v1|p +|v2|p +…+|vn|p)p1
There are ways to extend this definition for p = 0 as well (but the obvious extension, using the formula above, does not meet the mathematical criteria to be a norm).
University of Victoria – CSC 445/545 – Summer 2021 34
Measurement of Error (4)
The L1-norm of a vector v = (v1,v2,…,vn) is given by ||v||1 =|v1|+|v2|+…+|vn|.
If total approximation error is measured with the L1 norm, an approximation with minimum total error will be one which minimizes the sum of the individual error terms.
University of Victoria – CSC 445/545 – Summer 2021 35
Measurement of Error (5)
The L2-norm of a vector v = (v1,v2,…,vn) is given by ||v||2 =v12 +v2 +…+vn2.
(The absolute value operator on each vi term is unnecessary since squares are always positive)
The L2 norm is the familiar Euclidean Length operator. For use as an objective function in a minimization problem, it is sufficient to minimize the sum of squares (instead of taking the square root). This is equivalent to the objective of least squares regression.
University of Victoria – CSC 445/545 – Summer 2021 36
Measurement of Error (6)
The L∞-norm of a vector v = (v1,v2,…,vn) is the result of taking the limit of the Lp norm formula as p → ∞.
||v||∞ = max(|v1|,|v2|,…,|vn|).
If the L∞ norm to is used to measure total error, the approximation with minimum total
error will minimize the largest individual error term.
University of Victoria – CSC 445/545 – Summer 2021 37
Optimization for Curve Fitting (1)
Let S = (x1,y1),…,(xm,ym) with xi ̸= xj for all i ̸= j. Consider the problem of finding a polynomial
p(x) = c0 + c1x + c2x2 + . . . + cnxn
of degree n which minimizes the total approximation error among points in S.
University of Victoria – CSC 445/545 – Summer 2021 38
Optimization for Curve Fitting (2)
Minimizing the L1-norm: The coefficients c0, c1, . . . , cn are the solution to the following (unconstrained) minimization problem
University of Victoria – CSC 445/545 – Summer 2021
39
m n min. yi −cjxj
i i=1 j=0
Optimization for Curve Fitting (3)
The formulation mn j min. i =1 j =0 cj xi − yi
can be rewritten in constrained form as
min. |ε1|+|ε2|+…+|εm|
s.t. c0 +c1x1 +c2x12 +…+cnx1n −ε1
= y1 = y2
c0 +c1x2 +c2x2 +…+cnx2n −ε2
.
c0 +c1xm +c2xm2 +…+cnxmn −εm
= ym
where equality constraints are used to create an explicit variable εi for each error term.
University of Victoria – CSC 445/545 – Summer 2021 40
Optimization for Curve Fitting (4)
Minimizing the L2-norm: The coefficients c0, c1, . . . , cn are the solution to the following (unconstrained) minimization problem
or, similar to the previous case, written in constrained form as
University of Victoria – CSC 445/545 – Summer 2021
41
min. s.t.
ε21 +ε2 +…+ε2m
c0 +c1x1 +c2x12 +…+cnx1n −ε1 c0 +c1x2 +c2x2 +…+cnx2n −ε2
.
c0 +c1xm +c2xm2 +…+cnxmn −εm
= y1 = y2
= ym
m n 2 i
min. yi−cjxj i=1 j=0
Optimization for Curve Fitting (5)
Minimizing the L∞-norm: The coefficients c0, c1, . . . , cn are the solution to the following (unconstrained) minimization problem
or, in constrained form,
min. s.t.
|ε2|, · · · , |εm|)
University of Victoria – CSC 445/545 – Summer 2021
42
n
max yi −cjxj i=1,2,…,m i j=0
max (|ε1|,
c0 +c1x1 +c2x12 +…+cnx1n −ε1 c0 +c1x2 +c2x2 +…+cnx2n −ε2
.
c0 +c1xm +c2xm2 +…+cnxmn −εm
min.
= y1 = y2
= ym
Optimization for Curve Fitting (6)
Notice that, in all three constrained formulations, the constraint system is linear but the objective function is not.
It turns out that the L1 and L∞ cases can be written as linear programs (by finding a linear expression of the absolute value and maximum operators).
The L2 case can be solved as a quadratic program (for polynomials of arbitrary degree). It can also be solved analytically with linear algebra for the case n = 1 (linear regression).
University of Victoria – CSC 445/545 – Summer 2021 43
L1 Norms via Linear Programming (1)
Task: Rewrite the L1 regression problem
min. |ε1|+|ε2|+…+|εm|
as a linear program.
University of Victoria – CSC 445/545 – Summer 2021
44
s.t.
c0 +c1x1 +c2x12 +…+cnx1n −ε1 c0 +c1x2 +c2x2 +…+cnx2n −ε2
.
c0 +c1xm +c2xm2 +…+cnxmn −εm
= y1 = y2
= ym
L1 Norms via Linear Programming (2)
min. |ε1|+|ε2|+…+|εm|
s.t. c0 +c1x1 +c2x12 +…+cnx1n −ε1
c0 +c1x2 +c2x2 +…+cnx2n −ε2
.
c0 +c1xm +c2xm2 +…+cnxmn −εm
= y1 = y2
= ym
The only obstacle is the absolute value operator in the objective function (the constraints are already linear, albeit not in standard form).
University of Victoria – CSC 445/545 – Summer 2021 45
L1 Norms via Linear Programming (3)
min. ε1 +ε2 +…+εm
s.t. ε1 = c0 +c1x1 +c2x12 +…+cnx1n −y1
ε2 = c0 +c1x2 +c2x2 +…+cnx2n −y2 .
εm = c0 +c1xm +c2xm2 +…+cnxmn −ym
We can rewrite the problem to move the absolute value operators into the constraints (and force each εi to be non-negative as a result).
University of Victoria – CSC 445/545 – Summer 2021 46
L1 Norms via Linear Programming (4)
Claim: The constraint p ≥ |q| is equivalent to the pair p≥q
p ≥ −q and furthermore, an optimization problem like
min. p
s.t. p = |q|
p≥0 (with p subject to no other constraints), is equivalent to
min. p
s.t. p≥q
p ≥ −q p≥0
University of Victoria – CSC 445/545 – Summer 2021
47
L1 Norms via Linear Programming (5)
Justification: The constraint p ≥ |q| is satisfied if and only if p is greater than or equal to the larger of q and −q (at most one of which is positive).
In the optimization problem
min. p
s.t. p ≥ |q|
p≥0
the minimization of p (by the objective function) will force p to take the value |q| (as long
as no other constraints act on p).
University of Victoria – CSC 445/545 – Summer 2021 48
L1 Norms via Linear Programming (6)
min. ε1 +ε2 +…+εm
s.t. ε1 = c0 +c1x1 +c2x12 +…+cnx1n −y1
ε2 = c0 +c1x2 +c2x2 +…+cnx2n −y2 .
εm = c0 +c1xm +c2xm2 +…+cnxmn −ym
Using the transformation on the previous slides, we can convert the non-linear optimization problem above into a linear program.
University of Victoria – CSC 445/545 – Summer 2021 49
L1 Norms via Linear Programming (7)
min. ε1 +ε2 +…+εm
s.t. ε1 ≥ ε1 ≥ ε2 ≥ ε2 ≥
.
εm ≥ εm ≥
c0 +c1x1 +c2x12 +…+cnx1n −y1 −c0 −c1x1 −c2x12 −…−cnx1n +y1 c0 +c1x2 +c2x2 +…+cnx2n −y2 −c0 −c1x2 −c2x2 −…−cnx2n +y2
c0 +c1xm +c2xm2 +…+cnxmn −ym −c0 −c1xm −c2xm2 −…−cnxmn +ym
The direct result of the transformation (not in standard form) is shown above.
University of Victoria – CSC 445/545 – Summer 2021 50
L1 Norms via Linear Programming (8)
min. ε1 + ε2 + … s.t. c0 + c1x1 + c2x12 −c0 −c1x1 −c2x12 c0 + c1x2 + c2x2 −c0 −c1x2 −c2x2
.
c0 +c1xm +c2xm2
−c0 −c1xm −c2xm2
Putting the LP into standard form (and moving all of the terms involving the optimization
variables c0, c1, . . . , cn, ε1, ε2, . . . , εm to the left hand formulation above.
sides
of the
constraints)
yields
the
University of Victoria – CSC 445/545 – Summer 2021
51
+εm +…+cnx1n −…−cnx1n +…+cnx2n −…−cnx2n
−ε1 ≤ y1 −ε1 ≤−y1 −ε2 ≤ y2 −ε2 ≤−y2
+…+cnxmn −…−cnxmn
−εm ≤ym −εm ≤−ym
L2 Regression via Linear Algebra (1)
Task: Find a way to solve the L2 regression problem
min. ε21 +ε2 +…+ε2m
s.t. c0 +c1x1 +c2x12 +…+cnx1n −ε1
c0 +c1x2 +c2x2 +…+cnx2n −ε2
.
c0 +c1xm +c2xm2 +…+cnxmn −εm
using linear algebra (not quadratic programming).
= y1 = y2
= ym
University of Victoria – CSC 445/545 – Summer 2021
52
L2 Regression via Linear Algebra (2)
The unconstrained form of the L2 regression problem is m n 2
The objective function
can be minimized by finding points where the gradient
equals zero.
University of Victoria – CSC 445/545 – Summer 2021
53
i min. yi−cjxj
i=1 j=0
F(c0,c1,…,cn)= yi −cjxj i=1 j=0
∇F(c0,c1,…,cn) =
∂F 0n
m n 2 i
∂F ∂c ,…, ∂c
L2 Regression via Linear Algebra (3)
The elements of the gradient are the partial derivatives of F with respect to each coefficient ci .
University of Victoria – CSC 445/545 – Summer 2021
54
m n 2 ∂F∂F j
∂c =∂c yi− cjxi k ki=1 j=0
mn
=−2xik yi−cjxij i=1 j=0
L2 Regression via Linear Algebra (4)
Setting the partial derivative with respect to ck to zero and rearranging gives mn
−2xik yi − cj xij = 0
i=1 j=0 mmn
−2xik yi +2xik cjxij =0 i=1 i=1 j=0
mnm 2xikcjxij =2xikyi
i=1 j=0 i=1 mnm
xikxijcj = xikyi i=1 j=0 i=1
The collection of these partial derivatives for each ck gives us a system of n + 1 equations
in n + 1 unknowns.
University of Victoria – CSC 445/545 – Summer 2021 55
L2 Regression via Linear Algebra (5)
Define an m × (n + 1) matrix A similarly to the polynomial interpolation matrix:
University of Victoria – CSC 445/545 – Summer 2021
56
1 1
A = . .
1
x1 x12 … x1n
x2 x2 … x2n
. .
xm xm2 … xmn
L2 Regression via Linear Algebra (6)
Observation 1: The sum
m xikyi
i=1
is the kth entry of the product
T x1 x2 x3 … xmy2
University of Victoria – CSC 445/545 – Summer 2021
57
1 1 1 … 1y1 A y=. . .
. . . x1nx2nx3n…xmn ym
L2 Regression via Linear Algebra (7)
Observation 2: The sum
mn xikxijcj
i=1 j=0
is the kth entry of the product
1 1 1 … 11
T x1 x2 x3 … xm1
x1 x12 … x1nc0 x2 x2 … x2nc1
University of Victoria – CSC 445/545 – Summer 2021
58
A Ac=. .
x1n x2n
. . . .
. .
xm2 … xmn
. .
cn
x3n … xmn 1 xm
L2 Regression via Linear Algebra (8)
The matrix Q = ATA is (n+1)×(n+1), and the entry qkj at row k and column j of Q is m
q =xkxj. kj ii
i=1
Therefore, each entry rk of the product R = AT Ac = Qc is nnmmn
r =q c = xkxj c =xkxjc k kjj iij iij
j=0 j=0 i=1 i=1 j=0
(Note that the zero-based indexing of the sum over j is a notational convenience).
University of Victoria – CSC 445/545 – Summer 2021 59
L2 Regression via Linear Algebra (9)
Combining the observations from the previous slides, the points (c0, c1, . . . , cn) such that ∇F(c0,c1,…,cn) = 0 are precisely the solutions to the equation
AT Ac = AT y (with the same A, c and y as defined earlier).
Therefore, if the matrix AT A is invertible, the coefficient vector c is equal to (AT A)−1AT y (and can be computed either directly with inverses or by solving several sets of linear systems).
University of Victoria – CSC 445/545 – Summer 2021 60
L2 Regression via Linear Algebra (10)
Question: When is AT A invertible?
University of Victoria – CSC 445/545 – Summer 2021 61
L2 Regression via Linear Algebra (11)
Lemma 1: If m ≥ n+1 and there are at least m distinct x values, A has rank n+1. That is, the columns
are linearly independent.
University of Victoria – CSC 445/545 – Summer 2021
62
1 x1 1 x2
x1n x2n
··· , 1xm xmn
., . , . .
. .
L2 Regression via Linear Algebra (12)
Lemma 2: The rank of AT A is equal to the rank of A (for any matrix A, not just the one used for L2 regression).
Proof: If A is m×(n+1), then ATA is (n+1)×(n+1). Since both matrices have the same number of columns, the rank of each can be written as
rank(A) = (n + 1) − nullity(A) rank(AT A) = (n + 1) − nullity(AT A)
We can show that the null space of A is identical to the null space of AT A, which implies nullity(A) = nullity(AT A) and therefore proves that rank(A) = rank(AT A).
University of Victoria – CSC 445/545 – Summer 2021 63
L2 Regression via Linear Algebra (13)
Part I: Null(A) ⊆ Null(AT A):
Let x ∈ Null(A). Then Ax = 0 and (ATA)x = AT(Ax) = 0, so x ∈ Null(ATA).
University of Victoria – CSC 445/545 – Summer 2021 64
L2 Regression via Linear Algebra (14)
Part II: Null(AT A) ⊆ Null(A) (proof courtesy of Wikipedia):
Let x ∈ Null(AT A). Then (AT A)x = 0. Multiplying both sides by xT gives
xT AT Ax = xT 0
(xT AT )Ax = 0 (a scalar)
(Ax)T Ax = 0
The last equality implies that the length of the vector Ax is zero, which requires that
Ax = 0.
University of Victoria – CSC 445/545 – Summer 2021 65
L2 Regression via Linear Algebra (15)
Lemma 1 states that A has rank n + 1, and Lemma 2 implies that the product AT A also has rank n + 1.
Since AT A is (n + 1) × (n + 1), it therefore has full rank and is invertible.
University of Victoria – CSC 445/545 – Summer 2021 66
Comparison of Regression Methods (1)
6.00
4.00
2.00
0.00
0.00 2.00 4.00
Consider the point set above.
6.00 8.00 10.00
x
12.00
University of Victoria – CSC 445/545 – Summer 2021
67
y
Comparison of Regression Methods (2)
6.00
L1 regression L2 regression
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00 12.00
The results of linear (n = 1) regression using both L1 norm (solid red line) and L2 norm (dashed green line) are shown above.
University of Victoria – CSC 445/545 – Summer 2021 68
y
Comparison of Regression Methods (3)
6.00
L1 regression L2 regression
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00 12.00
L2 regression (i.e. least squares regression) is generally accepted as the default method for fitting a line to a set of data, and can be viewed as the two-dimensional analogue of the mean of a set of one-dimensional values
University of Victoria – CSC 445/545 – Summer 2021 69
y
Comparison of Regression Methods (4)
6.00
L1 regression L2 regression
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00 12.00
(As we saw at the beginning, the mean of a collection of data is the value which minimizes the squared error among all data points)
University of Victoria – CSC 445/545 – Summer 2021 70
y
Comparison of Regression Methods (5)
6.00
L1 regression L2 regression
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00
12.00
L1 regression, by minimizing the absolute error, is analogous to the median.
University of Victoria – CSC 445/545 – Summer 2021
71
y
Comparison of Regression Methods (6)
6.00
L1 regression L2 regression
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00 12.00
Consider the result of adding a single extra data point, deliberately chosen to be an outlier.
University of Victoria – CSC 445/545 – Summer 2021 72
y
Comparison of Regression Methods (7)
6.00
L1 regression L2 regression
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00 12.00
Notice that the addition of an outlier had a much more dramatic effect on the L2 regression result than the L1 regression result.
University of Victoria – CSC 445/545 – Summer 2021 73
y
Comparison of Regression Methods (8)
6.00
L1 regression L2 regression
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00 12.00
Since L2 regression uses the squares of errors, points with large errors can have dispropor- tionate contributions to the result.
University of Victoria – CSC 445/545 – Summer 2021 74
y
Comparison of Regression Methods (9)
6.00
L1 regression L2 regression
4.00
2.00
0.00
0.00 2.00 4.00
6.00 8.00
x
10.00 12.00
Similar to how the median of a set of data can be more robust than the mean if outliers are present, L1 regression can be preferable in cases where outliers might skew the results of L2 regression.
University of Victoria – CSC 445/545 – Summer 2021 75
y