程序代写 WiSe 2021/22

WiSe 2021/22
Machine Learning 1/1-X
Lecture 9 Regression

Copyright By PowCoder代写 加微信 powcoder

� Classification vs. Regression � Regression:
� Square Loss
� Least Square Regression � Ridge Regression
� Kernel Ridge Regression � Gaussian Processes
� Robust Regression

Classification vs. Regression
Classification Regression f :Rd →{−1,1} f :Rd →R
���� ���� ����
���� ���� ���� ����

Why Regression?
Example: Modeling physical systems
Source: Sauceda et al. J. Chem. Phys. 150, 114102 (2019)
� Useful models for molecular dynamics / chemical reactions.
� Quantity to predict (energy) is intrinsically real-valued.
� From the energy, one can derive quantities such as forces on different atoms.

Why Regression?
Example: Energy Foreasting
Source: https://www.enertiv.com/resources/faq/what-is-energy-forecasting
� Schedule energy-consuming tasks intelligently to minimize costs. � Related tasks: demand forecasting, etc.

Recap: Optimal Classifier
Example of data generating assumption:
� Assume our data is generated for each class ωj according to the multivariate Gaussian distribution
p( |ωj) = N(μj,Σ)
and with class priors P(ωj). Resulting optimal classifier:
� The Bayes optimal classifier is derived as argmax{P(ωj| )}
=argmax� �Σ−1μ�j −1μjΣ−1μj +logP(ωj)�
j2 which is a linear classifier.

Building an Optimal Regressor
Example of data generating assumption:
� Let and t denote the input and the output (target), respectively. Assume our data comes from the joint distribution
P( ,t)=N��μ �,�Σ � �� � �
Resulting optimal regressor:
Σ t�� μt Σt Σtt
� Using identities of multivariate Gaussians, we get the conditional expectation:
E[t| ] = Σt Σ−1 + −Σt Σ−1μx ���� � �� �
This is best possible regressor in the sense of expectation, and again a linear model.

Objective-based learning
� In real-world applications, data generating distributions are not known, or are difficult to estimate.
� For practical purposes, we consider instead a class of models of limited
complexity (e.g. linear)
with θ = ( , b), and find the model in that class that minimizes the
y=�+b error on the given dataset D (empirical error):
θ� =argmin Remp(θ,D) θ∈Θ
� Example of such approach in the context of classification: the perceptron.

Recap: The Perceptron
� Proposed by F. Rosenblatt in 1958.
� Classifier that prefectly separates training data
(if the data is linearly separable).
� Trained using a simple and cheap iterative procedure.
The perceptron can also be seen as a gradient descent of the error function 1 �N
E( ,b) = N max(0,−yktk) k=1
where tk ∈ {−1, 1} and max(0, −yk tk ) is called the hinge loss.

Hinge Loss vs. Square Loss
� Use a loss function designed for regression, e.g. the square loss:
E( ,b)=N (yk−tk)2
� The square loss (yk − tk )2 is a popular way of measuring the error of a model and optimize it. [Gauss 1809; Legendre 1805]
C. F. Gauss (1777–1855)
A. M. Legendre (1752–1833)

Least Square Regression
Initial Trick
To simplify the derivations, Replace the bias by adding a constant dimension into the data and a corresponding weight:
� +b=[,1]�[ ,b] ���� ����
Therefore, the error function
E( ,b)=N can be rewritten more compactly as:
( � k+b−tk)2
( � k−tk)2

Least Square Regression
The error function to minimize can be developed as:
( � k−tk)2
� k �k −2 � ktk+cst.
=N1�XX� −N2�X+cst.
where X = ( 1|…| N) is a matrix of size d × N containing the data and = (t1, . . . , tN ) is a vector of size N containing the targets.

Least Square Regression
Because XX� is positive semi-definite, and the error E( ) is consequently a convex function of , a nec- essary condition for a minimum of the function is to satisfy ∇E( ) = 0.
∇E( )=∇(N1 �XX� −N2 �X +cst.) =N2XX� −N2X=0
Rearranging terms and multiplying by (XX � )−1 on both sides, we get the closed form solution:
= (XX�)−1X

Least Square Regression, Dilemma
Questions:
� Are variation in training data signal or noise?
� How well will the model generalize to new data?
� Can we learn models subject to the constraint that it has to be simple (e.g. of limited slope)?

Recap: Structural Risk Minimization (SRM)
Structural risk minimization (Vapnik and Chervonenkis, 1974) is an approach to measure complexity and perform model selection.
� Structure the space of solutions into a nesting of increasingly large regions.
� If two solutions fit the data, prefer the solution that also belongs to the smaller regions.

Ridge Regression
� Implement the SRM principle by further restricting the class of functions the model can be selected from, i.e. min E( ) s.t. � �2 ≤ C.
� This objective can be minimized using Lagrange multipliers: ∇ L( ,λ)
=∇�N1 �XX� −N2 X+λ·(��2−C)�=0 which leads to:
= (XX�+ Nλ I)−1X ����
where λ is chosen to minimize the error subject to the constraint being satisfied. This regularized least square is called ridge regression.
� In practice, instead of specifying the parameter C, it is common to directly treat λ as the hyperparamter.

Ridge Regression
λ = 0 λ = 5 λ = 25
� The higher the parameter λ, the flatter the predicted function. � High values of λ are desirable for noisy high-dimensional data.
��� ������
��� ��� ��� ��� ���

From Linear to NonLinear Regression
� If the function to predict is nonlinear, nonlinearly transform the input data via some feature map Φ : Rd → Rh, and solve the problem linearly in the feature space.
� The feature map needs to be chosen appropriately.

Kernel Ridge Regression
� Idea: Redefine the prediction function y = �Φ( ) where ∈ Rh and the objective to minimize as
E( )=N ( �Φ(k)−tk)2 subjectto � �2≤C.
� The solution to this optimization problem is given by
= (Φ(X)Φ(X)� + λI)−1Φ(X)
where Φ(X) = (Φ( 1)|…|Φ( N)), and for an appropriate choice of
parameter λ.
� A new data point is then predicted as
y= �Φ()=Φ()�
= Φ( )�(Φ(X)Φ(X)� + λI)−1Φ(X)

Kernel Ridge Regression
from: y = Φ( )�(Φ(X)Φ(X)� + λI)−1Φ(X)
we arrive at:
y=k( ,X)(K+λI)−1

Kernel Ridge Regression
Observation:
� Predictions of the kernel ridge regression model:
y( )=k( ,X)(K+λI)−1
can be rewritten as a weighted sum of kernel
basis functions
Example: Gaussian ker- nel ridge regression on toy one-dimensional data.
k( , i)·αi whereα=(K+λI)−1 .
� The choice of the kernel function strongly influences the overall shape of the prediction function.
The contribution of each kernel basis function is shown as dotted lines, and the overall model is shown as a solid line.

Kernel Ridge Regression
Effect of the kernel (type of kernel and scale/degree) on the learned decision function:
Gaussian Laplacian Polynomial
����� ����� ����� ���
����� ����� ����� ���
����� ����� �����

Gaussian Process
� Think of regression outputs as being drawn from some joint distribution pθ(y1,y2,…,yN)
� with the joint distribution being Gaussian, with covariance structure determined by the inputs 1, . . . , N , i.e.
pθ(y1,…,yN) = N(0,Σ) Σij =k(xi,xj)+σ2δij
� where k can be any kernel function (e.g. Gaussian, Laplacian, polynomial, etc.) and σ2 is the intrinsic observation noise.

Gaussian Process
Examples of 5 samples ∼ pθ( ) drawn from the joint distribution of targets (taking points on a grid).
� pθ( ) can be interpreted as a ‘prior’ distribution on predictions. � Targets are correlated locally in input space.
� There is some iid. noise that simulates observation noise.

Gaussian Process
� Distinguish observed data (X , ) and unobserved data (X � , � ).
� They are governed by the same Gaussian distribution (now given in
block matrices):
pθ�� � ��=N��0�,� Σyy Σyy� ��
0 Σy�y Σy�y�
Σyy =k(X,X)+σ2I Σyy� =k(X,X�)
Σy�y =k(X�,X) Σy�y� =k(X�,X�)+σ2I
� Predict new data points by conditioning on observed targets values: pθ( �| )
� Any conditional of a Gaussian distribution is also Gaussian (of different mean and variance), these parameters can be obtained in closed form (cf. matrix cookbook).

Gaussian Process
� Using the conditional formulas for mutltivariate Gaussians, we get for a collection of future observations y� at input locations X� the expectation:
E[ �| = ]=Σy�yΣ−1 yy
=k(X�,X)(K +σ2I)−1
This is the same as kernel ridge regression (with λ = σ2)! Regularization
parameter λ can be interpreted as a noise assumption about the labels.
� Similarly, we can compute the covariance of the conditioned distribution
Cov[ �| = ] = Σy�y� − Σy�y Σ−1Σyy� yy
= (k(X�,X�) + σ2I)
−k(X�,X)(K +σ2I)−1k(X,X�)
The latter provides additional information about the local uncertainty of the predictive model.

Gaussian Process
Examples of 5 samples � ∼ pθ( �| ) where we have conditioned on some observed data (X, ) (black dots).
� Can be interpreted as a posterior distribution.
� All samples from the posterior fit the data.
� Not only expectation (best prediction), but also variance (predictive uncertainty) can be computed.

Predictive Uncertainty
Classification:
� (Kernel) logistic regression maps each data point to a class probability: P(y=1| )=σ( �Φ( ))
P(y =−1| )=1−P(y =1| ) where σ is the logistic sigmoid function.
� The probability score readily indicates the confidence the model has in assigning a class to a given data point (e.g. low confidence if the probability near 0.5).
Regression:
� (Kernel) regression maps data to real-valued scores following: f()= �Φ()
� No confidence measure attached to the prediction by default.
� Gaussian process enhances these models with an estimate of variance.

Robust Regression
Square loss (y −t)2 is not robust to possible outliers in target values (e.g. sensor failure, human mistake in data collection).
Robust loss functions:
� Absolute loss:
� Epsilon-insensitive loss max(0,|y −t|−ε).
� Huber los�s: h(y − t) where
r2 |r| ≤ c 2c|r| − c2 otherwise.

Regression, Further Topics
� Neural networks for regression.
� Dealing with different levels of noise (heteroskedasticity) � Enhanced error models (mixture density networks), ML2. � Structured output learning, time series predictions, ML2.

� Regression addresses the problem of making real-valued predictions (different from classification).
� Basic regression model: least squares regression. Admits a closed form.
� Because least square regression can strongly overfit (and is undefined
when d > N), one needs to regularize it (ridge regression).
� A nonlinear regression model can be obtained by mapping the data into some feature space and kernelizing the model and the predictions (kernel ridge regression).
� Gaussian processes extends kernel ridge regression by providing along with the prediction an estimate of variance.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com