Multivariate Data Analysis 2019/2020 — Course Work —
Exploring kernel regression
1 Framework and motivation
The objective of this course work is to explore the numerical implementation and the properties of simple kernel regression models. To deepen our understanding of this class of models, we shall in particular design our own implementation; we will thus not rely on any already existing packages or toolboxes performing kernel regression.
You are free to use the programming language of your choice. As we already discussed, R or Python are perfectly adapted (and recommended) languages for the completion of this project. In order to make your pieces of code and experi- ments more readable, you may use a Jupyter Notebook environment; a version of Anaconda with R and Python kernels is available on all the desktops in the Computer Laboratories of the School; you may also use your own laptops since all these pieces of software are open source. To help you in starting your study, two jupyter notebooks are available on Learning Central, one in R and one in Python, namely KernelRegX1_R and KernelRegX1_Py. The PDFs of these notebooks are also available, in case you would like to simply copy-paste the code and use a different programming environment.
As a remark, these notebooks use functions allowing for the simulation of realisa- tions of Gaussian ramdon vectors of the type 𝐱 ∼ 𝑑 (𝝁, 𝚺) for a given mean vector 𝝁 and a given covariance matrix 𝚺. We could nevertheless rely only on functions allowing for the simulation of independent and identically distributed standard real Gaussian random variables (i.e., independent (0, 1) random variables). Indeed, from the Standardisation Procedure and the decomposition 𝚺 = 𝐋𝐋𝑇 , we have
𝐱 ∼d 𝐋𝝃 + 𝝁, with 𝝃 ∼ 𝑑 (0, 𝕀𝑑 ),
so that the component of 𝝃 consist in independent (0, 1) random variables.
2 About the report
You have to write a short report presenting the analysis and the experiments you have carried out. Ideally, you may try to write your report in LATEX. For this purpose, you might use the .tex source of this documents as a base model: see the file MDA_CourseWork2020.tex on Learning Central; you may of course use a different font size and reduce the margin). If you are not familiar with LATEX, this could be an opportunity for you to learn it, but you might otherwise use Microsoft Word for instance. There is no minimal or maximal page limit; nevertheless, this work is a “small course work”: I do not expect you to write a thesis (around 8-10 pages, including figures, should be fine). Try to be concise, precise and pertinent; pay
1
attention to the quality of your report in both substance and form (i.e., content, organisation, quality of the writing and of the graphics, overall presentation and clarity of the discussion).
Your report is due, in PDF format, no later than Thursday, March 26, 2020. You should submit your report through learning central, in the Assessment tab. You can make this project alone, in pair, or in group of three; as you wish. Do not forget to put your name(s) and student number(s) at the beginning of your report, and submit only one report per group (if you do your project in group, use the name and student number of one member of the group when submitting your report on Learning Central). If you encounter any problem, please fill free to contact me by email at gauthierb@cardiff.ac.uk.
3 Expected work
You should try to reproduce the type of experiments that are performed in the ex- amples of Section 3.3 of the lecture notes, while explaining and commenting your developments. In a first time, I encourage you to focus mainly on the interpolation/ap- proximation of functions from R to R. In a second time and if you want to, you may explore the interpolation/approximation of functions from R𝑑 to R, but this is not mandatory; there is already a lot to do in the “R-to-R case”.
What you need to do:
(a) Study the provided pieces of code: try to understand the operations they perform, how they work and the way they are related to the theoretical devel- opments we have studied during the lectures.
(b) Build the confidence region related to a given Gaussian process regression model (see the notebooks and the lecture notes).
(c) Illustrate the impact of the choice of the kernel and of the kernel parameters; a list of kernels you might potentially use is provided at the end of this document.
(d) Perform the maximum-likehood estimation of the covariance kernel parame- ter(s) on an exemple (see the lecture notes).
What you could do to go further:
(e) For a fixed dataset (i.e., a fixed set of sample locations and a fixed response vector), try to illustrate what happens when, in the kernel regression formulas, the kernel matrix 𝐊 is replaced by (𝐊 + 𝜎2 𝕀) for various values of 𝜎2 > 0. This is related to the model described in the Question 3 of the exercise sheet for Chapter 3, i.e., the “model with observation noise”.
(f) Illustrate the link between linear regression and kernel regression with finite- dimensional kernels. This is related to the Question 2 of the exercise sheet for Chapter 3.
2
4
(g) Try to perform the interpolation/approximation of a dataset corresponding to the observation of a function from R2 to R.
(h) Performanyfollow-uporcomplementaryanalysisorexperimentthatyoujudge interesting and meaningful.
Additional information
Here are a few (symmetric positive-semidefinite) kernels that you might use. • Gaussian kernel (X ⊂ R𝑑 ):
𝐾(𝑥, 𝑥′) = 𝑒−𝜌‖𝑥−𝑥′‖2 (parameter 𝜌 > 0). • Laplace kernel (X ⊂ R𝑑 ):
𝐾(𝑥, 𝑥′) = 𝑒−𝜌‖𝑥−𝑥′‖ (parameter 𝜌 > 0). • Matérn-3/2(X ⊂R𝑑):
𝐾(𝑥,𝑥′)=(1+𝑑)𝑒−𝑑, with𝑑=𝜌√3‖𝑥−𝑥′‖ (parameter𝜌>0). • Matérn-5/2(X ⊂R𝑑):
𝐾(𝑥,𝑥′)=(1+𝑑+1𝑑2)𝑒−𝑑, with𝑑=𝜌√5‖𝑥−𝑥′‖ (parameter𝜌>0). 3
• Modified-Brownian kernel (X ⊂ R+, i.e., the positive or null reals): 𝐾(𝑥, 𝑥′) = 1 + min(𝑥, 𝑥′).
Infinitely many other symmetric positive-semidefinite kernels exist; if you find a kernel that you would like to use (while searching online for instance), you are of course strongly encourage to do so. You may in addition for instance notice that if 𝐾 ∈ RX ×X is a symmetric positive-semidefinite kernel, then for all function 𝑓 ∈ RX , so is the kernel (𝑥,𝑥′) ↦ 𝑓(𝑥)𝐾(𝑥,𝑥′)𝑓(𝑥′). Symmetric positive-semidefinite kernels may also be summed or multiplied together (and tensor-product of kernel can be used to defined kernel on product space).
The figures you will include in your report should be numbered and a captioned; a similar remark also holds for tables if you intend at using some. Here is an example describing how to include a figure, see Figure 1.
The last example provided in the Chapter 3 of the lecture notes is based on a 50-point dataset that have been extracted from the volcano dataset provided by the R package datasets; you may notice that the elevation component of the full volcano dataset have been sample-standardised to adapt the dataset to our kernel- regression setting. The extracted 50-point dataset is named DataVolcano50.txt and is available on Learning Central. It consists in the (𝑥, 𝑦, 𝑧) coordinate of 50 points in [0, 1]2 × R; we then aim at predicting 𝑧 as function of (𝑥, 𝑦). After having placed the file DataVolcano50.txt in the same folder as your notebook, you may load it by proceeding as indicated hereinafter.
3
0 2 4 6 8 10
1 2 3 4 5 6
1 2 3 4 5 6
Figure 1: Caption of the figure.
##### R
Volcano50 <- read.table('DataVolcano50.txt',sep=',',
header = TRUE)
Volcano50 <- data.matrix(Volcano50) # dataframe to matrix colnames(Volcano50) <- NULL # remove coulumn names print(Volcano50)
##### Python
Volcano50 = np.genfromtxt('DataVolcano50.txt', delimiter=',',dtype=float)
### remove the header row
Volcano50 = np.delete(Volcano50, (0), axis=0) print(Volcano50)
You may include some pieces of code in your report (while discussing them and explaining what they do). You however do not have to include all your code, but only the pieces you judge interesting and worth being discuss. Screen captures might be used, but LATEX also offers various tools to include pieces of code in a document, like the listings package for instance.
4
-2 -1 0 1 2