Project
Part 3: A dimension reduction algorithm
Andreas Artemiou
Due date: 4 May 2018 by 23:59 in Learning Central
1 General purpose of the exercise
Write a function(or a set of functions) that performs a dimension reduction on a dataset in a univariate regression or classification setting.
1.1 Procedure of the algorithm
First of all we assume that we have p predictors X = (X1,…,Xp) and a response variable Y . We collect n observations (Yi, Xi) for i = 1, . . . , n. We run the following steps:
1. Standardize the predictors to create the standardized variables Zi =
Σˆ −1/2 (X i − X ̄ ) where X ̄ is the p dimensional mean of the predictors and Σˆ is the p × p matrix estimate of the covariance matrix of the predictors, Σ = var(X).
- If Y is continuous, slice the Y variable in H slices – ideally having approximately equal number of observations in each slices. If Y is discrete, you can use each value of Y as a different slice and therefore the value of H is predefined by the number of different values Y takes.
- For each slice, find the observations that fall in the slice and then calculate the p dimensional mean vector of the predictors in that slice, denoted by mˆ j where j = 1,…,H, that is mˆ j = i∈slicej Xi/nj where nj the number of observations in the jth slice
1
- Using the calculated means within each slice calculate the p × p can- didate matrix
H
Vˆ = p j mˆ j ( mˆ j ) Tj=1
where pj is the proportion of points in slice j, that is pj = nj/n. - Perform an eigenvalue decomposition of the matrix Vˆ to estimate the d eigenvectors η1, . . . , ηd corresponding to the largest d eigenvalues.
- (Until now we performed dimension reduction on the standardized data). To move to unstandardized data we calculate βk = Σ−1/2ηk, k = 1,…,d.
1.2 Code needed
The above is the algorithm we have to run. I suggest that you write a function called “sdr”. It needs a few arguments:
- “x” will be an n × p matrix where each row is an observation and each column is a predictor.
- “y” will be an n dimensional vector describing all the responses and there should be a correspondence with the predictors. that is the first entry of vector y is the response associated with the first row on matrix “x” , the second entry of “y” corresponds to the second row of “x” and so on.
- “H” will be the number of slices and takes a default value 10.
- “d” will be the number of eigenvectors that need to be estimated at the end.
This function will return “B” which is a p×d matrix that has in each column one of the vectors βk.
You will also need to write two more functions. One function will be called “datacr” and it will create the data. This function will have the following parameters:
1. “n” which is the number of observations and takes a default value 100. 2. “p” the dimension of the predictors which takes a default value 10.
2
- “si” a multiplier in front of the error term that acts as the standard deviation and takes a default value 0.2.
- “model” the model to be developed
- “depend” which takes logical values TRUE or FALSE and denotes the form of the predictor’s covariance matrix to be used when creating the data. By default it is set to FALSE and the identity matrix is used.
- “v” if “depend is true the value of “v” is used to construct the covari- ance matric of the predictors.
This function will return the data created, that is
- “x” the n × p matrix containing the predictors ,
- “y” the n dimensional vector containing the response,
- “trueb” the p × d matrix containing the true vectors of the reduced dimension
The results of this function will be feeded in the function “sdr” to run the algorithm.
We need to clarify a few things here. We will allow our algorithm to use the following 3 models:
ModelI: Y =X1+X2+σε,
ModelII: Y=X1/[0.5+(X2+1)2]+σε, ModelIII: Y =X1(X1+X2+1)+σε,
where σ is defined by the argument “si” in our function ε ∼ N(0,1) and X = (X1,…,Xp) ∼ Np(0,Σ). Σ is the p × p identity matrix if argument “depend” is FALSE or a matrix that has value v|i−j| on the (i, j) entry where v is defined by the “v” argument of the function “datacr”.
One final thing we need to clarify is how to find the values of “d” and “trueb” for each of the models. (I will write a short description here and I will also explain it in class). For model I d = 1 and for models II and III d = 2 (I will explain this in class). For model I “trueb” is a p dimensional vector since d = 1. This vector will have on the first two entries the value 1 and then p−2 values of 0, that is it will look as (1, 1, 0, …, 0). For model II “trueb” will be a p×2 matrix where all the entries will be 0 except entries (1,1) and (2,2) which will have a value of 1. Finally for model III “trueb”
3
will ba again a p × 2 matrix that has all entries 0 except entries (1, 1), (1, 2) and (2, 2) that have entries 1.
The last function will be used to evaluate the quality of our estimation. That is how close the estimated matrix “B” is to the “trueb” we expected to get from our model. We will need to calculate what is known as trace correlation which takes values between 0 and 1. The closer it is to 1 the closer “B” and “trueb” is. If we denote “B” with Bˆ and “trueb” with B then the trace correlation is:
r ( k ) = P B P Bˆ k
where k the number of columns in B and PB = B(BTB)−1BT (similarly for P Bˆ ). Therefore this function which we will call “trcor” will take as arguments:
1. “v1” the true matrix
2. “v2” the estimated matrix
and it will return the trace correlation distance r(k) between “v1” and “v2”. (in our experiment “v1” is the “trueb” and “v2” is “B” but we will write the function as general as possible)
2 Statistical Analysis
You need to write a function called “runsim” which will repeat an experiment “nsim” times (this will be an argument to the function). “runsim” will run the following experiment:
1. create the data using function “datacr”
2. run the dimension reduction algorithm using function “sdr” 3. find the distance between “trueb” and estimated “B”
and it will repeat it “nsim” times. In addition to “nsim” the function “run- sim” will have all necessary arguments to run the other functions. At the end the function will return in a vector all “nsim” distances you calculated. If “nsim” is equal to 100 then you get a vector with 100 distances.
The standard values of the parameters will be the following: “nsim” to be 500, “n” to be 100, “p” to be 10, “H” to be 10, “si” to be 0.2 and “v” to be 0.8.
You then have to run the following analysis when “depend” is FALSE: 4
- Keeping all parameters at the standard values compare the perfor- mance of the algorithm on the three models for different values of “p”. For “p” use values 10, 20 and 30.
- Keeping all parameters at the standard values compare the perfor- mance of the algorithm on the three models for different values of “H”. For “H” use values 5, 10 and 20.
- Keeping all parameters at the standard values compare the perfor- mance of the algorithm on the three models for different values of “si”. For “si” use values 0.2, 0.5 and 1.
- Lastly set depend on TRUE, keep all parameters at standard value and compare the performance of the algorithm on the three models for different values of “v”. For “v” use values 0.2, 0.5 and 0.8.
The easiest way to present your results will be to put them in a Table for each case (that is create 4 different tables). Each cell will correspond to one condition and you need to report the mean value and the standard de- viation of the “nsim” distances you get in each condition. Put the standard deviation in a bracket so that each cell looks like this “mean (st.dev)”. (I will explain more on this in class)
Write a short report 3-4 pages to describe how each parameter affects the accuracy of the dimension reduction algorithm by commenting on what you see on each table.
2.1 Deliverables
There are two deliverables (which need to be submitted in Learning Central):
1.
2.
2.2
The complete code in R with your student id at the top – and with comments written as necessary for the marker to understand what you are trying to do.
A document where you present the results.
Some guidelines
This is a very specific exercise. There is no room to deviate a lot as the algorithm is given to you in detail. The only difference is how efficient your code will be.
Your code should adhere to good programming principles, that is, it should run correctly, it should include comments and finally run as fast as
5
possible. What you give me should have as comments very specific instruc- tions what to do (in the sense of what commands I need to run) to get results for each of the questions.
Your document needs to present the results in a Table. You are free to use any other visualization tool as well. If you feel you will be benefit by presenting a graphic, that will be great.
2.3 Marking Scheme
This is on how you will be marked:
- Creating the sdr function (25 points)
- Creating the datacr function (15 points)
- Creating the trcor function (5 points)
- Creating the runsim function (15 points)
- Creating the report and commenting appropriately about the results (20 points)
- Using correct programming principles in your code (20 points) Partial credit will be given for partially good performance. Late assign-
ments receive a 0.
2.4 Group work
This is a group work done in pairs. Part of your evaluation will be on how you communicate as a team. I do not mind how you split the work (as long as you all agree with the split). Make sure you set clear tasks early and stay on schedule. Set important meetings well before hand, so that you know well ahead when is the deadline and what is due by then. It is important that all members of the group stay on schedule with their tasks and discuss things between them on meetings
If a group fails to communicate appropriately or if one person is not following the group schedule, it is important that you let me know as soon as possible in writing. If we can’t resolve the issue, then the members of the group will not necessarily get the same grade on the project.
6