Description
Data
Click left sidebar Download data and save compressed (zipped) file machine_learning_01_data.zip to your local folder.
Unpack it and get your data file test_sample.csv.
Create variable dataPath with the path to your local folder where you saved the data file test_sample.csv.
Note that in R path is specified with forward slash “/”. e.g.
dataPath<-"C:/User1/LocalFolder"
Also note that variable dataPath does not end with forward slash “/”.
Read the data.
data <- read.table(paste(dataPath,"test_sample.csv",sep="/"),header=T)
Column Y contains dependent variable, and X1, X2,...,X491 are columns of regressors.
As in Lecture 1, variable Y was simulated using all predictors X1, X2,...,X491 with randomly selected coefficients from interval [1,3]. All predictors are simulated independently.
This means that all predictors are significant and their theoretical relative importance is proportional to the value of the corresponding slope ββ.
Task
As we know, determination coefficients of sequence of nested models mjmj based on predictors X1,…,Xj, 2≤j≤491X1,…,Xj, 2≤j≤491 monotonically increase when jj grows.
Follow the two steps below.
Step 1.
Fit linear regression models mjmj
Y=β0+β1X1+β2X2+…+βjXj+ϵY=β0+β1X1+β2X2+…+βjXj+ϵ
with increasing number j, 2≤j≤491j, 2≤j≤491 of regressors and calculate determination coefficient for each of these models.
Find smallest number of regressors making determination coefficient of linear model greater than 0.9 (90%). Denote it as N.orig.
Step 2.
Apply method of PCA Regression (function prcomp()) using factors as meta-features to select smallest number of them making determination coefficient greater than 90%.
Denote such number N.PCA, i.e. N.PCA is the smallest number of PCA factors that sufficient for given level of determination coefficient.
Define model dimensionality reduction as difference N.orig - N.PCA.
Enter model dimensionality reduction and determination coefficient of the model with N.PCA selected most important meta-features in the corresponding fields of Quiz tab.