HDDA Tutorial: Dimension Reduction : Solutions
Department of Econometrics and Business Statistics, Monash University Tutorial 10
1. Load the socioeconomic data on U.S. States (used in the lecture on principal components). Isolate the 5 numeric variables and scale this data
2. Carry out the singular value decomposition on the standardised data.
Copyright By PowCoder代写 加微信 powcoder
States_SVD<-svd(States_Scaled)
3. Construct a correlations biplot using ggplot. Before doing so look at the help function for pc.biplot
for additional instructions on how this is constructed
library(tidyverse) States<-readRDS('StateSE.rds') States%>%
select_if(is.numeric)%>% scale->States_Scaled
#Set scale and find sample size
scale=1 n<-nrow(States)
#Singular values can be put into a diagonal matrix using diag
#Use %*% for matrix multiplication
#Note that the observations are multiplied by root n
PC_obs<-States_SVD$u%*%diag((States_SVD$d)^(1-scale))*sqrt(n) #Note that the variables are divided by root n
PC_var<-States_SVD$v%*%diag((States_SVD$d)^scale)/sqrt(n)
#Create dataframe for observations with PCs df_Obs<-tibble(Label=pull(States,StateAbb), #Extract State Abbreviation
PC1=PC_obs[,1], #Extract first PC PC2=PC_obs[,2]) #Extract second PC
#Create dataframe for variables df_Vars<-tibble(Label=colnames(States_Scaled), #Extract State Abbreviation
PC1=PC_var[,1], #Extract first loading vector PC2=PC_var[,2]) #Extract second loading vector
ggplot(data = df_Vars,aes(x=PC1,y=PC2,label=Label))+ geom_text(data=df_Obs)+ #Observations
geom_text(color='red',nudge_y = -0.2)+ #Variables (offset using nudge_y) geom_segment(xend=0,yend=0,color='red', #Add arrow
arrow = arrow(ends="first",
length=unit(0.1, "inches"))) #Make tip smaller
VA TX Murder
Illiteracy
WY CO Income
MO PA IN MT MA OR KS
4. Using the spectral theorem, prove that Principal Components are uncorrelated by construction.
Consider the n × p matrix C = YV where Y is the data matrix and the columns of V are the eigenvectors of the covariance matrix S. The matrix C will be an n × p itself. The ith row and jth column of C is obtained by multiplying the ith row of Y by the jth column of V. The ith row of Y contains the values of all variables for observation i while the jth column of V contains the weights for principal components j. This implies that The ith row and jth column of C is the value of principal component j for variable i.
The matrix C is essentially a data matrix but for the principal components. As such the variance covariance matrix for the principal components is found by taking 1 C′C. Some matrix algebra shows
1 C′C = 1 (YV)′YV (1) n−1 n−1
= 1 V′Y′YV (2) n−1
=V′ 1 (Y′Y)V (3) n−1
= V′SV (4) Using the eigenvalue decomposition, S = VΛV′ and substituting
1 C′C = V′SV (5) n−1
= V′VΛV′V (6) =Λ (7)
The above holds since V′V = I. Since Λ is a diagonal matrix, the covariances are all 0 and the principal components are uncorrelated.