FES 844b / STAT 660b 2003
FES758 / STAT660b
Multivariate Statistics
Homework #2
Principle Components Analysis
Due : Tuesday, 2/14/17 11:59pm on CANVAS
Be sure to check out sample programs in SAS and R at the bottom of the syllabus page
http://reuningscherer.net/stat660/
Answers should be complete and concise. You should turn in typed solutions. If you are
working in a group, you may turn in one problem set per group (list all group members). If you
want to insert equations into MS Word, use Insert Objects Microsoft Equation Editor 3.0.
You may use any statistics program for calculations that you wish.
SAMPLE DATA SET
The example below is JUST FOR YOUR PRACTICE.
NOTHING TO TURN IN HERE!
The data set AirPollution.xls is an excel file that contains weather/pollution
measurements on 42 consecutive days at one site in Los Angeles. Each day,
measurements were taken at precisely 12 noon. There are seven variables :
Wind
Solar Radiation
Carbon Monoxide
Nitrogen Oxide
Nitrogen Dioxide
Ozone
Hydrogen Chloride
Your goal is to see if these measurements can be summarized in fewer than seven
dimensions.
1). Compute the correlation matrix between all variables (SAS and SPSS will provide
this for you as part of the PCA procedure – in SPSS, click on DESCRIPTIVES, in R use
the cor() function.). Comment on relationships you do/do not observe.
Correlation Matrix
Wind Radiation CO NO NO2 O3 HC
Correlation Wind 1.000 -.101 -.194 -.270 -.110 -.254 .156
Radiation -.101 1.000 .183 -.074 .116 .319 .052
CO -.194 .183 1.000 .502 .557 .411 .166
NO -.270 -.074 .502 1.000 .297 -.134 .235
NO2 -.110 .116 .557 .297 1.000 .167 .448
O3 -.254 .319 .411 -.134 .167 1.000 .154
HC .156 .052 .166 .235 .448 .154 1.000
http://reuningscherer.net/stat660/
There are some relationships – mostly between CO and other Oxygen-containing
compounds.
2). Perform Principle components analysis using the Correlation matrix (standardized
variables). Think about how many principle components to retain. To make this
decision look at
Total variance explained by a given number of principle components
The ‘eigenvalue > 1’ criteria
The ‘scree plot elbow’ method
Parallel Analysis : for the air pollution data, the first five threshold values for the
Allen and Longman methods are provided below (based on n=42 observations,
p=7 variables ) :
eigenval LONGMAN ALLEN
1 1.77411 1.78971
2 1.44221 1.52097
3 1.22756 1.32395
4 1.02647 1.16865
5 0.89682 1.03550
As you make this decision, keep in mind that the number of observations is
somewhat small relative to the number of variables.
Here are SPSS results :
Total Variance Explained
Component
Initial Eigenvalues Extraction Sums of Squared Loadings
Total % of Variance Cumulative % Total % of Variance Cumulative %
1 2.337 33.383 33.383 2.337 33.383 33.383
2 1.386 19.800 53.183 1.386 19.800 53.183
3 1.204 17.201 70.384 1.204 17.201 70.384
4 .727 10.387 80.771 .727 10.387 80.771
5 .653 9.335 90.106 .653 9.335 90.106
6 .537 7.667 97.773 .537 7.667 97.773
7 .156 2.227 100.000 .156 2.227 100.000
Extraction Method: Principal Component Analysis.
An 80% threshold would argue for 4 components. The eigenvalue greater than 1 rule
would argue for 3 components.
Scree plot is sort of double-jointed –
elbow at two and four, which would
argue for retaining one or three
components.
For Parallel analysis, you can use SAS,
R, or the SPSS Macro online. In
MINITAB, do the following using the
data provided in the table above.
1) Copy the data above into MINITAB.
2) Make two more variables – one
which has the eigenvalues calculated
by MINITAB (copy from output
screen), one which is a counter for
the eigenvalue number (here from 1
to 7) : see below
3) Under Graph Plot, input
three Y,X combinations :
Eigenvalues vs. Counter,
Longman vs. count, Allen vs.
Count. Under Frame, choose
multiple graphs and indicate
that the plots should be
overlayed. Indicate that
Symbols and Connect (i.e.
lines) should be displayed for
each plot. Use Edit Attributes
to change colors, plot
characters, etc.
1 2 3 4 5 6 7
Component Number
0.0
0.5
1.0
1.5
2.0
2.5
E
ig
e
n
v
a
lu
e
Scree Plot
Both parallel methods suggest
retaining one principle
component. Be aware that since
the number of observations is
small (only 44), the parallel
procedures will more easily reject
components with borderline
eigenvalues. Also keep in mind
that the parallel procedure
assumes the variables have a
normal distribution, a bit
questionable here. I decide to
keep three components since
with three components, I can
explain 70% of the variability in
the data (only 33% of variability
with one component – however, keeping one component is also a reasonable decision).
3). For principle components you decide to retain, examine the loadings (principle
components) and think about an interpretation for each component.
Variable PC1 PC2 PC3
Wind 0.237 -0.278 0.643
Radiatio -0.206 0.527 0.224
CO -0.551 0.007 -0.114
NO -0.378 -0.435 -0.407
NO2 -0.498 -0.200 0.197
O3 -0.325 0.567 0.160
HC -0.319 -0.308 0.541
Component one is mostly CO, NO2. Component 2 is Radiation, NO, and Ozone.
Component 3 is Wind, NO, HC. The division is not exact. However, with three
measures, I can explain 70% of the variability.
4). Write a paragraph summarizing your findings, and your opinions about the
effectiveness of using principle components on this data.
Not being a weather expert, I can’t say much about the interpretation of the factors
beyond what was stated above. Given the relatively small number of observations,
principle components was not entirely successful. Interpretations of the factors is
somewhat difficult.
1 2 3 4 5 6 7
0
1
2
Counter
E
ig
e
n
v
a
lu
e
s
Allen
Longman
HOMEWORK ASSIGNMENT
PLEASE turn in the following answers for YOUR DATASET!
If PCA is not appropriate for your data, use ONE of the
datasets online (either DrugAttitudes.xls or
NASAunderstory.xls described on the following pages).
List your name and a one sentence reminder of which
dataset your are using.
1). First, discuss whether your data seems to have a multivariate normal distribution.
Make univariate plots (boxplots, normal quantile plots as appropriate). Then make
transformations as appropriate. You do NOT need to turn all this in, but describe what
you did. THEN make a chi-square quantile plot of the data. Turn in your chi-square
quantile plot as appropriate and comment on what you see.
2). Compute the correlation matrix between all variables (SAS and SPSS will provide
this for you as part of the PCA procedure – in SPSS, click on DESCRIPTIVES. In R
use the cor() function.). Comment on relationships you do/do not observe. Do you
think PCA will work well?
3). Perform Principle components analysis using the Correlation matrix (standardized
variables). Think about how many principle components to retain. To make this
decision look at
Total variance explained by a given number of principle components
The ‘eigenvalue > 1’ criteria
The ‘scree plot elbow’ method (turn in the scree plot)
Parallel Analysis : think about whether this is appropriate based on what you
discover in number 1.
4). For principle components you decide to retain, examine the loadings (principle
components) and think about an interpretation for each retained component if possible.
5) Make a score plot of the scores for at least two pairs of component scores (one and
two, one and three, two and three, etc). Discuss any trends/groupings you observe. As
a bonus, try to make a 95% Confidence Ellipse for two of your components. You
might want to also try making a bi-plot if you’re using R.
6). Write a paragraph summarizing your findings, and your opinions about the
effectiveness of using principle components on this data. Include evidence based on
scatterplots of linearity in higher dimensional space, note any multivariate outliers in
your score plot, comment on sample size relative to number of variables, etc.
LOANER DATASETS
(if PCA is not appropriate for your data)
The data set DrugAttitudes.xls is an excel file that contains attitudes of 38 people
measured on 20 variables relating to drugs. Each question was measured on a 5 point
scale where 1=Strongly Agree and 5 = Strongly Disagree. The variables were
legal All drugs should be made legal and freely available.
dangerous
As a general rule of thumb, most drugs are dangerous and should be used only with
medical authorization.
regret Drugs can cause people to say or do things they might later regret.
unnatural Drugs are basically an “unnatural” way to enjoy life.
notuse Even if my best friend gave me some hash, I probably wouldn’t use it.
psycho Experimenting with drugs is dangerous if a person has any psychological problems.
trip I see nothing wrong with taking an LSD trip.
stoned I admire people who like to get stoned.
calm I wish I could get hold of some pills to calm me down whenever I get “up tight”.
high I would welcome the opportunity to get high on drugs.
noaspirin I’d have to be pretty sick before I’d take any drug including an aspirin.
relationship If people use drugs together, their relationships will be improved.
drugscene In spite of what the establishment says, the drug scene is really “where it’s at”.
caregivers
People who regularly take drugs should not be given positions of responsibility for
young children.
experience People who make drug legislation should really have personal experience with drugs.
fun People who use drugs are more fun to be with than those who don’t use drugs.
stupid Pep pills are a stupid way of keeping alert when there’s important work to be done.
lessalcohol Smoking marijuana is less harmful than drinking alcohol.
sideeffects Students should be told about the harmful side effects of certain drugs.
dope Taking any kind of dope is a pretty dumb idea.
Your goal is to see if these measurements can be summarized in fewer than 20
dimensions. NOTE that one variable may get imported as a text variable – this
might cause you problems.
Superior National Forest understory data.
Thirty-two quaking aspen and thirty-one black spruce
sites were studied. The dominant species in the site
constituted 80-95% of the total tree density and basal
area. For each plot, a two-meter diameter subplot was
defined and the percent of ground coverage by plants
under one meter in height was determined by species.
This example examines the percentage cover in each plot
of the 30 most prevalent understory species. The goal is
to use PCA to examine if there are groups of species that
tend to exhibit similar patterns of variation.