FES 844b / STAT 660b 2003
FES752b / STAT660b
Multivariate Statistics
Homework #4 : Discriminant Analysis
Due : Friday, 3/10/2017 11:59 pm on CANVAS
HOMEWORK ASSIGNMENT
PLEASE turn in the following answers for YOUR DATASET! If
Discriminant Analysis is not appropriate for your data, use one of
the two loaner datasets described on the next page.
List names/emails of everyone who worked on this assignments.
NOTE – some questions below will not be appropriate if you have only two groups, or if you only have two
possible discriminating variables. Modify assignment as appropriate!
1. Evaluate the assumptions implicit to Discriminant Analysis for your data – multivariate
normality WITHIN each group (i.e. chi-square quantile plots) and similarity of covariances
matrices (look at Box’s M or just look at raw standard deviations/covariance matrices).
Comment on what you find. Comment on whether you think transformations might help your
data to meet the assumptions of DA. If you think they might, make some transformations and
find out!
2. Perform stepwise discriminant analysis on your data. Comment on which model seems the
best. Use quadratic discriminant analysis if appropriate.
3. Comment on whether there is statistical evidence that the multivariate group means are
different.
4. How many discriminant functions are significant? What is the relative discriminating power of
each function?
5. Use classification, both regular and leave-one-out to evaluate the discriminating ability of your
functions.
6. Provide some evidence as to which of your original variables are the ‘best’ discriminators
amongst your groups (look at standardized discriminant coefficients).
7. Make a score plot for the first two or three DA function scores. Comment on what you see.
8. Bonus (and optional)– try kernel smoothing or k-nearest neighbors and get the admiration of
your professor and TA (and some extra credit)! You’ll have to use SAS or R for this.
LOANER DATASETS (choose one)
(if Discriminant Analysis is not appropriate for your data)
1) alzheim.xls and alzheim.sas (actually a program that makes the datafile). Data and sas
file are from Michael Friendly at York University. This file contains data from study by
Robert Hart concerning ability of memory tests to distinguish between patients with mild
Alzheimer’s disease, depression, and controls. The grouping variable is called GROUP.
The other variables are
subj = ‘Subject ID’
trecall = ‘Total recall’
hirecall= ‘HI Imagery Recall’
lirecall= ‘LO Imagery Recall’
tunrem = ‘Unreminded Memory’
hiunrem = ‘HI Imagery Unreminded Memory’
liunrem = ‘LO Imagery Unreminded Memory’
store = ‘Storage’
recog = ‘Recognition Memory’;
NOTE that trecall is just the sum of hirecall and lirecall (i.e. you can’t use all three of
these variables!)
2) The file RemoteSensing.csv contains data from satellite imagery. The goal is to use the
amount of light reflectance in four bandwidths to predict crop. The variables are as follows:
Crop : Either corn, soybeans, cotton, or clover. This the grouping variable (i.e. you
are trying to discriminate amongst these crops).
X1 : amount of light reflectance in the first of four bandwidths
X2 : amount of light reflectance in the second of four bandwidths
X3 : etc.
X4 : etc.
SAMPLE DATA SET
The example below is JUST FOR YOUR PRACTICE. NOTHING TO
TURN IN HERE!
IRIS.xls contains data analyzed initially by R.A. Fisher (grandfather of statistics). In consists
of four different measurements taken on three different species of iris, 50 replications per
species. The variables are species (group variable), petal length and width, and sepal length
and width.
SAS SOLUTION
1. Evaluate the assumptions implicit to Discriminant Analysis for your data – multivariate
normality WITHIN each group (i.e. chi-square quantile plots) and similarity of covariances
matrices (look at Box’s M or just look at raw standard deviations). Comment on what you find.
Comment on whether you think transformations might help your data to meet the assumptions
of DA. If you think they might, make some transformations and find out!
*Get data from online;
filename foo url ‘http://reuningscherer.net/stat660/data/iris.csv’;
PROC IMPORT OUT= WORK.iris
DATAFILE= foo
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
*check multivariate normality of the residuals;
*INCLUDE THE MACRO;
filename foo2 url “http://reuningscherer.net/stat660/sas/multnorm.sas.txt”;
%include foo2;
* make subgroup of the data;
data setosa; set iris; if var8=’Setosa’; run;
%multnorm(data=setosa, var=sepallen sepalwid petallen petalwid)
data versicolor; set iris; if var8=’Versic’; run;
%multnorm(data=versicolor, var=sepallen sepalwid petallen petalwid)
data virginica; set iris; if var8=’Virgin’; run;
%multnorm(data=virginica, var=sepallen sepalwid petallen petalwid)
Chi-square quantile plots look fine for each group.
2. Perform stepwise discriminant analysis on your data (unless using R, in which case, fit a few
different models for different sets of variables). Comment on which model seems the best.
Use quadratic discriminant analysis if appropriate.
proc stepdisc data= iris;
class var8;
var sepallen sepalwid petallen petalwid;
run;
* make plots to evaluate equality of covariance matrices;
Proc GPlot DATA=iris;
PLOT sepallen*sepalwid = var8;
PLOT petallen*petalwid = var8s;
Symbol1 V=Dot H=1 I=None C=Red;
Symbol2 V=Star H=1 I=None C=Blue;
Symbol3 V=Square H=1 I=None C=Green;
Run;
*run quadratic DA and get output to answer following questions;
proc discrim data=iris pool=no out=outiris crossvalidate manova canonical;
class var8;
var sepallen sepalwid petallen petalwid;
run;
The SAS output shows that all four variables are added in this process. So, model with all four
predictors is the best.
Plots (not shown here) suggest that there may be inequality of covariance matrices, so try
quadratic discriminant analysis. Also, when DA is performed, log determinants are 5.3, 7.5, and
9.5, suggests inequality of covariance matrices.
3. Comment on whether there is statistical evidence that the multivariate group means are
different.
The MANOVA option in proc discrim gives this output. Wilks lambda is significant so we
conclude multivariate means are not the same.
Multivariate Statistics and F Approximations
S=2 M=0.5 N=71
Statistic Value F Value Num DF Den DF Pr > F
Wilks’ Lambda 0.02343863 199.15 8 288 <.0001 Pillai's Trace 1.19189883 53.47 8 290 <.0001 Hotelling-Lawley Trace 32.47732024 582.20 8 203.4 <.0001 Roy's Greatest Root 32.19192920 1166.96 4 145 <.0001 NOTE: F Statistic for Roy's Greatest Root is an upper bound. NOTE: F Statistic for Wilks' Lambda is exact. 4. How many discriminant functions are significant? The CANONICAL option in proc discrim provides this output. There are 3 groups and 4 variables so min(3-1,4)=2 discriminant functions. Output shows that both are significant. Canonical Discriminant Analysis Adjusted Approximate Squared Canonical Canonical Standard Canonical Correlation Correlation Error Correlation 1 0.984821 0.984508 0.002468 0.969872 2 0.471197 0.461445 0.063734 0.222027 Test of H0: The canonical correlations in the Eigenvalues of Inv(E)*H current row and all that follow are zero = CanRsq/(1-CanRsq) Likelihood Approximate Eigenvalue Difference Proportion Cumulative Ratio F Value Num DF Den DF Pr > F
1 32.1919 31.9065 0.9912 0.9912 0.02343863 199.15 8 288 <.0001 2 0.2854 0.0088 1.0000 0.77797337 13.79 3 145 <.0001 5. Use classification, both regular and leave-one-out to evaluate the discriminating ability of your functions. Classification Summary for Calibration Data: MYLIB.IRIS Resubstitution Summary using Quadratic Discriminant Function Number of Observations and Percent Classified into Species1 From Species1 Setosa Versicolor Virginica Total Setosa 50 0 0 50 100.00 0.00 0.00 100.00 Versicolor 0 48 2 50 0.00 96.00 4.00 100.00 Virginica 0 1 49 50 0.00 2.00 98.00 100.00 Total 50 49 51 150 33.33 32.67 34.00 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for Species1 Setosa Versicolor Virginica Total Rate 0.0000 0.0400 0.0200 0.0200 Priors 0.3333 0.3333 0.3333 Cross-validation Summary using Quadratic Discriminant Function Number of Observations and Percent Classified into Species1 From Species1 Setosa Versicolor Virginica Total Setosa 50 0 0 50 100.00 0.00 0.00 100.00 Versicolor 0 47 3 50 0.00 94.00 6.00 100.00 Virginica 0 1 49 50 0.00 2.00 98.00 100.00 Total 50 48 52 150 33.33 32.00 34.67 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for Species1 Setosa Versicolor Virginica Total Rate 0.0000 0.0600 0.0200 0.0267 Priors 0.3333 0.3333 0.3333 Classification makes 3 errors for resubstitution and 54 errors for crossvalidation – so error rate is 2-3%, very good! 6. Provide some evidence as to which of your original variables are the ‘best’ discriminators amongst your groups. The DISCRIM Procedure Canonical Discriminant Analysis Total-Sample Standardized Canonical Coefficients Variable Label Can1 Can2 sepallen sepallen -0.686779533 0.019958173 sepalwid sepalwid -0.668825075 0.943441829 petallen petallen 3.885795047 -1.645118866 petalwid petalwid 2.142238715 2.164135931 Standardized coefficients are shown above. The largest coefficient for DA function one is for petal length, followed by petal width. Petal variables still have the largest coefficients for the second DA function – SO, petal characteristics are the best discriminators. 7. Make a score plot for the first two or thee DA functions. Comment on what you see. PROC GPLOT DATA=OUTIRIS; PLOT CAN2*CAN1=SPECIES1; RUN; Discrimination is mostly in the direction of DA function 1, but there is some separation of Versicolor from Virginica/Setosa in the direction of DA function 2. 8. Bonus (and optional)– try kernel smoothing or k-nearest neighbors and get the admiration of your professor and TA! You’ll have to use SAS or R for this. Examples of this are given online in irisdiscrim.sas R SOLUTION #get chi-square quantile plot function source("http://www.reuningscherer.net/STAT660/R/CSQPlot.r.txt") #get data iris=read.csv("http://www.reuningscherer.net/stat660/data/iris.csv", header=T) attach(iris) iris[1:5,] # sepallen sepalwid petallen petalwid speciesnum Species #1 50 33 14 2 1 Setosa #2 46 34 14 3 1 Setosa #3 46 36 10 2 1 Setosa #4 51 33 17 5 1 Setosa #5 55 35 13 2 1 Setosa #### define names for later ### vars=names(iris[,1:4]) #[1] "sepallen" "sepalwid" "petallen" "petalwid" levs=levels(iris[,6]) #[1] "Setosa" "Versicolor" "Virginica" 1) Use some method to asses the multivariate normality of the data in each group. Multivariate Chi-squared plots par(mfrow=c(1,3)) reschisqplot(iris[iris[,"species"]==1,1:4],label=levs[1]) reschisqplot(iris[iris[,"species"]==2,1:4],label=levs[2]) reschisqplot(iris[iris[,"species"]==3,1:4],label=levs[3]) Univariate normality quantile plots ###### univariate qqplot (plot for each variable)######## par(mfrow=c(1,3)) for (i in 1:3) { qqnorm(iris[iris[,"species"]==i,1],pch=19,cex=1.5,col="red",main=paste(vars[1], " (",levs[i],")",sep="" ));qqline(iris[iris[,"species"]==i,1], col ="blue" )} for (i in 1:3) {qqnorm(iris[iris[,"species"]==i,2],pch=19,cex=1.5,col="red",main=paste(vars[2], " (",levs[i],")",sep="" ));qqline(iris[iris[,"species"]==i,2], col = "blue")} 0 2 4 6 8 10 12 0 5 1 0 1 5 2 0 Chi-Square Quantiles for Setosa Chi-Square Quantiles S e to s a Data 95% Conf Limits 0 2 4 6 8 10 12 0 5 1 0 1 5 2 0 Chi-Square Quantiles for Versicolor Chi-Square Quantiles V e rs ic o lo r Data 95% Conf Limits 0 2 4 6 8 10 12 0 5 1 0 1 5 Chi-Square Quantiles for Virginica Chi-Square Quantiles V ir g in ic a Data 95% Conf Limits for (i in 1:3) {qqnorm(iris[iris[,"species"]==i,3],pch=19,cex=1.5,col="red",main=paste(vars[3], " (",levs[i],")",sep="" ));qqline(iris[iris[,"species"]==i,3], col = "blue")} for (i in 1:3) {qqnorm(iris[iris[,"species"]==i,4],pch=19,cex=1.5,col="red",main=paste(vars[4], " (",levs[i],")",sep="" ));qqline(iris[iris[,"species"]==i,4], col = "blue")} -2 -1 0 1 2 4 5 5 0 5 5 sepallen (Setosa) Theoretical Quantiles S a m p le Q u a n ti le s -2 -1 0 1 2 5 0 5 5 6 0 6 5 7 0 sepallen (Versicolor) Theoretical Quantiles S a m p le Q u a n ti le s -2 -1 0 1 2 5 0 5 5 6 0 6 5 7 0 7 5 8 0 sepallen (Virginica) Theoretical Quantiles S a m p le Q u a n ti le s -2 -1 0 1 2 2 5 3 0 3 5 4 0 sepalwid (Setosa) Theoretical Quantiles S a m p le Q u a n ti le s -2 -1 0 1 2 2 0 2 2 2 4 2 6 2 8 3 0 3 2 3 4 sepalwid (Versicolor) Theoretical Quantiles S a m p le Q u a n ti le s -2 -1 0 1 2 2 5 3 0 3 5 sepalwid (Virginica) Theoretical Quantiles S a m p le Q u a n ti le s -2 -1 0 1 2 1 0 1 2 1 4 1 6 1 8 petallen (Setosa) Theoretical Quantiles S a m p le Q u a n ti le s -2 -1 0 1 2 3 0 3 5 4 0 4 5 5 0 petallen (Versicolor) Theoretical Quantiles S a m p le Q u a n ti le s -2 -1 0 1 2 4 5 5 0 5 5 6 0 6 5 petallen (Virginica) Theoretical Quantiles S a m p le Q u a n ti le s 2. Perform stepwise discriminant analysis – are all variables good discriminators? (In case using R, fit a few different models for different sets of variables) First using lda() from MASS library, include all variables. library(MASS) library(mda) dafit=lda(iris[,1:4],grouping=iris$species) names(dafit) dafit [1] "prior" "counts" "means" "scaling" "lev" "svd" "N" "call" Call: lda(dat[, 1:4], grouping = dat$speciesnum) Prior probabilities of groups: 1 2 3 0.3333333 0.3333333 0.3333333 Group means: sepallen sepalwid petallen petalwid 1 50.06 34.28 14.62 2.46 2 59.36 27.70 42.60 13.26 3 65.88 29.74 55.52 20.26 Coefficients of linear discriminants: LD1 LD2 sepallen 0.08293776 -0.002410215 sepalwid 0.15344731 -0.216452123 petallen -0.22012117 0.093192121 petalwid -0.28104603 -0.283918785 Proportion of trace: LD1 LD2 0.9912 0.0088 Second, I’m also using different function fda() from mda library, which seems to have more things to investigate. dat2=iris[,-6] dafit2=fda(species~sepallen +sepalwid + petallen +petalwid, data=dat2) dafit2 # names(dafit2) # [1] "percent.explained" "values" "means" "theta.mod" "dimension" # [6] "prior" "fit" "call" "terms" "confusion" -2 -1 0 1 2 1 2 3 4 5 6 petalwid (Setosa) Theoretical Quantiles S a m p le Q u a n ti le s -2 -1 0 1 2 1 0 1 2 1 4 1 6 1 8 petalwid (Versicolor) Theoretical Quantiles S a m p le Q u a n ti le s -2 -1 0 1 2 1 4 1 6 1 8 2 0 2 2 2 4 petalwid (Virginica) Theoretical Quantiles S a m p le Q u a n ti le s plot(dafit2$fit$fitted.values,col=dat[,5],pch=as.character(dat[,5]),xlab="Discriminant 1", ylab="Discrmiant 2",main="Fit 2") dafit2$percent.explained # v1 v2 # 99.12126 100.00000 dafit2$confusion true #object 1 2 3 # 1 50 0 0 # 2 0 48 1 # 3 0 2 49 attr(,"error") [1] 0.02 2% error rate for classification. dafit2$fit #$fitted.values # [,1] [,2] #1 0.867024742 -1.0032169523 #1 0.962623940 -0.8042260629 #1 1.295925970 -0.8386326481 #1 0.874459690 -0.6592606363 #1 1.142980809 -0.9666945115 #1 0.568193386 -1.0680887910 #1 0.972454582 -0.9796668565 #1 1.055926338 -0.9814942255 #1 0.773914918 -0.9886858633 #1 1.140585114 -0.4472974019 #1 0.605276336 -1.0747132036 #. #. #. #$coefficients # [,1] [,2] #Intercept -1.78670201 -2.1820223424 #sepallen 0.01055527 -0.0097317871 #sepalwid 0.08431929 0.0430136701 #petallen -0.05454111 0.0008459637 1 1 1 1 1 1 1 11 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 -1 .0 -0 .5 0 .0 0 .5 1 .0 1 .5 Fit 2 Discriminant 1 D is c rm ia n t 2 #petalwid 0.05350100 0.1170500707 This time, we can pick different subsets of variables, excluding the last variable. dafit3=fda(species~sepallen +sepalwid + petallen, data=dat2) > confusion(dafit3)
true
object 1 2 3
1 50 0 0
2 0 48 3
3 0 2 47
attr(,”error”)
[1] 0.03333333
plot(dafit3$fit$fitted.values,col=dat[,5],pch=as.character(dat[,5]),xlab=”Discriminant 1″, ylab=”Discrmiant
2″,main=”Fit 3″)
Error rates increased from 0.02 to 0.03 and distinction between group 2 and 3 seems worse with those 3 variables. You can
choose more different subsets of variables to do this.
3. Comment on whether the covariance matrices seem similar between groups. Look at sample variances and/or Box’s M
Statistic.
R doesn’t support Box’s M statistics (Unfortunately the test is very sensitive to violations of normality, leading to rejection in
most typical cases). Instead, we can use bartlett.test(), which also performs test for the null that the variances in each of the
groups (samples) are the same.
bartlett.test(sepallen+sepalwid+petallen+petalwid ~ species, data=dat2)
Bartlett test of homogeneity of variances
data: sepallen + sepalwid + petallen + petalwid by species
Bartlett’s K-squared = 18.1169, df = 2, p-value = 0.0001164
It rejects the null hypothesis of equal variance-covariance matrix. Let’s investigate covariance matrix:
cov(dat2[dat2$speciesnum==1,1:4])
1
1
11
1
1 1
1
1
1
1
1
1
11
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1 1 1
1
1 1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
3
3
3 3
3
3
3
3 3
3
33
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0
-1
.5
-1
.0
-0
.5
0
.0
0
.5
1
.0
1
.5
Fit 3
Discriminant 1
D
is
c
rm
ia
n
t
2
sepallen sepalwid petallen petalwid
sepallen 12.424898 9.921633 1.6355102 1.0330612
sepalwid 9.921633 14.368980 1.1697959 0.9297959
petallen 1.635510 1.169796 3.0159184 0.6069388
petalwid 1.033061 0.929796 0.6069388 1.1106122
cov(dat2[dat2$speciesnum==2,1:4])
sepallen sepalwid petallen petalwid
sepallen 26.643265 8.518367 18.289796 5.577959
sepalwid 8.518367 9.846939 8.265306 4.120408
petallen 18.289796 8.265306 22.081633 7.310204
petalwid 5.577959 4.120408 7.310204 3.910612
cov(dat2[dat2$speciesnum==3,1:4])
sepallen sepalwid petallen petalwid
sepallen 40.434286 9.376327 30.328980 4.909388
sepalwid 9.376327 10.400408 7.137959 4.762857
petallen 30.328980 7.137959 30.458776 4.882449
petalwid 4.909388 4.762857 4.882449 7.543265
As for variable sepallen, the variance of 1st group is 12.42, while that of third group is 40.43. (the variable petalwid also has
larger variance in third group). We can visualize the variability (also mean) of each variable using boxplot:
par(mfrow=c(2,2))
for (i in 1:4) boxplot(iris[,i]~as.factor(iris[,5]),ylab=vars[i],main=vars[i])
Setosa Versicolor Virginica
4
5
5
0
5
5
6
0
6
5
7
0
7
5
8
0
sepallen
s
e
p
a
ll
e
n
Setosa Versicolor Virginica
2
0
2
5
3
0
3
5
4
0
sepalwid
s
e
p
a
lw
id
Setosa Versicolor Virginica
1
0
2
0
3
0
4
0
5
0
6
0
7
0
petallen
p
e
ta
ll
e
n
Setosa Versicolor Virginica
5
1
0
1
5
2
0
2
5
petalwid
p
e
ta
lw
id
Test the equality of means in each group by univariate and multivariate tests:
mano.fit=manova(as.matrix(iris[,1:4])~as.factor(iris$species))
summary.aov(mano.fit)
#Response sepallen :
# Df Sum Sq Mean Sq F value Pr(>F)
#as.factor(dat2$speciesnum) 2 6321.2 3160.6 119.26 < 2.2e-16 *** #Residuals 147 3895.6 26.5 # #Response sepalwid : # Df Sum Sq Mean Sq F value Pr(>F)
#as.factor(dat2$speciesnum) 2 1134.49 567.25 49.16 < 2.2e-16 *** #Residuals 147 1696.20 11.54 # #Response petallen : # Df Sum Sq Mean Sq F value Pr(>F)
#as.factor(dat2$speciesnum) 2 43710 21855 1180.2 < 2.2e-16 *** #Residuals 147 2722 19 # #Response petalwid : # Df Sum Sq Mean Sq F value Pr(>F)
#as.factor(dat2$speciesnum) 2 8041.3 4020.7 960 < 2.2e-16 *** #Residuals 147 615.7 4.2 mano.fit summary.manova(mano.fit,test="Wilks") # Df Wilks approx F num Df den Df Pr(>F)
#as.factor(dat2$speciesnum) 2 0.023 199.145 8 288 < 2.2e-16 *** #Residuals 147 #--- 4. If covariance matrices do not seem similar, try using unequal covariance matrices to do the discrimination. Comment on how your results change. 6. How many discriminant functions are significant? R doesn’t have formal Wilks lamda test for testing significance of discriminant function. However, investigating (comparing) the discriminant function scores for each group would be also helpful in deciding if given discriminant function is significant. Fitted2=dafit2$fit$fitted.values names(Fitted2)=paste("Dis",1:ncol(Fitted2)) par(mfrow=c(1,2)) for (i in 1:2) boxplot(Fitted2[,i]~as.factor(iris[,5]),main=paste(i,"th Discriminant")) Both disriminant functions seems to separate the groups pretty well. Also check the difference using univariate ANOVA: mano.fit2=manova(as.matrix(Fitted2[,1:2])~as.factor(iris$species)) summary.aov(mano.fit2) # Response 1 : # Df Sum Sq Mean Sq F value Pr(>F)
#as.factor(dat$speciesnum) 2 70.242 35.121 326.94 < 2.2e-16 *** #Residuals 147 15.791 0.107 # # Response 2 : # Df Sum Sq Mean Sq F value Pr(>F)
#as.factor(dat$speciesnum) 2 78.251 39.125 396.61 < 2.2e-16 *** #Residuals 147 14.502 0.099 summary.manova(mano.fit2,test="Wilks") Df Wilks approx F num Df den Df Pr(>F)
as.factor(dat$speciesnum) 2 0.02 403.82 4 292 < 2.2e-16 *** Residuals 147 ANOVA result shows the difference of both discriminant function scores between three different groups are significant. Setosa Versicolor -1 .0 -0 .5 0 .0 0 .5 1 .0 1 .5 2 .0 1 th Discriminant Setosa Versicolor -1 .0 -0 .5 0 .0 0 .5 1 .0 1 .5 2 th Discriminant SPSS Steps for Discriminant Analysis Iris data used as example 1. Evaluate the assumptions implicit to Discriminant Analysis for your data – multivariate normality WITHIN each group (i.e. chi-square quantile plots) and similarity of covariance matrices (look at Box’s M or just look at raw standard deviations). Comment on what you find. Comment on whether you think transformations might help your data to meet the assumptions of DA. If you think they might, make some transformations and find out! To evaluate univariate normality for each group Boxplots: Use GraphBoxplot. There are a number of options that will work. In the dialogue box, choose Simple and click on summaries of groups of cases. Then list the variable of interest (here we have 4 variables of interest: sepal length, sepal width, petal length, and petal width) and the category axis (species number or species). Output for the variable sepal width for each of the 3 groups: VirginicVersicolSetosa Species 45 40 35 30 25 20 s e p a lw id 44 Normal Quantile Plots: Unfortunately there is no easy way to make separate normal quantile plots for each species or grouping in SPSS. Go to Minitab (or one of the other programs) for this. In MINITAB Use GraphProbability Plot. Chose simple and select the variables that you want to graph. Click on the button multiple graphs and select the by variables tab. When there, select the option that you prefer (there are two presented) and enter your grouping variable (in this case species). I chose the “by variables with groups in separate panels” option. Output for the variable sepal width for all 3 species: 403020 99 90 50 10 1 403020 99 90 50 10 1 Setosa sepalwid P e r c e n t Versicolor Virginica Mean 34.28 StDev 3.791 N 50 AD 0.491 P-Value 0.210 Setosa Mean 27.7 StDev 3.138 N 50 AD 0.560 P-Value 0.141 Versicolor Mean 29.74 StDev 3.225 N 50 AD 0.618 P-Value 0.102 Virginica Probability Plot of sepalwid Normal - 95% CI Panel variable: Species To make any necessary transformations Transformations: Use TransformCompute. Select the Target Variable (a new variable you created) and the appropriate numeric expression (e.g., log, natural log, etc.). To evaluate multivariate normality: Chi-square Quantile Plots Like Normal quantile plots, unfortunately it is not possible to create chi-square quantile plots by group in SPSS. Use one of the other programs. To evaluate similarity of covariance matrices Run Discriminant Analysis. Use AnalyzeClassifyDiscriminant. Enter independent variables and the grouping variable (must be numeric rather than text—here species groups are catego