ETF3500/ETF5500 代考 HDDA
Part A Multiple Choice (10 Marks)
The following questions are multiple choice. You MUST provide a brief explanation for your answer. If no explanation is given, you will receive ZERO marks for the question.
1. Suppose we have data on p variables, where p > 2. Let δij be the distance between observation i and
Copyright By https://powcoder.com 加微信 powcoder
observation j in p-dimensional space. Let B be the matrix obtained after subtracting row means and
column means from the matrix with entries given by −2δ2 . Suppose that the first two principal components ij
are plotted as a scatterplot. Let dij denote the distance between the first two principal components for
observation i and the first two principal components for observation j in the 2-dimensional space of this
plot. The quantity δ2 − d2 is guaranteed to be minimised when:
(a) Both δij and dij are Euclidean distances for all i,j. (b) Both δij and dij are Manhattan distances for all i,j.
(c) The eigenvalues of B are all non-negative.
(d) The eigenvalues of B are all non-negative, and dij are Euclidean distances for all i,j.
The answer is (a). The principal coordinates produced in multidimensional scaling will minimise the quantity in the question (strain) as long as the eigenvalues are non-negative and dij are Euclidean distances. This is guaranteed to occur when δij is Euclidean but can also occur for non-Euclidean distances (all of which suggests the answer is (d)). However the key to getting the right answer is to understand that, principal components are only equivalent to principal coordinates when the higher dimensional distances δij are also Euclidean. Students should at least demonstrate an understanding that PCA and MDS are equivalent when all distances are Euclidean and that MDS minimises strain.
2. Which of the following statements is true?
(a) The average linkage method is an example of non-hierarchical clustering while k-means clustering is
an example of hierarchical clustering.
(b) The average linkage method is an example of hierarchical clustering while k-means clustering is an
example of non-hierarchical clustering.
(c) The average linkage method and k-means clustering are both examples of hierarchical clustering.
(d) The average linkage method and k-means clustering are both examples of non-hierarchical clustering.
The answer is (b). In average linkage, the nearest clusters are sequentially merged with the distance between clusters defined as the average of the distance between all pairs of points across two clusters. At the end a solution with every possible number of clusters is available and an appropriate number of clusters can be selected using the dendrogram. In k-means clustering, the number of clusters is specified a priori, centroids are found and clusters are formed by allocating each observation to the nearest centroid. Only a single solution is available with the pre-specified number of clusters. Any correct answer that demonstrates that students understand the distinction between hierarchical and non-hierarchical methods of clustering should receive full marks
3. Which of the following statements is true?
(a) The covariances between the Principal Components are given by the eigenvalues of the variance- covariance matrix of the data, and the weights used to form the principal components are given by the eigenvectors of the variance-covariance matrix of the data.
(b) The covariances between the Principal Components are given by the eigenvectors of the variance- covariance matrix of the data, and the weights used to form the principal components are given by the eigenvalues of the variance-covariance matrix of the data.
Page 2 of 15
(c) The variances of the Principal Components are given by the eigenvalues of the variance-covariance matrix of the data, and the weights used to form the principal components are given by the eigenvectors of the variance-covariance matrix of the data.
(d) The variances of the Principal Components are given by the eigenvectors of the variance-covariance matrix of the data, and the weights used to form the principal components are given by the eigenvalues of the variance-covariance matrix of the data.
The answer is (c). By construction the principal components are uncorrelated which rules out (a) and (b). Also there are only p eigenvalues, on the other hand each of the p principal components requires its own unique set of p weights, so the weights used to form the Principal Components cannot be based on the eigenvalues, ruling out (d). Any correct answer that demonstrates some understanding of what eigenvalues and eigenvectors are should be rewarded with full marks.
4. When the data are measured as nominal variables, which of the following methods are sensible to use?
(a) Cluster Analysis and Principal Components Analysis.
(b) Principal Components Analysis and Multi Dimensional Scaling.
(c) Cluster Analysis and Multi Dimensional Scaling.
(d) Cluster Analysis, Multi Dimensional Scaling and Principal Components Analysis.
The answer is (c). Both cluster analysis and MDS take a distance matrix as their input and a distance matrix can even be defined with nominal data, for example using the Jaccard dissimilarity metric. Principal Components on the other hand is based on explaining variance which is only defined for metric data.
5. Consider the biplot below which comes from a Principal Components Analysis with four variables: A, B, C and D. Which of the following statements is most likely to be correct?
(a) The correlation between variable A and variable B is close to 0. The correlation between variable C and variable D is close to -1.
(b) The correlation between variable A and variable B is close to -1. The correlation between variable C and variable D is close to 0.
(c) The correlation between variable A and variable B is close to 1. The correlation between variable C and variable D is close to -1.
(d) The correlation between variable A and variable B is close to 1. The correlation between variable C and variable D is close to 0.
Page 3 of 15
−10 −5 0 5
49 8643 25
37 23 42 26
2 3946 22 15
−0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
The answer is (d). In a biplot, vectors pointing in the same direction are indicative of a strong positive association (correlation close to 1), vectors pointing in the opposite direction are indicative of a strong negative association (correlation close to -1) while vectors at right angles to one another are indicative of a weak association (correlation close to 0).
Page 4 of 15
−0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
−10 −5 0 5
Part B Exploratory Factor Analysis (10 Marks)
Let y be a p × 1 vector of random variables where E(yi) = 0 for all i = 1,2,…,p. Assume that y follows the factor model:
y = Bf + ε , (1)
• f is an r × 1 vector of random, unobservable common factors, • B is a fixed p × r matrix of factor loadings,
• ε is an p × 1 vector of unobserved idiosyncratic errors.
Also assume
A.1 Each common factor has a mean of 0 and variance of 1 (i.e. E(fj) = 0 and Var(fj) = E(fj2) = 1 for all
j = 1,2,…,r),
A.2 Each idiosyncratic error has a mean of 0 and a variance of ψi (i.e. E(εi) = 0 and Var(εi) = E(ε2i ) = ψi for
all i = 1,2,…,p),
A.3 All common factors and idiosyncratic errors are uncorrelated with one another (i.e. Cov(fj,εi) = 0 for all
j = 1,…,r and i = 1,…,p).
Please answer the following questions. Useful rules of matrix algebra can be found in Appendix I.
1. Explain how E(yy′) gives the variance covariance matrix of y.
Since the data have mean zero, all variances are given by E(yi2) and all covariances are given by E(yiyj). Using matrix multiplication we can see how the variances are on the diagonal of the E(yy′) and the covari- ances are on the off diagonals of E(yy′).
E(yy)=E
= . ….
y2 . y1 y2 … yp
y 2 y 1 y2
. . . y p y 1
… ypy2
. . .. . . . . .
y 2 y p . . . E(y2y1)
. . . E(ypy1)
. . . E(ypy2)
E(y1yp) E(y2yp) . . .
. . . cov(yp, y1)
var(y1) cov(y1, y2)
= . ….
cov(y1, yp) cov(y2, yp) . . . var(yp)
The key to this question is to understand how matrix multiplication works and the structure of a variance covariance matrix.
cov(y2, y1) var(y2)
. . . cov(yp, y2) . .. .
Page 5 of 15
2. Using the assumptions of the factor model, show that the variance covariance matrix of y is equivalent to BB′ + Ψ where
Grinding through the math
E(yy′) = E ((Bf + ε)(Bf + ε)′)
= E ((Bf + ε)((Bf)′ + ε′))
Substitute y = Bf + ε
Take transpose inside brackets
β11 β12 …β1r
β21 β22 … β2r ..
βp1 βp2 … βpr
= E ((Bf + ε)(f′B′ + ε′))
= E (Bff′B′ + Bfε′ + εf′B′ + εε′) Expand
= E (Bff′B′) + E (Bfε′) + E (εf′B′) + E (εε′) Expected value of each part =BE(ff′)B′+BE(fε′)+E(εf′)B′+E(εε′) Fixedtermscomeoutofexpectedvalue = BB′ + Ψ
The term E (ff′) is simply the variance covariance matrix of the factors. Since the factors have variance 1 and are uncorrelated, this is simply the identity matrix and will cancel out. The term E (εε′) is the variance covariance matrix of the idiosyncratic terms, which have variances given by ψi and are uncorrelated. This term can be replaced by Ψ. Finally since the factors and errors are uncorrelated both E (εf′) and E (fε′) are matrices of zeros and cancel out.
This was the more difficult question for this exam. Partial marks were awarded leniently for people that were on the right track but did not quite get the right answer.
3. Suppose the factors f are rotated to a new set of factors f ̃ using the rotation matrix G so that f ̃ = Gf . (a) Find the variance covariance matrix of f ̃ where G is a matrix that performs an orthogonal rotation.
Transpose of a product so switch order
Are the factors f ̃ correlated with one another?
Var-Cov(f) = E ff′ From the answer to question 1
= E (Gf(Gf)′) Substitution
=E(Gff′G′) Transposerule
= GE (f f ′ ) G′ Rotation matrix is fixed
= GG′ Since E (ff′) = I for the same reason as above
Since G is an orthogonal rotation matrix G = G and Var-Cov(f) = I. This implies the rotated
factors are uncorrelated.
(b) How does your answer to the previous question change if G is a matrix that performs an oblique
For an oblique rotation G′ ̸= G−1 and as a result the factors can be correlated.
ψ1 0…0
. . and Ψ= (2) …. …
. . . .
. .. .. 0 0 … 0 ψp
Page 6 of 15
Part C Correspondence Analysis (10 Marks)
Correspondence Analysis was carried out using the Laundry dataset covered during the unit. Each of the 332 observations corresponds to a different customer. The first variable is the brand of laundry detergent purchased and has 11 levels (Omo, Radiant, Surf, R.M. Gow, Drive, Other Unilever, Spree, Fab, Dynamo, Cold Power, Bushland). The second variable is employment status of the customer and has 6 levels (Full Time, Part Time, Home Duties, Student, Retired, Unemployed). Below a plot is provided with employment status represented by blue circles and brands represented by red triangles. The R output obtained from running the summary function after carrying out correspondence analysis is provided below the plot.
inertias (eigenvalues ):
value % cum% scree plot
0.435735 100.0
0.192557 0.097339 0.065014 0.062967 0.017858
44.2 44.2 ∗∗∗∗∗∗∗∗∗∗∗ 22.3 66.5 ∗∗∗∗∗∗
14.9 81.5 ∗∗∗∗
14.5 95.9 ∗∗∗∗
Unemployed
4.1 100.0 ∗ −−−−−−−− −−−−−
Dimension 1 (44.2%)
Other Unilever Bushland
Cold Power
Home duties
Page 7 of 15
Dimension 2 (22.3%)
−1.0 −0.5 0.0 0.5
Please answer the following questions:
1. Name two brands closely associated with one another.
Any two brands that are close, for example Fab and R.M Gow
2. Name a brand closely associated with Students.
Cold Power
3. Name an employment status closely associated with Spree.
4. Explain what is meant by the concept of inertia.
Inertia is closely related to the chi square test for independence with inertia equal to the test statistic from this test, divided by the sample size. The formula for inertia is
inertia=(orc −erc)2 rc erc
where r and c denote the rows and columns of the cross tab, o denotes observed joint probabilities and e denotes expected joint probabilities under the assumption of independence. Values of inertia close to 0 imply that the observed and expected probabilities are close to one another and that the dependence between the two categorical variables is weak, while large values for inertia imply strong dependence between the two categorical variables.
Correspondence analysis is based on the eigenvalue decomposition of the matrices MM′ and M′M where the elements of M are given by
(orc − erc) mrc = √erc
The sum of these eigenvalues is given by tr(MM′) = tr(M′M) which can be shown to be equivalent to the inertia. In this way the largest eigenvalue explains a large proportion of inertia, the second largest eigenvalue explains the second largest proportion of inertia. By reporting the proportion of inertia explained by a two or three dimensional plot, a researcher is able to evaluate the quality of a correspondence analysis.
5. What proportion of inertia is explained by the fourth dimension of the correspondence analysis on its own?
6. How much inertia is explained by the plot?
Since the plot is 2-dimensional it is the cumulative percentage of the first two dimensions which is 66.5%
Page 8 of 15
Part D Discriminant Analysis (10 Marks)
Write a half to one page description of Discriminant Analysis. Some issues that you may choose to discuss are.
• The objectives of discriminant analysis.
• The types of data (e.g. metric/non-metric etc.) that can be used for Discriminant Analysis.
• Examples of problems in business, marketing or any other discipline in statistics that can be solved using discriminant analysis.
• The difference between classification via the Maximum Likelihood rule and Bayes’ rule and the circum- stances under which it is appropriate to use each rule.
• Fisher’s linear discriminant and its relationship to Bayes’ rule for classification.
• The difference between linear discriminant analysis and quadratic discriminant analysis, and the circum-
stances under which it is appropriate to use each method.
• A discussion of multiclass discriminant analysis.
• Ways to validate discriminant analysis.
• Anything else that may be relevant.
(10 Marks)
Student’s answers were ranked relative to each another. As a rough guide students received a mark for each correct fact about Discriminant Analysis, and if the explanation was particularly good they received two marks. The answer below is of a much higher standard than can be produced under exam conditions, but is indicative of how the question could be answered.
The objective of Discriminant Analysis is to develop a mathematical rule that can be used to predict the group membership of different observations. For example, customers may belong to one of two groups such as purchasers and non-purchasers, and it may be possible to predict the group that they belong to using demographic information such as age, gender and income. The group membership can be called the dependent variable while the predictors are often called independent variables. Typically in discriminant analysis, there is a subset of data (sometimes called training data) for which data are available on both the dependent and independent variables, while for the remaining data, only independent variables are observed. The dependent variable in discriminant analysis is non-metric (usually binary) while the independent variables can be either metric or non metric, however some methods in Discriminant Analysis assume multivariate normality of the independent variables.
Under both the Maximum Likelihood rule and Bayes rule, the independent variables are assumed to be normally distributed but with means and in some cases variances that vary conditional on group membership. Let p(x|y = 0) be the density of the independent variables for the first group and p(x|y = 1) be the distribution of the independent variables for the second group. The means and variances of these densities can be estimated using training data. If the aim is to predict the group membership of x∗ then we classify this observation in the first group if p(x∗|y = 0) > p(x∗|y = 1) and into the second group otherwise. Under Bayes’ Rule for classification we classify the observation to the first group if p(y = 0|x∗) > p(y = 1|x∗) and into the second group otherwise. Here p(y = 0|x) and p(y = 1|x) are found using Bayes’ rule. This requires the specification of the unconditional probabilities Pr(y = 1) and Pr(y = 0) which can be estimated using the sample proportions of the dependent variable.
If the densities are correctly specified then the Bayes rule for classification is optimal in the sense that it minimises misclassification error. The maximum likelihood rule is not optimal in this sense, unless Pr(y = 1) = Pr(y = 0) = 0.5 in which case it is equivalent to the Bayes’ rule. This may be satisfied if the the groups are roughly equal in size.
Fisher’s linear discriminant on the other hand does not make any assumptions about normality of the inde- pendent variables. Instead it attempts to find a linear classifier that best ‘separates’ the two groups. Let di = w′xi be the discriminant evaluated for the ith observation and suppose this is computed for all observations. Let t∗ be the test statistic from a two sample t-test of the discriminants corresponding to y = 0 and the discriminants corresponding to y = 1. Fisher’s discriminant selects the weights w that maximise t∗.
Page 9 of 15
For binary classification and the assumption of equal variance covariance matrices across the two groups, Fisher’s discriminant and the Bayes’ rule for classification will be equivalent, implying that the Bayes’ rule gives a linear classifier. Both of these cases are referred to as ‘Linear Discriminant Analysis’. If Bayes’ rule is used under the assumption of unequal variance covariance matrices across the groups this gives leads to a quadratic classifier and is known as Quadratic Discriminant Analysis. Although all Discriminant Analysis methods generalise to more than two groups, Fisher’s discriminant analysis will no longer coincide with the Bayes’ rule for classification when there are more than two classes.
To validate discriminant analysis, one should first check whether the any assumptions made such as normality and equal variance-covariance matrices are valid (the latter can be done using a Box M test). Beyond that it is often a good idea to validate the performance of discriminant analysis by the method of cross validation. Here the one observation is left out of the training data, a discriminant in found using the remaining observations and then used to predict the observation that had been left out. This is repeated for all observations.
Page 10 of 15
Part E Structural Equation Modelling (10 Marks)
Consider a study where the objective is to understand how the perceived Ease of using Wikipedia and the per- ceived Quality of Wikipedia articles influence the Usage of Wikipedia. The variables Ease, Quality, Usage are unobservable latent factors that cannot be measured directly. Instead nine survey questions (denoted Q1-Q9) are developed to measure these latent factors. The following Structural Equation Model is proposed:
Measurement Equations
Structural Equation
Non-zero correlations
Q1 = λ1Ease + ξ1 (3) Q2 = λ2Ease + ξ2 (4) Q3 = λ3Ease + ξ3 (5) Q4 = λ4Quality + ξ4 (6) Q5 = λ5Quality + ξ5 (