代写代考 ETF3500/ETF5500 高维数据分析 R语言
Instructions to Students
Please answer ALL questions
General Comments: Most people recieved either a credit, distinction or high distinction for their overall grade. This is largely down to high assignment marks, but marks in the exam were also quite good. Well done to all of you since you were a strong class this year.
Copyright By https://powcoder.com 加微信 powcoder
Part A: Multiple Choice (10 Marks)
General Comments: Some people failed to give justifications. Sorry but the intructions were clear and I also told you about this in lectures.
The following questions are all multiple choice. There is one correct choice per question. Please do not answer on the question sheet but write your answer in the booklet provided. For each question you will recieve one mark for the correct choice. You will recieve one additional mark for a brief justification for your answer. If you fail to provide any justification you will recieve zero for the entire question (even if you have selected the right choice).
1. Let X1, X2, . . . , Xp be p variables and let C1, C2, . . . , Cp be their principal components. Also assume that the covariance between X1 and X2 is positive. Which of the following statements is true?
(a) The variance of the first principal component (Var(C1)) is larger than the variance of the first variable (Var(X1)). Also, the correlation between the first two principal components (Cor(C1,C2)) is larger than the correlation between the first two variables (Cor(X1,X2)).
(b) The variance of the first principal component (Var(C1)) is smaller than the variance of the first variable (Var(X1)). Also, the correlation between the first two principal components (Cor(C1,C2)) is larger than the correlation between the first two variables (Cor(X1,X2)).
(c) The variance of the first principal component (Var(C1)) is smaller than the variance of the first variable (Var(X1)). Also, the correlation between the first two principal components (Cor(C1,C2)) is smaller than the correlation between the first two variables (Cor(X1,X2)).
(d) The variance of the first principal component (Var(C1)) is larger than the variance of the first variable (Var(X1)). Also, the correlation between the first two principal components (Cor(C1,C2)) is smaller than the correlation between the first two variables (Cor(X1,X2)).
The first principal component is the linear combination with the largest possible variance so must have a larger variance than the first variable (the fact that the covariance between X1 and X2 is positive rules out the possibility that the first PC equals the first variable). PCs are uncorrelated meaning that the correlation between PC1 and PC2 is 0 and is thus smaller than the covariance between X1 and X2 (which is assumed to be positive). The answer is D.
2. Let F be an n × m matrix with n ̸= m. Also let F′ be the transpose of F. Which of the following matrix multiplications are conformable (i.e. possible)?
(a) FF (b) F′F
(c) F ′F , but only if F is a rotation matrix (d) None of the above.
Matrix multiplication is possible when the number of columns from the first matrix is equal to the number of rows in the second matrix. Since F′ is m×n and F is n×m, F′F satisfies this condition. There is no requirement for F to be a rotation matrix. The answer is B.
3. Consider a three-cluster solution (with clusters labelled A, B and C) and a two cluster solution (with Clusters labelled X and Y ). Both solutions are from the same algorithm. Let i and j be two observations that both belong to Cluster A in the three cluster solution. Which of the following statements is true?
(a) If i belongs to cluster X then j is guaranteed to be in cluster Y but only if hierarchical clustering has been used.
Page 2 of 12
(b) If i belongs to cluster X then j is guaranteed to be in cluster Y but only if non-hierarchical clustering has been used.
(c) If i belongs to cluster X then j is guaranteed to be in cluster X but only if hierarchical clustering has been used.
(d) If i belongs to cluster X then j is guaranteed to be in cluster X but only if non-hierarchical clustering has been used.
In hierarchical clustering entire clusters are merged at each stage. If i belongs to cluster X this implies that either A has been merged with another cluster to form cluster X or A is cluster X. In either case j stays with i. This is not guaranteed in non-hierarchical clustering. The answer is C.
4. Consider three clustering solutions, Solution 1, Solution 2 and Solution 3. Solution 1 is obtained by randomly allocating each observation into one of four groups. Solution 2 is obtained by k-means clustering with k = 4. Solution 3 is identical to Solution 1. Which of the following is true?
(a) The (unadjusted) Rand index between Solution 1 and Solution 2 is greater than or equal to 0. The adjusted Rand index between Solution 1 and Solution 3 is equal to 0.
(b) The (unadjusted) Rand index between Solution 1 and Solution 2 is greater than or equal to 0. The adjusted Rand index between Solution 1 and Solution 3 is equal to 1.
(c) There is insufficient infomation to determine whether the (unadjusted) Rand index between Solution 1 and Solution 2 is greater than or equal to 0. The adjusted Rand index between Solution 1 and Solution 3 is equal to 0.
(d) There is insufficient infomation to determine whether the (unadjusted) Rand index between Solution 1 and Solution 2 is greater than or equal to 0. The adjusted rand index between Solution 1 and Solution 3 is equal to 1.
The (unadjusted) Rand index is always between 0 and 1 – it can be interpreted as a probability. Therefore the Rand index between Solution 1 and Solution 2 (or indeed any two solutions) must be greater than 0. The adjusted Rand index (as well as the unadjusted Rand index) is 1 for two identical solutions (even if these were drawn randomly). The answer is B.
5. Which of the following statements is true?
(a) Distances can be defined for metric data but not for non-metric data
(b) Distances can be defined for non-metric data but not for metric data
(c) Distances can be defined for both non-metric data and metric data
(d) Only Euclidean distance can be defined for both non-metric data and metric data
Euclidean distance is just one example of a distance that can be used for metric data, while Jaccard distance is an example of a distance metric for non-metric data. It is thus incorrect to say that only Euclidean distance applies to both and the correct answer is therefore C.
Page 3 of 12
Part B: Multivariate Analysis of Variance (MANOVA) (10 Marks)
The easiest question on the paper. Most people answered the first four parts correctly. The final part not so well. Some of you interpreted this as the difference between MANOVA and ANOVA, the former being for testing the difference in means between multiple variables. This is not wrong and you were given some marks for answers along these lines. However the question specifically refers to Roy’s greatest root so those who provided more information – as in the answer below – got full marks.
Consider a company that sells two major products, shoes and apparel (two variables). Suppose this company trials one of three different marketing strategies in each store (three groups) and would like to use MANOVA to see if any of these marketing strategies is effective.
Let μ1, denote the expected sales of shoes and apparel (in a vector) for Group 1. Let μ2 and μ3 be defined
similarly for Group 2 and Group 3 respectively. Also Σ1, Σ2 and Σ3 denote the variance covariance matrices
for Group 1, Group 2 and Group 3 respectively. To answer questions 1 and 2, let 2 1 1 4 2 4 2 4 2
μ1= 2,μ2= 1,μ3= 1,Σ1= 2 5,Σ2= 2 5,Σ3= 2 5
1. Is the usual null hypothesis in MANOVA violated here? Explain. (2 Marks)
The usual null hypothesis in MANOVA is that all means are equal. Although μ2 = μ3, μ1 is not equal to these two. The null is violated.
2. Is the assumption of the homogeneity of variance covariance matrices violated here? Explain. (2 Marks)
The assumption is that all variance covariance matrices are equal. This is the case here – it is not relevant that the covariances do not equal the variances or that one variance does not equal the other. The assumption is satisfied.
To answer questions 3 and 4, let
1 1 1 4 2 4 2 4 1
μ1= 2,μ2= 2,μ3= 2,Σ1= 2 4,Σ2= 2 4,Σ3= 1 4
3. Is the usual null hypothesis in MANOVA violated here? Explain. (2 Marks)
The usual null hypothesis in MANOVA is that all means are equal. This is the case here – it is irrelevant that expected shoes sales are different to expected sales of apparel. The null is not violated.
4. Is the assumption of the homogeneity of variance covariance matrices violated here? Explain. (2 Marks)
The assumption is that all variance covariance matrices are equal. This is not the case here the covariance for group 3 is different to that of Group 1 and Group 2. The assumption is not satisfied.
5. Explain the connection between MANOVA based on the Roy’s greatest root test statistic and the F-statistic used in univariate ANOVA. (2 Marks)
Suppose we consider weighted linear combinations of all variables (in the example, shoes and apparel). This results in a single variable. The F-statistic used in univariate ANOVA can be used to test for significant differences between the groups for this linear combination. Suppose now that the weights are chosen so that the F-statistic is as large as possible. This is then equivalent to using Roy’s greatest root. The distribution of this statistic will not however be the usual F-distribution.
Page 4 of 12
Part C: Correspondence Analysis (CA) (10 Marks)
General comments: Again a question where most students did well particularly on question 1-4
Consider the following outputs from a correspondence analysis (Example A) which are needed to answer Questions 1,2,3 and 4. Labels on the plot are not needed to answer these questions and are omitted.
Cross Tab (Example A):
as(michnr) pndr2(hlt) 554 541 108 93 206 149 243 390 797 887 164 133 100 154 328 463 471 518 4 4 34 65 293 265 149 194 482 484 532 545 145 70 q448302893384 r 409 368 395 413 416 324 294 418 339 305 361 299 s 467 387 464 533 314 454 358 508 349 415 402 423 t 632 632 670 632 691 672 685 655 640 597 630 644 u 174 195 224 181 197 247 225 226 194 237 196 193 v 66 60 113 68 64 71 37 89 40 64 66 66 w 155 156 146 187 225 160 216 106 250 194 149 218 x 5 14 13 10 1 11 12 15 3 9 2 2 y 150 137 162 184 155 280 171 142 104 140 80 127 z 3 5 10 4 2 1 5 20 5 4 6 2
thrdg(bck) a 550 b 116 c 147 d 374 e 1015 f 131 g 131 h 493 i 442 j 2 k 52 l 302 m 159 n 534 o 516 p 115
drf(mchnr) lstw(clrk) 515 590 109 112 172 181 311 265 827 940 167 137 136 119 376 419 432 514 8 6 61 46 280 335 146 176 470 403 561 505 140 147
estwn(bck) fta(hmngw) 557 589 129 72 128 129 343 339 996 866 158 108 129 159 571 449 555 472 4 7 76 59 291 264 247 158 479 504 509 542 92 95
saf7(flkn) saf6(flkn) prof(clrk) 541 517 592 109 96 151 136 127 251 228 356 238 763 771 985 126 115 168 129 189 152 401 478 381 520 558 544
is(hmngwy) pndr3(hlt) 576 557 120 97 136 145 404 354 873 909 122 97 156 121 593 479 406 431 3 10 90 94 281 240 142 154 516 417 488 477 91 100
5 6 7 72 80 39 280 322 416 209 163 236 471 483 526 589 617 524 84 82 107
−0.2 0.0 0.2 0.4 0.6
Dimension 1 (40.9%)
Figure 1: Representation of Correspondence Analysis for Example A
Summary output (Example A):
Page 5 of 12
Dimension 2 (19.7%)
−0.2 −0.1 0.0 0.1 0.2 0.3
## Principal inertias (eigenvalues):
## dim value % cum% scree plot
0.007664 40.9 40.9 **********
0.003688 19.7 60.6 *****
0.002411 12.9 73.5 ***
0.001383 7.4 80.8 **
0.001002 5.3 86.2 *
0.000723 3.9 90.1 *
0.000659 3.5 93.6 *
0.000455 2.4 96.0 *
0.000374 2.0 98.0
0.000263 1.4 99.4
0.000113 0.6 100.0
——– —–
## Total: 0.018735 100.0
1. What does the value 0.018735 represent (look near the bottom of the summary output)? What does it measure? (2 Marks)
The value 0.018735 is the total inertia. It measures the strength of dependence between the two categorical variables in the analysis. More precisely it measures the discrepancy between the observed joint probabilities in a cross tab and the expected probabilities under independence.
2. What is the proportion of inertia explained by the third dimension on its own? (1 Mark) From the output this is 12.9%
3. What is the proportion of inertia explained by the first four dimensions all together? (1 Mark)
From the output this is 80.8%
Now consider the following outputs for a different dataset (Example B) which are needed to answer Questions 4. Labels on the plot are not needed to answer the question and are omitted.
Cross Tab (Example B): 123456
14 4 0 4 17 65 55 65 22 70 61 45 31 37 13 5
6 17 11 9 1 17 18 18
6 10 77 99 25 19
4 4 3 3 9 7
Page 6 of 12
−1.5 −1.0 −0.5 0.0 0.5 1.0
Dimension 1 (63.2%)
Figure 2: Representation of Correspondence Analysis for Example B Summary output (Example B):
## Principal inertias (eigenvalues):
## dim value % cum% scree plot
0.156941 63.2 63.2 ****************
0.085922 34.6 97.8 *********
0.004850 2.0 99.7
0.000652 0.3 100.0
8.7e-050 0.0 100.0
——– —–
## Total: 0.248452 100.0
4. Is the plot in Figure 1 a more accurate representation of the cross tab in Example A or is the plot in Figure 2 a more accurate representation of the cross tab in Example B? Justify your answer. (2 Marks)
The quality of a CA representation can be measured by the proportion of explained inertia. Since these are 2 dimensional plots, we are interested in the proportion of inertia explained by the first two dimensions together. For Dataset A this is 60.6%, for dataset B this is 97.8%. Therefore Figure 2 is a more accurate representation of Cross Tab B compared to Figure 1 as a representation of Cross Tab A.
5. Describe how Correspondence Analysis can be used for text data with an example from business, marketing or any another applied discipline. (2 Marks)
Consider the case where we have online reviews of different products (e.g. smartphones). The brands of phone (e.g Apple, Samsung, HTC) are one categorical variable while the words used in a review (e.g. thin, screen, fast, apps) are another categorical variable. A cross tab of word freqencies can be constructed and correspondence analysis can be carried out.
This analysis allows for three comparisons. Phones can be compared to phones, phones can be compared to
Page 7 of 12
Dimension 2 (34.6%)
−0.8 −0.4 0.0 0.2 0.4
words and words can be compared to words. In all cases proximity on the CA representation is indicative of a strong association.
6. Name two shortcomings (i.e. limitations) of the approach you discussed in your answer to Question 5 above. (2 Marks)
A shortcoming of this approach is that combinations of words, e.g. “large screen” may be more important and meaningful than single words. If a large number of words are considered the plot may be cluttered and difficult to visualise. Association between words may be spurious and not imply causation. Also, any dimesion reduction technique necessarily involves information loss.
Page 8 of 12
Part D: Multidimensional Scaling (10 Marks)
General comments: Given that marks were generally high, I graded this question somewhat strictly. For each distinct point made (either the ones below or something you thought of yourselves) I gave a mark if that point was made in a way that the student clearly understood the concepts and half a mark for explanations that were mostly OK but had some errors. A common misunderstanding is that many of you thought that non-metric multidimensional scaling applies to non-metric data. In fact non-metric multidimensional scaling is used when the distances beteween data points are non-metric. Apart from that I would say that most of you have a very sound understanding of MDS.
Write a half to one page description of Multidimensional Scaling (MDS). Some issues that you should discuss are:
• The objective of MDS.
• The types of data (e.g. metric/non-metric etc.) that can be used for MDS.
• The role that distance plays in MDS.
• Examples of problems in business, marketing or any other discipline in statistics that can be solved
using MDS.
• The criterion minimised or maximised by Classical MDS
• Validating the quality of a Classical MDS Solution.
• The circumstances in which Classical MDS is similar to Principal Components Analysis (PCA).
• The difference between Classical MDS and the Sammon Mapping.
• The difference between metric and non-metric MDS.
• Anything else that you may think is relevant.
(10 Marks)
The objective of MDS is to represent the distances between multivariate observations in a low (usually 2) dimensional plot. MDS can be used for any data as long as distances between observations can be computed. For instance, if data are metric then Euclidean or Manhattan distance can be used, if they are non-metric then Jaccard distance can be used. In fact distances can even be elicited directly from a survey, for instance customers may be asked to rate products that are similar or dissimilar to one another, or may be asked to group products in such a way that a distance metric can be derived from these groupings.
One potential application of MDS in marketing may involve data where each observation is a brand, and multiple attributes of the brand (e.g. price, physical characteristics, availablity) are variables. In this case MDS can create a perceptual map showing which brands have similar attibutes and which brands have differing attributions. This can be used to identify competitors or gaps in the market. As a caveat it is important to note that a “gap” identified by MDS may exist since these attributes are not desirable to customers.
Let δij represent the distance between observation i and observation j in the original data and dij represent
the distance between observation i and observation j in the lower dimensional MDS solution. Classical
MDS minimises strain defined as (δ2 − d2 ), where the sum is taken over all pairwise combinations of
observations.
Like all dimesion reduction techniques MDS involves some loss of information. This can be quantified using goodness of fit measures based on the Eigenvalues of a transformed version of the matrix of pairwise distances. For instance for a 2-dimensional solution this may be the sum of absolute values of the first two eigenvalues realtive to the sum of the absolute value of all eigenvalues. An alternative goodness of fit function replaces the absolute values of the eigenvalues with the maximum of each eigenvalue and 0.
Classical MDS is best suited to the case where the input distances are Euclidean distances. In this case, all eigenvalues are positive, strain is guaranteed to be minimised, and both traditional goodness of fit measures are equal. Furthermore, in this case, the MDS solution is equivalent to a scatterplot of principal components.
Alternative criteria exist for MDS including the Sammon mapping. Rather than minimise Strain, the Sammon mapping minimises Stress defined as (δij −dij)2/δij. By downweighting observations that are far apart from one another in the original space, the Sammon mapping preserves the local structure of the data.
Page 9 of 12
Non-metric MDS does not refer to non-metric data but to the case where the distances themselves are non-metric, i.e only the ranks of distances are known. For instance it may be known that A is closest to B and A is furthest from C, with the distance between B and C somewhere in between but no number can be attached to these distances.
Page 10 of 12
Part E: Discriminant Analysis (10 Marks)
As expected this was the questions most students struggled with. However this question was very similar to the last tutorial question I only changed the distribution fr