代写代考 ETF3500/ETF5500 High dimensional data analysis exam
Semester Two 2016 Examination Period
Faculty of Business and Economics
EXAM CODES:
Copyright By https://powcoder.com 加微信 powcoder
TITLE OF PAPER:
EXAM DURATION:
READING TIME:
THIS PAPER IS FOR STUDENTS STUDYING AT: (tick where applicable)
ETF3500/ETF5500 SURVEY DATA ANALYSIS 2 hours writing time
10 minutes
During an exam, you must not have in your possession any item/material that has not been authorised for your exam. This includes books, notes, paper, electronic device/s, mobile phone, smart watch/device, calculator, pencil case, or writing on any part of your body. Any authorised items are listed below. Items/materials on your desk, chair, in your clothing or otherwise on your person will be deemed to be in your possession.
No examination materials are to be removed from the room. This includes retaining, copying, memorising or noting down content of exam material for personal use or to share with any other person by any means following your exam.
Failure to comply with the above instructions, or attempting to cheat or cheating in an exam is a discipline offence under Part 7 of the Monash University (Council) Regulations.
AUTHORISED MATERIALS OPEN BOOK
YES YES
CALCULATORS
If yes, only a HP 10bII+ calculator is permitted.
SPECIFICALLY PERMITTED ITEMS if yes, items permitted are:
Candidates must complete this section if required to write answers within this paper
STUDENT ID: __ __ __ __ __ __ __ __ DESK NUMBER: __ __ __ __ __
Page 1 of 11
Part A Multiple Choice (10 Marks)
The following questions are multiple choice. You MUST provide a brief explanation for your answer. If no explanation is given, you will receive ZERO marks for the question.
1. Suppose we have data on p variables, where p > 2. Let δij be the distance between observation i and
observation j in p-dimensional space. Let B be the matrix obtained after subtracting row means and
column means from the matrix with entries given by −2δ2 . Suppose that the first two principal components ij
are plotted as a scatterplot. Let dij denote the distance between the first two principal components for
observation i and the first two principal components for observation j in the 2-dimensional space of this
plot. The quantity δ2 − d2 is guaranteed to be minimised when:
(a) Both δij and dij are Euclidean distances for all i,j. (b) Both δij and dij are Manhattan distances for all i,j.
(c) The eigenvalues of B are all non-negative.
(d) The eigenvalues of B are all non-negative, and dij are Euclidean distances for all i,j.
2. Which of the following statements is true?
(a) The average linkage method is an example of non-hierarchical clustering while k-means clustering is
an example of hierarchical clustering.
(b) The average linkage method is an example of hierarchical clustering while k-means clustering is an
example of non-hierarchical clustering.
(c) The average linkage method and k-means clustering are both examples of hierarchical clustering.
(d) The average linkage method and k-means clustering are both examples of non-hierarchical clustering. (2 Marks)
3. Which of the following statements is true?
(a) The covariances between the Principal Components are given by the eigenvalues of the variance- covariance matrix of the data, and the weights used to form the principal components are given by the eigenvectors of the variance-covariance matrix of the data.
(b) The covariances between the Principal Components are given by the eigenvectors of the variance- covariance matrix of the data, and the weights used to form the principal components are given by the eigenvalues of the variance-covariance matrix of the data.
(c) The variances of the Principal Components are given by the eigenvalues of the variance-covariance matrix of the data, and the weights used to form the principal components are given by the eigenvectors of the variance-covariance matrix of the data.
(d) The variances of the Principal Components are given by the eigenvectors of the variance-covariance matrix of the data, and the weights used to form the principal components are given by the eigenvalues of the variance-covariance matrix of the data.
4. When the data are measured as nominal variables, which of the following methods are sensible to use?
(a) Cluster Analysis and Principal Components Analysis.
(b) Principal Components Analysis and Multi Dimensional Scaling.
(c) Cluster Analysis and Multi Dimensional Scaling.
(d) Cluster Analysis, Multi Dimensional Scaling and Principal Components Analysis.
Page 2 of 11
5. Consider the biplot below which comes from a Principal Components Analysis with four variables: A, B, C and D. Which of the following statements is most likely to be correct?
(a) The correlation between variable A and variable B is close to 0. The correlation between variable C and variable D is close to -1.
(b) The correlation between variable A and variable B is close to -1. The correlation between variable C and variable D is close to 0.
(c) The correlation between variable A and variable B is close to 1. The correlation between variable C and variable D is close to -1.
(d) The correlation between variable A and variable B is close to 1. The correlation between variable C and variable D is close to 0.
−10 −5 0 5
49 8643 25
37 23 42 26
2 3946 22 15
−0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
Page 3 of 11
−0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
−10 −5 0 5
Part B Exploratory Factor Analysis (10 Marks)
Let y be a p × 1 vector of random variables where E(yi) = 0 for all i = 1,2,…,p. Assume that y follows the factor model:
y = Bf + ε , (1)
• f is an r × 1 vector of random, unobservable common factors, • B is a fixed p × r matrix of factor loadings,
• ε is an p × 1 vector of unobserved idiosyncratic errors.
Also assume
A.1 Each common factor has a mean of 0 and variance of 1 (i.e. E(fj) = 0 and Var(fj) = E(fj2) = 1 for all
j = 1,2,…,r),
A.2 Each idiosyncratic error has a mean of 0 and a variance of ψi (i.e. E(εi) = 0 and Var(εi) = E(ε2i ) = ψi for
all i = 1,2,…,p),
A.3 All common factors and idiosyncratic errors are uncorrelated with one another (i.e. Cov(fj,εi) = 0 for all
j = 1,…,r and i = 1,…,p).
Please answer the following questions. Useful rules of matrix algebra can be found in Appendix I.
1. Explain how E(yy′) gives the variance covariance matrix of y.
2. Using the assumptions of the factor model, show that the variance covariance matrix of y is equivalent to BB′ + Ψ where
β11 β12 …β1r
β21 β22 … β2r ..
βp1 βp2 … βpr
0 … 0 ψp
3. Suppose the factors f are rotated to a new set of factors f ̃ using the rotation matrix G so that f ̃ = Gf . (a) Find the variance covariance matrix of f ̃ where G is a matrix that performs an orthogonal rotation.
Are the factors f ̃ correlated with one another?
(b) How does your answer to the previous question change if G is a matrix that performs an oblique rotation?
ψ1 0…0
. . and Ψ= (2) …. …
. . . . . .. .. 0
Page 4 of 11
Part C Correspondence Analysis (10 Marks)
Correspondence Analysis was carried out using the Laundry dataset covered during the unit. Each of the 332 observations corresponds to a different customer. The first variable is the brand of laundry detergent purchased and has 11 levels (Omo, Radiant, Surf, R.M. Gow, Drive, Other Unilever, Spree, Fab, Dynamo, Cold Power, Bushland). The second variable is employment status of the customer and has 6 levels (Full Time, Part Time, Home Duties, Student, Retired, Unemployed). Below a plot is provided with employment status represented by blue circles and brands represented by red triangles. The R output obtained from running the summary function after carrying out correspondence analysis is provided below the plot.
inertias (eigenvalues ):
value % cum% scree plot
0.435735 100.0
0.192557 0.097339 0.065014 0.062967 0.017858
44.2 44.2 ∗∗∗∗∗∗∗∗∗∗∗ 22.3 66.5 ∗∗∗∗∗∗
14.9 81.5 ∗∗∗∗
14.5 95.9 ∗∗∗∗
Unemployed
4.1 100.0 ∗ −−−−−−−− −−−−−
Dimension 1 (44.2%)
Other Unilever Bushland
Cold Power
Home duties
Page 5 of 11
Dimension 2 (22.3%)
−1.0 −0.5 0.0 0.5
Please answer the following questions:
1. Name two brands closely associated with one another.
2. Name a brand closely associated with Students.
3. Name an employment status closely associated with Spree.
4. Explain what is meant by the concept of inertia.
5. What proportion of inertia is explained by the fourth dimension of the correspondence analysis on its own?
6. How much inertia is explained by the plot?
Page 6 of 11
Part D Discriminant Analysis (10 Marks)
Write a half to one page description of Discriminant Analysis. Some issues that you may choose to discuss are.
• The objectives of discriminant analysis.
• The types of data (e.g. metric/non-metric etc.) that can be used for Discriminant Analysis.
• Examples of problems in business, marketing or any other discipline in statistics that can be solved using discriminant analysis.
• The difference between classification via the Maximum Likelihood rule and Bayes’ rule and the circum- stances under which it is appropriate to use each rule.
• Fisher’s linear discriminant and its relationship to Bayes’ rule for classification.
• The difference between linear discriminant analysis and quadratic discriminant analysis, and the circum-
stances under which it is appropriate to use each method.
• A discussion of multiclass discriminant analysis.
• Ways to validate discriminant analysis.
• Anything else that may be relevant.
(10 Marks)
Page 7 of 11
Part E Structural Equation Modelling (10 Marks)
Consider a study where the objective is to understand how the perceived Ease of using Wikipedia and the per- ceived Quality of Wikipedia articles influence the Usage of Wikipedia. The variables Ease, Quality, Usage are unobservable latent factors that cannot be measured directly. Instead nine survey questions (denoted Q1-Q9) are developed to measure these latent factors. The following Structural Equation Model is proposed:
Measurement Equations
Structural Equation
Non-zero correlations
Q1=λ1Ease+ξ1 Q2=λ2Ease+ξ2 Q3=λ3Ease+ξ3
Q4 = λ4 Quality +
Q5 = λ5 Quality +
Q6 = λ6 Quality +
Q7 = λ7Usage + ξ7 (9) Q8 = λ8Usage + ξ8 (10) Q9 = λ9Usage + ξ9 (11)
Usage = β1Ease + β2Quality + ε (12)
cor(ξ3, ξ4) ̸= 0 (13) cor(ξ7, ξ8) ̸= 0 (14) cor(Ease, Quality) ̸= 0 (15)
The output from the lavaan package after fitting this model is given in Appendix II at the end of the exam paper.
1. Draw a path diagram for the model above.
2. Is the effect of Quality on Usage statistically significant at the 5% level? Explain which part of the output you used to determine this.
3. Is the effect of Ease on Usage statistically significant at the 5% level? Explain which part of the output you used to determine this.
4. Does the proposed model fit the data well? Answer with respect to the following two quantities. (a) The comparative fit index (CFI).
(b) The Root mean square error of approximation (RMSEA).
ξ4 (6) ξ5 (7) ξ6 (8)
Page 8 of 11
(3) (4) (5)
Appendix I: Rules of Matrix Algebra Matrix Algebra
Let W, X, Y and Z be matrices and assume all matrix multiplication below is conformable. The following rules of matrix algebra can be used:
(W +Z)′ =W′ +Z′ (16) (WZ)′ =Z′W′ (17) (W +Z)(X+Y)=WX+WY +ZX+ZY (18)
Matrices and Expected Values
Assume that X is a matrix of random variables and let E(X) denote the matrix obtained by taking the expected value of each element in X. The following results can be used:
When W is fixed and Z is random
and if W is random and Z is fixed
Neither of the previous two rules apply if W and Z are both random.
E(W +Z)=E(W)+E(Z) (19) E (W Z) = W E (Z) (20)
E (W Z) = E (W ) Z (21)
Page 9 of 11
Appendix II: R Output for Structural Equation Modelling Question
lavaan (0.5−20) converged normally after 45 iterations
Number of observations Estimator
ML 60.528 22 0.000
2399.995 36 0.000
0.984 0.973
−8192.533 −8162.269
23 16431.066 16538.052 16465.016
0.048 0.033 0.062 0.586
Expected Standard
Minimum Degrees P−value
Model test
Minimum Degrees P−value
Function Test Statistic of freedom
(Chi−square )
baseline model :
Function Test Statistic of freedom
User model
Comparative Fit Index (CFI)
Tucker− (TLI) Loglikelihood and Information
versus baseline
Loglikelihood user model (H0) Loglikelihood unrestricted model (H1)
Number of free parameters
Akaike (AIC)
Bayesian (BIC)
Sample−size adjusted Bayesian (BIC)
Root Mean Square Error of Approximation :
90 Percent Confidence Interval P−value RMSEA <= 0.05
Standardized Root Mean SRMR
Parameter Estimates :
Information Standard Errors
Latent Variables :
Ease = ̃ Q1
Q3 Quality = ̃
Residual :
1.000 1.469 0.701
1.000 0.936
Std.Err Z−value P(>|z|)
Criteria :
0.154 9.553 0.092 7.590
0.038 24.597
0.000 0.000
Page 10 of 11
Q6 0.828 Usage = ̃
Q7 1.000 Q8 0.821 Q9 1.446
0.038 21.600
0.047 17.323 0.111 13.047
Std.Err Z−value 0.084 −0.351
0.059 9.525
Std.Err Z−value 0.021 7.629 0.017 −3.179 0.040 5.492
Std.Err Z−value 0.028 14.486 0.042 5.313 0.039 18.566 0.018 9.783 0.019 12.936 0.021 16.328 0.050 12.533 0.048 15.791 0.079 3.508 0.030 6.678 0.040 14.160 0.043 9.047
0.000 0.000
P(>|z|) 0.726
P(>|z|) 0.000 0.001 0.000
P(>|z|) 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Regressions
Usage ̃ Ease
Quality Covariances
Ease ̃ ̃ Quality
Q3 ̃ ̃ Q4
Q7 ̃ ̃ Q8
Variances :
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Ease Quality Usage
−0.029 0.564
Estimate 0.163 −0.055 0.222
Estimate 0.411 0.222 0.724 0.179 0.241 0.350 0.622 0.752 0.279 0.200 0.564 0.391
Page 11 of 11