UNIT CODE:
UNIT TITLE:
ASSESSMENT DURATION:
ETF3500-ETF5500
High Dimensional Data Analysis
2 hours 40 minutes (includes reading, downloading, and uploading time)
Semester Two 2020
Exam – Alternative Assessment Task
STUDENT ID:
SURNAME:
GIVEN NAME:
This is an individual assessment task.
This is an open book exam
All responses must be included in the RMARKDOWN template document available on Moodle, and then rendered into a pdf document.
ALL STUDENTS are required to answer questions A, B, C and F. ETF3500 students are required to answer question D. ETF5500 students are required to answer question E.
This assessment accounts for 50% of the total in the unit.
Any model of calculator is allowed.
Upon completion of this assessment task, please upload the pdf document to Moodle using the assignment submission link.
Your submission must occur within 2 hours and 40 minutes of the official commencement of this assessment task (Australian Eastern Daylight Time).
Page 1 of 4
Please read the next page carefully and sign and date the Student Statement before commencing the assessment task.
Page 2 of 4
Intentional plagiarism or collusion amounts to cheating under Part 7 of the Monash University (Council) Regulations
Plagiarism: Plagiarism means taking and using another person’s ideas or manner of expressing them and passing them off as one’s own. For example, by failing to give appropriate acknowledgment. The material used can be from any source (staff, students or the internet, published and unpublished works).
Collusion: Collusion means unauthorised collaboration with another person on assessable written, oral or practical work and includes paying another person to complete all or part of the work.
Where there are reasonable grounds for believing that intentional plagiarism or collusion has occurred, this will be reported to the Associate Dean (Education) or delegate, who may disallow the work concerned by prohibiting assessment or refer the matter to the Faculty Discipline Panel for a hearing.
Student Statement:
I have read the university’s Student Academic Integrity Policy and Procedures.
I understand the consequences of engaging in plagiarism and collusion as described in
Part 7 of the Monash University (Council) Regulations
https://www.monash.edu/legal/legislation/current-statute-regulations-and-related-
resolutions
I have taken proper care to safeguard this work and made all reasonable efforts to ensure it could not be copied.
I have not used any unauthorised materials in the completion of this assessment task.
No part of this assessment has been previously submitted as part of another unit/course.
I acknowledge and agree that the assessor of this assessment task may for the purposes
of assessment, reproduce the assessment and:
i. provide to another member of faculty and any external marker; and/or
ii. submit it to a text-matching software; and/or
iii. submit it to a text-matching software which may then retain a copy of the
assessment on its database for the purpose of future plagiarism checking.
I certify that I have not plagiarised the work of others or participated in unauthorised
collaboration when preparing this assessment.
Signature: (Type your full name) Date:
Privacy Statement
The information on this form is collected for the primary purpose of assessing your assessment and ensuring the academic integrity requirements of the University are met. Other purposes of collection include recording your plagiarism and collusion declaration, attending to the course and administrative matters and statistical analyses. If you choose not to complete all the questions on this form it may not be possible for Monash University to assess your assessment task. You have a right to access personal information that Monash University holds about you, subject to any exceptions in relevant legislation. If you wish to seek access to your personal information or inquire about the handling of your personal information, please contact the University Privacy Officer: privacyofficer@adm.monash.edu.au
Page 3 of 4
MARKS ALLOCATED TO QUESTIONS WITHIN THIS ASSESSMENT TASK
Question
A
B
C
D/E
F
TOTAL
Allocated Marks
10
10
15
10
5
50
Office Use Only
Mark received
A
B
C
D/E
F
TOTAL
Second marking
Page 4 of 4
The exam
This exam uses simulated data that emulates SOME features of the Australian labour market. By now, you should have access to your data set which is produced according to your student ID. The dataset provides 9 attributes on 500 individuals who speak English as their first language. The following variables are provided in the dataset:
surname: Surname of the individual.
income: Yearly income (dollars).
experience: Work experience (years).
age: Age of the individual (years).
gender: Gender of the individual.
sector: Industry of work.
second_language: Second language spoken by the individual. education_years: Total number of tertiary education years. siblings: Total number of siblings.
Based on this information you must answer the questions below. Code to produce each of the R outputs in your answers must be provided.
Page 5 of 11
A Standardisation and Distance (10 Marks)
The following question only requires you to use the variables income, experience and age.
1. Standardise income, experience and age by centering (subtracting the mean) and scaling (dividing by the standard deviation) using the scale function. Print out the first 5 observations.
(1 Marks)
2. From your answer to Q1, what is the standardised value of income for the first observation (Nichols) in your data (1 Mark)
3. The government proposes a universal basic income meaning that $10000 is added to every income. Create a variable NewIncome which is equal to income plus 10000 (NewIncome is only to be used for question A). (1 Mark)
4. Find the Euclidean Distance between the first and second observation (Nichols and Fisher) using income, experience and age as the variables. Do NOT standardise the data (1 Marks)
5. Find the Euclidean Distance between the first and second observation (Nichols and Fisher) using NewIncome, experience and age as the variables. Do NOT standardise the data (1 Mark)
6. Are the answers to Question 4 and Question 5 the same? Why or why not? (1 Marks)
7. Consider that you are working for a business that streams movies. You have access to data on a list of movies that each customer has seen. How could you use this data to define a distance between two different customers? (2 Marks)
8. For the example in the previous question, describe how collaborative filtering can be used to make recommendations of movies to customers. (2 Marks)
Page 6 of 11
B Principal Components Analysis (10 Marks)
1. Carry out Principal Components on the data using all numeric variables. (2 Marks) 2. Did you standardise the variables? Why or why not? (2 Marks)
3. What is the weight on number of siblings for the 4th principal component? (1 Mark) 4. What is the standard deviation of the 3rd principal component? (1 Mark)
5. Make a distance biplot. (1 Marks)
6. Pick two variables that according to the biplot are highly postively correlated with one another. If there are no such variables for your dataset, then describe what you would be looking for in the biplot to indicate that two variables are postively correlated. (1 Mark)
7. Pick two variables that according to the biplot are uncorrelated. If there are no such variables for your dataset, then describe what you would be looking for in the biplot to indicate that two variables are uncorrelated. (1 Mark)
8. What proportion of overall variation in the data is explained by the biplot? (1 Mark)
Page 7 of 11
C Multidimensional Scaling (15 Marks)
1. Using only those observations for which second_language is French, carry out classical multidimensional scaling. Find a two dimensional representation and use standardised value of income, experience, age, education_years and siblings as the variables. (4 Marks)
2. Plot a 2-dimensional representation of this data. Rather than plot the observations as points use the individuals’ surnames. (3 Marks)
3. Name two individuals (by surname) who are similar according to your plot in Question 2, and two individuals (by surname) who are different. If you were unable to generate the plot in Question 2, then describe how you would answer this question. (1 Mark)
4. Plot the same plot as in Question 2 using the Sammon mapping. (3 Marks)
5. Are you conclusions in Question 3 robust to using a different multidimensional scaling method? If you were unable to generate the plot in Question 2 and/or Question 4, then describe how you would answer this question. (1 Mark)
6. Describe the differences between classical multidimensional scaling and the Sammon mapping. (3 Marks)
Page 8 of 11
D Correspondence analysis (ETF3500 students only) (10 Marks)
1. Construct a contingency table between the sector and second_language variables. (1 Mark)
2. Using the contingency table in point 1, perform correspondance analysis on the sector and second_language variables and visualise the results. (2 Marks)
3. Based on the results in point 2, which sector is most associated to people that speak Spanish as a second language? (1 Mark)
4. Based on the results in point 2, how much inertia is explained by the first dimension?(1 Mark)
5. Repeat point 2, but this time, only consider those individuals whose income is greater than 100000 and age is greater than 25. (2 Marks)
6. Based on the results in point 5, how much inertia is explained by the second dimension?
(1 Mark)
7. Compute how much inertia is explained overall by the figures in points 2 and 5. Discuss in which of these two exercises CA helps explain a larger amount of inertia. (2 Marks)
Page 9 of 11
E Correspondence analysis (ETF5500 students only) (10 Marks)
1. Using only individuals whose gender is Female and whose income is less than $200000, construct a contingency table between the sector and second_language variables. (1 Mark)
2. Using the contingency table in point 1, perform correspondance analysis on the sector and second_language variables and visualise the results. (1 Marks)
3. Based on the results in point 2, which sector is most associated to people that speak Spanish as a second language? (1 Mark)
4. Based on the results in point 2, how much inertia is explained by the first dimension?(1 Mark)
5. Repeat point 2, but this time, only consider those individuals whose gender is Male and whose income is less than $200000. (1 Marks)
6. Based on the results in point 5, how much inertia is explained by the second dimension?
(1 Mark)
7. Compute how much inertia is explained overall by the figures in points 2 and 5. Discuss in which of these two figures CA helps explain a larger amount of inertia. (1 Marks)
8. Disscuss the differences or similarities between the results obtained in points 2 and 5, for example, are the associations between sector and second_language consistent? (1 Mark)
9. In your own words, describe the role that the sigular value decompostion (SVD) of a matrix plays in correspondace analysis. (2 Marks)
Page 10 of 11
F Factor Modelling (5 Marks)
1. Fit a 2-factor model to the numerical variables in the dataset (set rotation=‘none’). (1 Mark)
2. For each of the two factors, list the variables whose factor loadings are greater than 0.1 in absolute value. (1 Mark)
3. Provide a plot that visualises the association between factors and variables. (1 Mark) 4. Fit a 2-factor model to the numerical variables in the dataset, but now setting rotation
= “promax”. (1 Mark)
5. Disscuss the differences between the two factor modelling approaches used in questions 1
and 4. (1 Mark)
END OF EXAMINATION
Page 11 of 11