Practice Exam
A dataset provides 9 attributes on 500 individuals who speak English as their first language. The following variables are provided in the dataset:
surname: Surname of the individual.
income: Yearly income (dollars).
Copyright By PowCoder代写 加微信 powcoder
experience: Work experience (years).
age: Age of the individual (years).
gender: Gender of the individual.
sector: Industry of work.
second_language: Second language spoken by the individual. education_years: Total number of tertiary education years. siblings: Total number of siblings.
The data is stored in the data file Revision.csv in the Week 12 section on Moodle. Use this data to do the following:
1. Create a scatter plot of experience versus income, colour the points according to sector.
2. Using only the individuals with a salary greater than $120K, create a scatter plot of experience versus education_years, use surname labels instead of points, and colour the labels according to sector.
3. Name (by surname) the two most similar individuals working in the Agriculture sector. Use the Euclidean distance as the measure of dissimilarity.
4. Assume that the government collects 30% of the income from each individual as taxes. Add a new variable tax_paid equal to the tax contribution.
5. Discuss the differences between the Manhattan and Euclidean distances.
6. Conduct principal component analysis. Use only numeric variables and remember to standardise the data. Report the weights of the first principal component.
7. Conduct principal component analysis. Use only numeric variables and do NOT stan- dardise the data. Report the weights of the first principal component.
Page 1 of 6
8. Are the results in questions 6 and 7 different, if so, why?.
9. Using the results in question 6, what is the weight of experience on the second principal component?
Page 2 of 6
Denote yi = (yi1, yi2)′ as the data vector with two attributes for observation i, and which you have observed n = 100 times. You know that the sample mean of the data is a zero vector. You do not know the sample covariance matrix S, but you know that its largest eigenvalue is λ1 = 3 with associated eigenvector w1 = ( √3 , 1 )′ and second largest eigenvalue is λ2 = 1.
Use this information to answer the questions below.
1. Indetify which of the two options below provides the eigenvector associated to λ2. Provide an
explanation to your answer.
−√2 −1 Option1: w2= √2 ; Option2:w2= √2 .
For the remaining questions, use the selected option as the specified value for w2.
2. Using your answer to the previous question compute the sample covariance matrix S.
3. What is the sample correlation between the variables yi1 and yi2?
4. Determine if the eigenvectors w1 and w2 are solutions to the PCA optimisation problem
max w′Sw w
s.t. w′w = 1
5. Provide an expression for the second principal component of y.
6. In your own words discuss what an index variable is and how PCA allows you to construct this type of variable.
7. Assume that you construct an index variable as equivalent to the first principal component. For the observations y1 = (1.5, −1.2)′, y2 = (1.3, 0.5)′ and y3 = (2, −1)′ compute the corre- spoding index values.
8. In your own words, describe how the eigenvalue problem is connected to principal component analysis.
A bank collects data on multiple attributes of their customers. Amongst these attributes are the variables age, job and education. The categorical variable job has the categories admin., technician, blue-collar, management, student, unemployed, entrepreneur and housemaid. The categorical variable education has the categories no_highschool, high.school, professional.course and university.degree.
Using this information, we apply correspondance analysis to study the dependence between the variables job and education. We conduct the study separtely in two age groups: young adults (age<30) and older adults (age>50). The results are presented below.
Correspondance analysis in young adults
management
admin. entrepreneur
blue−collar
university.degree
unemployed
high.school no_highschool
technician
professional.course
−2 −1 0 1 2
Dimension 1 (61.2%)
## Principal inertias (eigenvalues):
## dim value % cum% scree plot
## 1 ## 2 ## 3 ##
0.524682 61.2 61.2 ***************
0.215697 25.2 86.4 ******
0.116877 13.6 100.0 ***
——– —–
Page 4 of 6
Dimension 2 (25.2%)
−1.0 −0.5 0.0 0.5
## Total: 0.857256 100.0
Correspondance analysis in older adults
management
university.degree
admin. entrepreneur
blue−collar
high.school
unemployed
no_highschool
professional.course
technician
−2 −1 0 1 2
Dimension 1 (66.3%)
## Principal inertias (eigenvalues):
## dim value % cum% scree plot
## 1 ## 2 ## 3 ##
0.341441 66.3 66.3 *****************
0.125191 24.3 90.7 ******
0.048129 9.3 100.0 **
——– —–
## Total: 0.514762 100.0
## [1] 451 4
## [1] 674 4
Use the correspondance analysis results above to answer the following questions.
Page 5 of 6
Dimension 2 (24.3%)
−1.0 −0.5 0.0 0.5
1. For older adults, provide examples of education levels that are similar and education levels that are dissimilar.
2. For young adults, provide examples of jobs that are similar and jobs that are dissimilar.
3. For older adults, discuss the association between jobs and education. Is the association between jobs and education in line with your intutiton?
4. For older adults, discuss how much inertia do the two dimensions presented in the figure explain.
5. For young adults, how much ineartia is explained by the second dimension?
6. For young adults, what occupation is most associated to high school education?
7. Are there any similarities or differences in the results for young adults and older adults? 8. In your own words, describe what inertia is, and its role in correspondance analysis?.
Page 6 of 6