CS代考 SP500 index which summarises the information of stock prices in the U.S., a

Question 1
Answer sheet for Part E of the exam
data<- as_tibble(read.csv('Revision.csv',header=TRUE)) ggplot(aes(x = experience,y=income,color=sector)) + geom_point() Copyright By PowCoder代写 加微信 powcoder

0 10 20 30
experience
sector Accommodation
Administrative_Support Agriculture Construction Education Finance_Insurance Health
Manufacturing
Not_stated
Other PublicAdministration_Safety Retail
Scientific_Technical Transport_Postal_Warehousing Wholesale
Question 2
filter(income>120000)%>%
ggplot(aes(x=experience,y=education_years,label=surname,color=sector))+geom_text(size=2)

FleOmLcainorg
LtsoWMwDVeaeairnlcnliaenlrueva EPweinaSgimmons
Johns ShBeRaaulWdbwhioieFMnAerbaleBbcnrioecatinsttleByullockCarMsJoanysPWcoaenorkoedr Lin
DonBouvcaknley MaLratiwnSWsuoraniDgrCReuhoziatcorhkteCaOVJHriarFcmaoFloenleinltlnseznceypeOeinazlsltrbFiCocreklraGneyAemuHtorGeananrnuoyrestldireorrez Wise
HSucClhlwlaayCrtGhzrlieSsnttinoAaknliesRYiogHrgGVoksaoilnepgser BeGSriognoegdrlemBPtCoriataontnsymCFprFaotlMowosnytefeodrrcdadLoester
GDoiTaorzedvino HnaFeaPalsoalDrgesleon RomLGaeomrmobeHWzrFteaisrgmAhneraecrnrher Merritt
CoJsatcakson MBaadCsdasesLntuiHllcCoaCemaorCnowmlteraoemnllednts MeCracscreayll CeuVrratuisghanSchrHoeadyHedreanyMeCushnRoiaSzsemrruasnsWoeRiniotesrs
WLoezbasnteorLLlRoiuyodbMNHbFaiaocGnFnrhGsinLltaAgeoiyalyunylnosrMaEegrcwMleDrBlachloatialigstalBylvoescreneyknabCpnuCoariRrnrsrtnyearnoGseSnEtcrshytmesittDaHvieBdsetnejraminPugRheed Williams
LSaimndorny NMCaovBrlaroaoGrCnrwMoayjrLeadedsOeeinrTtaeHhsgoRarmGiWncaeohCKoMsualrfolruReldreogswMgbaeZinlnlrealWsmSoontolHefrvaellnCMsauldreprhSoyhHnanrdnByoaVnlalnUcenderwood
Richardson
0 10 20 30
experience

a Accommodation
a Administrative_Support a Agriculture
a Construction
a Education
a Finance_Insurance a Health
a Manufacturing
a Not_stated
a PublicAdministration_Safety
a Scientific_Technical
a Transport_Postal_Warehousing a Wholesale
Ander StailPvLoaallarBaredGckriffith ABraerlrlaenrHaoeAnLscBoelagvnkeaedno
Question 3
filter(sector==’Agriculture’)%>%
select_if(is.numeric)%>%
filter(sector==’Agriculture’)%>%
pull(surname)->attributes(d)$Labels
as.matrix(d)==min(d)
## Schwartz
## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Bass FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
education_years

## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Bass Harper
## ALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE
TRUE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE
## ALSE FALSE FALSE FALSE FALSE FALSE
The two most dissimilar individuals are Underwood and Gallegos. The two most similar individuals are Bass and 4
## # A tibble: 6 x 10
## surname income experience age gender sector second_language education_years
mutate(tax_paid=0.3*income)->data
head(data)
##
## 1 Nichols 126320.
## 2 Fisher 175883.
## 3 Mcbride 76911.
## 4 Noble 115962.
## 5 Park 85264.
## 6 Ponce 105280.
## # … with 2 more variables: siblings , tax_paid

6 30 Male Manuf~ Hindi 2
14 38 Female Scien~ German 5
0 24 Female Trans~ Spanish 5
9 33 Male Educa~ Spanish 1
4 29 Female Educa~ Hindi 0
18 44 Male Other Spanish 0
Question 5
The Euclidean distance measures the length of the segment between two points in the high-dimensional space. In high dimensions, this distance can be constructed using a high-dimesional version of the Pythagoras theorem. On the other hand, the Manhattan distance does not measure the length of segment between two points. Instead, it is computed as the sum of the absolute difference between each of the coordinates of the points. The Manhattan distance, as such, is a more practical measure of distance for walking distances between locations in a city with rectangular blocks, hence its name is an analogy to the streets in Manhattan.
Question 6
select_if(is.numeric)%>%

prcomp -> pca1
pca1$rotation[,1]
## income
## -0.5046699
## tax_paid
## -0.5046699
Question 7
## income
## -9.578263e-01
## tax_paid
## -2.873479e-01
Question 8
experience
-0.3099180
age education_years
-0.3071195 -0.4146027
select_if(is.numeric)%>%
prcomp -> pca2
pca2$rotation[,1]
experience
-7.940285e-05
age education_years
-8.111826e-05 -6.447295e-05
1.927985e-05
Yes, the matrices with the factor weights are different. We know that in pca we construct linear combinations of the variables that maximize the variance and that it is sensitive to the units of measurement. Because the variables income and tax_paid have a variance with a much larger magnitude in the non-standardized version of the exercise, then these variables absorb most of the variability of the linear combinations and as such have much larger weights.
Question 9
The weight of experience on the second principal component is
pca1$rotation[‘experience’,2]
## [1] 0.5836477

1. To indentify which of these two vectors is an eigenvector for S we must remember from lecture 10 that eigenvectors of symmetric matrices are orthogonal to each other. We must then check which of the two
options is a vector that is orthogonal to w1. For option 1 we have that w1′ w2 = − 46 + 42 ̸= 0, so option √√
one does not provide an orthgonal vector. For option 2 we have that w1′ w2 = − 43 + 43 = 0, so we identify that the second option provides the second eigenvector for S.
2. Now that we have know the two eigenvectors and eigenvalues of S we can than use the spectral theorem to compute S. From the spectral theorem we know
􏰋 √3 −1 􏰌􏰉 3 0 􏰊􏰋 √3 1 􏰌 􏰋 3√3 −1 􏰌􏰋 √3 1 􏰌 􏰋 10 2√3 􏰌
S=WΛW′= 2 √2 2 √2 = 2 √2 2 √2 = √4 4 1301−13 33−13 236 2222222244
3. This question asks about the correlation of yi1 and yi2. Because these are observations, their correlation will be indeterminate. However, we can calculate the correlation between y1 and y2. We know that the
cov(y1 ,y2 )
sample correlation between the two variables can be calculated as ρ = √
the covariance, while V (y1) and V (y2) denote the variance of y1 and y2, respectively. Because we already
, where cov(y1, y2) denotes have access to S, we know that cov(y1,y2) = 2 3, V(y1) = 10 and V(y2) = 6. When we replace each of
these terms into the correlation formula, we get
= √ = 0.4472
V (y1)V (y2)
4. We know that in order to solve the PCA optimization problem the vectors w1 and w2 must comply with two conditions. First, they must be eigenvalues of S and as we already know, they are. Second, they must be vectors so that w1′ w1 = 1 and w2′ w2 = 1. After completing these two inner products we can confirm that these two vectors are indeed solutions to the PCA optimization problem.
5. We know that the second principal component of y is the linear combination constructed using the eigen- vector associated to the second largest eigenvalue. In this case we know that such an eigenvector is w2. Thus,
the second principal component can be expressed as ci2 = − 12 yi1 + 23 yi2 .
6. An index is a variable that summarises most of the information (variation) in a set of variables. Examples of index variables include the SP500 index which summarises the information of stock prices in the U.S., and the human development index which measures the quality of life across different countries and which summarises information in many economic, education and health variables. PCA allows you to construct an index variable as an optimal linear combination of the variables in the data. This optimal combination is obtained by maximizing the variance of the resulting linear combination.
7. Because we are assuming that the index and the first principal component are equivalent, to compute the value of the index we must first write an expression for the first principal component. Using a similar logic
as that in question 6, the first principal component can be expressed as ci1 = 23yi1 + 12yi2. Then, the index √
value for the first observation is i1 = 23 1.5 + 12 (−1.2) = 0.699. Fol lowing a similar process the index values for the second and third observations are i2 = 1.3758 and i3 = 1.2321.
8. The objective in PCA is to construct the linear combination of the variables of the dataset that has the largest variance. Thus, PCA is in fact an optimisation problem that takes as input the variance-covariance matrix of the data. As it turns out, the solution to this optimisation problem can be found by solving the eigenvalue problem of the variance-covariance matrix. Thus we can show that to find a solution to PCA we must solve an eigenvalue problem.

1. The second figure indicates that the closest education levels are university.degree and high.school, as their corresponding points are close to each other. On the other hand, professional.course seems to be the education level that is most dissimilar to the other levels.
2. An example of jobs that are similar are admin and management jobs. A second example would be housemaid and blue-collar jobs. An example of jobs that are dissimilar are blue-collar and management jobs as their corresponding points are the furthest apart. A second example of dissimilar jobs would be technician and management.
3. The association of jobs to level of education shows some interesting patterns. For instance, management and admin jobs are more associated to university.degree levels of education than to any other level of edu- cation. Also, more manual labor jobs such as blue-collar and housemaid are more associated to education levels that are lower than high-school education. Interestingly, entrepreneur jobs are more associated to high-school education than they are to university.degrees. These results are mostly in line with intu- ition. Manual labor jobs (housemaid, blue-collar) require less years of academic education. On the other hand, admin and management jobs often require tertiary levels of education, hence why they are more as- sociate to university.degree. Finally, entrepreneurship does not necessarily require university.degree levels of education.
4. For older adults, the two dimensions presented in the figure explain a total of 90.6% of inertia. 5. For young adults, the second dimension explains a total of 25.2% of inertia.
6. For young adults, the occupations student and entrepreneur are the most associated to high-school levels of education.
7. Generally speaking the patterns are quite similar. However, we can observe a couple of differences. First, in young adults the job housemaid seems to be closer to high.school levels of education than in older adults. Second, the association of technician to professional.course is much stronger in older adults than in young adults. Finally, the occupation student does not feature in the analysis for older adults which indicates that none of the observation had this category as a job. Similarly, the occupation retired does not feature in the results for young adults, which indicates that none of the individuals in this age group were retired.
8. Inertia is a measure of dependence between two categorical variables. It is closely linked to the chi-squared statistic, which indicates how far two variables are from being independent. The larger the amount of inertia the more dependent the two categorical variables are. Correspondence analysis constructs a 2D plot, where the coordinates of the points for the two categorical variables are constructed to maximise the amount of inertia replicated in the visualisation.