CS代考计算机代写 —


title: “High Dimensiona Data Analysis”
output: pdf_document

“`{r setup, include=FALSE}

“`

“`{r, echo=FALSE ,eval=TRUE,message=FALSE}
library(MASS)
library(ca)
library(knitr)
library(kableExtra)
library(dplyr)
library(stats)
library(broom)
library(tidyverse)
“`

\newpage

# A Standardisation and Distance **(10 Marks)**

*The following question only requires you to use the variables `income`, `experience` and `age`.*

*1. Standardise `income`, `experience` and `age` by centering (subtracting the mean) and scaling (dividing by the standard deviation) using the `scale` function. Print out the first 5 observations.* **(1 Marks)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*2. From your answer to Q1, what is the standardised value of `income` for the first observation (Nichols) in your data* **(1 Mark)**

INCLUDE YOUR ANSWER HERE

*3. The government proposes a universal basic income meaning that $10000 is added to every income. Create a variable `NewIncome` which is equal to `income` plus 10000 (**`NewIncome` is only to be used for question A**).* **(1 Mark)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*4. Find the Euclidean Distance between the first and second observation (Nichols and Fisher) using `income`, `experience` and `age` as the variables. Do NOT standardise the data* **(1 Marks)**

INCLUDE YOUR ANSWER HERE

*5. Find the Euclidean Distance between the first and second observation (Nichols and Fisher) using `NewIncome`, `experience` and `age` as the variables. Do NOT standardise the data* **(1 Mark)**

INCLUDE YOUR ANSWER HERE

*6. Are the answers to Question 4 and Question 5 the same? Why or why not?* **(1 Marks)**

INCLUDE YOUR ANSWER HERE

*7. Consider that you are working for a business that streams movies. You have access to data on a list of movies that each customer has seen. How could you use this data to define a distance between two different customers?* **(2 Marks)**

INCLUDE YOUR ANSWER HERE

*8. For the example in the previous question, describe how collaborative filtering can be used to make recommendations of movies to customers.* **(2 Marks)**

INCLUDE YOUR ANSWER HERE

\newpage

# B Principal Components Analysis **(10 Marks)**

*1. Carry out Principal Components on the data using all numeric variables.* **(2 Marks)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*2. Did you standardise the variables? Why or why not?* **(2 Marks)**

INCLUDE YOUR ANSWER HERE

*3. What is the weight on number of siblings for the 4th principal component?* **(1 Mark)**

INCLUDE YOUR ANSWER HERE

*4. What is the standard deviation of the 3rd principal component?* **(1 Mark)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

INCLUDE YOUR ANSWER HERE

*5. Make a distance biplot.* **(1 Marks)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*6. Pick two variables that according to the biplot are highly postively correlated with one another. If there are no such variables for your dataset, then describe what you would be looking for in the biplot to indicate that two variables are postively correlated.* **(1 Mark)**

INCLUDE YOUR ANSWER HERE

*7. Pick two variables that according to the biplot are uncorrelated. If there are no such variables for your dataset, then describe what you would be looking for in the biplot to indicate that two variables are uncorrelated.* **(1 Mark)**

INCLUDE YOUR ANSWER HERE

*8. What proportion of overall variation in the data is explained by the biplot?* **(1 Mark)**

INCLUDE YOUR ANSWER HERE

\newpage

# C Multidimensional Scaling **(15 Marks)**

*1. Using only those observations for which `second_language` is French, carry out classical multidimensional scaling. Find a two dimensional representation and use standardised value of `income`, `experience`, `age`, `education_years` and `siblings` as the variables.* **(4 Marks)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*2. Plot a 2-dimensional representation of this data. Rather than plot the observations as points use the individuals’ surnames.* **(3 Marks)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*3. Name two individuals (by surname) who are similar according to your plot in Question 2, and two individuals (by surname) who are different. If you were unable to generate the plot in Question 2, then describe how you would answer this question.* **(1 Mark)**

INCLUDE YOUR ANSWER HERE

*4. Plot the same plot as in Question 2 using the Sammon mapping.* **(3 Marks)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*5. Are you conclusions in Question 3 robust to using a different multidimensional scaling method? If you were unable to generate the plot in Question 2 and/or Question 4, then describe how you would answer this question.* **(1 Mark)**

INCLUDE YOUR ANSWER HERE

*6. Describe the differences between classical multidimensional scaling and the Sammon mapping.* **(3 Marks)**

INCLUDE YOUR ANSWER HERE

\newpage

# D Correspondence analysis (ETF3500 students only) **(10 Marks)**

*1. Construct a contingency table between the `sector` and `second_language` variables.* **(1 Mark)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*2. Using the contingency table in point 1, perform correspondance analysis on the `sector` and `second_language` variables and visualise the results.* **(2 Marks)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*3. Based on the results in point 2, which sector is most associated to people that speak Spanish as a second language?* **(1 Mark)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

INCLUDE YOUR ANSWER HERE

*4. Based on the results in point 2, how much inertia is explained by the first dimension?***(1 Mark)**

INCLUDE YOUR ANSWER HERE

*5. Repeat point 2, but this time, only consider those individuals whose `income` is greater than 100000 and `age` is greater than 25. * **(2 Marks)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*6. Based on the results in point 5, how much inertia is explained by the second dimension?* **(1 Mark)**

INCLUDE YOUR ANSWER HERE

*7. Compute how much inertia is explained overall by the figures in points 2 and 5. Discuss in which of these two exercises CA helps explain a larger amount of inertia.* **(2 Marks)**

INCLUDE YOUR ANSWER HERE

\newpage

# E Correspondence analysis (ETF5500 students only) **(10 Marks)**

*1. Using only individuals whose `gender` is Female and whose `income` is less than $200000, construct a contingency table between the `sector` and `second_language` variables.* **(1 Mark)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*2. Using the contingency table in point 1, perform correspondance analysis on the `sector` and `second_language` variables and visualise the results.* **(1 Marks)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*3. Based on the results in point 2, which sector is most associated to people that speak Spanish as a second language?* **(1 Mark)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

INCLUDE YOUR ANSWER HERE

*4. Based on the results in point 2, how much inertia is explained by the first dimension?***(1 Mark)**

INCLUDE YOUR ANSWER HERE

*5. Repeat point 2, but this time, only consider those individuals whose `gender` is Male and whose `income` is less than $200000. * **(1 Marks)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*6. Based on the results in point 5, how much inertia is explained by the second dimension?* **(1 Mark)**

INCLUDE YOUR ANSWER HERE

*7. Compute how much inertia is explained overall by the figures in points 2 and 5. Discuss in which of these two figures CA helps explain a larger amount of inertia.* **(1 Marks)**

INCLUDE YOUR ANSWER HERE

*8. Disscuss the differences or similarities between the results obtained in points 2 and 5, for example, are the associations between `sector` and `second_language` consistent? * **(1 Mark)**

INCLUDE YOUR ANSWER HERE

*9. In your own words, describe the role that the sigular value decompostion (SVD) of a matrix plays in correspondace analysis.* **(2 Marks)**

INCLUDE YOUR ANSWER HERE

\newpage

# F Factor Modelling **(5 Marks)**

*1. Fit a 2-factor model to the numerical variables in the dataset (set `rotation`=’none’).* **(1 Mark)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*2. For each of the two factors, list the variables whose factor loadings are greater than 0.1 in absolute value.* **(1 Mark)**

INCLUDE YOUR ANSWER HERE

*3. Provide a plot that visualises the association between factors and variables.* **(1 Mark)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*4. Fit a 2-factor model to the numerical variables in the dataset, but now setting `rotation = “promax”`.* **(1 Mark)**

“`{r}
#INCLUDE YOUR R CODE HERE
“`

*5. Disscuss the differences between the two factor modelling approaches used in questions 1 and 4.* **(1 Mark)**

INCLUDE YOUR ANSWER HERE