Please make sure if you use R you copy and paste it into Word using Courier Font (makes it easier to Read). For each of the problems that are looking for a response (not just a calculation), be sure to explain and interpret the results.
2. Principal Component Analysis
The head of an airport is looking to determine issues related to efficiencies and operations at the airport. However, they are uncertain where to look.
Given the dataset “airport_cancellations.csv” and “airport_operations.csv” conduct the following analysis.
a) First you will need to merge this data together – see the merge function in R to create a combined dataset. Consider what you will need to merge the data on. What happened after you merged the data? Anything interesting? Why?
b) Are there any outliers or missing observations that need to be dealt with? How will you deal with them.
c) Conduct a principal component analysis. After the PCA how many dimensions are necessary? Explain how you determined this?
d) Are there any overlapping loadings? Explain what you will do with the overlapping loadings.
e) Name the dimensions you are left with.
f) Is this analysis reasonable? Are there any issues with your final dimension list?
g) Create new variables in your dataset and compute values for each PCA dimension. The
conduct a correlation analysis between the variables and each dimension. Why is this
interesting?
• h) Is there a correlation between the PCA values? Is this surprising?
3. Factor Analysis
It is a well known fact that sports analytics are very popular. The dataset fifa.csv contains information about soccer (football) players obtained from FIFA19 information. It would be interesting if the ratings that are used could be analyzed as a simple set of factors.
a) Identify the columns that are the best candidates for analysis as factors.
b) Conduct a factor analysis using these columns (if you get an error reduce the number of
factors in the function, until the error disappears).
c) Justify your use of the rotation method. What does it tell you ?
d) Are there any correlations between the factors?
e) Create the diagram (by hand or powerpoint) of the factors, be sure to label everything.
f) Split the data set into two parts (Left footed players and right footed players). Conduct a
factor analysis and explain if you see any differences between right and left footed players . Note you do not need to draw this one.
4. K-nearest
A supervisor wishes to conduct a classification analysis on the breast_cancer.csv dataset to see if new observations can be properly classified.
a) Examine the dataset and explain why KNN might be used, discuss the benefits of this algorithm, and discuss the drawbacks.
b) Conduct a correlation plot of the relevant variables , and make an initial assessment of the relationships between each of the variables and the classifier.
c) Conduct a KNN classification using all of the relevant columns
i) Discuss your initial approach including how you will set this up and what the
steps are including number of variables in your training set and number of
variables in your test set.
ii) Provide measure of accuracy (since this is a straight classification you only need
a simple assessment).
iii) Reduce the number of variables from the original 10. How would you decide
which variables might be better to use in this assessment.
iv) Conduct KNN analysis using these variables, and determine what would be a
reasonable KNN model considering the lowest number of columns needed, with the most reasonable accuracy. Defend your position.
5. Decision Tree
People Analytics is becoming a more popular and demanding area for Data Mining. The Head of Human Resources is looking to identify reasons for attrition (people leaving either voluntary or otherwise). The task is to develop a decision tree to determine this. You will need to merge three datasets (employee_survey_data.csv, general_data.csv, manager_survey_data.csv) to accomplish this task.
a) Examine the dataset and determine if there are any issues. Produce histograms and summary stats for the appropriate columns and identify any issues.
b) Determine and list any columns that are not needed in a first “full” model.
c) After examining the data, which methods (algorithms) do you think are appropriate
d) Develop the decision tree model and compare the methods based on accuracy.
e) Provide a chart of your final model.