PRACTISE SHORT ANSWER TEST
Please note the following rules for this assessment:
Copyright By PowCoder代写 加微信 powcoder
1. You are expected to answer all questions
2. All questions carry equal marks
3. Answer each question in the space provided on this paper
4. Any information required is either printed in the text or is available as a file on Blackboard
5. You may use any written or online materials you wish to help you answer the questions
6. By taking this test, you are agreeing that you will work alone and will not consult anyone for help. Breaches of this rule will be taken very seriously.
Student number:…………………………………………………
QUESTION 1
The above figure shows a qqnorm plot of the residuals after a linear regression analysis.
Describe the plot as fully as you can to someone unfamiliar with qqnorm plots explaining:
· whether the residuals meet the assumptions of simple linear regression;
· how you reached this conclusion; and
· what action you might take, if any.
[10 marks]
QUESTION 2
Briefly explain the following statistical or data science concepts and terms [2 marks each]:
A. Jack-knife
B. Feature engineering
C. Winsorizing
D. Bayesian
E. Outlier
QUESTION 3
Explain what the following lines of R code do. The object “dog” is a dataframe which contains the variables “weight”, “food_intake” and “hunger_index”. [2 marks each]
A. str(dog)
B. hist(dog$weight)
C. lines(lowess(dog$food_intake,dog$hunger_index,f=.1),
col=”red”)
D. plot(dog$food_intake,dog$weight)
E. mean(weight,na.rm=TRUE)
QUESTION 4
Import the file “Satellite.xls” into R and do the following:
A. Scale but do not center the variable for band 1 and then calculate its mean and standard deviation. Write the answers here using 4 decimal places [5 marks]:
B. Paste below a histogram created in R of your scaled variable for band 1, requesting 30 bins (bars) shaded in the colour green. Label the histogram as fully as possible [5 marks].
QUESTION 5
Answer the following questions [2 marks each]:
A. What form of random sampling does the Bootstrap use?
B. Who put forward the method of multiple working hypotheses?
C. Why is the Yeo-Johnson transformation more generally applicable than the Box-Cox transformation?
D. Who created the tidyverse?
E. Who proposed that 0.05 was an acceptable level for significance testing?
QUESTION 6
Write a paragraph to interpret as fully as possible the output from R given below. Explain the likely research question, what analysis has been undertaken and what the results mean. The dataset AirPoll contains air pollution data (HC, NOX and SOX), and includes the variable MORT meaning “mortality”. [10 marks]
lm(formula = MORT ~ PREC + JANT + JULT + OVR65 + POPN + EDUC +
HOUS + DENS + NONW + WWDRK + POOR + HC + NOX + SOX + HUMID,
data = AirPoll)
Residuals:
Min 1Q Median 3Q Max
-68.066 -18.017 0.912 19.224 86.961
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.764e+03 4.373e+02 4.034 0.000215 ***
PREC 1.905e+00 9.237e-01 2.063 0.045071 *
JANT -1.938e+00 1.108e+00 -1.748 0.087413 .
JULT -3.100e+00 1.902e+00 -1.630 0.110159
OVR65 -9.065e+00 8.486e+00 -1.068 0.291230
POPN -1.068e+02 6.978e+01 -1.531 0.132952
EDUC -1.716e+01 1.186e+01 -1.447 0.155085
HOUS -6.511e-01 1.768e+00 -0.368 0.714393
DENS 3.600e-03 4.027e-03 0.894 0.376147
NONW 4.460e+00 1.327e+00 3.360 0.001618 **
WWDRK -1.871e-01 1.662e+00 -0.113 0.910883
POOR -1.676e-01 3.227e+00 -0.052 0.958807
HC -6.721e-01 4.910e-01 -1.369 0.177985
NOX 1.340e+00 1.006e+00 1.333 0.189506
SOX 8.625e-02 1.475e-01 0.585 0.561745
HUMID 1.068e-01 1.169e+00 0.091 0.927644
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 34.93 on 44 degrees of freedom
Multiple R-squared: 0.7649, Adjusted R-squared: 0.6847
F-statistic: 9.542 on 15 and 44 DF, p-value: 2.193e-09
QUESTION 7
Explain the purpose of each of the following five lines of R code [2 marks per line of code]:
library(caret)
ES.group <- read.csv("ES_students.csv",header = TRUE) train.control <- trainControl(method="cv", number=6) ES.model <- train(student.mark ~ hours.worked + chocolate.eaten + wine.consumed,_data=ES.group, trControl=train.control, method="lm") summary(ES.model) QUESTION 8 The very messy data set “Messy_loans_20.xlsx”contains one variable called “ratio between assets and liabilities” and another called “predicted ratio between assets and liabilities”. This prediction was derived from a linear model. To assess how good the model was, calculate the root mean squared error (RMSE) between these two variables using R. To show that you have done this correctly, paste your code below and include the value of RMSE using 4 decimal places [10 marks] QUESTION 9 A student is working on a dataframe called Species which contains the variables Weight, Diet and Age. She has made at least five mistakes in the R code she has written: Species <- read.csv("Species.csv",header = TRUE) View(species) Species.sample <- subset(Species, Age < 9, select= Weight:Diet) Now plot the data: plot(Species.sample$Weight, Species.sample$Diet) lines(lowess(Species_sample$Weight,Species.sample$Diet,f=.1),col="blue") abline(lm(Species.sample$Diet#Species.sample$Weight), col="green") histo(Species.sample$Weight,Col="biege") Find five of the mistakes and briefly explain how to correct them. [2 marks each] QUESTION 10 A researcher has watched a group of elephants at a waterhole and recorded how long each animal spends drinking. Write a single line of R code to calculate the (sample) standard deviation of total time spent drinking which is currently recorded in a dataframe as two separate variables, HOURS and MINUTES. The code must give the answer in minutes. Assume that the variables are contained within a dataframe called DRINKING. [10 marks] END OF PAPER 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com