Lab session 1: Solutions
Lab session 1: Solutions
Copyright By PowCoder代写 加微信 powcoder
RMIT University-MATH1302
02/03/2022
library(mosaic) # favstats()
library(ggplot2)
library(tidyverse) # summarise()
library(car) #ncvTest()
library(randtests) # runs.test()
Question 1:
Matt belongs to a golf club in my neighbour-hood. He divides the year into three golf seasons: summer (June-September), winter (November-March), and shoulder (October, April, and May). He believes that he plays his best golf during the summer (because he has more time and the course isn’t crowded) and shoulder (because the course isn’t crowded) seasons, and his worst golf during the winter (because all of the part-year residents show up, the course is crowded, play is slow, and he gets frustrated). Data from the last year are shown in the following table:
The aim of this question is to investigate the effectiveness of the three seasons on his performance.
a) Create a data frame called ‘Golf’ in R to store the above data. Use str() command to display the structure of the data.
Insert the data frame. Note that seasons will be treated as a categorical variable (1 = Summer, 2 = Shoulder, 3 = Winter)
Golf<- data.frame(seasons = c(1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3),
observations = c(83,85,85,87,90,88,88,84,91,90,
91,87,84,87,85,86,83,
94,91,87,85,87,91,92,86))
head(Golf)
## seasons observations
## 1 1 83
## 2 1 85
## 3 1 85
## 4 1 87
## 5 1 90
## 6 1 88
Export the data to the desktop
write.csv(Golf,"/Users/abdulrahmanalamri/Desktop/Data/Golf.csv", row.names = TRUE) # to Export DataFrame as a CSV file from RStudio to desktop
Display the structure of the data.
## 'data.frame': 25 obs. of 2 variables:
## $ seasons : num 1 1 1 1 1 1 1 1 1 1 ...
## $ observations: num 83 85 85 87 90 88 88 84 91 90 ...
b) Treat the season as a categorical variable (1 = Summer, 2 = Shoulder, 3 = Winter). Check if the season variable converts into factor.
Golf$seasons <- as.factor(Golf$seasons) # To convert the treatment effects into factors
## 'data.frame': 25 obs. of 2 variables:
## $ seasons : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ observations: num 83 85 85 87 90 88 88 84 91 90 ...
c) Calculate all the descriptive statistics.
group_by(seasons) %>%
summarise(n = n(),
mean = mean(observations, na.rm = TRUE),
sd = sd(observations, na.rm = TRUE),
stderr = sd/sqrt(n),
LCL = mean – qt(1 – (0.05 / 2), n – 1) * stderr,
UCL = mean + qt(1 – (0.05 / 2), n – 1) * stderr,
median = median(observations, na.rm = TRUE),
min = min(observations, na.rm = TRUE),
max = max(observations, na.rm = TRUE),
IQR = IQR(observations, na.rm = TRUE))
## # A tibble: 3 x 11
## seasons n mean sd stderr LCL UCL median min max IQR
##
## 1 1 10 87.1 2.77 0.875 85.1 89.1 87.5 83 91 4.5
## 2 2 7 86.1 2.61 0.986 83.7 88.6 86 83 91 2.5
## 3 3 8 89.1 3.27 1.16 86.4 91.9 89 85 94 4.5
Alternatively: using mosaic package
favstats(~ observations | seasons, data = Golf)
## seasons min Q1 median Q3 max mean sd n missing
## 1 1 83 85.00 87.5 89.50 91 87.10000 2.766867 10 0
## 2 2 83 84.50 86.0 87.00 91 86.14286 2.609506 7 0
## 3 3 85 86.75 89.0 91.25 94 89.12500 3.270539 8 0
d) Prepare a boxplot that compares between the seasons.
You can graphically check if there is a difference between the distribution using Box plot.
boxplot (observations ~ seasons, data = Golf)
Note that you include the jittered dot using ggplot package.
ggplot(Golf, aes(x = seasons, y = observations, fill = seasons)) +
geom_boxplot() +
geom_jitter(shape = 15,
color = “steelblue”,
position = position_jitter(0.21)) +
theme_classic() +
labs( title= “Comparative Box Plot”)
e) Comment on the general distribution of the data using your boxplot as your guide. Have you identified any outliers?
From the boxplot graph, we can see that the three seasons have no much different effect on Matt’s performance. However, there are some outliers espically in Shoulder (2).
f) What is the best statistical test to compare outcomes among more than two independent groups?
Analysis of variance (ANOVA) is used to analyze the differences among means in more than two independent samples.
g) Does the data indicate that his opinion is correct? Stat Use α=0.05.
We are going to state the hypothesis between seasons and test their significant:
\(H_0: \tau_1= \tau_2= \tau_3= 0\).
\(H_a: at \hspace{0.1cm} least \hspace{0.1cm} one \hspace{0.1cm} \tau_i \neq 0\)
where \(\tau_i\) are the treatment effects of the three seasons on Matt’s performance.
we can run the one-way ANOVA test with the command aov.
model <- aov(formula = observations ~ seasons, data=Golf)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## seasons 2 35.61 17.804 2.121 0.144
## Residuals 22 184.63 8.392
Alternatively:
model <- lm(formula = observations ~ seasons, data=Golf)
anova(model)
## Analysis of Variance Table
## Response: observations
## Df Sum Sq Mean Sq F value Pr(>F)
## seasons 2 35.608 17.8039 2.1214 0.1437
## Residuals 22 184.632 8.3924
The p-value for the three seasons is (\(p=0.1437\)) which is greater than the usual threshold of (\(\alpha=0.05\)), so we fail to reject the null hypothesis and we conclude that there is no differences in Matt’s performance over the different seasons. Thus, his opinion is not correct.
h) Construct the confidence intervals. What do the results tell us?
summary(model)
## lm(formula = observations ~ seasons, data = Golf)
## Residuals:
## Min 1Q Median 3Q Max
## -4.125 -2.125 -0.100 1.875 4.875
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.1000 0.9161 95.077 <2e-16 ***
## seasons2 -0.9571 1.4276 -0.670 0.510
## seasons3 2.0250 1.3741 1.474 0.155
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.897 on 22 degrees of freedom
## Multiple R-squared: 0.1617, Adjusted R-squared: 0.08547
## F-statistic: 2.121 on 2 and 22 DF, p-value: 0.1437
confint(model)
## 2.5 % 97.5 %
## (Intercept) 85.2001274 88.999873
## seasons2 -3.9178805 2.003595
## seasons3 -0.8248089 4.874809
We can check whether \(0\) is contained in the \(95\%\) confidence interval of all the estimators. Since the \(95\%\) confidence interval contains \(0\), we do not reject the null hypothesis at the significance level of \(0.05\), which mean his opinion is not correct and there is no differences in Matt’s performance over the different seasons.
i) Analyse the residuals from this experiment and comment on model adequacy.
par(mfrow=c(2,2))
plot(model)
The normal probability plot shows the residuals follow an approximately straight line and that there are no outliers, verifying that the residuals are normally distributed. The residuals versus fits plots display points that have no recognisable pattern that fall randomly above and below zero, suggesting that the residuals are randomly distributed and verifying that they have constant variance and are homoscedastic.
To clarify the meaning of the plots
1. Residuals vs Fitted This plot shows if residuals have non-linear patterns. There could be a non-linear relationship between predictor variables and an outcome variable and the pattern could show up in this plot if the model doesn’t capture the non-linear relationship. If you find equally spread residuals around a horizontal line without distinct patterns, that is a good indication you don’t have non-linear relationships.
2. Normal Q-Q This plot shows if residuals are normally distributed. Do residuals follow a straight line well or do they deviate severely? It’s good if residuals are lined well on the straight dashed line.
3. Scale-Location It’s also called Spread-Location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is how you can check the assumption of equal variance (homoscedasticity). It’s good if you see a horizontal line with equally (randomly) spread points.
4. Residuals vs Leverage This plot helps us to find influential cases (i.e., subjects) if any. Not all outliers are influential in linear regression analysis (whatever outliers mean). Even though data have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis. They follow the trend in the majority of cases and they don’t really matter; they are not influential. On the other hand, some cases could be very influential even if they look to be within a reasonable range of the values. They could be extreme cases against a regression line and can alter the results if we exclude them from analysis. Another way to put it is that they don’t get along with the trend in the majority of the cases. Unlike the other plots, this time patterns are not relevant. We watch out for outlying values at the upper right corner or at the lower right corner. Those spots are the places where cases can be influential against a regression line. Look for cases outside of a dashed line, Cook’s distance. When cases are outside of the Cook’s distance (meaning they have high Cook’s distance scores), the cases are influential to the regression results. The regression results will be altered if we exclude those cases.
Check assumptions statistically:
Residual Test:
a. Equality of variances - homogeneity
\(H_0\): variances are equal
\(H_1\): at least one variance is different
ncvTest(model)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.4618596, Df = 1, p = 0.49676
The corresponding p-value (\(0.49676\)) is more than \(\alpha = 0.05\). Therefore, we fail to reject the null hypothesis which means the model has a constant variance or “homogeneity”.
b. Normality of Residual
\(H_0\): Errors are normally distributed
\(H_1\): Errors are not normally distributed
y = rstudent(model)
shapiro.test(y)
## Shapiro-Wilk normality test
## data: y
## W = 0.94796, p-value = 0.2254
The p-Value \(0.2254\) is larger than the significant level \(\alpha = 0.05\). So, we conclude that we fail to reject the null hypothesis test. Therefore, there is statistically significant evident that the stochastic component of this model is normally distributed.
c. Auto Correlation Function:
\(H_0\): Errors are uncorrelated
\(H_1\): Errors are correlated
acf(model$residuals)
The ACF plot confirms that there is no any lags exceeded the horizontal dashed line. Therefore, we fail to reject the null hypothesis which means we have non-autocorrelation on residual model.
durbinWatsonTest(model)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1450783 1.565905 0.144
## Alternative hypothesis: rho != 0
The p-Value is much greater than the significant level \(\alpha = 0.05\). So, we conclude not to reject the null hypothesis test. Therefore, there is statistically significant evident that this model is non autocorrelation.
d. Randomness
\(H_0\): Errors are random
\(H_1\): Errors are not random
runs.test(rstudent(model))
## Runs Test
## data: rstudent(model)
## statistic = -0.83485, runs = 11, n1 = 12, n2 = 12, n = 24, p-value =
## alternative hypothesis: nonrandomness
The corresponding p-value is much greater than the significant level \(\alpha = 0.05\). Therefore, we fail to reject null hypothesis. which means the independence of this model was randomness.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com