—
title: “R Notebook”
output:
html_document:
df_print: paged
—
“`{r include=FALSE}
if(!require(tidyverse)) install.packages(“tidyverse”)
if(!require(car)) install.packages(“car”)
if(!require(dplyr)) install.packages(“dplyr”)
library(dplyr)
library(tidyverse)
library(car)
dmnds <- diamonds %>%
slice(1:1500)
“`
# Question [45 marks]
The **dmnds** data set that is loaded into memory contains the prices and other attributes of 1500 diamonds.
1. Create a copy of the **dmnds** dataset that contains the following variables: [5 marks]
* cut – quality of the cut
* price – price in US dollars
2. Perform exploratory data analysis which will reveal how many diamonds are in each *cut* category, how many *price* values are missing, as well as what is the minimum, maximum, mean, median and standard deviation of these values per each *cut* category.
3. Draw the boxplots which will reveal how collected *price* values vary across the *cut* categories.
4. Investigate if there is a significant statistical difference in *price* between *cut* categories, and state your findings by interpretting the resulting p value.
5. Perform a further investigation on which *cut* categories have statistically significant differences between the means and state your findings by interpreting the resulting p values.
# Solutions
1.
“`{r}
# 1.
# cut_col <- dmnds[,2]
# price_col <- dmnds[,7]
# dmnds_copy <- data.frame(cut=cut_col ,price=price_col )
# print(dmnds_copy)
# 2.
dmnds_copy <- dmnds %>%
select (
cut,
price
)
print(dmnds_copy)
#1 is equal to 2
“`
2.
“`{r}
# how many diamonds are in each cut category
dmnds_copy %>%
group_by(cut) %>%
summarise(
# how many diamonds are in each cut category
cut_categories = n(),
# how many *price* values are missing
price_missing = sum(is.na(price)),
# minimun per each *cut* category
min_categories = min(price) %>% round(2),
# maximun per each *cut* category
max_categories = max(price),
# mean per each *cut* category
mean_categories = mean(price, na.rm=TRUE),
# median per each *cut* category
median_categories = median(price),
# standard deviation per each *cut* category
sd_categories = sd(price, na.rm=TRUE)
)
“`
3.
“`{r}
if(!require(ggplot2)) install.packages(“ggplot2”)
if(!require(plotly)) install.packages(“plotly”)
library(ggplot2)
library(plotly)
# boxplot(dmnds_copy$price ~ dmnds_copy$cut)
dmnds_copy %>%
ggplot(aes(x=cut, y=price)) +
geom_boxplot()
plotly::ggplotly()
“`
4.
H0: The mean price is the same across all cut
HA: At least one mean is different than others
“`{r}
# dmnds_cut_pvalue <- t.test(cut_categories,dmnds$price)$p.value
# print(dmnds_cut_pvalue)
anova_test <- aov(formula=price ~ as.factor(cut), data=dmnds_copy)
summary(anova_test)
```
From the results shows above,
The test is very significant at 5% level because p=0.00123 < 5%, we are very confident that HA is to be preferred to H0,
i.e. at least one mean is different from the others.
As we concluded that at least one pair or means differ, and because we do not know which one, we need to use t-tests with Bonferroni correction to compare each pair of means to each other
i.e. multiple comparisons
5.
```{r}
pairwise.t.test(x=dmnds_copy$price, g= as.factor(dmnds_copy$cut),
p.adjust.method = "bonferroni")
```
Conclusions:
1. In the case of the mean difference in *price* at *cut* category between "Fair-Good" and "Fair-Very Good", the test is very highly significant at 5% level because p= 0.00039<1% & p=0.00629<1%. Therefore, we are very confident that HA is to be preferred to H0, i.e very confident that there is a difference between the Fair-Good and Fair-Very Good.
2. In other comparisons,the test is highly significant at 5% level because 0.1% < p <1%. There is considerable evidence for rejection H0 in favor of HA, i.e. considerable evidence of a difference among the price of the cut category of others.