Statistics Lab Key
Statistics Lab Key
Copyright By PowCoder代写 加微信 powcoder
Read in the infant mortality data from indicatorinfantmort35.csv
mort = read_csv(“data/indicatorinfantmort35.csv”)
## Rows: 197 Columns: 255
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: “,”
## chr (1): country
## dbl (254): 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768, 1769, 1770,…
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Compute the correlation between the 1980, 1990, 2000, and 2010 mortality data. No need to save this in an object. Just display the result to the screen.
sub = mort %>%
select(`1980`, `1990`, `2000`, `2010`)
## 1980 1990 2000 2010
## 1980 1.0000000 0.9601540 0.8888411 NA
## 1990 0.9601540 1.0000000 0.9613842 NA
## 2000 0.8888411 0.9613842 1.0000000 NA
## 2010 NA NA NA 1
cor(sub, use = “complete.obs”)
## 1980 1990 2000 2010
## 1980 1.0000000 0.9596846 0.8877433 0.8468284
## 1990 0.9596846 1.0000000 0.9610269 0.9247192
## 2000 0.8877433 0.9610269 1.0000000 0.9862345
## 2010 0.8468284 0.9247192 0.9862345 1.0000000
psych::corr.test(sub, use = “complete.obs”)
## Call:psych::corr.test(x = sub, use = “complete.obs”)
## Correlation matrix
## 1980 1990 2000 2010
## 1980 1.00 0.96 0.89 0.85
## 1990 0.96 1.00 0.96 0.92
## 2000 0.89 0.96 1.00 0.99
## 2010 0.85 0.92 0.99 1.00
## Sample Size
## 1980 1990 2000 2010
## 1980 197 197 197 193
## 1990 197 197 197 193
## 2000 197 197 197 193
## 2010 193 193 193 193
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## 1980 1990 2000 2010
## 1980 0 0 0 0
## 1990 0 0 0 0
## 2000 0 0 0 0
## 2010 0 0 0 0
## To see confidence intervals of the correlations, print with the short=FALSE option
psych::corr.test(sub)
## Call:psych::corr.test(x = sub)
## Correlation matrix
## 1980 1990 2000 2010
## 1980 1.00 0.96 0.89 0.85
## 1990 0.96 1.00 0.96 0.92
## 2000 0.89 0.96 1.00 0.99
## 2010 0.85 0.92 0.99 1.00
## Sample Size
## 1980 1990 2000 2010
## 1980 197 197 197 193
## 1990 197 197 197 193
## 2000 197 197 197 193
## 2010 193 193 193 193
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## 1980 1990 2000 2010
## 1980 0 0 0 0
## 1990 0 0 0 0
## 2000 0 0 0 0
## 2010 0 0 0 0
## To see confidence intervals of the correlations, print with the short=FALSE option
cor(mort[,c(“1980”, “1990”, “2000”, “2010”)])
## 1980 1990 2000 2010
## 1980 1.0000000 0.9601540 0.8888411 NA
## 1990 0.9601540 1.0000000 0.9613842 NA
## 2000 0.8888411 0.9613842 1.0000000 NA
## 2010 NA NA NA 1
What’s going on?! Seems we have NAs in the 2010 column
summary(mort[,c(“1980”, “1990”, “2000”, “2010”)])
## 1980 1990 2000 2010
## Min. :0.07049 Min. :0.04922 Min. :0.03232 Min. :0.03173
## 1st Qu.:0.18649 1st Qu.:0.14942 1st Qu.:0.09336 1st Qu.:0.08170
## Median :0.56665 Median :0.34315 Median :0.21037 Median :0.14339
## Mean :0.94731 Mean :0.70911 Mean :0.55004 Mean :0.39601
## 3rd Qu.:1.47828 3rd Qu.:0.99579 3rd Qu.:0.71003 3rd Qu.:0.49919
## Max. :3.49035 Max. :3.28983 Max. :2.81186 Max. :2.02857
## NA’s :4
Both are equivalent
cor(mort[,c(“1980”, “1990”, “2000”, “2010”)], use = “complete.obs”)
## 1980 1990 2000 2010
## 1980 1.0000000 0.9596846 0.8877433 0.8468284
## 1990 0.9596846 1.0000000 0.9610269 0.9247192
## 2000 0.8877433 0.9610269 1.0000000 0.9862345
## 2010 0.8468284 0.9247192 0.9862345 1.0000000
cor(mort %>% select(“1980”, “1990”, “2000”, “2010”), use = “complete.obs”)
## 1980 1990 2000 2010
## 1980 1.0000000 0.9596846 0.8877433 0.8468284
## 1990 0.9596846 1.0000000 0.9610269 0.9247192
## 2000 0.8877433 0.9610269 1.0000000 0.9862345
## 2010 0.8468284 0.9247192 0.9862345 1.0000000
Compute the correlation between the Myanmar, China, and United States mortality data. Store this correlation matrix in an object called country_cor
Extract the Myanmar-US correlation from the correlation matrix.
mort_subs = mort %>%
filter(country %in% c( “China”, “Myanmar”,”United States”)) %>%
arrange(country)
long = mort_subs %>%
gather(key = “year”, value = death, -country) %>%
filter(!is.na(death))
long = long %>%
spread(key = country, value = death)
sub = long %>%
select(Myanmar, China, `United States`)
## Myanmar China United States
## Myanmar 1.0000000 0.9743646 0.6292625
## China 0.9743646 1.0000000 0.6909263
## United States 0.6292625 0.6909263 1.0000000
mort_subs = mort_subs %>%
select(-country)
mort_subs = t(mort_subs)
country_cor = cor(mort_subs,
use = “complete.obs”)
Run the following to see that the order is China, Myanmar, US:
mort %>% filter(country %in% c(“Myanmar”, “China”, “United States”))
## # A tibble: 3 × 255
## country `1760` `1761` `1762` `1763` `1764` `1765` `1766` `1767` `1768` `1769`
##
## 1 China NA NA NA NA NA NA NA NA NA NA
## 2 Myanmar NA NA NA NA NA NA NA NA NA NA
## 3 United … NA NA NA NA NA NA NA NA NA NA
## # … with 244 more variables: 1770
## # 1774
## # 1780
## # 1786
## # 1792
## # 1798
## # 1804
Extract the Myanmar-US correlation
country_cor[1,3]
## [1] 0.6909263
Is there a difference between mortality information from 1990 and 2000? Run a t-test and a Wilcoxon rank sum test to assess this. Hint: to extract the column of information for 1990, use mort$”1990″
t.test(mort$”1990″, mort$”2000″)
## Sample t-test
## data: mort$”1990″ and mort$”2000″
## t = 2.146, df = 386.23, p-value = 0.03249
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01333621 0.30479998
## sample estimates:
## mean of x mean of y
## 0.7091080 0.5500399
t.test(mort$`1990`, mort$`2000`)
## Sample t-test
## data: mort$`1990` and mort$`2000`
## t = 2.146, df = 386.23, p-value = 0.03249
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01333621 0.30479998
## sample estimates:
## mean of x mean of y
## 0.7091080 0.5500399
t.test(mort$`1990`, mort$`2000`, paired = TRUE)
## Paired t-test
## data: mort$`1990` and mort$`2000`
## t = 10.025, df = 196, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1277769 0.1903593
## sample estimates:
## mean of the differences
## 0.1590681
wilcox.test(mort$"1990", mort$"2000")
## Wilcoxon rank sum test with continuity correction
## data: mort$"1990" and mort$"2000"
## W = 23006, p-value = 0.001445
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(mort$"1990", mort$"2000", paired = TRUE)
## Wilcoxon signed rank test with continuity correction
## data: mort$"1990" and mort$"2000"
## V = 17997, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Read in the Kaggle cars auction dataset from data/kaggleCarAuction.csv with the argument col_types = cols(VehBCost = col_double()):
cars = read_csv("data/kaggleCarAuction.csv",
col_types = cols(VehBCost = col_double()))
Fit a linear regression model with vehicle cost (VehBCost) as the outcome and vehicle age (VehicleAge) and whether it’s an online sale (IsOnlineSale) as predictors as well as their interaction. Save the model fit in an object called lmfit_cars and display the summary table.
lmfit_cars = lm(VehBCost ~ VehicleAge + IsOnlineSale + VehicleAge*IsOnlineSale, data = cars)
lmfit_cars = lm(VehBCost ~ VehicleAge*IsOnlineSale, data = cars)
tidy(lmfit_cars)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
##
## 1 (Intercept) 8063. 16.6 486. 0
## 2 VehicleAge -321. 3.67 -87.4 0
## 3 IsOnlineSale 514. 108. 4.77 0.00000180
## 4 VehicleAge:IsOnlineSale -55.4 25.6 -2.17 0.0303
Create a variable called expensive in the cars data that indicates if the vehicle cost is over $10,000. Use a chi-squared test to assess if there is a relationship between a car being expensive and it being labeled as a “bad buy” (IsBadBuy).
cars = cars %>%
mutate(expensive = VehBCost > 10000)
tab = table(cars$expensive, cars$IsBadBuy)
chisq.test(tab)
## Pearson’s Chi-squared test with Yates’ continuity correction
## data: tab
## X-squared = 1.8152, df = 1, p-value = 0.1779
fisher.test(tab)
## Fisher’s Exact Test for Count Data
## data: tab
## p-value = 0.179
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.8156428 1.0362735
## sample estimates:
## odds ratio
## 0.9205426
prop.test(tab)
## 2-sample test for equality of proportions with continuity correction
## data: tab
## X-squared = 1.8152, df = 1, p-value = 0.1779
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.020853549 0.003506105
## sample estimates:
## prop 1 prop 2
## 0.8766766 0.8853503
Fit a logistic regression model where the outcome is “bad buy” status and predictors are the expensive status and vehicle age (VehicleAge). Save the model fit in an object called logfit_cars and display the summary table. Use summary or tidy(logfit_cars, conf.int = TRUE, exponentiate = TRUE) or tidy(logfit_cars, conf.int = TRUE, exponentiate = FALSE) for log odds ratios
logfit_cars = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = binomial)
logfit_cars = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = binomial())
logfit_cars = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = “binomial”)
summary(logfit_cars)
## glm(formula = IsBadBuy ~ expensive + VehicleAge, family = “binomial”,
## data = cars)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9091 -0.5489 -0.4800 -0.3648 2.4859
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.249490 0.033244 -97.745 <2e-16 ***
## expensiveTRUE -0.080387 0.060696 -1.324 0.185
## VehicleAge 0.286605 0.006476 44.257 <2e-16 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
## Null deviance: 54421 on 72982 degrees of freedom
## Residual deviance: 52442 on 72980 degrees of freedom
## AIC: 52448
## Number of Fisher Scoring iterations: 5
tidy(logfit_cars, conf.int = TRUE, exponentiate = TRUE)
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
##
## 1 (Intercept) 0.0388 0.0332 -97.7 0 0.0363 0.0414
## 2 expensiveTRUE 0.923 0.0607 -1.32 0.185 0.818 1.04
## 3 VehicleAge 1.33 0.00648 44.3 0 1.32 1.35
tidy(logfit_cars, conf.int = TRUE, exponentiate = FALSE)
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
##
## 1 (Intercept) -3.25 0.0332 -97.7 0 -3.31 -3.18
## 2 expensiveTRUE -0.0804 0.0607 -1.32 0.185 -0.201 0.0370
## 3 VehicleAge 0.287 0.00648 44.3 0 0.274 0.299
Randomly sample 10,000 observations (rows) with replacement from the cars dataset and store this new dataset in an object called cars_subsample. Hint: sample_n function
set.seed(1)
cars_subsample = cars %>%
sample_n(size = 10000)
or another way
rows = sample(nrow(cars), size = 10000, replace = TRUE)
cars_subsample = cars[rows,]
Fit the same logistic regression model as in problem 6 above with this sample and display the summary table using the tidy function. How do the results compare? Call this model logfit_cars_sub:
logfit_cars_sub = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = binomial)
summary(logfit_cars_sub)
## glm(formula = IsBadBuy ~ expensive + VehicleAge, family = binomial,
## data = cars)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9091 -0.5489 -0.4800 -0.3648 2.4859
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.249490 0.033244 -97.745 <2e-16 ***
## expensiveTRUE -0.080387 0.060696 -1.324 0.185
## VehicleAge 0.286605 0.006476 44.257 <2e-16 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
## Null deviance: 54421 on 72982 degrees of freedom
## Residual deviance: 52442 on 72980 degrees of freedom
## AIC: 52448
## Number of Fisher Scoring iterations: 5
tidy(logfit_cars_sub, conf.int = TRUE, exponentiate = TRUE)
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
##
## 1 (Intercept) 0.0388 0.0332 -97.7 0 0.0363 0.0414
## 2 expensiveTRUE 0.923 0.0607 -1.32 0.185 0.818 1.04
## 3 VehicleAge 1.33 0.00648 44.3 0 1.32 1.35
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com