FS21 STT180 Exam 1 Practice (codes):
The data frame animals is used in the next three questions.
animals <- data.frame(type = c("cat", "dog", "dog", "cat", "rabbit"), ageInYears = c(15.4, 9.1, 1.5, 4.2, 5.5), weightInPounds = c(11.1, 56.8, 23.2, 8.9, 2.1), everInShelter = c("Yes", "No", "No", "No", "Yes")) 1. Consider the data frame animals. How would you have R return the type of animal that weighs the most? a. animals %>% max(type, weightInPounds)
b. animals[max(weightInPounds)]
c. with(animals, type[weightInPounds == max(weightInPounds)])
d. apply(animals, 1, max)
e. tapply(animals$type, animals$weightInPounds, max)
2. Consider the data frame animals. How would you compute the mean age and mean weight of
the animals in the data frame?
a. apply(animals(weightInPounds, ageInYears), 2, mean)
b. apply(animals[,c(2,3)], 2, mean)
c. tapply(animals, weightInPounds, ageInYears, mean)
d. with(animals, weightInPounds, ageInYears, mean)
e. animals %>% summarize(mean(weightInPounds, ageInYears))
3. Consider the data frame animals. How would you compute the mean age and mean weight of
the animals in the data frame using dplyr functions?
a. animals %>% filter(mean(ageInYears), mean(weightInPounds))
b. animals %>%filter(ageInYears, weightInPounds) %>% summarize(meanW =
mean(weightInPounds), meanA = mean(ageInYears)) %>% group_by(type)
c. animals %>% arrange(mean(weightInPounds), mean(ageInYears))
d. animals %>% summarize(meanW = mean(weightInPounds), meanA =
mean(ageInYears))
4. Consider the data frame animals. Write the code to add a column called WeightInKg that gives
weights in kilograms rather than pounds? (There are approximately 2.2 pounds in a kilogram.)
______________________________________________________________________
For the next eight questions you will work with a data set containing batting statistics for major
league baseball players. It’s wrangled from data in the Lahman package (tables Batting and
People) and has a limited number of variables. The data frame is called BattingStats (its not a
database available in the R databases). The .csv file is uploaded alongwith the practice for your
reference. Here is the structure of that data frame.
str(BattingStats)
‘data.frame’: 89322 obs. of 8 variables:
$ PlayerName : chr “David Aardsma” “David Aardsma” “Don Aase” “Andy Abad” …
$ Team : chr “CHN” “BOS” “NYN” “OAK” …
$ BattingAverage: num 0 0 0 0 0 0 0 0 0 0 …
$ HomeRuns : int 0 0 0 0 0 0 0 0 0 0 …
$ RBIs : int 0 0 0 0 0 0 0 0 0 0 …
$ Year : int 2006 2008 1989 2001 2006 2010 1910 1950 1952 2003 …
$ League : chr “NL” “AL” “NL” “AL” …
$ BA : Factor w/ 2 levels “High”,”Low”: 2 2 2 2 2 2 2 2 2 2 …
5. Which of the following would produce a data frame containing only players whose
BattingAverage is between 0.250 and 0.310?
A. BattingStats %>% filter(BattingAverage > 0.25 | BattingAverage <0.31)
B. BattingStats %>% mutate(BattingAverage > 0.25 | BattingAverage <0.31)
C. BattingStats %>% filter(BattingAverage > 0.25 & BattingAverage <0.31)
D. BattingStats %>% mutate(BattingAverage > 0.25 & BattingAverage <0.31)
E. BattingStats %>% summarize(BattingAverage > 0.25 | BattingAverage <0.31)
F. BattingStats %>% summarize(BattingAverage > 0.25 & BattingAverage <0.31)
6. Which of the following would produce a data frame containing the mean and median batting
average, and number of players computed separately for each team, like the one that is
partially displayed here?
# A tibble: 47 x 4
Team meanBA medBA numPlayers
1 ANA 0.221 0.25 337
2 ARI 0.200 0.231 727
3 ATL 0.200 0.222 1951
4 BAL 0.211 0.234 2380
5 BLA 0.247 0.266 62
6 BOS 0.217 0.239 4130
7 BRO 0.213 0.233 1921
8 BSN 0.210 0.227 1764
9 CAL 0.217 0.238 1315
10 CHA 0.212 0.233 4111
# … with 37 more rows
a. BattingStats %>% filter(Team) %>% summarize(meanBA =
mean(BattingAverage,na.rm = TRUE), medBA = median(BattingAverage,
na.rm = TRUE), numPlayers = n())
b. BattingStats %>% group_by(Team) %>% summarize(meanBA =
mean(BattingAverage,na.rm = TRUE), medBA = median(BattingAverage,
na.rm = TRUE), numPlayers = n())
c. BattingStats %>% mutate(Team) %>% summarize(meanBA =
mean(BattingAverage,na.rm = TRUE), medBA = median(BattingAverage,
na.rm = TRUE), numPlayers = n())
d. BattingStats %>% summarize(meanBA = mean(BattingAverage, na.rm =
TRUE),medBA = median(BattingAverage, na.rm = TRUE), numPlayers = n())
%>% group_by(Team)
e. BattingStats %>% summarize(meanBA = mean(BattingAverage, na.rm =
TRUE), medBA = median(BattingAverage, na.rm=TRUE)
7. Which of the following returns the number of missing observations in the BattingAverage
column of the data frame BattingStats?
a. with(BattingStats, is.na(BattingAverage))
b. with(BattingStats, is.na(sum((BattingAverage))))
c. with(BattingStats, BattingAverage[is.na(BattingAverage)])
d. with(BattingStats, sum(is.na(BattingAverage)))
8. Which of the following returns the number of different teams represented in the data set?
a. length(unique(BattingStats$Team))
b. length(BattingStats$Team)
c. number(BattingStats$Team))
d. unique(BattingStats$Team)
9. Give one line of R code that returns the vector of data in the column BattingAverage.
__________________________________________________________________________________
10. Which of the following would compute the mean batting average separately for each year, for
players with more than 20 home runs, producing output like this?
# A tibble: 97 x 2
Year meanBA
1 1911 0.3
2 1915 0.285
3 1919 0.322
4 1920 0.376
5 1921 0.342
# … with 92 more rows
a. BattingStats %>%
group_by(HomeRuns > 20) %>%
filter(Year) %>%
summarize(meanBA = mean(BattingAverage, na.rm = TRUE))
b. BattingStats %>%
filter(HomeRuns > 20) %>%
group_by(Year) %>%
summarize(meanBA = mean(BattingAverage, na.rm = TRUE))
c. BattingStats %>%
group_by(HomeRuns > 20) %>%
mutate(Year) %>%
summarize(meanBA = mean(BattingAverage, na.rm = TRUE))
d. BattingStats %>%
filter(HomeRuns > 20)
%>%group_by(Year)
%>%summarize(meanBA = mean(BattingAverage, na.rm = TRUE))
11. Determine if the dplyr and base R sets of code produce the same result. Mark TRUE if the
results are the same, and FALSE if the dplyr and base R results are different.
dplyr Base R
BattingStats %>%
filter(HomeRuns > 60) %>%
group_by(BA) %>%
summarize(n())
Stats60<-subset(BattingStats,HomeRuns > 60)
table(Stats60$BA)
TRUE FALSE
12. Determine if the dplyr and base R sets of code produce the same result. Mark TRUE if the
results are the same, and FALSE if the dplyr and base R results are different.
dplyr Base R
BattingStats %>%
filter(!is.na(BattingAverage) &
!is.na(HomeRuns))%>%
summarise(AvBA = mean(BattingAverage,
na.rm = TRUE),AveHR = mean(HomeRuns,
na.rm = TRUE))
BattingStats4 <- BattingStats[!is.na(BattingAverage) & !is.na(HomeRuns), ] apply(BattingStats4[, c(3, 4)], 1, mean) TRUE FALSE 13. Determine if the dplyr and base R sets of code produce the same result. The dataset is the Carseat data from the ISLR package. Mark TRUE if the results are the same, and FALSE if the dplyr and base R results are different. dplyr Base R Carseats%>%
filter(ShelveLoc==”Medium”)%>%
summarise(TotalCarseats=n(), Mean_sale =
mean(Sales, na.rm=TRUE),
StdDev_sale=sd(Sales, na.rm=TRUE))
CarseatsM<- subset(Carseats,Carseats$ShelveLoc=="Medium") data.frame(TotalCarseats=length(CarseatsM$Shelv eLoc), Mean_sale = mean(CarseatsM$Sales, na.rm=TRUE), StdDev_sale=sd(CarseatsM$Sales, na.rm=TRUE)) TRUE FALSE 14. Determine if the dplyr and base R sets of code produce the same result. The dataset is the Credit data from the ISLR package.Mark TRUE if the results are the same, and FALSE if the dplyr and base R results are different. dplyr Base R Credit%>%
group_by(Married)%>%
summarise(Count=n(), Average_rating
= mean(Rating, na.rm=TRUE),
StdDev_rating=sd(Rating, na.rm=TRUE))
CreditM<-subset(Credit,Credit$Married=="Yes") data.frame(Married = c("No","Yes"),Count=c(length(Credit$Married),len gth(CreditM$Married)), Average_rating = c(mean(Credit$Rating, na.rm=TRUE),mean(CreditM$Rating, na.rm=TRUE)), StdDev_rating=c(sd(Credit$Rating, na.rm=TRUE),sd(CreditM$Rating, na.rm=TRUE))) TRUE FALSE 15. Determine if the dplyr and base R sets of code produce the same result. Mark TRUE if the results are the same, and FALSE if the dplyr and base R results are different. dplyr Base R TRUE FALSE Here’s the glimpse of the Credit dataset form the ISLR package. 16. Based on the dataset Credit from the ISLR package, what code needs to be added to the following code chunks to get the displayed table: Credit%>%
summarise(AveNumofCards = mean(Cards, na.rm=TRUE),
Ave_CreditLimit=mean(Limit, na.rm=TRUE),StDev_CreditLimit=sd(Limit,
na.rm=TRUE))
17. Based on the dataset Credit from the ISLR package, write a one-line code to find how many cards a person
can have?