STAT S4206/S5206 Midterm
STAT S4206/S5206 Midterm
Gabriel
5/21/2021
Part I: Instructions
The STAT S4206/S5206 midterm is open notes, open book(s), open computer and online resources are allowed.
Students are not allowed to communicate with any other people regarding the exam with the exception of
the instructor (Gabriel Young) and the TA (Andrew Davison). This includes emailing fellow students, using
WeChat and other similar forms of communication. If there is any suspicion of one or more students cheating,
further investigation will take place. If students do not follow the guidelines, they will receive a zero on the
exam and potentially face more severe consequences. The exam will be posted on Friday, 05/21/2021 at
11:00AM (ET). Students are required to submit both the .pdf (or .html) and .rmd files on Canvas by Sunday,
05/23/2021 at 12:00PM (ET). Students will be given the whole 49 hour time frame to complete and upload
their knitted file.
A few more recommendations follow:
• Don’t forget to submit both the correct .rmd file and at least one of your .html or .pdf files.
• Save your .rmd regularly to avoid any problems if your computer crashes.
• Please ensure your output is tidy. Do not print pages and pages of data. Doing so will result in points
deducted.
• Please stop working on the exam at least 30 min before it’s deadline to make sure your RMarkdown file
knitts properly.
• If you have a question, please include both the instructor and TA in the email thread. Don’t forget
that both Gabriel and Andrew are in the US so expect a delay in their response times if you are in a
different timezone.
Part I: Warm-up
Run the following code:
warm_up <- read.csv("Warm_up.csv")
head(warm_up,4)
## V1 V2 V3 V4 V5 V6
## 1 NA 0.46620043 -1.8923489 NA 0.6408445 -1.7308123
## 2 0.1848492 0.95466641 1.2928042 -0.9007918 -1.6013778 -0.4983996
## 3 1.5878453 -0.94720635 -0.6182543 0.1499151 -0.7778154 -0.3540769
## 4 -1.1303757 0.03856309 1.0409383 1.1264128 -1.6473925 NA
## V7 V8 V9 V10
## 1 NA 1.023567236 -0.02165721 0.7705831
## 2 NA -0.003508363 0.98448042 -0.7702532
## 3 0.04548124 -1.274958744 -0.68091825 -1.3121963
## 4 -0.46466759 1.366271568 -0.26052233 -0.4345667
1
dim(warm_up)
## [1] 1000 10
Question 1: (10 pts)
Perform the same task as the following chunk without using a loop and in one line of code.
my_out <- NULL
for (j in 1:10) {
keep_vec <- NULL
for (i in 1:1000) {
if (!is.na(warm_up[i,j])) {
keep_vec <- c(keep_vec,i)
}
}
keep_column <- warm_up[keep_vec,j]
my_out[j] <- round(mean(keep_column),4)
}
my_out
## [1] 0.0647 0.0191 0.0693 0.0342 0.0367 0.0012 -0.0783 0.0333 -0.0723
## [10] 0.0409
### Solution goes here --------------------
Part II: Data Cleaning and Graphics
Run the following code:
data <- read.csv("PlayerProfiles.csv")
names(data) <- c("Name","Country","Age","DOB","Nickname","PDC_Ranking","Tour_Card","Career_Earnings")
head(data,4)
## Name Country Age DOB Nickname PDC_Ranking
## 1 Michael van Gerwen Netherlands 32 4/25/1989 Mighty Mike 1
## 2 Peter Wright Scotland 51 3/10/1970 Snake Bite 2
## 3 Gerwyn Price Wales 36 3/7/1985 The Iceman 3
## 4 Adrian Lewis England 36 1/21/1985 Jackpot 13
## Tour_Card Career_Earnings
## 1 Yes £8,321,167
## 2 Yes £3,469,888
## 3 Yes £1,497,803
## 4 Yes £3,137,634
dim(data)
## [1] 96 8
Question 2: (5 pts)
There appears to be some repeated cases in the dataframe (data). Identify the number of repeated cases in
this dataset. There are many ways to solve this problem, e.g., you can look at repeated names or nicknames
2
and then figure out how to extract this information from the dataframe. I personally used a loop for this
problem.
### Solution goes here --------------------
Question 3: (5 pts)
Create a new dataframe called new_data that excludes all repeats. For partial credit, you can manually
identify the repeats and remove these cases with a basic subsetting command. Display the head and dimension
of new_data. I personally used a loop for this problem.
### Solution goes here --------------------
Question 4: (10 pts)
Use base R character string functions to convert the Career_Earnings variable into a numeric mode. For
example, the symbol £535,131 should be converted to 535132. Your converted variable should be appended
to your dataframe and named Career_Earnings_NUM. You can use the original data or new_data to
solve this problem. Display the head of the numeric vector Career_Earnings_NUM.
### Solution goes here --------------------
Question 5: (10 pts)
Using relevant R functions, identify the 5 most frequent countries measured in this dataset. Display the
country and its frequency. Note that England should have the highest frequency. You can use the original
data or new_data to solve this problem. Also create a barplot displaying the 5 most frequent countries.
### Solution goes here --------------------
Question 6: (5 pts)
Create a new variable in the dataframe named England, which reads “England” if the case belongs to
England and “NotEngland” otherwise. You can use the original data or new_data to solve this problem.
Display the head of the England variable.
### Solution goes here --------------------
Question 7: (5 pts)
Using Base R, construct a scatter plot of log(Career_Earnings_NUM) versus PDC_Ranking, split
by England. Make sure to include a legend and label your axes appropriately.
If you could not solve question 4, construct a base R scatter plot of PDC_Ranking versus Age, split by
England. Make sure to include a legend and label your axes appropriately.
### Solution goes here --------------------
3
Question 8: (5 pts)
Using ggplot, construct a scatter plot of log(Career_Earnings_NUM) versus PDC_Ranking, split by
England. Make sure to include a legend and label your axes appropriately.
If you could not solve question 4, construct a ggplot scatter plot of PDC_Ranking versus Age, split by
England. Make sure to include a legend and label your axes appropriately.
### Solution goes here --------------------
Part II: Nonparametric Procedures
Question 9: (10 pts)
Compute the the two conditional probabilities (or proportions):
I) Given the respondent represents England, what is the probability that their PDC_Ranking is in the
upper 25%?
p1 = P (High Rank|England)
II) Given the respondent does not represent England, what is the probability that their PDC_Ranking
is in the upper 25%?
p2 = P (High Rank|Not England)
### Solution goes here --------------------
Question 10: (20 pts)
Consider testing the the null alternative pair:
H0 : p1 − p2 = 0 versus HA : p1 − p2 6= 0
Run a bootstrap procedure to test the above hypothesis. To accomplish this task, construct a 95% bootstrap
interval on the parameter θ = p1 − p2 and check if 0 falls in the interval.
Note for comparison, you can look up the two-sample proportions z-test, which is the analogous parametric
procedure. The final results should be similar. Also note that the bootstrap hypothesis test introduced above
is two-tailed, which only shows that the proportions are statistically different. A one tailed test might be
more appropriate but this approach is not required for the midterm.
### Solution goes here --------------------
Part III: Writing a R function
Question 11: (10 pts)
Write a function named my_description that creates a written description of a particular dart player. The
function should have two inputs, (1) the case as an integer and (2) the data frame (new_data or data). For
example, respondent 3 is Gerwyn Price, i.e., my_description(case=3,df=new_data). Your function’s
output should read similar to:
4
“Gerwyn Price aka The Iceman, is a 36 year old dart player from Whales. The Iceman is currently ranked
number 3 and his career earnings total 1497803 dollars.”
Try to make this exercise fun. You can write the above expression however you want but your function must
generalize to multiple cases. You can also include extra information if desired. To check your function, make
sure to display your function’s output for Gerwyn Price.
### Solution goes here --------------------
Question 12: (5 pts)
Use a vectorized operation or an apply function to compute all dart player’s descriptions. Show the head of
your resulting vector.
### Solution goes here --------------------
5
Part I: Instructions
Part I: Warm-up
Question 1: (10 pts)
Part II: Data Cleaning and Graphics
Question 2: (5 pts)
Question 3: (5 pts)
Question 4: (10 pts)
Question 5: (10 pts)
Question 6: (5 pts)
Question 7: (5 pts)
Question 8: (5 pts)
Part II: Nonparametric Procedures
Question 9: (10 pts)
Question 10: (20 pts)
Part III: Writing a R function
Question 11: (10 pts)
Question 12: (5 pts)