CS计算机代考程序代写 STAT S4206/S5206 Midterm

STAT S4206/S5206 Midterm

Gabriel

5/21/2021

Part I: Instructions

The STAT S4206/S5206 midterm is open notes, open book(s), open computer and online resources are allowed.
Students are not allowed to communicate with any other people regarding the exam with the exception of
the instructor (Gabriel Young) and the TA (Andrew Davison). This includes emailing fellow students, using
WeChat and other similar forms of communication. If there is any suspicion of one or more students cheating,
further investigation will take place. If students do not follow the guidelines, they will receive a zero on the
exam and potentially face more severe consequences. The exam will be posted on Friday, 05/21/2021 at
11:00AM (ET). Students are required to submit both the .pdf (or .html) and .rmd files on Canvas by Sunday,
05/23/2021 at 12:00PM (ET). Students will be given the whole 49 hour time frame to complete and upload
their knitted file.

A few more recommendations follow:

• Don’t forget to submit both the correct .rmd file and at least one of your .html or .pdf files.
• Save your .rmd regularly to avoid any problems if your computer crashes.
• Please ensure your output is tidy. Do not print pages and pages of data. Doing so will result in points

deducted.
• Please stop working on the exam at least 30 min before it’s deadline to make sure your RMarkdown file

knitts properly.
• If you have a question, please include both the instructor and TA in the email thread. Don’t forget

that both Gabriel and Andrew are in the US so expect a delay in their response times if you are in a
different timezone.

Part I: Warm-up

Run the following code:
warm_up <- read.csv("Warm_up.csv") head(warm_up,4) ## V1 V2 V3 V4 V5 V6 ## 1 NA 0.46620043 -1.8923489 NA 0.6408445 -1.7308123 ## 2 0.1848492 0.95466641 1.2928042 -0.9007918 -1.6013778 -0.4983996 ## 3 1.5878453 -0.94720635 -0.6182543 0.1499151 -0.7778154 -0.3540769 ## 4 -1.1303757 0.03856309 1.0409383 1.1264128 -1.6473925 NA ## V7 V8 V9 V10 ## 1 NA 1.023567236 -0.02165721 0.7705831 ## 2 NA -0.003508363 0.98448042 -0.7702532 ## 3 0.04548124 -1.274958744 -0.68091825 -1.3121963 ## 4 -0.46466759 1.366271568 -0.26052233 -0.4345667 1 dim(warm_up) ## [1] 1000 10 Question 1: (10 pts) Perform the same task as the following chunk without using a loop and in one line of code. my_out <- NULL for (j in 1:10) { keep_vec <- NULL for (i in 1:1000) { if (!is.na(warm_up[i,j])) { keep_vec <- c(keep_vec,i) } } keep_column <- warm_up[keep_vec,j] my_out[j] <- round(mean(keep_column),4) } my_out ## [1] 0.0647 0.0191 0.0693 0.0342 0.0367 0.0012 -0.0783 0.0333 -0.0723 ## [10] 0.0409 ### Solution goes here -------------------- Part II: Data Cleaning and Graphics Run the following code: data <- read.csv("PlayerProfiles.csv") names(data) <- c("Name","Country","Age","DOB","Nickname","PDC_Ranking","Tour_Card","Career_Earnings") head(data,4) ## Name Country Age DOB Nickname PDC_Ranking ## 1 Michael van Gerwen Netherlands 32 4/25/1989 Mighty Mike 1 ## 2 Peter Wright Scotland 51 3/10/1970 Snake Bite 2 ## 3 Gerwyn Price Wales 36 3/7/1985 The Iceman 3 ## 4 Adrian Lewis England 36 1/21/1985 Jackpot 13 ## Tour_Card Career_Earnings ## 1 Yes £8,321,167 ## 2 Yes £3,469,888 ## 3 Yes £1,497,803 ## 4 Yes £3,137,634 dim(data) ## [1] 96 8 Question 2: (5 pts) There appears to be some repeated cases in the dataframe (data). Identify the number of repeated cases in this dataset. There are many ways to solve this problem, e.g., you can look at repeated names or nicknames 2 and then figure out how to extract this information from the dataframe. I personally used a loop for this problem. ### Solution goes here -------------------- Question 3: (5 pts) Create a new dataframe called new_data that excludes all repeats. For partial credit, you can manually identify the repeats and remove these cases with a basic subsetting command. Display the head and dimension of new_data. I personally used a loop for this problem. ### Solution goes here -------------------- Question 4: (10 pts) Use base R character string functions to convert the Career_Earnings variable into a numeric mode. For example, the symbol £535,131 should be converted to 535132. Your converted variable should be appended to your dataframe and named Career_Earnings_NUM. You can use the original data or new_data to solve this problem. Display the head of the numeric vector Career_Earnings_NUM. ### Solution goes here -------------------- Question 5: (10 pts) Using relevant R functions, identify the 5 most frequent countries measured in this dataset. Display the country and its frequency. Note that England should have the highest frequency. You can use the original data or new_data to solve this problem. Also create a barplot displaying the 5 most frequent countries. ### Solution goes here -------------------- Question 6: (5 pts) Create a new variable in the dataframe named England, which reads “England” if the case belongs to England and “NotEngland” otherwise. You can use the original data or new_data to solve this problem. Display the head of the England variable. ### Solution goes here -------------------- Question 7: (5 pts) Using Base R, construct a scatter plot of log(Career_Earnings_NUM) versus PDC_Ranking, split by England. Make sure to include a legend and label your axes appropriately. If you could not solve question 4, construct a base R scatter plot of PDC_Ranking versus Age, split by England. Make sure to include a legend and label your axes appropriately. ### Solution goes here -------------------- 3 Question 8: (5 pts) Using ggplot, construct a scatter plot of log(Career_Earnings_NUM) versus PDC_Ranking, split by England. Make sure to include a legend and label your axes appropriately. If you could not solve question 4, construct a ggplot scatter plot of PDC_Ranking versus Age, split by England. Make sure to include a legend and label your axes appropriately. ### Solution goes here -------------------- Part II: Nonparametric Procedures Question 9: (10 pts) Compute the the two conditional probabilities (or proportions): I) Given the respondent represents England, what is the probability that their PDC_Ranking is in the upper 25%? p1 = P (High Rank|England) II) Given the respondent does not represent England, what is the probability that their PDC_Ranking is in the upper 25%? p2 = P (High Rank|Not England) ### Solution goes here -------------------- Question 10: (20 pts) Consider testing the the null alternative pair: H0 : p1 − p2 = 0 versus HA : p1 − p2 6= 0 Run a bootstrap procedure to test the above hypothesis. To accomplish this task, construct a 95% bootstrap interval on the parameter θ = p1 − p2 and check if 0 falls in the interval. Note for comparison, you can look up the two-sample proportions z-test, which is the analogous parametric procedure. The final results should be similar. Also note that the bootstrap hypothesis test introduced above is two-tailed, which only shows that the proportions are statistically different. A one tailed test might be more appropriate but this approach is not required for the midterm. ### Solution goes here -------------------- Part III: Writing a R function Question 11: (10 pts) Write a function named my_description that creates a written description of a particular dart player. The function should have two inputs, (1) the case as an integer and (2) the data frame (new_data or data). For example, respondent 3 is Gerwyn Price, i.e., my_description(case=3,df=new_data). Your function’s output should read similar to: 4 “Gerwyn Price aka The Iceman, is a 36 year old dart player from Whales. The Iceman is currently ranked number 3 and his career earnings total 1497803 dollars.” Try to make this exercise fun. You can write the above expression however you want but your function must generalize to multiple cases. You can also include extra information if desired. To check your function, make sure to display your function’s output for Gerwyn Price. ### Solution goes here -------------------- Question 12: (5 pts) Use a vectorized operation or an apply function to compute all dart player’s descriptions. Show the head of your resulting vector. ### Solution goes here -------------------- 5 Part I: Instructions Part I: Warm-up Question 1: (10 pts) Part II: Data Cleaning and Graphics Question 2: (5 pts) Question 3: (5 pts) Question 4: (10 pts) Question 5: (10 pts) Question 6: (5 pts) Question 7: (5 pts) Question 8: (5 pts) Part II: Nonparametric Procedures Question 9: (10 pts) Question 10: (20 pts) Part III: Writing a R function Question 11: (10 pts) Question 12: (5 pts)

Related Posts