General Guidelines
Homework 4
Stats 20 Lec 1 and 2 Fall 2020
Please use R Markdown for your submission. Include the following files: • Your .Rmd file.
• The compiled/knitted HTML document.
• Your .bib file (if needed).
Name your .Rmd file with the convention 123456789_stats20_hw0.Rmd, where 123456789 is replaced with your UID and hw0 is updated to the actual homework number. Include your first and last name and UID in your exam as well. When you knit to HTML, the HTML file will inherit the same naming convention.
The knitted document should be clear, well-formatted, and contain all relevant R code, output, and explanations. R code style should follow the Tidyverse style guide: https://style.tidyverse.org/.
Note: All questions on this homework should be done using only functions or syntax discussed in Chapters 1–6 of the lecture notes. No credit will be given for use of outside functions.
Basic Questions
Collaboration on basic questions must adhere to Level 0 collaboration described in the Stats 20 Collaboration Policy.
Question 1
The objective of this question is to give practice with writing functions involving matrices.
For any matrix A, the transpose of A, denoted AT (or sometimes A′), is the matrix whose rows are the
columns of A and whose columns are the rows of A. For example:
1 2 3T 1 4 4 5 6 =2 5
36 The t() function returns the transpose of an input matrix.
Write a function called my_t() that returns the transpose of a matrix without the t() function. Account for vector or matrix input. The output of my_t(x) and t(x) should be identical for any vector or matrix x.
1
Question 2
The objective of this question is to show how R can be applied in the context of linear regression using matrices.
(a)
Given n pairs of values {(x1,y1),(x2,y2),…,(xn,yn)}, the sample (Pearson) correlation coefficient r is defined by
n
(xi −x ̄)(yi −y ̄)
i=1 r=.
nn
( x i − x ̄ ) 2 ( y i − y ̄ ) 2
i=1 i=1
The cor() function can input two vector arguments of the same length and output the correlation coefficient
between them.
Write a function called my_cor() that computes the correlation between two numeric vectors x and y without the cor(), cov(), var(), or sd() functions. Include a character argument called use that specifies whether to use all observations (use = “everything”) or only pairwise complete observations (use = “pairwise.complete.obs”) by removing a pair (xi,yi) if either xi or yi is missing. The output of my_cor(x, y) and cor(x, y) should be identical for any numeric vectors x and y.
Hint: The cor() function also allows for matrix inputs x and y and computes the correlation matrix between columns of x and columns of y. Your my_cor() function does not need to include this functionality.
(b)
Assume there is a linear relationship between variables x and y. The least squares regression line is a linear model that predicts y from x. The equation of the regression line is denoted by y = a + bx.
The coefficients of the regression line are computed by the formulas b=rsy anda=y ̄−bx ̄,
sx
where x ̄ and y ̄ and the respective means and sx and sy are the respective standard deviations for the data
vectors x and y.
Write a function called linreg() that inputs two numeric vectors x and y and outputs a numeric vector of
length 2 that corresponds to the coefficients a and b of the least squares regression line that predicts y from x.
2
(c)
The heights and weights of six self-identified women are given below.
Height (inches) 61
62
63
64
66
68
Weight (pounds) 104
110
125
141
160
170
Assume there is a linear relationship between height and weight. We want to fit a linear regression model that predicts weight from height. Use your linreg() function from (b) to find the equation of the regression line y = a + bx.
(d)
Write a function called linreg_mat() that inputs two numeric vectors x and y and outputs the expression (XT X)−1XT y, where X is the design matrix given by
1 x1
1 x2 X = . . ,
. . 1 xn
where x1 , x2 , . . . , xn are the observed data values of the predictor variable x (n is the sample size). Hint: Add a column of 1’s to the predictor (explanatory) vector to create the design matrix X.
(e)
Use your linreg_mat() function from (d) on the height and weight data in (c). Compare your answer with your results from part (c).
(f)
Interpret the slope coefficient in the context of the data.
3
Question 3
The objective of this question is to introduce how to write infix operators and give further practice writing functions with loops and matrices.
From the Wikipedia article on matrix multiplication (https://en.wikipedia.org/wiki/Matrix_multiplication): If A is an m×n matrix and B is an n×p matrix,
b1p b2p
am1 am2 ··· amn bn1 bn2 ··· bnp the matrix product C = AB is defined to be the m × p matrix
such that
for i = 1,…,m and j = 1,…,p. (a)
A= .
. ..
a11 a21
a12 ··· a22 ···
b22 ··· . ..
a1n b11 b12 ···
a2n b21 . , B= .
. , ….….
c1p
cm1 cm2 ··· cmp
n
cij =ai1b1j +···+ainbnj =aikbkj,
k=1
c11
c21 C= .
c12 · · · c22 · · ·
c2p . ….
. ..
Write an infix operator function called %m% that inputs two numeric matrix arguments A and B and outputs the matrix product of A and B. The body of %m% cannot use %*%. The output of A %m% B and A %*% B should be identical for any numeric matrices A and B.
Hint 1: To create an infix operator, the name of the function must be contained within backticks. That is, assign the function to the name `%m%`.
Hint 2: Make sure to check that the matrix arguments are conformable. (b)
Verify that your %m% operator in (a) works on
6 5 4 1 4 7 X=321 andY=258.
369
Hint: You may use %*% to check that your function outputs the correct product.
4
Question 4
The objective of this question is to show how to apply matrix multiplication to matrix exponentiation and give further practice with writing functions with loops and matrices.
For nonnegative integers k, the power of a square matrix A is the matrix product of k copies of A. For example, if A is an n × n matrix, then
where In is the n × n identity matrix. (a)
A0 = In A1 = A A2 = AA
. = .
Ak = AA···A,
k copies
Use your matrix multiplication infix operator %m% from Question 5 to write an infix operator function called %ˆ% such that A %ˆ% k outputs the kth power of a square numeric matrix A. The body of %ˆ% cannot use %*%.
(b)
Consider the matrix Z defined to be
0.2 0.7 0.1 Z = 0.6 0.2 0.2 .
0.4 0.1 0.5 Use your %ˆ% operator in (a) to compute Z0, Z5, Z50, and Z500.
Note: Notice that the values in each row of Z sum to 1. Matrix Z is an example of a stochastic matrix, which describes transition probabilities between states of a Markov chain. To learn more about Markov chains, take Stats 102C.
Question 5
The objective of this question is to give practice with logical indices and writing functions with different types of output.
Write a function called my_which() that returns the indices of TRUE values in a logical vector or matrix without the which() function. Include an optional logical argument called arr.ind argument with a default of FALSE to optionally return two dimensional indices of TRUE values in a logical matrix. The output of my_which(x) and which(x) should be identical for any logical vector or matrix x.
5
Question 6
The objective of this question is to give practice with using vectorization with factors.
Download the mlb.RData1 file from CCLE and save it to your working directory. Then run the command load(“mlb.RData”) to load data from the 2018 season of Major League Baseball. This command will create 6 objects in your workspace:
league: A factor with 2 levels indicating if the player plays for a team in the
American League (AL) or the National League (NL).
team: A factor with 30 levels indicating which Major League Baseball team the
player is a member of.
pos: A factor with 9 levels indicating the primary position of the player.
ab: An integer vector indicating the number of “at bats” the player had.
hit: An integer vector indicating the number of “hits” the player had.
hr: An integer vector indicating the number of “home runs” the player had.
(a)
i.
Using one command find the maximum number of hits from each team..
ii.
Using one command find the number of players on each team.
iii.
Using one command and no subsetting, find the number of players on each team with at least one home run.
(b)
A definition: Batting Average: the batting average of a player (or a team), is calculated by the number of
hits (hit) divided by the number of at bats (ab). i.
Using one command, find the highest batting average for each team among players with at least 100 at bats.
ii.
Using one command, find the batting average for each team.
(c)
Using one command, find the average number of home runs for each position in each league. Which position
has the largest difference between leagues?
(d)
Using one command, compute the median number of players for each position on any team.
1The data found in mlb.RData was obtained from the Lahman R package in CRAN, with modifications. 6
Intermediate Questions
Collaboration on intermediate questions must adhere to Level 1 collaboration described in the Stats 20 Collaboration Policy.
Question 7
The objective of this question is to give practice writing a function with different outputs for different types of inputs and give further practice with loops and matrices.
(a)
Write a function called my_row() that returns a matrix of integers indicating the row number (i.e., the ijth element is equal to i) without the row() function. The output of my_row(x) and row(x) should be identical for any matrix x.
(b)
Write a function called my_col() that returns a matrix of integers indicating the column number (i.e., the ijth element is equal to j) without the col() function. The output of my_col(x) and col(x) should be identical for any matrix x.
(c)
Write a function called my_diag() that returns the same output as diag() without the diag() function. Account for different types of inputs (scalars, vectors, matrices). Include optional arguments nrow and ncol that specify the dimensions of the output matrix when the input object is a vector. The output of my_diag(x) and diag(x) should be identical for any vector or matrix x.
Hint: In addition to the lecture notes, you may use the missing() function, if necessary. When used inside the body of a function, the missing() function returns TRUE if a formal argument of the function is missing (i.e., not specified) and has no default value. For more guidance on how to use the missing() function, read
the “Formal Arguments” document in the Required Reading on CCLE.
7
Advanced Questions
Collaboration on advanced questions must adhere to Level 1 collaboration described in the Stats 20 Collaboration Policy.
Note: Advanced Questions are intended for further enrichment and a deeper challenge, so they will not count against your grade if they are not completed or attempted.
Question 8
Watch this video for an explanation of how ranked choice voting works: https://youtu.be/Rgo-eJ-D__s (For anyone who needs it, a transcript can be found here: https://bit.ly/32G4bWK)
Download the votes.RData file from CCLE and save it to your working directory. Then run the following command to load the votes object in your workspace:
load(“votes.RData”)
Note: Do not print the entire votes object. It is extremely bad practice/style to output more than about 10
rows of a matrix (or data frame).
(a)
Pseudocode (or outline) a function that inputs a matrix of ranked choice votes and returns a matrix of results.
(b)
Write a function called tally_rcv() that inputs a matrix of ranked choice votes and returns a matrix of results.
• Your tally_rcv() function must be able to handle any number of choices/candidates and any number of voters.
• The column names of the output matrix must be the names of the choices/candidates.
• The row names of the output matrix must correspond to the appropriate round numbers.
• If there is a tie for last place in a round, you must eliminate candidates in the following manner:
– If the tie occurs in the first round, eliminate the tied candidate who comes last alphabetically by first name.
– In any subsequent round:
∗ If the sum of the the tied candidates votes is less than the number of votes for the next lowest
candidate, eliminate both of the tied candidates.
∗ Eliminate the tied candidate with the least votes in the previous round.
Hint 1: In addition to the lecture notes, you may use the order() and/or rank() functions, if necessary. The order() function inputs a vector and outputs the indices of the input vector that will return the sorted values. The rank() function inputs a vector and outputs the relative rank of each element.
Hint 2: It may be helpful to consider eliminating a candidate by setting all of their ranks to Inf. (c)
It is Election Day in Pawnee! Use your tally_rcv() function on the votes data and print the results using the function knitr::kable(). The knitr::kable() function will print the output matrix in a well-formatted table after knitting your file.
8