HW2 Due Feb 4 NOTE: Everything you need to do this assignment is here, in your class notes, or was covered in
discussion or lecture.
• DO NOT look for solutions online.
• DO NOT collaborate with anyone inside (or outside) of this class.
• Work INDEPENDENTLY on this assignment.
• EVERYTHING you submit MUST be 100% your, original, work product. Any student suspected of plagiarizing, in whole or in part, any portion of this assignment, will be immediately referred to the Dean of Student’s office without warning.
Homework Requirements
1. Draw a flowchart of the main function OR write pseudocode sufficiently complete, clear, and concise enough to enable a person to accurately implement the function in any programming language they are adept with using.
2. Write the function(s) which accurately implements the algorithm(s) as described or requested.
3. Include the error-handling to ensure your function(s) work properly.
Note: The requirements apply to only problem 2’s *_impute() functions.
1: Short Answers
(a) Please provide three examples in which data might naturally be presented in a messy (non-tidy) way. For each example, include the context, observations, and variables that would be recorded. Provide a small sample dataset (10 observations) for each example. You are allowed to search online for context, but you MUST cite your sources.
(b) For each of the three examples in (a), describe how the data might be better presented in a tidy way. Please use tidyverse functions to reorganize the small sample datasets in (a) into tidy format.
2: Teacher’s Gradebook
One scenario which naturally creates non-tidy data is a teacher’s gradebook. For example:
UID Homework_1 Homework_2 … Homework_10 Exam_1 Exam_2 Exam_3 Section
123456787 70 123456788 91 123456789 60
90 … 80 85 … 73 71 … 78
768870A 9010080A 888573A
2
HW2 Due Feb 4
(a) Create a simulated dataset in R called gradebook that represents a possible gradebook in
the basic format given above:
• Each row of the gradebook should contain measurements for a single student. • Each column should contain scores for individual assignments.
• The last column should be “Section.”
The simulated gradebook should contain the grades for at least 150 students (80 in section A and 70 in section B) and scores for 13 assignments. Set the seed for simulating your dataset to be your UID.
(b) Randomly replace 10% of the scores in the Homework_10 and the Exam_3 by NA values. For each section, print out one student with NA value in the Homework_10, one with NA value in the Exam_3, and one without NA values in both columns. You will use those six students for the rest of the problems to demonstrate your results.
Imputation is the process of replacing missing values by estimated values. The simplest (far from preferred) method to impute values is to replace missing values by the most typical (or “average”) value.
(c) Write a function messy_impute() that will impute missing values in a data frame (or tibble) that is organized in the same non-tidy way as gradebook. You may present your pseudo code or flowchart here.
The messy_impute() function should have two optional arguments:
• The center argument specifies whether to impute using the mean or the median.
• The margin argument specifies one of two ways to input values:
◦ Impute the missing values using the center of the observed (non-missing) values in the column.
◦ Impute the missing values using the center of the observed values in the row.
• The range argument specifies the columns/the rows for computing the typical value.
It could be names or numeric indices.
(d) Using the gradebook variable, without reshaping or tidying, impute the missing homework and exam scores with messy_impute() using both the mean and the median of the observed homework and exam scores for sections A and B respectively.
(e) Using the gradebook variable, without reshaping or tidying, impute each missing homework and exam score with messy_impute() using both the mean and the median of the individual student’s observed homework and exam scores.
(f) Transform the gradebook variable into tidy format. Call the transformed variable gradebook_tidy.
(g) Write a function tidy_impute() that will impute missing values from a specified column in a tibble that is organized in the same tidy way as gradebook_tidy. The tidy_impute() function should have optional arguments to impute values that correspond to imputing in the same ways as in the messy_impute() function. You may present your pseudo code or flowchart here.
3
HW2 Due Feb 4
(h) Using the gradebook_tidy variable impute the missing homework and exam scores with tidy_impute() using both the mean and the median of the observed homework and exam scores for sections A and B respectively.
(i) Using the gradebook_tidy variable, impute each missing homework and exam score with tidy_impute() using both the mean and the median of the individual student’s observed homework and exam scores.
4