Assignment 1
Please submit a document with your answers to the problems in terms of graphs, tables and text, as required.
• Mark clearly what problem and sub-problem you are answering
• Do NOT just paste in dump of your R console output
• Do not install packages as part of your code without checking if they already exist
For each problem submit a text file with .R extension that contains the R code necessary to reproduce your results: you can double check that your code executes correctly by using command source(filename.R) from a clean R session.
Marks will be deducted for code that doesn’t run or doesn’t reproduce the values reported in the previous document. Up to 5 marks will be awarded for writing clear and reusable code (pay attention to indentation, choice of variable identifiers, comments, error checking, etc) and a well-presented report.
Overall, you are expected to submit 5 files via Learn (1 Word or pdf document, 4 .R files).
Alternatively, if you are already familiar with Rmarkdown, you can submit just 1 .Rmd file.
Problem 1 (25 points)
You have been provided with electronic health record data from a study cohort. Three CSV (Comma Separated Variable) files are provided on learn.
The first file is a cohort description file cohort.csv file with fields:
• id = study identifier
• yob = year of birth
• age = age at measurement
• bp = systolic blood pressure
• albumin = last known albuminuric status (factor) • diabetes = diabetes status
The second file lab01.csv is provided by a laboratory after measuring various biochemistry levels in the cohort blood samples. Notice that a separate lab identifier is used to anonymise results from the cohort. The year of birth is also provided as a check that the year of birth aligns between the two merged sets.
• LABID = lab identifier
• yob = year of birth
• urea = blood urea
• creatinine = serum creatinine
• glucose = random blood glucose
To link the two data files together, a third linker file linker.csv is provided. The linker file includes a LABID identifier and the cooresponding cohort id for each person in the cohort.
(a) Using all three files provided on learn, load and merge to create a single data table based dataset cohort.dt. This will be used in your analysis. Perform assertion checks to ensure that all identifiers in cohort.csv have been accounted for in the final table and that any validation fields are consistent between sets. After the assertions are complete, drop the identifier that originated from lab dataset LABID. Ensure that a single yob field remains. Ensure that the albumin field is converted to a factor and the ordering of the factor is 1=“normo”,2=“micro”,3=“macro”. (5 points)
(b) Since we are provided with a complete set of year of birth data, evaluate if the any missing age fields can be calculated. Based on the age/year offset being consistent for the remainder of the cohort, update any missing age fields. (3 points)
(c) Create a separate data table to provide a summary of the data in the cohort.dt table. This should contain a row for each column in the cohort.dt data table. Provide the following columns in your summary table, noting that some of these columns will need to contain character strings as they include brackets and multiple numbers:
• Variable name
• Median (Interquartile Range) for continuous non-categorical data and N (%) for
cate- gorical
• Total N
• Missing N (%)
• Create a function that takes a vector of numeric or integer values and returns a
character string of the median and interquartile range. This must be in the form “12 (11,13)”. For numeric vectors, the returned values should be rounded to two decimal places. The function should accept a boolean parameter impute.to.mean to impute any missing values to the mean value of the vector. Set the default value of the impute.to.mean as FALSE and perform the function on all cohort.dt columns with continuous values. Use the result to populate the median(IQR) field in your summary table. (10 points)
• (d) Categorical fields in the summary table should be split into a row per category. For example, the albumin field should show three rows – one for normo, one for micro and one for macro. The number of missing reported is repeated for each row. The N (%) for each row is the number in the category and overall percentage in that category from the full cohort.dt table (including those with missing albumin information).
• Create a function that takes a vector of categorical values as an input. The category name being calculated should also be passed to the function as parameter use.str. Eg. summary.cat(x, use.str = “normo”). The function must return a string in the form “199 (49.75)” where 199 is the total number of non- missing rows of the factor matching the normo parameter and 49.75 is the overall percentage of this field based on N values of the input vector (including missings) rounded to two decimal places.
• Use the function to populate the three categorical albumin rows, each time passing the relevant factor name to be calculated. Combine the results in your summary data table in order that it is formatted as follows (with ? values filled):
(7 points)
Problem 2 (25 points)
Files ltegfr1.csv and ltegfr2.csv (available on Learn) contain information regarding a longitudinal dataset containing records on 250 patients. For each subject, eGFR (estimated glomerular filtration rate, a measure of kidney function) was collected at irregularly spaced time points: variable “fu.years” contains the follow-up time (that is, the distance from baseline to the date when each eGFR measurement was taken, expressed in years).
(a) Convert the files to data tables and merge in an appropriate way into a single data table, then order the observations according to subject identifier and follow-up time, and add an assertion that the ordering is correct. (3 points)
(b) Compute the average eGFR and length of follow-up for each patient, then tabulate the number of patients with average eGFR in the following ranges: (0, 15], (15, 30], (30, 60], (60, 90], (90, ∞). Count and report the number of patients with missing average eGFR. Again, ensure that the table ordering is returned to id, and follow-up time.
Problem 3 (25 points)
The MDRD4 and CKD-EPI equations are two different ways of estimating the glomerular filtration rate (eGFR) in adults:
For each patient with at least 3 eGFR readings, fit a linear regression model for the eGFR measurements as a function of time, and store it in the data table. Count how many patients have a slope < -3, [-3, 0), [0, 3], > 3. (4 points)
(c) For patients with average eGFR in the (0,15] range, collect in a data table their identifier, sex, age at baseline, average eGFR, time of last eGFR reading and number of eGFR measurements taken. (9 points)
(d) For patients 3, 37, 162 and 223 (one at a time):
• Plot the patient’s eGFR measurements as a function of time.
• Fit a linear regression model and add the regression line to the plot.
• Report the 95% confidence interval for the regression coefficients of the fitted
model.
• Using a different colour, plot a second regression line computed after removing
the
extreme eGFR values (one each of the highest and the lowest value). (9 points)
where:
• Scr is serum creatinine (in mmol/L)
• κ is 0.7 for females and 0.9 for males
• α is -0.329 for females and -0.411 for males
(a) Write an R6 class Calcegfr that accepts 4 vectors of serum creatinine (scr), age (age), sex (sex) and ethnicity (ethnic). The initialisation method of the class must validate that vector lengths match and store them for use by the methods. You can assume that sex is a factor variable with levels “Male” and “Female”, and ethnic is a factor variable with levels “Black” and “Other”). The class must then implement two separate public methods of the two formulas storing the result as a vector of the same length as the input:
>
> egfr.mdrd4 <- function() { +
• +
• +
• +
• +
• +
• +
scr <- self$scr
age <- self$age
sex <- self$sex
ethnicity <- self$ethnicity +
# Remainder of function code to # return algorithm result +}
>
> egfr.ckdepi <- function() { +
• + scr <- self$scr
• + age <- self$age
• + sex <- self$sex
• + ethnicity <- self$ethnicity +
• + # Remainder of function code to + # return algorithm result
+}
(14 points)
(b) For the scr2.csv dataset available on Learn, instantiate an object of your new class and compute the eGFR according to the two equations. Report (rounded to the second decimal place) mean and standard deviation of the two eGFR vectors and their Pearson correlation coefficient. Also report the same quantities according to strata of MDRD4 eGFR: 0-60, 60-90 and > 90. (5 points)
(c) Produce a scatter plot of the two eGFR vectors, and add vertical and horizontal lines corresponding to median, first and third quartiles. Is the relationship between the two eGFR equations linear? Justify your answer. (6 points)
Problem 4 (20 points)
The builtin infert dataset contains data from a study of secondary infertility (that is, the inability to get pregnant for a woman who had previously been pregnant, recorded in column
“case”). Import the dataset, convert to data table and perform the following:
(a) Fit a logistic regression model (M1) to predict secondary infertility using age and parity (number of previous successful pregnancies) as predictors. Use the deviance to judge the
goodness of fit of the model and report a p-value to 3 significant figures. (4 points)
(b) Fit a second model (M2) by adding the number of spontaneous abortions to the set of predictors used in model M1. Report odds ratio and 95% confidence interval for the spontaneuous abortions variable. Perform a likelihood ratio test to compare model M2 to model M1, and report the p-value for the test. (6 points)
(c) Implement a function that computes the binomial log-likelihood according to the formula:
The function should have the following signature: loglik.binom <- function(y.obs, y.pred)
where y.obs is a vector of observed outcomes (with values either 0 or 1 to represent controls and cases, respectively), and y.pred is a vector of fitted probabilities learnt from a logistic regression model. Use function loglik.binom() to compute deviance and null deviance for model M2. (6 points)
(d) Using functions glm.cv() and predict.cv() from Lab 3, perform 10-folds cross- validation for model M2 (set the random seed to 1 before creating the folds). To evaluate the predictive performance of model M2, use function loglik.binom() to compute the log-likelihood of the predicted probabilities for each test fold. Report the sum of the test log-likelihoods over all folds. (4 points)