Coursework 1 – Version 1.docx
IRDR0004 Coursework 1
Computer-Based R Exercise & Report
IRDR0004 Coursework Assessment
Instructions: You are required to submit a report answering all questions any time before
17:00 (UK) November 29, 2021. Please fill in the details of your candidate number (see
Portico>My Studies>Examinations) and module code (IRDR0004) in your first page. Please
do not include your name in your submission, we will perform anonymous marking to the
best of our ability.
There are two ways to present your work. You can choose either of the two and it will not
impact your grades, although we recommend you to use the RMarkdown to present your
answers:
a) You could input all written answers for questions, including scripts, outputs displayed on R
Console and graphical plots using RMarkdown to generate your report in a word or PDF
format (see video tutorials on Moodle for how to use RMarkdown). Make sure to submit your
RMarkdown script FULLY annotated NEATLY with its question number.
b) Alternatively, you could write your answers in a word format manually accompanied with
screenshots of your R code and its output as requested. Make sure to submit your R working
code as script files for EACH QUESTION.
Failure to comply with the above instructions, or any documentation missing (i.e.,
RMarkdown file or script) from the submission will lead to an automatic deduction of grades.
The report must be submitted through the UK TURNITIN system via Moodle. Bear in mind
that any detected plagiarism will result in zero marks for all involved and disciplinary
procedure will be followed as per UCL policy.
You are provided with a dataset to which you will further have to apply the algorithms to
make the records personalised based on your UCL ID number so as each student has a unique
coursework.
Please attempt to answer all questions. Questions with higher difficulty yield more points.
FAQs
1. What do I need to submit?
You must submit two documents: (1) if you choose RMarkdown, you will need to submit the
RMarkdown script (.rmd) with codes used for to answer the questions and generating all
outputs including the report itself and the full report in Word/PDF format (this document
should be generated from your RMarkdown script) containing the analysis code, all written
answers with its corresponding R console outputs and plots; or (2) alternatively, if you
choose to input the answers manually in Word, you will need to submit your final report
containing the code screenshot, answers, outputs and plots, and a zipped file containing the
R script for each questions.
2. When asked to provide codes for certain questions – must I write the full function, or just
type the test code we are supposed to use?
You must provide the code in full of the optional arguments
3. I was wondering if it is ok if I can just type down my answers in Word instead of using
RMarkdown?
Yes, you can, but we highly recommend you to use the RMarkdown as it will be easier and
neater to present your answer. You can watch the video posted on Moodle which provides
an in-depth explanation of how you can create your report and what’s expected of you in
terms of quality. Please note the way you present will not affect your grade.
4. What is the maximum word limit, or a max number of words per question?
There is no word limit. For the statistical question – when writing your answers (especially
for null & alternative hypotheses as well giving interpretations) keep them as concise as
possible – avoid waffling.
5. Should I treat all questions as separate entities?
Yes, all major questions are self-contained and therefore must be treated separately.
However, the sub-questions within a major question (e.g., Question 1 (major) and for
instance 1a, 1b, and 1c (sub-question(s))) must be answered cautiously – as an answer
given to a previous sub-question can led to its follow-up. For instance, make sure to provide
the correct answer to Question 2 (b) because any incorrect answer in 2 (b) can potentially
lead to a follow-up error in 2 (c).
6. How do I choose the right test?
Make sure to read and understand questions whilst bearing in mind the assumptions about
the data at hand: it’s sample size and whether it’s assumed to be normal or non-normal.
Key advice is to know the difference between parametric and non-parametric and knowing
what is meant by a null and alternative hypothesis.
Question 1
Cornwall and Devon are areas in the Southwest region of England known to have topsoil
heavily contaminated with arsenic. You are tasked with assessing the burden of environmental
contamination in these areas.
Instructions: Your full UCL student ID number represents the total number of topsoil samples
taken from residential garden soils across the region, and the last 4 digits of your UCL ID
represents the number of topsoil samples with elevated arsenic concentrations exceeding the
UK soil acceptable limits. Use this information to answer question 1a and 1b.
Note: Your student ID number contains eight digits, and it should look something akin to
these examples: 18020105 or 19012500. Using 19012500 as a motivating example to explain
the above instruction: 19012500 (full ID) will represent the total number of topsoil samples;
and 2500 (last four digits) is the number of samples exceeding the acceptable limits.
If the last four digits of your ID begins with a zero – for instance 0105 from 18020105. You
can choose to use the last three (105) or five digits (20105) instead to arrive to a number not
starting with 0
a. What is the prevalence of soil samples detected to exceed the UK acceptable limits
(express as %)? [1]
b. Develop a function called soilArsenic_Prevalence() in R that calculates the
prevalence of arsenic contamination. The function must express the result in
percentage [1]
15 random garden soil samples were studied for arsenic concentrations by multiplying the
values to a factor variable using the last 3-digit values of your UCL ID number.
Garden ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Factor 0.07 0.41 0.73 0.28 0.25 0.34 0.39 0.26 0.16 0.33 0.30 0.66 0.56 0.17 0.48
Soil arsenic (mg/kg)
c. Calculate the summary statistics for the soil arsenic (mg/kg) and provide the
interpretation of these descriptive measures? [4]
d. What are the best approaches for visualising the above data and provide a justification
for your answer? Write R codes to generate the appropriate plots [4]
10 marks
Question 2
An England-wide campaign was launched to target residential gardens to bring contamination
levels of arsenic below the acceptable limits of 32 mg/kg, hence a geographical study design
was used to comparing the environmental soil arsenic contamination levels from 30, 46 and
32 gardens (mg/kg) selected from West Midlands, East Midlands, and East of England,
respectively.
a. Use the “Question_2.csv” to create a variable called “soilAs” to answer the following
questions 2b, 2c, 2d and 2e.
Use your full UCL ID number in the set.seed() function to begin creating a personalized
column called “soilAs”. From a uniform distribution using the function runif()with the
following parameters specified (n = 111, min = 1 & max = 5) to generate random values,
and then subtract the generated values from the variable called “Sample” to create
“soilAs”. [1]
b. State the appropriate hypothesis for comparing the distributions across the three regions?
[1]
c. What is the best methodology for testing this hypothesis? State the correct statistical test
and provide a justification for choosing it? [4]
d. Write down the correct R code to compute the statistical test and p-value. [2]
e. Are there any differences across the three regions, and what conclusion can you draw
from this analysis? [2]
10 marks
Question Three
100 patients were admitted to Charing Cross Hospital – upon admission – their condition was
critical as it turned out they were symptomatic cases of COVID-19. On the spot, the patients’
symptoms were cared for, and monitored round the clock on a 3-hourly basis until their
condition became stable after a week. Blood samples were taken on a 3-hourly basis to
monitor viral loads of infection – these were examined on the spot, and a week after to see if
there was a reduction (indicator that a patient is recovering well).
The lab readings for viral loads from the serologic analysis are stored in “Question_3.csv”, if
you multiply the viral load readings with the last 2-digits of your UCL ID – the values
become standardised.
On a patient-level, you want to assess whether these patients are recovering well.
a. What is the hypothesis for determining whether patients are making a recovery? [2]
b. Briefly discuss some of the issues with the records in “Question_3.csv” and suggest what
can be done to mitigate such issues. Apply the appropriate data cleaning to derive the
desired format to answer the 3c accordingly [10]
c. Use the most appropriate statistical methodology for analysing this data and provide a
justification for this approach. Write out the full R script for analysing the data and
performing the statistical test [5]
d. What conclusions can you arrive with regards to these cohort of patients – provide a full
interpretation [3]
20 marks
Question Four
A study was launched to assess the mean Body Mass Index (BMI) of inhabitants of villages
across Zambia, Zimbabwe, and Malawi to determine the impact of environmental levels of
aridity (i.e., dryness) in the villages as well as farmers who supplies foodstuff experiencing
food shortages in those villages on BMI.
Use the dataset ‘Question_4.csv’ to examine the relationship between village-level BMI and
Aridity index. It contains the following independent variables: Farmers affected by food
shortages (categorical with 0 = “Affected” and 1 = “Not Affected”) and Aridity Index
(continuous whereby a high value means higher dryness and vice versa). The dependent
variable village-level BMI estimated as a mean.
To apply the personalization, use the following steps:
● Use full UCL number in the set.seed() function ensure your data is reproducible and
personalised
● Create a variable called “correction” from normal distribution with n = 7,201, mean =
0 and standard deviation = 1.5 using the rnorm() function
● Add to the “correction” to “EstimatedBMI” variable to create observed
“EstimatedBMI_new”
a. Create the additional column for EstimatedBMI_new and use it in the multivariable linear
regression model.
Write the code to perform a multivariable linear regression model in R using the
estimated BMI at village-level as the dependent variable against presence of food
shortage and aridity index as the independent variables.
Show the FULL results for model output and include the 95% confidence intervals. [6]
b. Provide a FULL interpretation for the regression coefficients of presence of food shortage
and aridity index and include the 95% confidence and whether this relation is statistically
significant or not. [10]
c. Construct the multivariable linear regression model. What are the predicted levels of
mean BMI in villages where there are farms with no food shortages, and the aridity index
is 5.5? [4]
d. In your opinion is this a good, poor or an invalid model? Justify your answer [5]
25 marks
Question Five
A subset of the villages from Zambia which were impacted by food shortage were selected to
assess the direct impacts of environmental levels of aridity in a village on village-level
estimated BMI.
Use the data “Question_5.csv”. To apply the personalization, use the following steps:
● Use full UCL number in the set.seed() function
● Create a variable called “correction” from normal distribution with n = 506, mean = 0
and standard deviation = 0.1053 using rnorm() function
● Add to the “correction” to “EstimatedBMI” variable to create observed
“EstimatedBMI_new”
a. Create the additional column for EstimatedBMI _new in the dataset and describe its
overall relationship aridity index. Is there anything peculiar about these two variables? [5]
b. Use a univariable regression model to assess the relationship between village-level BMI
and aridity index. You may consider applying some transformation function and explain
why this might be needed. Also, use the appropriate parameters to construct regression
model [15]
c. Provide the approach interpretation for the regression parameter for aridity index [5]
d. Use a non-linear regression model with an inclusion of a quadratic term and compare the
model performance with the model in 5b.
In opinion, which model performed better? Justify your answer [10]
35 marks
Question Six
Select the study design accordingly to answer this question. There are broadly 4 different
study design types listed as Pilot, Ecological, Cross-sectional, and Longitudinal.
0 – 1 = Pilot
2 – 3 = Ecological study
4 – 6 = Cross-sectional study
7 – 9 = Longitudinal study
Instructions: Use your UCL student ID number to select two study designs to answer 6a.
Using this ID number (18020155) as a motivating example – the fourth and sixth digit should
fall in one of the defined ranges for the different study design types. For instance, the fourth
digit in the above ID is ‘2’, select Ecological study. The sixth digit is ‘1’, therefore select
Pilot study.
a. Use the fourth and sixth digit of your UCL ID number to select two study design types to
discuss five differences (if numbers give the same study – move to the next digit).
Construct a table to contrast the selected study design types. [5]
b. Use the seventh digit of your UCL ID number to select a study design. Write a short
proposal with 250 words for an outline for a quantitative study that explores the following
topic:
“Impact of surface water floods and risk of physical injuries in rural communities
near large water bodies (i.e., rivers and lakes).” [10]
c. Discuss the five problems that can arise from this type of study in question 6b? [10]
25 marks