Department of Biochemical Engineering
BENG0091 Stochastic Calculus & Uncertainty Analysis Coursework 2
Please read the guidelines before starting the work. Guidelines
– You need to provide all MATLAB, Python or equivalent code that you have developed as part of your submission to Turnitin. This is compulsory. Include clarifications/comments in your code whenever you feel appropriate.
Copyright By PowCoder代写 加微信 powcoder
– You need to submit one version of the code that is executable. Unless the code is executable locally reaching the same results as those in your report, it will not receive full marks.
o One option is to have the code in the submitted document in a state where we can copy it off your submission and execute. A few tips to assist you in the process: Please note that line numbers left in the code often creates an issue with executability. Python codes embedded in LaTeX can also create problems with executability. Please ensure that prior to submission, you can copy the code back from the document you plan to submit and execute it, just to double check.
o If you do not want to worry about the Turnitin version being executable or not, you can additionally choose to use Datalore as suggested. A detailed video on how to use it is available in Moodle. Please note that submitting via Datalore is optional.
– Your submission (excluding the space taken up by your code) should be no more than 15 pages and contain no more than 15 Figures. Clarity is expected in the Text, in your Figures, and in your codes. A single figure/image cannot comprise of 10 illegible plots, please use your reasoning when preparing your report.
– Please make sure that you address the answer for each section or question at its respective slot. e.g. a correct answer to section (a) provided as response to section (b) will not be considered for marking.
– You need to develop your own code. You are not allowed to use pre-existing toolboxes to conduct stochastic simulations, for example. However, the use of standard Python packages such as pandas or NumPy are acceptable. Regarding random number generators (r.n.g.), you are only allowed to use a/the uniform r.n.g. available in the programming language you chose (MATLAB, Python etc.). Uniqueness of your scripts will be assessed and will contribute to your mark.
– To achieve full marks in each question, your methodology needs to be correctly implemented and your code needs to be original (i.e. your own work).
– You will be allowed to submit your work multiple times until the deadline. The Turnitin submission will be made available weeks before the deadline. Please note that it is your responsibility to ensure that the submission is made on time. Late submissions, SORAs and ECs will be handled by the Admin Team, not your tutors.
Department of Biochemical Engineering
Coursework 2 Brief:
Tuberculosis (TB) is one of the leading infectious disease killer in the world. According to the World Health Organisation (WHO), a total of 1.5 million people died from TB in 2020. Worldwide, TB is the 13th leading cause of death, and the second leading infectious killer after COVID-19. TB is present in all countries and age groups, but it is curable and preventable. Globally, close to one in two TB-affected households face costs higher than 20% of their household income. The world did not reach the milestone of 0% TB, and this leads to patients and their households facing catastrophic costs as a result of TB disease by 2020 (https://www.who.int/news-room/fact-sheets/detail/tuberculosis). This indicates a clear relationship between the socioeconomic parameters and disease treatment and management. It has been shown that patients, who have been diagnosed and treated with TB are susceptible to future pulmonary complications including other lung diseases and accelerated lung ageing (https://doi.org/10.1016/j.ijid.2020.02.032). These relationships demonstrate the complex landscape of treating the disease and its long-term post-treatment effects and the socioeconomic factors that play a role in having access to good treatment. Elucidation of the nature of these relationships can assist and advise worldwide disease treatment and prevention programmes and help save millions of lives. As a data scientist, you have been asked to look into this further. You are provided a dataset containing demographic information collected from some key locations around the world and are asked to find which (if any) of these factors would be a predictor for the prevalence of residual susceptibility for future complications.
The dataset you have been provided with (TB_demographics.csv) has 3047 data rows, each row corresponding to data collected from one specific region in the world. There are 19 different factors data was collected for: column 1 gives us the incidence of residual impacts due to TB per capita (per 100,000 people), which you are asked to predict (Y) with your model (Column ID: TARGET_residualsRatePerCapita). The remaining 18 columns represent different types of demographic information collected from each one of these regions, which are the input factors (𝑍𝑖) to your model. The details of the information for these input factors as named in the dataset are given in Table 1:
Table 1: Dataset input factors summary
Column ID in dataset
incidenceRatePerCapita popEst2015 MarriedPerCapita
NoHS18_24PerCapita
HS18_24PerCapita
BachDeg18_24PerCapita
HS25_OverPerCapita
BachDeg25_OverPerCapita
Employed16_OverPerCapita Unemployed16_OverPerCapita
Input factor
Mean per capita (100,000) TB diagnoses
Population of region
Residents who are married (per capita)
Residents aged 18-24 highest education attained: less than high school (per capita)
Residents aged 18-24 highest education attained: high school diploma (per capita)
Residents aged 18-24 highest education attained: bachelor’s degree (per capita)
Residents aged 25 and over highest education attained: high school diploma (per capita)
Residents aged 25 and over highest education attained: bachelor’s degree (per capita)
Residents aged 16 and over employed (per capita) Residents ages 16 and over unemployed (per capita)
Department of Biochemical Engineering
𝑍11 PrivateCoveragePerCapita
𝑍12 EmpPrivCoveragePerCapita
𝑍13 PublicCoveragePerCapita
𝑍14 PublicCoverageAlonePerCapita
𝑍15 MarriedHouseholdsPerCapita
𝑍16 avgResidualsPerCapitaPerYear
𝑍17 povertyPerCapita
𝑍18 AvgHouseholdSizePerCapita
You will build a multiple linear regression model taking in all the data you have been provided into
Residents with private health coverage (per capita) Residents with employer-provided private health coverage (per capita)
Residents with government-provided health coverage (per capita)
Residents with government-provided health coverage alone (per capita)
Married households (per capita)
Average number of people suffering from residual effects from any disease
𝑌 = 𝑏0 + ∑ 𝑏𝑖 ∙ 𝑍𝑖 𝑖=1
Poverty score of regions given per capita Average household size per capita
account using the following equation:
All your input factors (𝑍𝑖) will be associated with a certain degree of uncertainty arising from the nature of the way the data was collected. These uncertainties are represented by random errors. Furthermore, you have been given the following information in Table 2 concerning the systematic uncertainty associated with some of the regression coefficients. Unless listed in the table below, you can assume all other regression coefficients not to have any uncertainty associated with them. You know that there is no correlation between the parameter uncertainties for the following regression coefficients and the input factor uncertainties.
Table 2: Summary of standard systematic errors for a subset of the regression coefficients
b2 Normal 2 b7 Normal 1 b9
b10 b16 b17
Distribution of systematic errors
Systematic Uncertainty (br) % value
Triangular
Department of Biochemical Engineering
Q1. You are asked to check the quality of your dataset by identifying and eliminating any outliers. For this purpose, you will investigate all your input variables and output variable (i.e., each of the 19 columns) separately and determine the outliers, if any, in each column. [5 marks]
You are asked to follow a very strict approach: If any data row has at least one variable, which is identified as an outlier, you will exclude that row from analysis. [5 marks]
Explain your decisions stating all the underlying assumptions you have made and state the size of the new dataset you end up with, and the new population properties for each column of variables. [5 marks]
[Total marks available for Q1: 15]
Q2. You are then asked to determine the uncertainty around the output variable, i.e., the predicted incidence of residual impacts due to TB per capita as given by the multiple linear regression model. For this purpose, you will use the Monte Carlo Method for uncertainty propagation to determine the expanded uncertainty using your dataset. [5 marks]
Make sure to demonstrate that your calculation of the expanded uncertainty has converged. Justify any assumptions you make in your analysis and discuss your results. If you have used the standard MCM, state the number of iterations that would be sufficient to achieve convergence. If you are using adaptive MCM, state your criterion for convergence and at how many iterations that has been reached. [5 marks]
In this case does it suffice to report expanded uncertainty within a confidence interval? Do the results implicate that the coverage interval needs to be calculated? If yes, report the probabilistically symmetric coverage interval. If no, justify your reasoning. [10 marks]
Correct implementation of the codes and their originality: [15 marks]
Interpretation and discussion of your results, presentation of assumptions: [10 marks] [Total marks available for Q2: 45 marks]
Q3. The initial challenge you were tasked was to identify which (if any) of the demographic factors have the largest impact on the incidence rate of residual disease impacts due to TB per capita. For this question assume all of the uncertainties around your input factors and regression coefficients stated in Table 2 to follow a uniform distribution. Perform a Sensitivity Analysis by applying the Elementary Effects Method on the multiple linear regression model, assuming an appropriate range of variation for each variable. Apply the Elementary Effects Method using the original sampling strategy proposed by Morris (refer to lecture notes) and justify/show the convergence of your results. [5 marks]
Based on your findings, which demographic factors plan an important role in predicting the uncertainty around residual effects of TB manifesting at a later stage in treated individuals? [5 marks]
Department of Biochemical Engineering
Correct implementation of code and its originality for the Sensitivity Analysis by applying the Elementary Effects Method: [15 marks]
Interpretation and discussion of results, stating assumptions: [15 marks] [Total marks available for Q3: 40 marks]
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com