CS代考 project2 (1)

project2 (1)

Copyright By PowCoder代写 加微信 powcoder

# Initialize Otter
import otter
grader = otter.Notebook(“project2.ipynb”)

Project 2: Cardiovascular Disease: Causes, Treatment, and Prevention¶

In this project, you will investigate the major causes of death in the world: cardiovascular disease!

Logistics¶
Deadline. This project is due at 11:59pm on Friday, Dec. 02. It’s much better to be early than late, so start working now.

Checkpoint. For full credit, you must complete 2 checkpoints. For checkpoint 1, you must complete the questions up until the end of Part 2, pass all public autograders, and submit them by 11:59pm on Tuesday Nov. 15. For checkpoint 2, you must complete the questions up until the end of Part 3, pass all public autograders, and submit them by 11:59pm on Tuesday Nov. 22.

Partners. You may work with one (or 2) other partners. Your partner(s) must be enrolled in the same lab as you are. Only one of you is required to submit the project. On gradescope, the person who submits should also designate their partner(s) so that both of you receive credit.

Rules. Don’t share your code with anybody but your partner(ss). You are welcome to discuss questions with other students, but don’t share the answers. The experience of solving the problems in this project will prepare you for exams (and life). If someone asks you for the answer, resist! Instead, you can demonstrate how you would solve a similar problem.

Support. You are not alone! Come to office hours, post on Piazza, and talk to your classmates. If you want to ask about the details of your solution to a problem, make a private Piazza post and the staff will respond. If you’re ever feeling overwhelmed or don’t know how to make progress, email your TA or CAs for help. You can find contact information for the staff on the piazza staff page.

Tests. Passing the tests for a question does not mean that you answered the question correctly. Tests usually only check that your syntax is correct or tables have the right column names. However, more tests will be applied to verify the correctness of your submission in order to assign your final score, so be careful and check your work!

Advice. Develop your answers incrementally. To perform a complicated table manipulation, break it up into steps, perform each step on a different line, give a new name to each result, and check that each intermediate result is what you expect. You can add any additional names or functions you want to the provided cells.

All of the concepts necessary for this project are found in the textbook. If you are stuck on a particular problem, reading through the relevant textbook section often will help clarify the concept.

To get started, load datascience, numpy, and plots.

from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use(‘fivethirtyeight’)
np.set_printoptions(legacy=’1.13′)

In the following analysis, we will investigate the world’s most dangerous killer: Cardiovascular Disease. Your investigation will take you across decades of medical research, and you’ll look at multiple causes and effects across four different studies.

Here is a roadmap for this project:

In Part 1, we’ll investigate the major causes of death in the world during the past century (from 1900 to 2015).
In Part 2, we’ll look at data from the Framingham Heart Study, an observational study into cardiovascular health.
In Part 3, we’ll examine the effect that hormone replacement therapy has on the risk of coronary heart disease for post-menopausal women using data from the Nurses’ Heart Study and Heart and Estrogen-Progestin Replacement Study.
In Part 4, we move in a slightly different direction and look at Covid & Diet

Part 1: Causes of Death¶

In order to get a better idea of how we can most effectively prevent deaths, we need to first figure out what the major causes of death are. Run the following cell to read in and view the causes_of_death table, which documents the death rate for major causes of deaths over the last century (1900 until 2015).

causes_of_death = Table.read_table(’causes_of_death.csv’)
causes_of_death.show(5)

Year Cause Age Adjusted Death Rate
2015 Heart Disease 168.5
2015 Cancer 158.5
2015 Stroke 37.6
2015 Accidents 43.2
2015 Influenza and Pneumonia 15.2

… (575 rows omitted)

Each entry in the column Age Adjusted Death Rate is a death rate for a specific Year and Cause of death.

If we look at unadjusted data, the age distributions of each sample will influence death rates. In an older population, we would expect death rates to be higher for all causes since old age is associated with higher risk of death. To compare death rates without worrying about differences in the demographics of our populations, we adjust the data for age.

The Age Adjusted specification in the death rate column tells us that the values shown are the death rates that would have existed if the population under study in a specific year had the same age distribution as the “standard” population, a baseline.

You aren’t responsible for knowing how to do this adjustment, but should understand why we adjust for age and what the consequences of working with unadjusted data would be.

Question 1: What are all the different causes of death in this dataset? Assign an array of all the unique causes of death to all_unique_causes.

all_unique_causes = causes_of_death.group(“Cause”).column(“Cause”)
sorted(all_unique_causes)

[‘Accidents’, ‘Cancer’, ‘Heart Disease’, ‘Influenza and Pneumonia’, ‘Stroke’]

grader.check(“q1_1”)

Question 2: We would like to plot the death rate for each disease over time. To do so, we must create a table with one column for each cause and one row for each year.

Create a table called causes_for_plotting. It should have one column called Year, and then a column with age-adjusted death rates for each of the causes you found in Question 1. There should be as many of these columns in causes_for_plotting as there are causes in Question 1.

Hint: Use pivot, and think about how the first function might be useful in getting the Age Adjusted Death Rate for each cause and year combination.

# This function may be useful for Question 2.
def first(x):
return x.item(0)

causes_for_plotting = causes_of_death.pivot(“Cause”,”Year”,values =”Age Adjusted Death Rate”, collect = sum)
causes_for_plotting.show(5)

Year Accidents Cancer Heart Disease Influenza and Pneumonia Stroke
1900 90.3 114.8 265.4 297.5 244.2
1901 109.3 118.1 272.6 312.9 243.6
1902 93.6 119.7 285.2 219.3 237.8
1903 106.9 125.2 304.5 251.1 244.6
1904 112.8 127.9 331.5 291.2 255.2

… (111 rows omitted)

Let’s take a look at how age-adjusted death rates have changed across different causes over time. Run the cell below to compare Heart Disease (a chronic disease) and Influenza and Pneumonia (infectious diseases).

causes_for_plotting.select(‘Year’, “Heart Disease”, “Influenza and Pneumonia”).plot(‘Year’)

Question 3: Beginning in 1900 and continuing until 1950, we observe that death rates for Influenza and Pneumonia decrease while death rates for Heart Disease increase. What might have caused this shift?

Assign disease_trend_explanation to an array of integers that correspond to possible explanations for these trends.

People are living longer, allowing more time for chronic conditions to develop.
A cure has not been discovered for influenza, so people are still dying at high rates from the flu.
Improvements in sanitation, hygiene, and nutrition have reduced the transmission of viruses and bacteria that cause infectious diseases.
People are more active, putting them at lower risk for conditions like heart disease and diabetes.
Widespread adoption of vaccinations has reduced rates of infectious disease.
The medical community has became more aware of chronic conditions, leading to more people being diagnosed with heart disease.

Hint: Consider what contributes to the development of these diseases. What decreases the transmission of infections? Why do we see more lifestyle-related conditions like heart disease?

disease_trend_explanation = make_array(3,4,5,6)
disease_trend_explanation

array([3, 4, 5, 6])

grader.check(“q1_3”)

This phenomenon is known as the epidemiological transition – in developed countries, the severity of infectious disease has decreased, but chronic disease has become more widespread. Coronary heart disease (CHD) is one of the most deadly chronic diseases that has emerged in the past century, and more healthcare resources have been invested to studying it.

Run the cell below to see what a plot of the data would have looked like had you been living in 1950. CHD was the leading cause of death and had killed millions of people without warning. It had become twice as lethal in just a few decades and people didn’t understand why this was happening.

# Do not change this line
causes_for_plotting.where(‘Year’, are.below_or_equal_to(1950)).plot(‘Year’)

The view from 2016 looks a lot less scary, however, since we know it eventually went down. The decline in CHD deaths is one of the greatest public health triumphs of the last half century. That decline represents many millions of saved lives, and it was not inevitable. The Framingham Heart Study, in particular, was the first to discover the associations between heart disease and risk factors like smoking, high cholesterol, high blood pressure, obesity, and lack of exercise.

# Do not change this line
causes_for_plotting.plot(‘Year’)

Let’s examine the graph above. You’ll see that in the 1960s, the death rate due to heart disease steadily declines. Up until then, the effects of smoking, blood pressure, and diet on the cardiovascular system were unknown to researchers. Once these factors started to be noticed, doctors were able recommend a lifestyle change for at-risk patients to prevent heart attacks and heart problems.

Note, however, that the death rate for heart disease is still higher than the death rates of all other causes. Even though the death rate is starkly decreasing, there’s still a lot we don’t understand about the causes (both direct and indirect) of heart disease.

Part 2: The Framingham Heart Study¶

The Framingham Heart Study is an observational study of cardiovascular health. The initial study followed over 5,000 volunteers for several decades, and followup studies even looked at their descendants. In this section, we’ll investigate some of the study’s key findings about cholesterol and heart disease.

Run the cell below to examine data for 3842 subjects from the first wave of the study, collected in 1956.

framingham = Table.read_table(‘framingham.csv’)
framingham

AGE SYSBP DIABP TOTCHOL CURSMOKE DIABETES GLUCOSE DEATH ANYCHD
39 106 70 195 0 0 77 0 1
46 121 81 250 0 0 76 0 0
48 127.5 80 245 1 0 70 0 0
61 150 95 225 1 0 103 1 0
46 130 84 285 1 0 85 0 0
43 180 110 228 0 0 99 0 1
63 138 71 205 0 0 85 0 1
45 100 71 313 1 0 78 0 0
52 141.5 89 260 0 0 79 0 0
43 162 107 225 1 0 88 0 0

… (3832 rows omitted)

Each row contains data from one subject. The first seven columns describe the subject at the time of their initial medical exam at the start of the study. The last column, ANYCHD, tells us whether the subject developed some form of heart disease at any point after the start of the study.

You may have noticed that the table contains fewer rows than subjects in the original study – we are excluding subjects who already had heart disease or had missing data.

Section 1: Diabetes and the Population¶

Before we begin our investigation into cholesterol, we’ll first look at some limitations of this dataset. In particular, we will investigate ways in which this is or isn’t a representative sample of the population by examining the number of subjects with diabetes.

According to the CDC, the prevalence of diagnosed diabetes (i.e., the percentage of the population who have it) in the U.S. around this time was 0.93%. We are going to conduct a hypothesis test with the following null and alternative hypotheses:

Null Hypothesis: The probability that a participant within the Framingham Study has diabetes is equivalent to the prevalence of diagnosed diabetes within the population. (i.e., any difference is due to chance).

Alternative Hypothesis: The probability that a participant within the Framingham Study has diabetes is different than the prevalence of diagnosed diabetes within the population.

We are going to use the absolute distance between the observed prevalence and the true population prevalence as our test statistic. The column DIABETES in the framingham table contains a 1 for subjects with diabetes and a 0 for those without.

Question 1: What is the observed value of the statistic in the data from the Framingham Study? You should convert prevalence values to proportions before calculating the statistic!

observed_diabetes_distance = abs(((framingham.where(“DIABETES”, are.equal_to(1)).num_rows)/framingham.num_rows)-.0093)
observed_diabetes_distance

0.01802951587714732

grader.check(“q2_1_1”)

Question 2: Define the function diabetes_statistic which should return exactly one simulated statistic of the absolute distance between the observed prevalence and the true population prevalence under the null hypothesis. Make sure that your simulated sample is the same size as your original sample.

Hint: The array diabetes_proportions contains the proportions of the population without and with diabetes, respectively.

diabetes_proportions = make_array(0.9907, 0.0093)

def diabetes_statistic():
simulated_stat = sample_proportions(1, diabetes_proportions)
return abs(simulated_stat.item(1)-diabetes_proportions.item(1))

grader.check(“q2_1_2”)

Question 3: Complete the following code to simulate 5000 values of the statistic under the null hypothesis.

diabetes_simulated_stats = make_array()

for i in np.arange(5000):
simulated_stat = abs(sample_proportions(3842, diabetes_proportions).item(1)-.0093)
diabetes_simulated_stats = np.append(diabetes_simulated_stats,simulated_stat)

diabetes_simulated_stats

array([ 0.00201213, 0.00137153, 0.00123129, …, 0.00345377,
0.00059068, 0.00137153])

grader.check(“q2_1_3”)

Question 4: Run the following cell to generate a histogram of the simulated values of your statistic, along with the observed value.

If you’re not sure if your histogram is correct, think about how we’re generating the sample statistics under the null, and what those statistics will look like

Make sure to run the cell that draws the histogram, since it will be graded.

Table().with_column(‘Simulated distance to true prevalence’, diabetes_simulated_stats).hist()
plots.scatter(observed_diabetes_distance, 0, color=’red’, s=30);

/share/pkg.7/python3/3.8.10/install/lib/python3.8/site-packages/datascience/tables.py:5206: UserWarning: FixedFormatter should only be used together with FixedLocator
axis.set_xticklabels(ticks, rotation=’vertical’)

Question 5: Based on the historgram above, should you reject the null hypothesis?

According to the histogram, we should reject the null hypothesis, because the observed value is too far away.

Question 6: Why might there be a difference between the population and the sample from the Framingham Study? Assuming that all these statements are true – what are possible explanations for the higher diabetes prevalence in the Framingham population?

Assign the name framingham_diabetes_explanations to an array of the following explanations that are consistent with the trends we observe in the data and our hypothesis test results.

Diabetes was under-diagnosed in the population (i.e., there were a lot of people in the population who had diabetes but weren’t diagnosed). By contrast, the Framingham participants were less likely to go undiagnosed because they had regular medical examinations as part of the study.
The relatively wealthy population in Framingham ate a luxurious diet high in sugar (high-sugar diets are a known cause of diabetes).
The Framingham Study subjects were older on average than the general population, and therefore more likely to have diabetes.

framingham_diabetes_explanations = make_array(3)
framingham_diabetes_explanations

array([3])

grader.check(“q2_1_6”)

In real-world studies, getting a truly representative random sample of the population is often incredibly difficult. Even just to accurately represent all Americans, a truly random sample would need to examine people across geographical, socioeconomic, community, and class lines (just to name a few). For a study like this, scientists would also need to make sure the medical exams were standardized and consistent across the different people being examined. In other words, there’s a tradeoff between taking a more representative random sample and the cost of collecting more information from each person in the sample.

The Framingham study collected high-quality medical data from its subjects, even if the subjects may not be a perfect representation of the population of all Americans. This is a common issue that data scientists face: while the available data aren’t perfect, they’re the best we have. The Framingham study is generally considered the best in its class, so we’ll continue working with it while keeping its limitations in mind.

(For more on representation in medical study samples, you can read these recent articles from NPR and Scientific American).

Section 2: Cholesterol and Heart Disease¶

In the remainder of this question, we are going to examine one of the main findings of the Framingham study: an association between serum cholesterol (i.e., how much cholesterol is in someone’s blood) and whether or not that person develops heart disease.

We’ll use the following null and alternative hypotheses:

Null Hypothesis: In the population, the distribution of cholesterol levels among those who get heart disease is the same as the distribution of cholesterol levels
among those who do not.

Alternative Hypothesis: The cholesterol levels of people in the population who get
heart disease are higher, on average, than the cholesterol level of people who do not.

Question 1: From the provided Null and Alternative Hypotheses, does it seem reasonable to use A/B Testing to determine which model is more consistent? Assign the variable ab_reasonable to True if it seems reasonable and False otherwise.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com