Myelodysplastic syndromes¶
For this problem set, you are tasked with analyzing exome sequencing data from 100 MDS patients.
Copyright By PowCoder代写 加微信 powcoder
The data are provided in MDSExome.xlsx
Expectation:
Carefully follow the description of the analysis to be performed. If you believe there are more than one ways to solve the problem, please pick one that is closest to the instruction.
Your answer will be graded based on correctness first, and adherence to the instruction second.
To give you some challenges, some parts of the code will be left blank. You are also not allowed to change any of the code provided.
You may send me a message + screen shot on Teams if you run into errors you cannot resolve. Each of you can do this 3 times per programming problem set. So use your quota wisely.
For questions about concepts, please add you answer in the same comment cell as the question itself. For example,
Example Q0: Explain the symptoms of MDS¶
Answer: blah blah
For questions about coding, add your code(s) in the code cell provided. Look for FILL HERE sign
If the coding question asks you to print something or plot something, always annotate your answer. For example, Use print(‘number of rows in the data:’, data.shape[0]) rather than print(data.shape[0])
Add axis labels, title, and legend to the graph as appropriate
Q1: Import packages that you need here¶
*FILL HERE*
In the code below, data were loaded without specifying index_col because none of the columns can serve as unique index for the rows
Q2: If we want to create a unique index for the data in each row, what would be some possible ideas?¶
data = pd.read_excel(‘MDSExome.xlsx’)
data.head()
Q3: First, let’s examine the data¶
Print the size of this data (number of rows and columns)
Print the number of missing data for each columns
*FILL HERE*
There are several missing values from the Nucleotide Position, Amino Acid Position, and Amino Acid Change.
The code below extract these rows into a DataFrame rows_with_missing.
Q4: What are the causes for these missing values?¶
rows_with_missing = data.loc[pd.isna(data[[‘Nucleotide Position’, ‘Amino Acid Position’, ‘Amino Acid Change’]]).any(axis = 1), :]
rows_with_missing.head(10)
For Population AF (allele frequency in human population, estimated from 1,000 genome project), the missing values are because some mutations identified here are not well-documented in public human mutation database.
One possibility is that these mutations are rare. We can test this idea in various ways. For example, we can use VAF (variant allele frequency) identified in these 100 patients as a proxy for how rare the mutations are.
Q5: Visualize the distribution of VAF between rows with Population AF and rows without¶
Use seaborn’s violinplot. We first create a new data column that contain True/False to indicate whether Population AF is present to aid the plotting
Use matplotlib’s hist and overlay the two histograms onto the same plot. Set density parameter to show the density, not the count.
data[‘Has Population AF’] = *FILL HERE*
## Violin plot
_ = sns.violinplot(data = *FILL HERE*, x = *FILL HERE*, y = *FILL HERE*)
plt.show()
## Histogram
plt.hist(data[‘VAF’].loc[data[‘Has Population AF’]], *FILL HERE*)
*FILL HERE*
plt.show()
Q6: Use an appropriate non-parametric test to determine whether VAF of mutations with Population AF are significantly higher than VAF of mutations without Population AF¶
Print the test statistics and p-value
What is your conclusion from Q5 and Q6 about the hypothesis that mutations without Population AF are rare mutations?
*FILL HERE*
To use t-test, we should first test whether the VAF data are normally distributed.
Q7: Test whether VAF data is normally distributed¶
Print the test statistics and p-value from normaltest
Plot a histogram of the VAF values
What is your conclusion? Does the histogram agrees with the normaltest result?
*FILL HERE*
Next, let’s identified frequently mutated genes in these patients because they might be related to MDS
Q8: Show the mutation frequency for each gene in this dataset¶
Use valut_counts
What are the top 3 mutated genes?
Are they known to be involved in MDS? Provide some literature evidence
*FILL HERE*
Q9 Use pie chart to show the frequency of the following columns¶
Predicted Impact
Variant Type
Hint: You can use the output from value_counts as input for pie plot
## First pie chart
*FILL HERE*
plt.show()
## Second pie chart
*FILL HERE*
plt.show()
Let’s explore whether VAF is correlated with Population AF
Q10: Visualize relationship between VAF and Population AF¶
Use seaborn’s lmplot
*FILL HERE*
Q11: Calculate Pearson’s and Spearman’s correlation between VAF and Population AF¶
Note: Be careful that some functions do not allow missing values
*FILL HERE*
We can guess that Variant Type likely contributed to the HIGH, MODERATE, LOW, and MODIFIER predicted impact. But, let’s check the data to be sure
Q12: Generate a table summarizing the relationship between Variant Type and Predicted Impact¶
Use crosstab
What do you find?
*FILL HERE*
Finally, we want to summarize the mutation profile for each patient
Q13: Generate a table summarizing the number of mutation for each patient¶
Below is a code template that can do this in a roundabout way. Fill in the missing parts
all_patients = pd.unique(*FILL HERE*)
mutation_count = pd.DataFrame(0, index = all_patients, columns = [‘Number of Mutations’])
for patient in all_patients:
mutation_count.loc[patient, ‘Number of Mutations’] = *FILL HERE*
mutation_count = mutation_count.sort_values(‘Number of Mutations’, ascending = False)
mutation_count.head()
Q14: Repeat Q13 with a one-line code¶
mutation_count = pd.DataFrame(*FILL HERE*, index = all_patients, columns = [‘Number of Mutations’])
mutation_count.head()
Q15: Generate a table summarizing the frequency of each Variant Type for each patient¶
Use crosstab
*FILL HERE*
Bonus question:¶
Come up with a visualization, a table, or a statistical test that says something interesting about this dataset or MDS.
*FILL HERE*
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com