CS代考计算机代写 Bioinformatics GNBF5030 Homework3 (Due on

GNBF5030 Homework3 (Due on
Monday 23/11/2020)
For the purpose of practice, don’t use any packages that are not introduced in the class.
Question 1: Data frame, subsetting, function, and file I/O Read in the states.txt file into a data frame as described.
a. Use logical subsetting to extract a numeric vector called murder_lowincome containing murder rates for just those states with per capita incomes less than the median per capita income. Similarly, extract a vector called murder_highincome containing murder rates for just those states with greater than (or equal to) the median per capita income. Run a two-sample t.test() to determine whether the mean murder rates are different between these two groups.
b. Use syntax of [, ] to extract a new data frame called states_name_pop containing only the columns for name and population. Extract another data frame
called states_gradincome_high containing all columns, and all rows where the income is greater than the median of income or where hs_grad is greater than the median of hs_grad. (There are 35 such states listed in the table.)
c. Create a table called states_by_region_income , where the rows are ordered first by region and second by income. (Hint: You may want to use the order() function here. The order() function can take multiple parameters as in order(vec1, vec2) , which considered in turn in determining the ordering. See more examples from here (See%20examples%20%5Bhere%5D(https://stats.idre.ucla.edu/r/faq/how-can-i-sort-my-data-in-r/).)
d. Write a function called normalize_mean_sd() that takes such a vector and returns the normalized version. The function should work even if any values are NA (the normalized version of NA should simply be NA). To “normalize” a vector of numbers, first subtract the mean from each number and then divide each by the standard deviation of the numbers. Use vector like
sample <- c(3.2, 5.1, 2.4, 1.6, NA, 7.9) to test your function. e. Use apply() , together with the function normalize_mean_sd() that you created, to normalize the data columns in states.txt . Then write the new table to a new file called states_normalized.txt . Question 2: R Plotting The data file ozone.csv was obtained from the supplementary data of Biostatistics: A Methodology for the Health Sciences (http://faculty.washington.edu/heagerty/Books/Biostatistics/index-data.html), describing the weather conditions in New York City in 1973. Full description available here (http://faculty.washington.edu/heagerty/Books/Biostatistics/DATA/ozonedoc.txt). a. Make the scatter plot of Solar Radiation against Ozone, the histogram of Wind Speed and the Boxplot of Ozone level per month. They should look as follows. Hint: For the histogram, see the breaks and freq arguments to create 20 bins and display density rather than frequency. For the box plot, try the rainbow function; the colors are not necessary the same as in the figure below. The las argument changes the label orientation. See ?par . Look at the arguments to boxplot to see how to change the names printed under each box. b. Create a layout with three columns with par function. Then plot Ozone versus Solar Radiation, Wind Speed and Temperature on separate graphs. Use different colors and plotting characters on each plot. At last, save the plot to a pdf. HINT: Create the graph first in RStudio. When you're happy with it, re-run the code preceded by the pdf function to save to a file. Don't forget to use dev.off() to close the file. c. Temperature and Ozone level seem to be correlated. However, there are some observations that do not seem to fit the trend, especially those with Ozone level > 100.
Modify the plot so that these outlier observations are in a different colour.
Add a legend to help interpret the plot HINT: You can break down the problem into the following steps
Create a blank plot
Identify observations with ozone > 100
Plot the corresponding Temperature and Ozone values for these in red
Identify observations with ozone < 100 Plot the corresponding Temperature and Ozone values for these in orange Question 3: Hypothesis testing The gene expression data collected by Golub et al. (1999) are among the classical in bioinformatics. The data are stored in golub.txt , containing gene expression values of 3051 genes (rows) from 38 leukemia patients (columns). Twenty-seven patients (column 1 to 27) are diagnosed as acute lymphoblastic leukemia (ALL) and eleven (column 28 to 38) as acute myeloid leukemia (AML). The tumor class of ALL is 0 (negative), while the tumor class of AML is 1 (positive). The important gene CD33 is among one of the investigated genes. It has its expression values in row 808 of the golub data. Supposed that normality of the ALL and AML expression values has been examed. Test the equality of the means by an appropriate test about gene CD33. Formulate the null hypothesis, the p-value and your conclusion. Question 4: Categorical tests A study in 1986 (Erosion of dental enamel among competitive swimmers at a gas-chlorinated swimming pool, Centerwall et al.) was carried out to see if exposure to acid (via swimming in the club pool) is associated with the erosion of dental enamel. One of the surveys was made of 49 club members with erosion and 235 without. The data is summarized below. What statistical test(s) will you use on this data set in order to answer the question being studied? Perform the test(s) in R and interpret your results. With erosion swimming time per week >=6hrs 32
swimming time per week <6hrs 17 without erosion 118 117 //