程序代写代做 html CMDA 3654 Assignment 3

CMDA 3654 Assignment 3
Submission deadline: Mar 6th, 2020 by 11:59 pm (Start upload by 11:30pm) Submission format: upload pdf/html document in Canvas Submission format: upload ​Lastname_Firstname_HW_3.pdf​ in Canvas Use ​R Markdown​ to create the PDF or HTML
1. (10 points) Data visualization. ​Consider the babynames data from assignment 1.
(a) Create a subset of the data with female babies named “Mary” from 1880-2014.
(b) Create a subset of the data with female babies named “Sophia” from 1880-2014.
(c) Construct a plot of the proportion of female babies named “Mary” from 1880-2014. On the same plot,
add/overlay a plot of the proportion of female babies named “Sophia” from 1880-2014.
(d) Briefly describe your interpretation of the plot in (c).
Hint:​ You can use geom_line() twice overlay two or more plots in R, i.e., two or more y-series with the same x-series. Run the following code for illustration.
library(ggplot2)
x = (1:100)/10
sin= sin(x)
cos=cos(x)
data = data.frame(x,sin,cos) ggplot(data,aes(x=x,y=y1,color=”red”))+geom_line()+ geom_line(data=data,aes(x=x,y=y2,color=”blue”))
2. (15 points). Webscraping. ​Extract the data table from the safe routes website at http://apps.saferoutesinfo.org/legislation_funding%20/state_apportionment.cfm​, and analyze the data to answer the following questions:
(a) Identify the top 5 states that received the most funds in 2010.
(b) Construct a plot of the data set with years in the x-axis, and total funding received by all states in the
y-axis.
3. (15 points) Statistical learning intuition. ​Consider the Iris data set from assignment 2. Construct the following plots in R.
(a) Plot of Petal Length (x-axis) vs Petal Width (y-axis). Briefly describe the relation between petal length and petal width as you observe from the plot.
(b) Plot of Petal Length (x-axis) vs Petal Width (y-axis), with different colors for the different classes of plants.

(c) Plot of Sepal Length (x-axis) vs Sepal Width (y-axis), with different colors for the different classes of plants.
(d) Observing the plots in (b) and (c), if you had to distinguish between classes by using either petal dimensions or sepal dimensions, which one would you choose — petals or sepals, and why?
4. (20 points) Linear regression with synthetic data.​ Load the synthetic dataset posted in Canvas. This dataset was generated using the following model:
Y = 100 + 0.3X1 + 0.4X2 + 0.5Z1 – 0.5Z2
Of course, as statisticians, in a real-world scenario, we do not know what the true model is.
(a) Fit simple linear regression of (i) y on x1 (ii) y on x2 (iii) y on x3. Describe how good these models are in the following respects: (1) explaining the variation of y, (2) matching the true coefficients, and (3) determining the dependence of y on x1, x2, and x3.
(b) Fit multiple linear regression of (iv) y on x1, x2. How does model (iv) compare against models (i) and (ii) in terms of: (1) explaining the variation of y, (2) matching the true coefficients, and (3) determining the dependence of y on x1 and x2. Explain why you get different slope coefficients for x1 and x2 under model (iv) compared to models (i) and (ii).
(c) Fit multiple linear regression of (v) y on x1, x2, x3. How does model (v) compare against models (i), (ii), and (iii) in terms of: (1) explaining the variation of y, (2) matching the true coefficients, and (3) determining the dependence of y on x1, x2, and x3.
(d) Among the five models that you have used, which is the best model, and why?
5. (20 points) Linear regression.​ Load the anscombe dataset in R. (Hint: data.anscombe = anscombe)
(e) Fit linear regression of (i) y1 on x1 (ii) y2 on x2 (iii) y3 on x3 and (iv) y4 on x4. Write down the four fitted regression lines.
(f) Construct the following plots (i) y1 vs x1 (ii) y2 vs x2 (iii) y3 vs x3 and (iv) y4 vs x4.
(g) Use your judgement and describe the discrepancy between the plots and the regression lines.
(h) For each of the four cases in the anscombe dataset, explain whether a linear regression model is
appropriate.
Hint: ​Look at regression diagnostics like plot of residuals against x, plot of leverage (or influence), etc.
6. (20 points) Linear regression.​ Load the ​mtcars ​data in R. Consider ​mpg​ to be the response variable, and all other variables as features.
(a) Compute the correlation coefficient between ​mpg​ and all other features in the dataset. What are the two features most strongly correlated with ​mpg​?

(Hint: a strong correlation can be either positive or negative, use abs(x) to obtain the absolute value
of a number x.)
(b) Fit two simple linear regression models: model 1 using the strongest feature from (a) and model 2
using the second strongest feature from (a). Report the linear regression formula (i.e., report the line equation) and the value of R2​ ​ from the two models. If you had to choose between these two models, which one would you choose and why?
(c) Fit a multiple linear regression model with all features. Which features are significant in this model? What is the value of R2​ ​ in this model?
(d) Using ​stepAIC​, identify the best subset of features. Fit a multiple linear regression model using the best subset of features. Write down the regression formula and R2​ ​ for this model. Are any of the features from (a) included in this model? Do they have the same coefficients as they had in model 1 or model 2 from (b)? If the coefficient values have changed, explain why.
Assignment instructions:
1. Honor code: ​The Virginia Tech honor pledge for assignments is as follows:
“I have neither given nor received unauthorized assistance on this assignment.”
The pledge is to be written out on all graded assignments at the university and signed by the student. Type up your name to sign.
2. Submit your assignment as a document (preferably PDF) generated using RMarkdown to Canvas, clearly marked with student’s name and assignment number, eg. Sengupta_Srijan_HW3.pdf. Your submission should include R code and answers to problems.
3. Late submission: 10 points off for late submission within 24 hours of deadline, 20 points for late submission within 48 hours of deadline. Late assignments beyond 48 hours will not be accepted. Check Canvas regularly for assignments and submission dates.
4. Youarefreetodiscussassignmentproblemswithyourclassmates,butsubmittedwork(answersandcodes) must ​be your own work. Students are not allowed to copy computer codes or answers from each other, and must write their own codes and answers.