CS计算机代考程序代写 flex ## Packages

## Packages

“`{r message=FALSE}
# add packages you need for this assignment
library(tidyverse) # includes tibbles, ggplot2, dyplr, and more.
“`

In addition, I’d like to ask `R` to print decimal numbers with 2 digits:
“`{r}
options(scipen=2)
“`

# Part I: Tests for a cybersecurity data set

Let’s revisit cybersecurity breach report data downloaded 2015-02-26 from the US Health and Human Services.
From the *Office for Civil Rights* of the *U.S. Department of Health and Human Services*, I obtained the following information:

“As required by section 13402(e)(4) of the HITECH Act, the Secretary must post a list of breaches of unsecured protected health information affecting 500 or more individuals.

“Since October 2009 organizations in the U.S. that store data on human health are required to report any incident that compromises the confidentiality of 500 or more patients / human subjects (45 C.F.R. 164.408). These reports are publicly available. Our data set was downloaded from the Office for Civil Rights of the U.S. Department of Health and Human Services, 2015-02-26.”

Load this data set and store it as `cyberData`, using the following code:
“`{r}
cyberData<-read.csv(url("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/HHSCyberSecurityBreaches.csv")) ``` As you know, this data set contains *all* reports regarding health information data breaches from 2009 to 2015. Let's pretend this is just a *sample* from the population of *all data breaches*, related or not to health information. ### Question 1. Compare the number of individuals affected by data breaches (column `Individuals.Affected`) in two states, Arkansas (`State=="AR"`) and California (`State=="CA"`). This can be done by performing a test of difference in means, for example. Repeat the same test for another pair of states, California ("CA") and Illinois ("IL"). *Please note, in order to answer this question completely, you will need to run several lines of code, extract subsets of the data appropriately, run a statistical hypothesis test, and interpret the results. Draw a conclusion. Partial answers to the question will are insufficient.* ```{r} # Your code here. # ``` > Your answer / discussion / interpretation here.

### Question 2.

Explore the variable `Type.Of.Breach` collected in this data set:

* What proportion of data entries in `cyberData` have `Type.of.Breach == “Hacking/IT Incident”` ?

“`{r}
# Your code here.
#
“`

> Answer.

* What are all the different values of `Type.Of.Breach` reported in the data set? How many are hacking/IT incidents?

“`{r}
# Your code here.
#
# [Hint: make a table using the `table` function.]
# Here is a simple example:
ex <- c("type a", "type b", "type a", "type a", "c") table(ex) # MAKE SURE TO REMOVE THE EXAMPLE CODE FROM YOUR HW SUBMISSION! ``` > Your answer here: what do you see??

* What type of breach is reported in the 748th row of `cyberData`? How about 349th row? Was row 349 counted in the proportion of Hacking/IT incident breaches you computed above? Why or why not?

“`{r}
# Your code here.
#
# [Hint: use table again!]
# Here is a simple example:
# compare this to the above code:
ex <- c("type a", "type b", "type a", "type a, type b", "c") table(ex) # and also compare to this: tb. <- strsplit(ex, ', ') table(unlist(tb.)) ``` > Your answer here!

* Perform a hypothesis test on whether there is a difference in proportion of Hacking/IT incidents between the state of Illinois and the state of California. Write your conclusion interpreting the results of the statistical test.

“`{r}
#Your code here. Hint: use the prop.test function.
“`

> Your answer here!

—–

# Part II: Review of basic concepts in statistical learning

You will spend some time thinking of some real-life applications for statistical learning.

### Question 3.
Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

> Your answer here!

### Question 4.
Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

> Your answer here!

### Question 5.
Describe three real-life applications in which cluster analysis might be useful.

> Your answer here!

### Question 6.
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

> Your answer here!

# Part III: Simple and Multiple Linear Regression

Load the `Boston` data set

“`{r}
# import packages
library(MASS)
#load data
data(Boston)
“`

### Question 7:

Construct a simple linear regression of `medv` with `crim`, `dis`, and `age` respectively. Based on the output, answer the following questions:

* Is there a relationship between the predictor and the response?

* How strong is the relationship between the predictor and the response?

* Is the relationship between the predictor and the response positive or negative?

* Based on the *RSE* and *$R^2$*, which model will you choose for the simple linear regression? Explain it.

“`{r}
#Your code here.
“`

> Your answer here!

### Question 8:

Please use all the other features/attributes to construct a linear regression model.

* Interpret the coefficients of all the attributes. Which attributes are insignificant?

* Remove the insignificant attributes and construct a new linear regression model

* Any improvement on the *RSE* and *$R^2$*?

“`{r}
#Your code here.
“`

> Your answer here!