title: ‘Exam 2: Practice Problems’
author: ” and Wu”
date: “Fall 2021”
output: html_document
Copyright By PowCoder代写 加微信 powcoder
“`{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
## Fitting and Interpreting a Linear Regression Model: the `cars` Data Set
The `cars` data set (note: this is distinct from the `mtcars` data set!) contains data on stopping distances for cars driving at different speeds.
data(cars)
head(cars)
As you can see, the data set has just two columns: `speed` and `dist`, corresponding to speed (in miles per hour) and stopping distance (in feet), respectively.
Note that this data was gathered in the 1920s. Modern cars can go a lot faster and stop far more effectively!
__Part a: plotting the data__
Create a scatter plot of the data, showing stopping distance as a function of speed (i.e., distance on the y-axis and speed on the x-axis).
Do you notice a trend? Discuss (a sentence or two is plenty).
plot(cars$speed, cars$dist)
It appears that stopping distance tends to increase as the speed increases. The increase appears to be approximately linear.
__Part b: fitting linear regression__
Use `lm` to fit a linear regression model that predicts stopping distance from speed (and an intercept term).
That is, fit a model like `dist = beta0 + beta1*speed`.
m1 <- lm( dist ~ 1 + speed, cars ) Use the resulting slope and intercept terms to create the scatter plot from Part a, but this time add a line, __in blue__, indicating our fitted model (i.e., add a line with slope and intercept given by your estimated coefficients). plot(cars$speed, cars$dist) abline(coef=m1$coefficients, col='blue') Do you notice anything about your model? Is the model a good fit for the data? Why or why not? Two or three sentences is plenty here. The model seems to generally match the trend of the data. It does seem like the bulk of the residuals are negative in the "middle" of the plot. Examine the model object returned by `lm`. Should we or should we not reject the null hypothesis that the `speed` variable has a non-zero coefficient? m1 <- lm( dist ~ 1 + speed, cars ) summary(m1) The p-values associated to both the intercept term and the `speed` coefficient indicate that we should reject the null that these coefficients are zero, at least at the $\alpha=0.05$ level. __Part c: accounting for nonlinearity__ Let's see if we can improve our model. We know from physics class that kinetic energy grows like the square of the speed. Since stopping amounts to getting rid of kinetic energy, it stands to reason that stopping distance might be better predicted by the square of the speed, rather than the speed itself. It's not exactly clear in the data that such a trend exists, but let's try fitting a different model and see what happens. Fit the model `dist = beta0 + beta1*speed^2` to the `cars` data. m2 <- lm( dist ~ 1 + I(speed^2), cars ); Plot stopping distance as a function of speed again and again add the regression line __in blue__ from Part c. Then add another line (a curve, really, I guess), __in red__, indicating the prediction of this new model. That is, the predicted distance as a linear function of *squared* speed. __Hint:__ the speed values in the data range from 4 to 25. You may find it useful to create a vector `x` containing a sequence of appropriately-spaced points from 4 to 25 and evaluate your model at those `x` values. __Another hint:__ this is the rare problem where it's probably actually easier to use `ggplot2`, but if you prefer to do everything in R, don't forget about the `lines` function, which might be helpul here. plot(cars$speed, cars$dist) abline(coef=m1$coefficients, col='blue'); lines( cars$speed, m2$coefficients[1] + m2$coefficients[2]*cars$speed^2, col='red' ) ## PCA for data exploration: the Pima Indian data revisited Let's return to the Pima Indian data set that we saw in a previous discussion section. This data set contains biometric data and diabetes diagnoses from a collection of Pima Indian women. The data is described in detail in the `MASS` library documentation. library(MASS) If you have not installed `MASS` (highly unlikely, since it is included by default in most R installations), you'll get an error running the above code block. You can install the library by running `install.packages(MASS)` and then the above should run without a hitch. The following code downloads a slightly modified version of the data set from the course website. The data frame `pima.x` contains the biometric data (e.g., blood glucose levels, age, etc.). Each row corresponds to a patient in the study. The single-column data frame stored in `pima.y` encodes a binary variable (called `type`) indicating whether or not each patient has diabetes. The `i`-th row of `pima.x` corresponds to the `i`th entry in `pima.y`, and these jointly correspond to the `i`-th row of the `Pima.te` data frame in the `MASS` library. download.file('https://kdlevin-uwstat.github.io/STAT340-Fall2021/Exam2/pimaX.csv', 'pimaX.csv') download.file('https://kdlevin-uwstat.github.io/STAT340-Fall2021/Exam2/pimaY.csv', 'pimaY.csv') pima.x <- read.csv('pimaX.csv') pima.y <- read.csv('pimaY.csv') __Part a: plotting to spot correlations__ Create a pairs plot that shows how the seven biometric variables vary with one another. Do you see any obvious correlations between the variables? A sentence or two describing what you see is plenty, here. __Note:__ Pair plots can get quite crowded once they involve more than three or four variables. You may find it helpful, after identifying an interesting pair of variables, to make a full-sized scatter plot of just these two variables to make it easier to see exactly what's going on. pairs(pima.x) A linear relation between `skin` and `bmi` certainly jumps out. `bp` and `bmi`; `age` and `npreg` also appear to have some clear positive correlation, though not as strong on that between `skin` and `bmi`. __Part b: adding diabetes status to the plot__ Use the data in `pima.y` to create the same plot as above, but now color the points according to diabetes status, with diabetic patients colored red and non-diabetic patients colored blue. pima_cols <- c('blue', 'red')[unclass(pima.y$type)] pairs(pima.x, col=pima_cols) Suppose that you were required to choose only two variables from among the seven biometric variables in the data set (i.e., two of the seven columns of `pima.x`) with which to predict diabetes status. Which two would you choose to maximize your predictive accuracy? __Note:__ There is no need to do any additional coding or math, just interpret the plot and explain your answer in a few sentences. There is no single right answer, here. Clear explanation and well-motivated reasoning is enough for full credit. The `glu`-`bp`, `glu`-`skin` and `glu`-`bmi` plots appear to exhibit the best separation between the two populations. These three plots exhibit clear distinct point clouds of diabetic and non-diabetic observations. __Part c: dimensionality reduction with PCA__ Use `prcomp` to extract the first two principle components of the data. Don't forget to use the `scale` argument to ensure that the variables are all on the same scale. pima.pr <- prcomp(pima.x, scale=TRUE, rank.=2) Interpret the loadings in these first two components. Which variables appear to be corrlated based on the loadings? The first principle component indicates that all seven biometric variables are positively correlated (they all have the same sign on PC1), with `skin`, `bmi`, `age` and `bp` being especially highly correlated with one another. The second component indicates that there is additional correlation between `npreg` and `age` and between `skin` and `bmi`, with these correlations being in opposite directions. __Part d: creating a biplot__ Visualize the data projected onto the first two principal components using `biplot` (or, if you prefer, you may use the `ggbiplot` package). Do you see anything interesting? Which variables tend to behave similarly? Do you find this surprising? Unsuprising? biplot(pima.pr) The directions of the loading vectors largely agree with our observations above-- very clear correlations between `age` and `npreg`, `bp` and `glu`, and between `skin` and `bmi`. It is perhaps surprising to see `age` not being so strongly correlated with, e.g., `bp` and `bmi`, given what we know about aging, but this is likely because the correlation between `age` and `npreg` is simply much larger (and thus swamps the other correlations, which we observed above are certainly not zero). __Part e: PCA and classification__ Use the output of `prcomp` to create a scatter plot of the data projected onto the first two principal component directions, with the data colored according to diabetes status. As in our plot in Part b, color points corresponding to diabetic patients red, and color non-diabetic patients blue. __Note:__ it would be great to just do this with `biplot`, but `biplot` does not have a simple way to pass information for coloring individual points in the plot-- the `col` argument specifies colors for the points and the loading vectors. If you are using `ggbiplot`, the support for this is better, and you may use that instead of extracting the components from the `prcomp` output if you wish. __Hint:__ you want the `x` attribute of the object returned by `prcomp`. plot( pima.pr$x[,1], pima.pr$x[,2], col=pima_cols ) Do you see any clear structure in the data? How well do you think we would do if we tried to use PCA coordinates to predict diabetes status? Going back to your explanation from Part a, would you prefer to use these two PCA components or the two variables you picked in Part a if your goal was to accurately predict diabetes status? Diabetic subjects tend to be on the left of the plot, but there is not a particularly clear separation. Intuitively, it seems that simply using, say, `bp` and `glu` would give better predictive performance. The plot below shows that this gives a much better separation of the populations (but still far from perfect, of course). plot( pima.x$glu, pima.x$bp, col=pima_cols ) ## Estimating the success parameter of a Bernoulli The following code downloads a CSV file called `binary.csv` and reads it into a data frame `data` with one column, `X`, which is a vector of binary observations (i.e., every entry is zero or one). download.file(destfile='binary.csv', url='https://kdlevin-uwstat.github.io/STAT340-Fall2021/Exam2/binary.csv') data <- read.csv('binary.csv' ) head(data) Let us suppose that these observations were drawn iid froma Bernoulli distribution with unknown success probability $p$, and we are interested in estimating $p$. __Part a__ Use Monte Carlo to construct a 90% (__note:__ 90%, not 95%) confidence interval for the parameter $p$. phat = mean(data$X) NMC <- 500; reps <- rep( NA, NMC ); n <- nrow(data); for ( i in 1:NMC ) { fake_data <- rbinom(n=n, size=1, prob=phat); reps[i] <- mean(fake_data); sdhat <- sd( reps ); zscore <- qnorm( 0.95 ); CI90 <- c( phat - zscore*sdhat, phat+zscore*sdhat ); __Part b__ Now, use the same method to construct a 95% confidence interval for $p$. zscore <- qnorm( 0.975 ); CI95 <- c( phat - zscore*sdhat, phat+zscore*sdhat ); Is this CI wider or narrower than the one in Part a? Is this what you would expect? Why or why not? The 95% CI is wider, which makes sense-- if we are to "catch" the parameter $p$ with higher probability (i.e., more often), we need a wider interval. __Part c__ Now, using the same data, construct a 95% (__note:__ 95% now, like in part b, *not* 90%) CLT-based confidence interval for $p$. n <- nrow(data) sdhat <- sd( data$X )/sqrt(n); zscore <- qnorm( 0.975); CLTCI95 <- c( phat- zscore*sdhat, phat+zscore*sdhat ); __Part d__ We said in lecture that in general, these two approaches should yield fairly similar results. Is that the case here? Compare the two confidence intervals (e.g., which one is narrower, if any). A sentence or two is fine. The CLT-based CI is slightly wider, but only by about 0.02 (at least when I ran this code-- you will see some random variation if/when you recompile this, of course). 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com