Exam 2: Practice Problems
Exam 2: Practice Problems
Copyright By PowCoder代写 加微信 powcoder
Fitting and Interpreting a Linear Regression Model: the cars Data Set
The cars data set (note: this is distinct from the mtcars data set!) contains data on stopping distances for cars driving at different speeds.
data(cars)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
As you can see, the data set has just two columns: speed and dist, corresponding to speed (in miles per hour) and stopping distance (in feet), respectively. Note that this data was gathered in the 1920s. Modern cars can go a lot faster and stop far more effectively!
Part a: plotting the data
Create a scatter plot of the data, showing stopping distance as a function of speed (i.e., distance on the y-axis and speed on the x-axis). Do you notice a trend? Discuss (a sentence or two is plenty).
#TODO: plotting code goes here.
TODO: brief discussion here.
Part b: fitting linear regression
Use lm to fit a linear regression model that predicts stopping distance from speed (and an intercept term). That is, fit a model like dist = beta0 + beta1*speed.
# TODO: regression code goes here.
Use the resulting slope and intercept terms to create the scatter plot from Part a, but this time add a line, in blue, indicating our fitted model (i.e., add a line with slope and intercept given by your estimated coefficients).
# TODO: plotting code goes here.
Do you notice anything about your model? Is the model a good fit for the data? Why or why not? Two or three sentences is plenty here.
TODO: discussion/explanation goes here.
Examine the output produced by lm. Should we or should we not reject the null hypothesis that the speed variable has a non-zero coefficient?
#TODO: code goes here (if you need to extract model information not obtained above)
TODO: discussion/explanation goes here.
Part c: accounting for nonlinearity
Let’s see if we can improve our model. We know from physics class that kinetic energy grows like the square of the speed. Since stopping amounts to getting rid of kinetic energy, it stands to reason that stopping distance might be better predicted by the square of the speed, rather than the speed itself. It’s not exactly clear in the data that such a trend exists, but let’s try fitting a different model and see what happens.
Fit the model dist = beta0 + beta1*speed^2 to the cars data.
# TODO: regression code goes here.
Plot stopping distance as a function of speed again and again add the regression line in blue from Part c. Then add another line (a curve, really, I guess), in red, indicating the prediction of this new model. That is, the predicted distance as a linear function of squared speed.
Hint: the speed values in the data range from 4 to 25. You may find it useful to create a vector x containing a sequence of appropriately-spaced points from 4 to 25 and evaluate your model at those x values.
Another hint: this is the rare problem where it’s probably actually easier to use ggplot2, but if you prefer to do everything in R, don’t forget about the lines function, which might be helpul here.
# TODO: plotting code goes here.
PCA for data exploration: the Pima Indian data revisited
Let’s return to the Pima Indian data set that we saw in a previous discussion section. This data set contains biometric data and diabetes diagnoses from a collection of Pima Indian women. The data is described in detail in the MASS library documentation.
library(MASS)
If you have not installed MASS (highly unlikely, since it is included by default in most R installations), you’ll get an error running the above code block. You can install the library by running install.packages(MASS) and then the above should run without a hitch.
The following code downloads a slightly modified version of the data set from the course website. The data frame pima.x contains the biometric data (e.g., blood glucose levels, age, etc.). Each row corresponds to a patient in the study. The single-column data frame stored in pima.y encodes a binary variable (called type) indicating whether or not each patient has diabetes. The i-th row of pima.x corresponds to the ith entry in pima.y, and these jointly correspond to the i-th row of the Pima.te data frame in the MASS library.
download.file(‘https://kdlevin-uwstat.github.io/STAT340-Fall2021/Exam2/pimaX.csv’, ‘pimaX.csv’)
download.file(‘https://kdlevin-uwstat.github.io/STAT340-Fall2021/Exam2/pimaY.csv’, ‘pimaY.csv’)
pima.x <- read.csv('pimaX.csv') pima.y <- read.csv('pimaY.csv') Part a: plotting to spot correlations Create a pairs plot that shows how the seven biometric variables vary with one another. Do you see any obvious correlations between the variables? A sentence or two describing what you see is plenty, here. Note: Pair plots can get quite crowded once they involve more than three or four variables. You may find it helpful, after identifying an interesting pair of variables, to make a full-sized scatter plot of just these two variables to make it easier to see exactly what’s going on. # TODO: code goes here. TODO: discussion/explanation goes here. Part b: adding diabetes status to the plot Use the data in pima.y to create the same plot as above, but now color the points according to diabetes status, with diabetic patients colored red and non-diabetic patients colored blue. # TODO: code goes here. Suppose that you were required to choose only two variables from among the seven biometric variables in the data set (i.e., two of the seven columns of pima.x) with which to predict diabetes status. Which two would you choose to maximize your predictive accuracy? Note: There is no need to do any additional coding or math, just interpret the plot and explain your answer in a few sentences. There is no single right answer, here. Clear explanation and well-motivated reasoning is enough for full credit. TODO: discussion/explanation goes here. Part c: dimensionality reduction with PCA Use prcomp to extract the first two principle components of the data. Don’t forget to use the scale argument to ensure that the variables are all on the same scale. # TODO: code goes here Interpret the loadings in these first two components. Which variables appear to be corrlated based on the loadings? #TODO: code goes here if you didn't display the loadings above. TODO: discussion/explanation goes here. Part d: creating a biplot Visualize the data projected onto the first two principal components using biplot (or, if you prefer, you may use the ggbiplot package). Do you see anything interesting? Which variables tend to behave similarly? Do you find this surprising? Unsuprising? #TODO: code goes here. TODO: discussion/explanation goes here. Part e: PCA and classification Use the output of prcomp to create a scatter plot of the data projected onto the first two principal component directions, with the data colored according to diabetes status. As in our plot in Part b, color points corresponding to diabetic patients red, and color non-diabetic patients blue. Note: it would be great to just do this with biplot, but biplot does not have a simple way to pass information for coloring individual points in the plot– the col argument specifies colors for the points and the loading vectors. If you are using ggbiplot, the support for this is better, and you may use that instead of extracting the components from the prcomp output if you wish. Hint: you want the x attribute of the object returned by prcomp. # TODO: code goes here. Do you see any clear structure in the data? How well do you think we would do if we tried to use PCA coordinates to predict diabetes status? Going back to your explanation from Part a, would you prefer to use these two PCA components or the two variables you picked in Part a if your goal was to accurately predict diabetes status? Why? TODO: discussion/explanation goes here. Estimating the success parameter of a Bernoulli The following code downloads a CSV file called binary.csv and reads it into a data frame data with one column, X, which is a vector of binary observations (i.e., every entry is zero or one). download.file(destfile='binary.csv', url='https://kdlevin-uwstat.github.io/STAT340-Fall2021/Exam2/binary.csv') data <- read.csv('binary.csv' ) head(data) Let us suppose that these observations were drawn iid froma Bernoulli distribution with unknown success probability \(p\), and we are interested in estimating \(p\). Use Monte Carlo to construct a 90% (note: 90%, not 95%) confidence interval for the parameter \(p\). # TODO: code goes here. Now, use the same method to construct a 95% confidence interval for \(p\). Is this CI wider or narrower than the one in Part a? Is this what you would expect? Why or why not? TODO: brief discussion/explanation goes here. Now, using the same data, construct a 95% (note: 95% now, like in part b, not 90%) CLT-based confidence interval for \(p\). #TODO: code goes here. We said in lecture that in general, these two approaches should yield fairly similar results. Is that the case here? Compare the two confidence intervals (e.g., which one is narrower, if any). A sentence or two is fine. TODO: brief discussion/explanation goes here. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com