代写 statistic Assignment 1

Assignment 1
MAST90083 Computational Statistics and Data Mining
Due time: 5PM, Monday September 16th You must submit your report via LMS
1 Data Analysis
Gross domestic product is a standard measure of the size of an economy; it’s the total value of all goods and services bought and solid in a country over the course of a year. It’s not a perfect measure of prosperity, but it is a very common one, and many important questions in economics turn on what leads GDP to grow faster or slower. One common idea is that poorer economies, those with lower initial GDPs, should grower faster than richer ones. The reasoning behind this catching up is that poor economies can copy technologies and procedures from richer ones, but already-developed countries can only grow as technology advances. A second, separate idea is that countries can boost their growth rate by under- valuing their currency, making the goods and services they export cheaper. Our dataset “uval.csv” contains the following variables:
• Country, in a three-letter code.
• Year (in five-year increments).
• Per-capita GDP, in dollars per person per year
• Average percentage growth rate in GDP over the next five years.
• An index of currency under-valuation. The index is 0 if the currency is neither over- nor under-valued, positive if under-valued, negative if it is over-valued.
Note that not all countries have data for all years. However, there are no missing values in the data table.
1. Linearly regress the growth rate on the under-valuation index and the log of GDP. Report the coefficients and their standard errors. Do the coefficients support the idea of catching up? Do they support the idea that under-valuing a currency boosts economic growth?
1

2. Repeat the linear regression but add as covariates the country, and the year. Use factor(year), not year, in the regression formula.
(a) Report the coefficients for log GDP and undervaluation, and their standard errors.
(b) Explain why it is more appropriate to use factor(year) in the formula than just year.
(c) Plot the coefficients on year versus time.
(d) Does this expanded model support the idea of catching up? Of undervaluation boosting growth?
3. Does adding in year and country as covariates improve the predictive ability of a linear model which includes log GDP and under-valuation?
(a) What are the R2 and the adjusted R2 of the two models?
(b) Use leave-one-out cross-validation to find the mean squared errors of the two
models. Which one actually predicts better, and by how much?
(c) Explain why using 5-fold cross-validation would be hard here.
4. Kernel regression Use kernel regression, as implemented in the np package, to non- parametrically regress growth on log GDP, under-valuation, country, and year (treating year as a categorical variable). Hint: read chapter four of Shalizi carefully. In partic- ular, try setting tol to about 10−3 and ftol to about 10−4 in the npreg command, and allow several minutes for it to run.
(a) Give the coefficients of the kernel regression, or explain why you cannot.
(b) Plot the predicted values of the kernel regression, for each country and year, against the predicted values of the linear model.
(c) Plot the residuals of the kernel regression against its predicted values. Should these points be scattered around a flat line, if the model is right? Are they?
(d) The npreg function reports a cross-validated estimate of the mean squared error for the model it fits. What is that? Does the kernel regression predict better or worse than the linear model with the same variables?
2 Kernel regression and varying smoothness
Starter code for this problem is in starter.R. That code will generate a data set to be used for this problem, and will also provide a true mean function μ(x). The resulting data frame has a x column (your predictor) and a y column (your response).
1. Plot y versus x. Overlay the true mean function μ(x) using the curve function in R. What do you notice for x < 4π and x > 4π?
2

4.
3
1. 2. 3.
2.
Using the np library in R, fit a kernel regression on each of the following datasets:
(a) Only those data points with x < 4π. (b) Only those data points with x > 4π.
(c) All the data points
For each of these regressions, what is the optimal bandwidth? How does the optimal bandwidth for the overall data set compare to the optimal bandwidth for each of the halves?
For each of the three selected bandwidths, make a plot showing:
• The true mean μ(x).
• The data points.
• The kernel regression predictions, with the bandwidth specified to be the selected bandwidth.
• The 95% confidence band for the regression curve μ using resampling of residuals.
• The 95% confidence band for the regression curve μ using resampling of cases.
The result should be three plots, each tuned to one of the selected bandwidths. Give these plots clear titles to distinguish them.
How do these three plots differ? In particular, how well do the regressions trained on the left and right halves do on each half of the data set? How well does the bandwidth fit on the overall data set do on each half? (Be specific about the types of problems that occur.) What lesson might this tell about functions of varying smoothness and kernel regression, if any?
Theoretical questions
Exercise 1.2 in Shalizi Exercise 1.4 in Shalizi Exercise 7.4 in ESL
3.
3