STAT 206 Final: Due Wednesday, December 8, 5PM
STAT 206 Final: Due Wednesday, December 8, 5PM
General instructions: Final must be completed as a pdf file. Be sure to include your name in the file.
Give the commands to answer each question in its own code block, which will also produce plots that will
be automatically embedded in the output file. Each answer must be supported by written statements as
well as any code used. You are free to consult the textbook and recommended texts, lecture slides, resources
provided through the class website, books and papers in the library, or legitimate online resources, though all
use of these sources must be acknowledged in your work. (Websites which compile course materials are not
legitimate online resources.) You are not allowed to discuss the content of the exams with anyone other than
the instructors; in particular, you may not discuss the content of the exam with other students in the course.
Any questions you have must be directed to .
Office hours via Zoom will be as follows:
• James
– December 6, Monday 1:00 p.m. – 2:00 p.m.
– December 8, Wednesday 8:00 a.m. – 9:00 a.m.
• Brad
– December 7, Tuesday 12:20 p.m. – 1:20 p.m.
The final is worth 100 total points.
• 10 points for good-faith effort at every part.
• 10 points for clean, well-formatted, easily readable code.
• 80 points based on correctness of the questions below.
Part I – Rescaled Epanechnikov kernel
The rescaled Epanechnikov kernel is a symmetric density function given by
f(x) =
{ 3
4 (1− x
2) for |x| ≤ 1
0 otherwise (1)
1. (4 points) Check that the above formula is indeed a density function (using calculus).
2. (4 points) Produce a plot of this density function. Set the X-axis limits to be −2 and 2. Label your
axes properly and give a title to your plot.
3. (6 points) Devroye and Gy”orfi give the following algorithm for simulation from this distribution.
Generate iid random variables U1, U2, U3 ∼ U(−1, 1). If |U3| ≥ |U2| and |U3| ≥ |U1|, deliver U2,
otherwise deliver U3. Write a program that implements this algorithm in R. Using your program,
generate 1000 values from this distribution. Display a histogram of these values.
4. (6 points) Construct kernel density estimates from your 1000 generated values using the Gaussian and
Epanechnikov kernels. How do these compare to the true density?
Part II – Metropolis Hastings
Suppose we have observed data y1, y2, . . . , y200 sampled independently and identically distributed from the
mixture distribution
δN(7, 0.52) + (1− δ)N(10, 0.52).
1
mailto:
5. (5 points) Simulate 200 realizations from the mixture distribution above with δ = 0.7.
6. (2 points) Draw a histogram of the data that also includes the true density. How close is the histogram
to the true density?
7. (5 points) Now assume δ is unknown with a Uniform(0,1) prior distribution for δ. Implement an
independence Metropolis Hastings sampler with a Uniform(0,1) proposal.
8. (5 points) Implement a random walk Metropolis Hastings sampler where the proposal δ∗ = δ(t) + �
with � ∼ Uniform(−1, 1).
9. (3 points) Explain why the independence Metropolis Hastings sampler from (7) is better than the
random walk Metropolis Hastings sampler from (8).
Part III – Metropolitan Statistical Areas
For data-collection purposes, urban areas of the United States are divided into several hundred “Metropolitan
Statistical Areas” based on patterns of residence and commuting; these cut across the boundaries of legal
cities and even states. In the last decade, the U.S. Bureau of Economic Analysis has begun to estimate “gross
metropolitan products” for these areas. More recently, it has been claimed that these gross metropolitan
products show a simple quantitative regularity, called “supra-linear power-law scaling”. If Y is the gross
metropolitan product in dollars, and N is the number of people in the city, then, the claim goes,
Y ≈ cN b
where the exponent b > 1 and the scale factor c > 0. If this model holds with an exponent b < 1, there is said to be “sub-linear scaling”. The data gmp-2006.csv can be found on canvas, which contains the following variables for each metropolitan statistical area, in 2006: 1. Its name; 2. Its per-capita gross metropolitan product (dollars per person per year); 3. Its population (number of persons); 4. The proportion of the city’s economy derived from each of four industries: finance, professional and technical services, information and communications technologies, and management services. Some variables are missing for some cities. Since not all variables are used in all problems, deleting all rows with incomplete data is a bad idea. 10. (3 points) A metropolitan area’s gross per capita product is P = Y/N . Show that if Y ≈ cN b holds, then logP ≈ β0 + β1 logN . Find equations for β0 and β1 in terms of c and b. 11. Use lm to linearly regress log per capita product, logP , on log population, logN . a. (2 points) Explain how would you translate your estimated coefficients into estimates of c and b. b. (2 points) What are the estimated coefficients? Explain whether or not your point estimates support the idea of supra-linear scaling. c. (2 points) Report the MSE of this model under 5-fold cross-validation. 12. (5 points) Fit a loess smoother to logP and logN . Choose the span parameter carefully. What is the MSE under cross-validation? 13. a. (3 points) Under the model from (11), what are the predicted per-capita GMPs of (i) Michigan City-La Porte, IN, (ii) Madison, WI, and (iii) Minneapolis-St. Paul-Bloomington, MN-WI? b. (3 points) Under the model from (12), what are the per-capita GMPs of those three cities? 14. (6 points) In the previous problems, you reported point estimates and point predictions without any measure of uncertainty. Using an appropriate technique, give 90% confidence intervals for β0 and β1, and for your predictions from (13). 2 15. (3 points) Plot P against N , adding to the plot both the estimated power law from (11) and the nonparametric curve from (12). Comment on the difference in shapes. Also comment on which model seems to predict better. 16. Part of the idea of supra-linear scaling is that increasing N should lead to a more-than-proportional increase in Y , no matter what N is. a. (2 points) Under the model from (11), what is the predicted change in logP for a 10% increase in population for cities the size of (i) Michigan City-La Porte, (ii) Madison, and (iii) Minneapolis- St. Paul-Bloomington? b. (3 points) Repeat the previous problem, but make predictions under the model from (12). c. (2 points) Do the non-parametric estimates support the idea of supra-linear scaling? 17. (4 points) Based on all your analyses so far, what can you conclude about the idea of supra-linear scaling? Is it well-supported by this data, or do they undermine it, or is the situation more ambiguous? 3 Part I - Rescaled Epanechnikov kernel Part II - Metropolis Hastings Part III - Metropolitan Statistical Areas