Homework 1
Due Friday, 9/28 by 5pm
Turn in to UVA Collab as a PDF
Instructions: Complete the following homework. Your answers should be turned into collab on a single typed PDF document. Homework submitted as anything but a PDF will not be accepted. Code should be provided in a separate R file.
- R simulation validation of error decomposition (Please turn in your R code in a separate .R file for this question)
In this exercise, you will simulate z from the known function (and minimizer of squared error), 𝔼[𝑧|𝑥,𝑦]=𝑥+5𝑦+4sin(𝑥)+10sin(𝑦)+.5𝑥2+.5𝑦2,and 𝑧=𝔼[𝑧|𝑥,𝑦]+𝜖where𝜖∼ 𝑁(𝜇 = 0, sd = 8);- Sampling 𝑥 and 𝑦 both from a uniform distribution with support [0, 10], calculate the empirical bias/variance decomposition when using a linear model for prediction. You should consider the decomposition under a variety of sample sizes
- How does the standard deviation of the noise term effect your results?
- Now extend your model to include nonlinear transformations included in the known
function. Calculate the variability of your estimator. You should do this under a variety
of sample sizes.
- Discuss the effect on the variability under varying standard deviations of the noise term.
- Data exercise (This data was prepared by Sham Kakade. I have modified it to make it easier to load into R.) This data set is modified from the Yelp Recruiting Competition on Kaggle (https://www.kaggle.com/c/yelp-recruiting). The files of interest are:
- sparse_matrix_stars.Rdata which is a very sparse matrix of features. To load it, simply call load(‘sparse_matrix_stars.Rdata’). The matrix structure is a sparse matrix, but you will not have to rework your code to work with the sparse matrix instead of a matrix.
- Star_features.txt- this file contains the features that are represented by the columns of the sparse_matrix_stars file. The features are unigrams, bigrams, and trigrams from the text of the reviews.
- Star_labels- this file contains the ratings for each review in the dataset
- Upvote_data.csv is a file that contains the feature matrix for the upvote data
- Upvote_features is a file that describes the features corresponding to the upvote_data file. These features are not restricted to the text of the review
- Upvote_labels contains the number of upvotes for each review.
b. Use the Lasso (the function you want to use is glmnet, with alpha=1) to build two
functions, one that predicts the star rating given the review, and one to predict the number of upvotes. For this homework, you are welcome to use the original data
(somebody wrote a nice script to convert the data from JSON format to CSV files) make use of more features. But this is not required.
- You should look over a wide range of lambda to pick a good regularization parameter.
- You should split your data into a training and a test set so that you can examine out-of-sample performance.
c. Discuss the models. Do they seem reasonable? Is this the only subset of parameters that could possibly predict well?
3. In this problem you will write a simulation that will compare LDA and logistic regression. You will simulate from two normal distributions. To do this you need to
- Randomly sample a 30×30 covariance matrix (this is actually not as easy as it may seem. Think about the requirements of a covariance matrix. How can you accomplish this?)
- Sample two multivariate normal populations from two different means. You should pick means that are separated, though you don’t want to make the problem so easy for the algorithms that you can’t compare their risks (so not too separated). You should try a range of sample sizes
- Compare and contrast LDA and Logistic regression. Under what circumstances does each algorithm perform better than the other?