Columbia SPS 5335 Summer 2020 HW 3
This homework is intended to give you some practice applying and comparing some of the more sophisticated machine learning algorithms that we have studied on some slightly more realistic data. Since you will be making use of R library implementations of the algorithms, the code needed to answer these problems should be quite succinct.
• Reminder: Please do this assignment individually. While you are encouraged to discuss general questions that arise with your colleagues (and with the course staff), or get help with debugging specific problems, you must write any code on your own. DO NOT DIRECTLY COPY CODE WRITTEN BY OTHER STUDENTS OR COPY/PASTE CODE YOU FIND ON THE INTERNET WITHOUT CITING IT. If you consult other resources (which is OK), you should still try to rewrite the code yourself, but regardless, please just cite any sources that you use.
library(dslabs)
library(AmesHousing)
A: (40 pts) Regularized Regression
We will explore using regularized regression using the Ames data which describes properties and sales prices for 2,930 properties in Ames, Iowa (De Cock, 2011). The data has been cleaned and processed a bit and is available in the AmesHousing R package by calling make_ames() from the AmesHousing package. This is a realistic, complicated data set that will provide an opportunity to take advantage of some of the more sophisticated methods that we have studied to predict the sale prices using the other variables.
A1: (2 pts) Split the data 80/20 into a training set and test set that you can use to evaluate your models.
A2: (8 pts) Exploratory Data analysis: Make one or two figures that clearly summarize the most important characteristics and relationships among the variables. Based on what you observe, you may choose to do some manual selection of the variables to use in the rest of your modeling and work with a smaller set of variables.
A3: (5 pts) As a starting point, fit a regular multiple regression model to predict house prices, using forward variable selection.
A4 (5 pts) Fit a ridge regression model to predict house prices.
A5 (5 pts) Fit a lasso model to predict house prices.
A6: (5 pts) What are the most important variables in each of your models of house prices? Are they they same or different across the different models?
A7: (10 pts) If you were implementing a system to predict house prices, which of the methods explored do you belive would be the best? Briefly justify your choice.
B: (50 pts) Classification
We will explore building some models to classify biopsy image data as either benign or malignant, using the brca dataset from the dslabs package. Fine-needle Aspirate (FNA) biopsies were taken from 569 patients (212 with cancer, and 357 with benign fibrocystic breast masses). These biopsies are spread onto a slide, processed and stained to produce images of cell nuclei. The data comes from the paper “Computerized Breast Cytologic Diagnosis” by Wolberg et al. (1995) and this data has been widely used as a test case for different machine learning methods.
Ordinarily, a pathologist would look at the images from the slides to judge whether the biopsy is likely to be from a tumor or a benign mass. Here, a computer program was first used to automatically extract 30 different relatively simple geometric features from the images, such as the average radius of nuclei, area, symmetry, etc. We will attempt to use these features to classify the samples as cancer or benign, using machine learning methods.
The data is found in the brca object from the dslabs package. This object is is just an R list with the features in a in the brca$x matrix, and the labels (benign or malignant) that we will predicting are stored in brca$y, a 1-d vector.
B1. (2 pts) Split the data into a training set (80%) and test set (20%), which we will hold out for final evaluation of our models. Also, generate a scaled version of the test / training data.
B2. (8 pts) Exploratory data analysis: Generate one or two plots that summarize the characteristics or relationships that you believe will most be important to keep in mind as you construct your models. Please make sure that your plots are clear and legible (you may need to explore with a variety of different plots, but then decide on one or two that clearly show the characteristics that are important.). Briefly summarize what you observe.
B3. (30 pts) Train a classifier for the brca data using each of the following algorithms. If there are hyperparameters, these should be selected using cross-validation. You can (and should) use the built-in R packages that implement these algorithms. For each algorithm, generate the confusion matrix, and report your classification performance (sensitivity, specificity, and f1 score) on the held-out test data.
a. (5 pts) KNN
b. (5 pts) Logistic Regression
c. (5 pts) Decision Trees (for the decision tree, please also show a picture of the resulting tree!)
d. (5 pts) Random Forests
e. (5 pts) SVM
f. (5 pts) Neural nets (I suggest using the r neuralnet package; you can choose which variables to use based on your experience with the other models..)
B4. (10 pts) Which of these algorithms do you believe would best for this problem? Briefly justify your choice.
C: (30 pts) Dimensionality Reduction & Clustering
This question explores applying PCA and clustering to a gene expression dataset, tissue_gene_expression dataset from dslabs, containing measurements of 500 different genes from 189 tissue samples from various tissues.
C1: (5 pts) Compute PCA on the data.
C2: (5 pts) Make a scree plot (elbow plot). Describe what you observe.
C2: (5 pts) Plot the PCA biplot for the samples. Color the samples by the tissue labels.
C3: (5 pts) Describe what you observe in the PCA biplot
C4: (5 pts) Cluster the samples using hierarchical clustering, using both correlation-based distance and euclidean distance. Which is most appropriate?
C5: (5 pts) Plot the data matrix as a heatmap, clustering both the rows and columns and displaying the dendrograms (you can use the R built-in Heatmap function or the ComplexHeatmap package also provides some very nice tools)