代写 R algorithm math python graph software Homework 6: Multiclass, Trees, and Gradient Boosting

Homework 6: Multiclass, Trees, and Gradient Boosting
Instructions: Your answers to the questions below, including plots and mathematical work, should be submitted as a single PDF file. It’s preferred that you write your answers using software that typesets mathematics (e.g. LATEX, LYX, or MathJax via iPython), though if you need to you may scan handwritten work. You may find the minted package convenient for including source code in your LATEX document. If you are using LYX, then the listings package tends to work better.
1 Reformulations of Multiclass Hinge Loss
1.1 Multiclass setting review
Consider the multiclass output space Y = {1, . . . , k}. Suppose we have a base hypothesis space H = {h : X × Y → R} from which we select a compatibility score function. Then our final multiclass hypothesis space is F = 􏰂f(x) = arg maxy∈Y h(x, y) | h ∈ H􏰃. Since functions in F map into Y, our action space A and output space Y are the same. Nevertheless, we will write our class-sensitive loss function as ∆ : Y ×A → R, even though Y = A. We do this to indicate that the true class goes in the first slot of the function, while the prediction (i.e. the action) goes in the second slot. This is important because we do not assume that ∆(y,y′) = ∆(y′,y). It would not be unusual to have this asymmetry in practice. For example, false alarms may be much less costly than no alarm when indeed something is going wrong.
In the spirit of empirical risk minimization, we would like to find f ∈ F minimizing the empirical
cost-sensitive loss:
n
min 􏰅 ∆ (yi, f(xi)) ,
f∈F
i=1
possibly with some regularization. But this is clearly intractable, since we already know binary classification is intractable and that’s a special case of this formulation. In lecture we proposed an alternative, tractable objective function: the multiclass SVM based on the convex multiclass hinge loss.
1.2 Two versions of multiclass hinge loss (or generalized hinge loss)
In lecture, we defined the margin of the compatibility score function h on the ith example (xi,yi) for class y as
mi,y(h) = h(xi, yi) − h(xi, y).
We also gave a formulation of a multiclass SVM objective function, where the loss on an individual example (xi, yi) was
1

l1(h,(xi,yi))= max (max[0,∆(yi,y)−mi,y(h)]). y∈Y−{yi}
There’s an alternative formulation, called the generalized hinge loss in SSBD Section 17.2. There they define
2
1. 2.
l2(h, (xi,yi)) = max [∆ (yi, y) + h(xi, y) − h(xi, yi)] . y∈Y
Show that if ∆(y, y) = 0 for all y ∈ Y, then l2 (h, (xi, yi)) = l1(h, (xi, yi)). [Hint: Note that maxy∈Y φ(y) = max 􏰀φ(yi), maxy∈Y−{yi} φ(yi)􏰁.]
In the context of the generalized hinge loss, we’ve said that ∆(yi, y) is like the “target margin” between the score for true class yi and the score for class y. Suppose that for our compatibility function h, all target margins are reached or exceeded on xi. That is
mi,y(h) = h(xi, yi) − h(xi, y) ≥ ∆(yi, y), forally∈Y−{yi}. Assumethat∆(yi,y)>0∀y̸=yi and∆(yi,y)=0fory=yi. ]
(a) Show that under the conditions above, l1(h, (xi, yi)) = l2(h, (xi, yi)) = 0.
(b) Show that under the conditions above, we make the correct prediction on xi. That is,
f(xi) = arg maxy∈Y h(xi, y) = yi.
SGD for Multiclass Linear SVM
Suppose our output space and our action space are given as follows: Y = A = {1, . . . , k}. Given a non-negative class-sensitive loss function ∆ : Y × A → [0, ∞) and a class-sensitive feature mapping Ψ : X × Y → Rd. Our prediction function f : X → Y is given by
fw (x) = arg max ⟨w, Ψ(x, y)⟩ y∈Y
For training data (x1, y1), . . . , (xn, yn) ∈ X ×Y, let J(w) be the l2-regularized empirical risk function for the multiclass hinge loss. We can write this as
for some λ > 0.
J(w)=λ∥w∥ +
n
2 1􏰅n
max[∆(yi,y)+⟨w,Ψ(xi,y)−Ψ(xi,yi)⟩], y∈Y
i=1
1. [Optional] Show that J(w) is a convex function of w. You may use any of the rules about convex functions described in our notes on Convex Optimization, in previous assignments, or in the Boyd and Vandenberghe book, though you should cite the general facts you are using. [Hint: If f1, . . . , fm : Rn → R are convex, then their pointwise maximum f(x) = max {f1(x), . . . , fm(x)} is also convex.]
2. Since J(w) is convex, it has a subgradient at every point. Give an expression for a subgradient of J(w). You may use any standard results about subgradients, including the result from an earlier homework about subgradients of the pointwise maxima of functions. (Hint: It may be helpful to refer to yˆi = arg maxy∈Y [∆ (yi, y) + ⟨w, Ψ(xi, y) − Ψ(xi, yi)⟩].)
3. Give an expression for the stochastic subgradient based on the point (xi,yi).
4. Giveanexpressionforaminibatchsubgradient,basedonthepoints(xi,yi),…,(xi+m−1,yi+m−1).
2

3 [Optional] Hinge Loss is a Special Case of Generalized Hinge Loss
Let Y = {−1, 1}. Let ∆(y, yˆ) = 1(y ̸= yˆ). If g(x) is the score function in our binary classification setting, then define our compatibility function as
h(x, 1) = g(x)/2 h(x, −1) = −g(x)/2.
Show that for this choice of h, the multiclass hinge loss reduces to hinge loss:
l (h, (x, y)) = max [∆ (y, y′)) + h(x, y′) − h(x, y)] = max {0, 1 − yg(x)}
y′∈Y
4 Multiclass Classification – Implementation
In this problem we will work on a simple three-class classification example, similar to the one given in lecture. The data is generated and plotted for you in the skeleton code.
4.1 One-vs-All (also known as One-vs-Rest)
In this problem we will implement one-vs-all multiclass classification. Our approach will assume we have a binary base classifier that returns a score, and we will predict the class that has the highest score.
1. Complete the class OneVsAllClassifier in the skeleton code. Following the OneVsAllClassifier code is a cell that extracts the results of the fit and plots the decision region. Include these results in your submission.
4.2 Multiclass SVM
In this question, we will implement stochastic subgradient descent for the linear multiclass SVM, as described in lecture and in this problem set. We will use the class-sensitive feature mapping approach with the “multivector construction”, as described in our multiclass classification lecture and in SSBD Section 17.2.1.
1. Complete the skeleton code for multiclass SVM. Following the multiclass SVM implementa- tion, we have included another block of test code. Make sure to include the results from these tests in your assignment, along with your code.
5 [Optional] Audio Classification
In this problem, we will work on the urban sound dataset URBANSOUND8K from the Center for Urban Science and Progress (CUSP) at NYU. (You should download the data from that link.) We will first extract features from raw audio data using the LibROSA package, and then we will train multiclass classifiers to classify the sounds into 10 sound classes. URBANSOUND8K dataset contains 8732 labeled sound excerpts broken into 10 folds. Use folds 1 and 2 for training, and folds 3 and 4 for validation.
3

1.
2.
3.
4. 5.
6
In LibROSA, there are many functions for visualizing audio waves and spectra, such as dis- play.waveplot() and display.specshow(). Load a random audio file from each class as a floating point time series with librosa.load(), and plot their waves and linear-frequency power spec- trogram. If you are interested, you can also play the audio in the notebook with functions display() and Audio() in IPython.display.
Mel-frequency cepstral coefficients (MFCC) are a commonly used feature for sound processing. We will use MFCC and its first and second differences (like discrete derivatives) as our features for classfication. First, use function feature.mfcc() from LibROSA to extract MFCC features from each audio sample. (The first MFCC coefficient is typically discarded in sound analysis, but you do not need to. You can test whether this helps in the optional problem below.) Next, use function feature.delta() to calculate the first and second differences of MFCC. Finally, combine these features and normalize each feature to zero mean and unit variance.
Train a linear multiclass SVM on your training set. Evaluate your results on the validation set in terms of 0/1 error and generate a confusion table. Compare the results to a one-vs-all classifier using a binary linear SVM as the base classifier. For each model, may use your code from the previous problem, or you may use another implementation (e.g. from sklearn).
[More Optional] Compare results to any other multiclass classification methods of your choice. [More Optional] Try different feature sets and see how they affect performance.
[Optional] Decision Tree Implementation
In this problem we’ll implement decision trees for both classification and regression. The strategy will be to implement a generic class, called Decision_Tree, which we’ll supply with the loss function we want to use to make node splitting decisions, as well as the estimator we’ll use to come up with the prediction associated with each leaf node. For classification, this prediction could be a vector of probabilities, but for simplicity we’ll just consider hard classifications here. We’ll work with the classification and regression data sets from previous assignments.
1. [Optional] Complete the class Decision_Tree, given in the skeleton code. The intended implementation is as follows: Each object of type Decision_Tree represents a single node of the tree. The depth of that node is represented by the variable self.depth, with the root node having depth 0. The main job of the fit function is to decide, given the data provided, how to split the node or whether it should remain a leaf node. If the node will split, then the splitting feature and splitting value are recorded, and the left and right subtrees are fit on the relevant portions of the data. Thus tree-building is a recursive procedure. We should have as manyDecision_Tree objects as there are nodes in the tree. We will not implement pruning here. Some additional details are given in the skeleton code.
2. [Optional] Complete either the compute_entropy or compute_gini functions. Run the code provided that builds trees for the two-dimensional classification data. Include the results. For debugging, you may want to compare results with sklearn’s decision tree. For visualization, you’ll need to install graphviz.
4

3. [Optional] Complete the function mean_absolute_deviation_around_median (MAE). Use the code provided to fit the Regression_Tree to the krr dataset using both the MAE loss and median predictions. Include the plots for the 6 fits.
7 Gradient Boosting Machines
Recall the general gradient boosting algorithm1, for a given loss function l and a hypothesis space F of regression functions (i.e. functions mapping from the input space to R):
1. Initialize f0(x) = 0. 2. For m = 1 to M:
(a) Compute:
 􏰄 n ∂􏰅n 􏰄
gm = l(yi,f(xi))􏰄􏰄 
􏰄f(xi)=fm−1(xi),i=1,…,n j=1
n
hm = arg min 􏰅 ((−gm)i − h(xi))2 .
h∈F
(c) Choose fixed step size νm = ν ∈ (0, 1], or take
∂f(xj) i=1 (b) Fit regression model to −gm:
(d) Take the step: 3. Return fM .
i=1
n
νm =argmin􏰅l(yi,fm−1(xi)+νhm(xi)).
ν>0
i=1
fm(x) = fm−1(x) + νmhm(x)
In this problem we’ll derive two special cases of the general gradient boosting framework: l2- Boosting and BinomialBoost.
1. Consider the regression framework, where Y = R. Suppose our loss function is given by l ( yˆ , y ) = 1 ( yˆ − y ) 2 ,
2
and at the beginning of the m’th round of gradient boosting, we have the function fm−1(x).
Show that the hm chosen as the next basis function is given by n
hm = arg min 􏰅 [(yi − fm−1(xi)) − h(xi)]2 .
h∈F
i=1
1Besides the lecture slides, you can find an accessible discussion of this approach in http://www.saedsayad. com/docs/gbm2.pdf, in one of the original references http://statweb.stanford.edu/~jhf/ftp/trebst.pdf, and in this review paper http://web.stanford.edu/~hastie/Papers/buehlmann.pdf.
5

8
2.
In other words, at each stage we find the base prediction function hm ∈ F that is the best fit to the residuals from the previous stage. [Hint: Once you understand what’s going on, this is a pretty easy problem.]
Now let’s consider the classification framework, where Y = {−1, 1}. In lecture, we noted that AdaBoost corresponds to forward stagewise additive modeling with the exponential loss, and that the exponential loss is not very robust to outliers (i.e. outliers can have a large effect on the final prediction function). Instead, let’s consider the logistic loss
l(m) = ln 􏰀1 + e−m􏰁 ,
where m = yf(x) is the margin. Similar to what we did in the l2-Boosting question, write an
expression for hm as an argmin over F.
Gradient Boosting Implementation
This method goes by many names, including gradient boosting machines (GBM), generalized boost- ing models (GBM), AnyBoost, and gradient boosted regression trees (GBRT), among others. Al- though one of the nice aspects of gradient boosting is that it can be applied to any problem with a subdifferentiable loss function, here we’ll keep things simple and consider the standard regression setting with square loss.
1. Complete the gradient_boosting class. As the base regression algorithm, you may use sklearn’s regression tree. You should use the square loss for the tree splitting rule and the mean function for the leaf prediction rule. Run the code provided to build gradient boosting models on the classification and regression data sets, and include the plots generated. Note that we are using square loss to fit the classification data, as well as the regression data.
2. [Optional] Repeat the previous runs on the classification data set, but use a different clas- sification loss, such as logistic loss or hinge loss. Include the new code and plots of your results. Note that you should still use the same regression tree settings for the base regression algorithm.
6