CS计算机代考程序代写 1. Consider a linear binary classifier that operates on two features, and assume we have the four training samples

1. Consider a linear binary classifier that operates on two features, and assume we have the four training samples
(x1 = [50,1]T,y1 = −1), (x2 = [20,5]T,y2 = −1), (x3 = [80,3]T,y3 = +1), (x4 = [60,5]T,y4 = +1).
For the following, “sketch” means that you are not asked for exact results but reasonable approximations and trends without doing any calculations.
(a) (2 marks) Sketch the training data in a 2D diagram with different markers for the two label values y = +1 and y = −1.
(b) (2 marks) Sketch the decision boundary for a linear classifier given the training data.
(c) (4 marks) Sketch the decision boundary and margins of the linear hard margin support vector machine (SVM) classifier for this training data. Identify the support vectors in your sketch.
Now we pre-process the training data through scaling and shifting, i.e., we generate
x′i =s⊙xi +o, i=1,2,3,4,
where ⊙ is element-wise multiplication, such that the average of each feature is 0 and the
variance of each feature is 1. The pre-processed training data is
(x′1 = [−0.1, −1.3]T , y1 = −1), (x2 = [−1.3, 0.8]T , y2 = −1), (x3 = [1.1,−0.3]T,y3 = +1), (x4 = [0.3,0.8]T,y4 = +1).
(d) (2 marks) Sketch the pre-processed training data in a 2D diagram with different markers for the two label values y = +1 and y = −1.
(e) (2 marks) Sketch the decision boundary and margins of the linear hard margin SVM classifier for the pre-processed training data.
(f) (2 marks) Compare the SVM classifiers for the original data (problem (c) above) and the pre-processed data (problem (d)). Would you recommend applying pre-processing before classification and why or why not?
2

2. Consider the optimization of weights in a neural network via gradient descent. We use the notation as in the textbook with w(l) denoting an individual weight between two nodes i and
j at layer l, and w containing all weights of the network. To update the weights, the partial derivatives ∂Ein(w) of the in-sample error E (w) are required.
∂w(l) in i,j
i,j
Now consider the application of the augmented in-sample error
λ 􏰄 (w(l))2 Eaug(w) = Ein(w) + ij ,
where a so-called penalty term is added. N is the number of training samples and the parameters λ≥0andw0 ≥0.
(a) (2 marks) What is usually the purpose of adding a penalty term to Ein(w)?
(b) (2 marks) Through what process would a suitable values for λ and w0 be selected?
(c) (2 marks) What is the benefit of using λ as above versus just using λ? N
(d) (2 marks) Compute the partial derivative of the penalty term with respect to w(l). i,j
(e) (3 marks) Argue that the penalty term causes weights with |w(l)| ≪ w0 to shrink faster i,j
than weights for which |w(l)| ≫ w0 is true. i,j
3. To learn a binary target function f, you have a data set of N = 1,000 samples {(xn,yn),1 ≤ n ≤ N}, and a learning model with M = 100 hypotheses. You split the data set into two subsets of size Ntrain = 800 and Ntest = 200 for training and testing, respectively. Based on the training set, you determine the final hypothesis g.
You can now upper bound the out-of-sample error Eout(g) = Pr[g(x) ̸= f(x)] using ei- Ntrain
ther the in-sample error Ein(g) = 1 􏰃 1{g(xn) ̸= f(xn)} or the test error Etest(g) = Ntrain n=1
N w2 +(w(l))2 i,j,l 0 ij
1
Ntest (a)
(b)
Ntest
􏰃 1{g(xn) ̸= f(xn)}, where 1{·} is the indicator function.
n=1
(5 marks) Determine if Ein(g) or Etest(g) provides a better bound for Eout(g) considering
a 2% error tolerance?
(2 marks) Reserving more samples for testing would make Etest(g) a better estimate for
Eout(g). What is the disadvantage of doing this?
3