CS计算机代考程序代写 1. Consider a linear binary classifier that operates on two features, and assume we have the four

1. Consider a linear binary classifier that operates on two features, and assume we have the four
training samples

(x1 = [50, 1]
T , y1 = −1), (x2 = [20, 5]T , y2 = −1),

(x3 = [80, 3]
T , y3 = +1), (x4 = [60, 5]

T , y4 = +1).

For the following, “sketch” means that you are not asked for exact results but reasonable
approximations and trends without doing any calculations.

(a) (2 marks) Sketch the training data in a 2D diagram with different markers for the two
label values y = +1 and y = −1.

(b) (2 marks) Sketch the decision boundary for a linear classifier given the training data.

(c) (4 marks) Sketch the decision boundary and margins of the linear hard margin support
vector machine (SVM) classifier for this training data. Identify the support vectors in your
sketch.

Now we pre-process the training data through scaling and shifting, i.e., we generate

x′i = s� xi + o, i = 1, 2, 3, 4,

where � is element-wise multiplication, such that the average of each feature is 0 and the
variance of each feature is 1. The pre-processed training data is

(x′1 = [−0.1,−1.3]T , y1 = −1), (x2 = [−1.3, 0.8]T , y2 = −1),
(x3 = [1.1,−0.3]T , y3 = +1), (x4 = [0.3, 0.8]T , y4 = +1).

(d) (2 marks) Sketch the pre-processed training data in a 2D diagram with different markers
for the two label values y = +1 and y = −1.

(e) (2 marks) Sketch the decision boundary and margins of the linear hard margin SVM
classifier for the pre-processed training data.

(f) (2 marks) Compare the SVM classifiers for the original data (problem (c) above) and the
pre-processed data (problem (d)). Would you recommend applying pre-processing before
classification and why or why not?

2

2. Consider the optimization of weights in a neural network via gradient descent. We use the
notation as in the textbook with w

(`)
i,j denoting an individual weight between two nodes i and

j at layer `, and w containing all weights of the network. To update the weights, the partial
derivatives

∂Ein(w)

∂w
(`)
i,j

of the in-sample error Ein(w) are required.

Now consider the application of the augmented in-sample error

Eaug(w) = Ein(w) +
λ

N


i,j,`

(w
(`)
ij )

2

w20 + (w
(`)
ij )

2
,

where a so-called penalty term is added. N is the number of training samples and the parameters
λ ≥ 0 and w0 ≥ 0.

(a) (2 marks) What is usually the purpose of adding a penalty term to Ein(w)?

(b) (2 marks) Through what process would a suitable values for λ and w0 be selected?

(c) (2 marks) What is the benefit of using λ
N

as above versus just using λ?

(d) (2 marks) Compute the partial derivative of the penalty term with respect to w
(`)
i,j .

(e) (3 marks) Argue that the penalty term causes weights with |w(`)i,j | � w0 to shrink faster
than weights for which |w(`)i,j | � w0 is true.

3. To learn a binary target function f , you have a data set of N = 1, 000 samples {(xn, yn), 1 ≤
n ≤ N}, and a learning model with M = 100 hypotheses. You split the data set into two
subsets of size Ntrain = 800 and Ntest = 200 for training and testing, respectively. Based on the
training set, you determine the final hypothesis g.

You can now upper bound the out-of-sample error Eout(g) = Pr[g(x) 6= f(x)] using ei-

ther the in-sample error Ein(g) =
1


n=1

1{g(xn) 6= f(xn)} or the test error Etest(g) =

1

n=1

1{g(xn) 6= f(xn)}, where 1{·} is the indicator function.

(a) (5 marks) Determine if Ein(g) or Etest(g) provides a better bound for Eout(g) considering
a 2% error tolerance?

(b) (2 marks) Reserving more samples for testing would make Etest(g) a better estimate for
Eout(g). What is the disadvantage of doing this?

3