Worksheet 11:
Constructing Features for Prediction March 30, 2021
1. Consider the following two functions.
CMPUT 397
(a) Design features for each function, to approximate them as a linear function of these fea- tures. Can you design features to make the approximation exact?
(b) Can you design one set of features, that allows you to represent both functions?
1
,
÷ ..
÷÷
A
.
Worksheet 11:
Constructing Features for Prediction March 30, 2021
2. Consider a neural network, where the input column vector s is mapped by the input-weight matrix A and activation function g to feature vector x= ̇ g(ψ), where ψ = As. Then the feature vector is mapped by output-weight matrix B linearly to the output vector yˆ= ̇ Bx.
Here, for a vector c, the ith element is denoted by ci, and for a matrix C, the element at the ith row and the jth column is denoted by Ci,j.
Recall the following gradients
∂yˆk =xj,
∂Bk,j
∂yˆk =B ∂xi =B ∂g(ψi)s. ∂A k,i∂A k,i ∂ψ j
i,j i,j i
(a) What are the derivatives specifically for the relu activation g?
(b) We talked about carefully initializing the weights for the NN. For example, each weight can be sampled from a Gaussian distribution. Imagine instead you decided to initialize all the weights to zero. Why would this be a problem? Hint: Consider the derivatives in (a).
CMPUT 397
2
Worksheet 11:
Constructing Features for Prediction March 30, 2021
3. Consider a problem with the state space, S = {0, 0.01, 0.02, · · · , 1}. Assume the true value function is
vπ(s) = 4|s − 0.5|
which is visualized below. We decide to create features with state aggregation, and choose
to aggregate into two bins: [0, 0.5] and (0.5, 1].
(a) What are the possible feature vectors for this state aggregation?
(b) Imagine you minimize the VE(w) = s∈S d(s)(vπ (s) − vˆ(s, w))2 with a uniform weighting
d(s)= 1 foralls∈S. Whatvectorwisfound? 101
(c) Now, if the agent puts all of the weighting on the range [0, 0.25], (i.e. d(s) = 0 for all s ∈ (0.25, 1]), then what vector w is found by minimizing VE?
CMPUT 397
3
Worksheet 11:
Constructing Features for Prediction March 30, 2021
4. Consider the following general SGD update rule with a general target Ut wt+1 = wt + α [Ut − vˆ(St, wt)] ∇vˆ(St, wt).
Assume we are using linear function approximation, i.e. vˆ(S, w) = x(S)⊤w.
(a) What happens to the update if we scale the features by a constant and use the new features
x ̃(S) = 2x(S)? Why might this be a problem?
(b) In general we want a stepsize that is invariant to the magnitude of the feature vector x(S), where the magnitude is measured by the inner product x(S)⊤x(S). The book suggests the following stepsize:
α=1. τx(S)⊤x(S)
What is x(S)⊤x(S) when using tile coding with 10 tilings? Suppose τ = 1000, what is α if we use tile coding with 10 tilings?
CMPUT 397
4