EECS 189 Introduction to Machine Learning Fall 2020
This homework is due Tuesday, November 24 at 11:59 p.m. 1 Getting Started
HW12
Read through this page carefully. You may typeset your homework in latex or submit neatly handwritten/scanned solutions. Please start each question on a new page. Deliverables:
1. Submit a PDF of your writeup, with an appendix for your code, to the appropriate assign- ment on Gradescope. If there are graphs, include those graphs in the correct sections. Do not simply reference your appendix.
2. If there is code, submit all code needed to reproduce your results.
3. If there is a test set, submit your test set evaluation results.
After you’ve submitted your homework, watch out for the self-grade form.
(a)
(b)
2
Who else did you work with on this homework? In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?
Please copy the following statement and sign next to it. We just want to make it extra clear so that no one inadvertently cheats.
I certify that all solutions are entirely in my words and that I have not looked at another student’s solutions nor have I looked at any online solutions to any of these problems. I have credited all external sources in this write up.
Decision Trees and Random Forests
In this problem, you will implement decision trees, random forests, and boosted trees for classifi- cation on two datasets:
1. Titanic Dataset: predict Titanic survivors 2. Spam Dataset: predict if a message is spam
In lecture, you were given a basic introduction to decision trees and how such trees are learned from training data. You were also introduced to random forests. Feel free to research different decision-tree training techniques online.
HW12, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1
Data format for Spam The preprocessed spam dataset given to you as part of the homework in spam data.mat consists of 11,029 email messages, from which 32 features have been extracted as follows:
• 25featuresgivingthefrequency(count)ofwordsinagivenmessagewhichmatchthefollow- ing words: pain, private, bank, money, drug, spam, prescription, creative, height, featured, differ, width, other, energy, business, message, volumes, revision, path, meter, memo, plan- ning, pleased, record, out.
• 7 features giving the frequency (count) of characters in the email that match the following characters: ;, $, #, !, (, [, &.
The dataset consists of a training set size 5172 and a test set of size 5857. NOTE: You should NOT use any software package for decision trees for Part b.
(a) Before applying the decision-tree learning algorithm to the Titanic dataset, we will first pre- process the dataset. In the real-world, pre-processing the data is a very important step since real-life data can be quite imperfect. However, to make this problem easier, we have provided some code to preprocess the data. Look and the code in Jupyter Notebook Part (a) and describe how we deal with the following problems:
• Some data points are misssing class labels; • Some features are not numerical values;
• Some data points are missing some features.
Data Processing for Titanic Here is a brief overview of the fields in the Titanic dataset.
(a) survived – 1 is survived; 0 is not. This is the class label.
(b) pclass – Measure of socioeconomic status: 1 is upper, 2 is middle, 3 is lower. (c) sex – Male/Female
(d) age – Fractional if less than 1.
(e) sibsp – Number of siblings/spouses aboard the Titanic
(f) parch – Number of parents/children aboard the Titanic (g) ticket – Ticket number
(h) fare – Fare.
(i) cabin – Cabin number.
(j) embarked – Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
(b) In Jupyter Notebook Part (b), implement the information gain, i.e., entropy of the parent node minus the weighted sum of entropy of the child nodes and Gini purification, i.e., Gini impurity of the parent node minus the weighted sum of Gini impurities of the child nodes splitting rules for greedy decision tree learning.
HW12, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2
Note: The sample implementation assumes that all features are continuous. You may convert all your features to be continuous or augment the implementation to handle discrete features.
(c) In Jupyter Notebook Part (c), train a shallow decision tree on the Titanic and Spam dataset. (for example, a depth 3 tree, although you may choose any depth that looks good) and visualize your tree (including your visualization results in your submission). Include for each non-leaf node the feature name and the split rule, and include for leaf nodes the class your decision tree would assign. You may use sklearn in this problem.
In Jupyter Notebook Part (c), we provide you a code snippet to draw sklearn’s tree using pydot and graphviz. If it is hard for you to install these dependencies, you need to draw the diagram by hand.
(d) From this point forward, you are allowed to use sklearn.tree.* and the classes we have im- ported for you in the starter code. You are NOT allowed to use other functions from sklearn. Fill in code in Jupyter Notebook Part (d) for bagged trees: for each tree up to n, sample with replacement from the original training set until you have as many samples as the training set. Fit a decision tree for each sampling.
(e) In Jupyter Notebook Part (e), apply bagged trees to the titanic and spam datasets. Find and state the most common splits made at the root node of the trees. For example:
(a) (“thanks”) < 4 (15 trees) (b) (“nigeria”) ≥ 1 (5 trees)
(f) Read the code in Jupyter Notebook Part (f) for random forests as follows: again, for each tree in the forest, sample with replacement from the original training set until you have as many samples as the training set. Learn a decision tree for each sample, this time using a randomly sampled subset of the features (instead of the full set of features) to find the best split on the data. Let m denote the number of features to subsample. You do not need to answer this part, only need to go through the code written in Jupyter Notebook Part (f).
(g) In Jupyter Notebook Part (g), apply bagged random forests to the titanic and spam datasets. Find and state the most common splits made at the root node of the trees.
(h) Implement the AdaBoost algorithm for a boosted random forest as follows: this time, we will build the trees sequentially. We will collect one sampling at a time and then we will change the weights on the data after each new tree is fit to generate more trees that focus their attention on tackling some of the more challenging data points in the training set. Let w ∈ RN denote the probability vector for each datum (initially, uniform), where N denotes the number of data points. To start off, as before, sample with replacement from the original training set accordingly to w until you have as many samples as the training set. Fit a decision tree for this sampling, again using a randomly sampled subset of the features. Compute the weight for tree
j based on its weighted accuracy:
aj = 1 log 1 − ej 2 ej
HW12, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3
where ej is the weighted error:
N Ij(xi)wi e = i=1
j Nwi i=1
and Ij(xi) is an indicator for data i being incorrectly classified by this learned tree. Then update the weights as follows:
w exp a i f I (x ) = 1 w+=ij ji
Repeat until you have M trees.
Predict by first calculating the score z(x, c) for a data sample x and class label c:
M
z(x, c) = ajIj(x, c).
j=1
where Ij(x, c) is now an indicator variable for whether tree j predicts class label c for data x.
Then, the class with the highest weighted votes is the prediction (classification result): yˆ = arg max z(x, c)
c
Fill in code in Jupyter Notebook Part (h) for boosted random forests. How are the trees being weighted? Describe qualitatively what this algorithm is doing. What does it mean when ai < 0, and how does the algorithm handle such trees? Apply boosted trees to the titanic and spam datasets.
(i) Run code in in Jupyter Notebook Part (i), summarize the performance evaluation of: a constant classifier, a single decision tree, a single decision tree (sklearn-implementation), bagged trees, random forests, and boosted trees. For each of the 2 datasets, report your training and validation accuracies. You should use a 3-fold cross validation, i.e., splitting the dataset into 3 parts. You should be reporting 32 numbers (2 datasets × 4 classifiers ×(3 + 1) for 3 cross validation accuracies and 1 training accuracy). Describe qualitatively which types of trees and forests performed best. Detail any parameters that worked well for you. In addition, for each of the 2 datasets, train your best model and submit your predictions on the test data to Gradescope. Your best Titanic classifier should exceed 73% accuracy and your best Spam classifier should exceed 76% accuracy for full points.
(j) For the spam dataset only: Go through the Jupyter Notebook Part (j), given examples shown in Jupyter Notebook Part (j), describe what kind of data are the most challenging to classify and which are the easiest. Also, you could come up with your own procedure for determining which data are easy or hard to classify.
i
wi exp −aj otherwise
HW12, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4
3 NLP and embeddings
Common Natural Language Processing (NLP) tasks involve spell-checking (easy), information extraction (medium), and machine translation (hard). A common first step in any NLP task is the representation of words. A vector representation of a word is called an embedding. Initially, words were represented as one-hot vectors in same way that we described representing other categorical data. If the size of the dictionary is |V| (For reference, there are about 30K distinct words in the complete works of Shakespeare), the word school would be represented as
0,0,...,0,1,0,...,0,
where the 1 happens at the index corresponding to the word school. A document d can be sum- marized as the mean of the vectors corresponding to all its words. This is called the bag-of-words model because it throws away the information of the relationship between words (similar to the re- lation between an image and its histogram). When we have a vector representation for a document, we can check how similar two documents d, d′ are by computing the cosine of the angle between them.
We know words like energy and energize are related, but one-hot encoding hides all similar- ity between words (the inner product between any two different one-hot representations is zero). A different way to create embeddings was proposed in the 1950s: words with similar meanings appear in similar contexts. This connection between meaning and distribution is called the distri- butional hypothesis. One of the early workers in the field, J. R. Firth, stated this idea sharply: “you shall know a word by the company it keeps.” In this problem, we will study a method for learning vector semantics (i.e., embeddings learned under the distributional hypothesis) called word2vec.
word2vec learns dense encodings (short vectors) for words in a self-supervised fashion: all the algorithm needs is running text. We assume that every word w in the vocabulary has a pair of embeddings φ(w), ψ(w) ∈ Rd , where d is the fixed length of the learned vectors (a hyperparameter usually ranging from 50 to 200). Suppose we have another word w′ with embeddings φ(w′) and ψ(w′). Then word2vec operates under the claim that the probability that w′ is a context word for w is
P(w′ is a context word for w)= 1 . 1 + e−ψ(w′)⊤φ(w)
To be a context word means that w′ tends to appear close to w in written text. This will be clearer soon.
(a) Suppose we have a large dataset {(wi, w′, t′)}N of pairs of words wi, w′ together with a label i i i=1 i
ti suchthatti = 1whenw′i isacontextwordforwi andti = 0whenitisnot. Assumingall pairs of words are independently drawn (completely unfounded!), give an expression for the log likelihood l(t1,...,tN|w1,w′1,··· ,wN,w′N;φ(w1),ψ(w′),...,φ(wN),ψ(wN)).
(b) Observe that the log likelihood uses two embeddings φ(wi), ψ(wi) per word wi: φ(wi) is used when wi is the center word, and ψ(wi) is used when wi is a context word for another center
HW12, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5
word. We can try to learn the embeddings by maximizing the likelihood, and we can do that by running SGD on the negative log likelihood. Provide an expression for the SGD updates to the weights φ(wi), ψ(wi) based on the log likelihood.
(c) We know how to update the vectors φ(w), ψ(w) for a word w based on the logistic loss as we did in the previous part. But where do we get the training set? We don’t need labeled text, just written text. Consider the sentence
“Concerned with unprecedented volatility, investors shun contracts and favor stocks.”
Choose investors as the center word. We create a window of length L (hyperparameter) around the center word. This gives us 2L pairs (investors, w′, 1). For instance, for L = 2, w′ ∈ {unprecedented, volatility, shun, contracts}. By moving the center word in a corpus of text, we generate a large dataset of labeled pairs. Note, however, that this procedure would only yield positive examples. To learn effectively, we also need some negative examples for self-supervision. Negative examples are usually generated by choosing at random words from the dictionary according to their probability of occurrence in the string. (Why? Because that is what their frequency would be in positive examples so we want to match that in negative examples for balance.) In the jupyter notebook, complete the code for the updates to the word embeddings. Training the embeddings may take up to one hour. Note the strings we are using come from transcripts of the European parliament. Source: http://www.statmt.org/europarl/.
(d) The code of this section generates the list of the five words which are closest (in angle) to an input word. Browse the values of the vocabulary and find two words whose closest words sound mostly reasonable to you. Then find another two words whose closest words you deem semantically objectionable. In both cases, state the words you used, their closest words, and your thoughts about their relatedness. Hint: The input we are using to learn the embeddings comes from transcripts from the European parliament. It’s probably wiser to choose words common in diplomatic settings.
(e) Look at the way in which we trained our vectors in part C of the jupyter notebook. There is no clearly distinguishable training set and validation set. Propose a way in which we can convert our data into a train+validation set and suggest a metric for checking the quality of our embeddings.
This is an open-ended question, but it is important for you to be able to think in this way in practice. Try.
(f) Suppose we have learned d-dimensional embeddings for a vocabulary V. Then we have |V|×d- matrices Φ and Ψ containing the center and context embeddings, respectively, for our words. Suppose we add a new word w′ to our vocabulary. We want to learn the embeddings of w′ while keeping the existing embeddings unmodified. In the jupyter notebook, you will see that we extracted all positive examples for a word w in the vocabulary V and we claimed that a new word w′ has exactly the same positive examples (this is like doing search replace in a text). Complete the code of this section to train the new center embedding and report the 5 words with embeddings closest to the embedding for w′. Do the results surprise you?
HW12, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6
(g) Realistic semantic embeddings take over 10 hours to train. We will keep exploring embeddings by loading previously-trained embeddings. We downloaded the 50-dimensional GloVe (an- other algorithm) vectors from https://nlp.stanford.edu/projects/glove/ and loaded them in the jupyter notebook. The code for this part plots in 2D the PCA projections of the vectors for several words. Choose 10-20 words and plot the 2D projections of the embed- dings. What do you notice? Which words are close to each other? Which words are farther away?
(h) It has been found that GloVe vectors can be used to solve some analogies. Given three words, w, w′, w′′, we compute φ(w) − φ(w′) + φ(w′′), and we observe that the resulting vector can sometimes give the answer to an analogy of the form w′ is to w what w′′ is to ???. Using the example code, find three analogies for which the code gives a sensical answer, and three analogies for which it does not. In the nonsensical case, explain what you expected.
What’s next? When trying to understand a sentence in English, it is important to figure out how words relate to each other grammatically (what is the object, subject. . . ). This goes by the name dependency parsing, and it’s a necessary task for translation and question answering. If you are Amazon’s Alexa, and you are told to do something, you must understand how the words spoken to you relate to entities in the real world and to actions you can take (e.g., play a song or turn lights on). This and more you will study in an NLP class. For now, if you want to get more of a taste for this area, go and read something about GPT-3; this will acquaint you with cutting-edge results in many NLP tasks.
4 Adversarial Examples (Theory)
Many machine learning models are known to be sensitive to small perturbations on the input. This phenomenon, dubbed adversarial examples, is an on-going area of research in machine learning, and is viewed as a particularly challenging problem for many deep learning contexts. In the pre- vious homework, we have shown that many classifiers are vulnerable to adversarial examples and that there is a natural way to try to “robustify” these classifiers. This problem is a sequel and focuses on theoretical analysis in a simple setting. This example illustrates why the adversarial example phenomenon should perhaps be expected and also has a kind of inevitability to it.
For this problem, we will consider data with input x and binary label y ∈ {+1, −1}. We assume that the class is balanced, i.e. P(y = +1) = P(y = −1) = 1/2. We are only concerned with linear classifiers of the form: f (x) = sign(w⊤ x).
(a) For now, let’s assume that the features x have the following distribution. xi ∼ N(ηy,σ2) ∀i ∈ {1,...,d}
With different features being independent conditioned on the label y. In other words, all fea-
tures of x are all weakly correlated with the label y. A natural choice for a linear classifier in
this case is to choose a uniform weight over all features, i.e. wunif = [1, 1,..., 1]. For this ddd
HW12, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7
choice of the classifier f (x) = sign(w⊤uni f x) to achieve 99% accuracy on the test distribution i.i.d. drawn from the same distribution above, what is approximately the smallest value of η in terms of σ and d?
(Hint: Look at the distribution for the random variable representing w⊤uni f x. What is it? What has to be true to have accuracy 99%?)
(b) The result from part (a) should suggest that η can be fairly small especially when d is large, and so it is easy to classify such data correctly with high accuracy.
Now consider a slightly different setting where we will test the same classifier against adversar- ial examples. For a given sample (x, y), derive adversarial examples xadv for the classifier from part (a) where the l∞-norm of the perturbation is contrained to ε, i.e. ∥δ∥∞ ≤ ε. Then, argue why this classifier will have very low accuracy on the adversarial examples if the adversary can use ε ≥ 2η.
(Hint: To derive the adversarial examples, you can use the result from the previous homework.)
(c) Use the same setting and the adversarial examples from part (b), and the assumption that the classes are balanced in this problem. What is the highest adversarial accuracy we can hope to achieve for any classifier with an adversary that powerful? Briefly justify your answer. No proof is needed for this part.
(d) For this part only, let’s consider a slightly different distribution for x ∈ Rd+1. x1 =y, xi ∼N(ηy,σ2) ∀i∈{2,...,d+1}
4σ
d
kind of input data.
Suppose that we insist that we do better than guessing for any ε = 2η < ε. How big can ε be
in this case?
Then think about what the most robust classifier is (i.e. what the weights of the classifier should look like if you want to achieve the highest accuracy on the adversarial examples?). You should realize that you want to prioritize the first feature, and the robust weights should have the following form:
wrob = [α,β,β,...,β] ∈ Rd+1 for α > 0
What is the range of values β such that this robust classifier has perfect accuracy on the adversarial examples? Answer in terms of α, ε, d. Which value of β gives you the largest margin?
(e) (Trade-off between normal and adversarial accuracy.) Now consider the following data dis- tribution which is slightly modified from the previous part and is a bit more realistic. For p > 0.5,
2
, xi∼N(ηy,σ) ∀i∈{2,…,d+1}
−y w.p. 1−p
HW12, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 8
Assumethatη= √
We want to understand how robust can a classifier be to adversarial perturbations given this
<1.
x1=
+y w.p. p
5
Your classifier from the previous part should have a perfect accuracy on the normal test set from the previous part, but that is because the features of x in the previous part are too easy. In practice, we are more likely to encounter a set of features where most of them are noisy and only weakly correlated with the true label, and only a few features are strongly correlated but still with some error. In this case, we choose the first feature to be the only “strong” feature and the rest d features to be the “weak” ones.
4σ
Onceagainsupposeη= √ <1andε=2η<εfromthepreviouspart.Arguethatwemay
d
not be able to have a single linear classifier that is both robust and has high accuracy on un-perturbed data at the same time.
Hint:Whatarethenormalandtheadversarialaccuraciesofwunif andwrob? Meta-learning for Learning 1D functions
We will return to the by-now familiar setup of learning a 1D function from Fourier features in an overparameterized setting, introduced in Problem 4 of Homework 5. In the past we worked with a version in which the first s features were known to be useful and thus assigned larger weights before performing regression. Suppose now that our task is not to learn not just one 1D function, but any of a class of 1D functions drawn from a task distribution DT .
You will notice a spiritual connection between this problem and the earlier homework problem on learning dimensionality reduction using auxiliary labels. This is not a coincidence and we hope this problem will help you better appreciate the interconnected nature of machine learning concepts and the power of actually understanding them from the foundations.
In this problem we consider all signals of the form
y = α s φ us ( x )
s∈S
The task distribution produces individual tasks which have true features with random coefficients in some a priori unknown set of indices S. Although we do not yet know the contents of S, we can sample tasks from DT . The important question is thus, how do we use sampled tasks in training to improve our performance on an unseen task drawn from DT at test time? One solution is to use our training tasks to learn a set of weights to apply to the features before performing regression through meta-learning.
In the original problem formulation, we chose feature weights ck to apply to the features φuk(x) before learning coefficients βˆk such that
d−1
yˆ = βˆ k c k φ uk ( x ) .
k=0
We then performed the min-norm least-squares optimization
βˆ = arg min ∥β∥2 β
HW12, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 9
d−1
s.t. y = βkckΦuk k=0
where Φu is the column vector of features [φu0(x), φu1(x), . . . , φud−1(x)]′ which are orthonormal with respect to the test distribution. In our new formulation we want to learn c which minimizes the expectation for βˆ over all tasks,
ˆ argminEDT LT βT,c
c
ˆ ˆ
where LT βT , c is the loss from learning bbeta for a specific task with the original formulation
and a given c vector. c is shared across all tasks and is what we will optimize with meta-learning.
There are many machine learning techniques which can fall under the nebulous heading of meta- learning, but we will focus on one with Berkeley roots called Model Agnostic Meta-Learning (MAML)1 which optimizes the initial weights of a network to rapidly converge to low loss within the task distribution. The MAML algorithm as described by the original paper is shown in Fig. 1. This problem should help you understand how and why MAML works, without any mystification.
Figure 1: MAML algorithm. We will refer to the training steps on line 6 as the inner update, and the training step on line 8 as the meta update.
At a high level, MAML works by sampling a “mini-batch” of tasks {Ti} and using regular gradient descent updates to find a new set of parameters θi for each task starting from the same initialization θ. Then the gradient w.r.t. the original θ each calculated for each task using the task-specific updated weights θi, and θ is updated with these ‘meta’ gradients. Fig. 2 illustrates the path the weights take with these updates.
1C. Finn, P. Abbeel, S. Levine, ”Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks,” in Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017
HW12, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 10
Figure 2: MAML gradient trajectory illustration
The end goal is to produce weights θ∗ which can reach a state useful for a particular task from DT after a few steps — needing to use less data to learn. If you want to understand the fine details of the algorithm and implementation, we recommend reading the original paper and diving into the code provided with this problem.
(a) For now, assume that we have access to grid-spaced training data and we have M alias groups and |S| = s < M indices (max one feature from each alias group) that are actually present in any of the true signals that we are going to encounter. Argue that if the feature weights on the s ∈ S indices are large compared to the rest then regression tasks will work.
(Hint: Can you relate this to the familiar setting from the previous problem? Think about the differences in this setup and argue why they shouldn’t matter for the success of regression given the necessary assumptions in the previous problem.)
(b) For utter simplicity, suppose now that we have exactly one training point (x, y), one true feature φut (x), and one alias of the true feature φua(x) that agrees with the true feature on this one point, even though both features are orthonormal relative to the test distribution. The function we wish to learn is y = φut (x). We learn coefficients βˆ using the training data. Note, both the coefficients and the feature weights are 2-d vectors. The first component corresponds to the true feature and the second one to the alias.
Derive the closed-form solution for βˆ with grid-spaced training data and feature weights c0, c1.
(c) Assume for simplicity that we have access to infinite data from the test distribution for the purpose of updating the feature weights c. Calculate the gradient of the expected test error with respect to the true and alias feature weights c0 and c1, respectively:
d 1 ˆ u ˆ u 2
E y−βcφ(x)−βcφ(x) . dc xtest,ytest 2 0 0 t 1 1 a 2
(Hint: the features φui (x) are orthonormal under the test distribution.)
(d) Generate a plot showing that, for some initialization c(0), as the number of iterations
i → ∞ the weights empirically converge to c0 = ∥c(0)∥, c1 = 0 using gradient descent with HW12, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 11
a sufficiently small step size. Include the initialization and its norm and the final weights. What will β go to?
The rest of the parts (e)-(k) are in the accompanying jupyter notebooks ‘prob2-part1.ipynb’ and ‘prob2-part2.ipynb’. Follow the instructions in the notebook and write your answers in the cells provided.
(e) Meta-learning with closed-from min-norm least squares regression, all uniform random data.
(f) Impact of incorrect feature weightings.
(g) Meta-learning with closed-from min-norm least squares regression, grid-spaced training data and uniform random meta update and test data.
(h) Replacing the closed-form solution with gradient descent.
(i) Impact of the number of gradient descent steps on meta-learning success.
(j) Meta-learning for classification with copy-pasted regression code.
(k) Meta-learning for classification with logistic regression.
6 Your Own Question
Write your own question, and provide a thorough solution.
Writing your own problems is a very important way to really learn the material. The famous “Bloom’s Taxonomy” that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we don’t want the same to be true about the highest level.
As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams.
Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You don’t have to achieve this every week. But unless you try every week, it probably won’t happen ever.
HW12, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 12
Contributors:
• Ana Tudor
• Anant Sahai
• Ashwin Pananjady
• Cathy Wu
• Chawin Sitawarin
• Inigo Incer
• Josh Sanz
• Mandi Zhao
• Mark Velednitsky
• Vignesh Subramanian • Yang Gao
• Yaodong Yu
• Yichao Zhou
HW12, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 13