CS578 Fall 2021
HW 2
Handed Out: October 25, 2021 Due: November 7, 2021
Instructions for submission
1. Conceptual Part: Submit a single PDF on Gradescope with all your answers.
You should have been added, but if not the Entry Code is N8NWDK. Make
sure you select the page corresponding to the beginning of each answer, else
points might be deducted. Your homework must be typed and must contain your
name and Purdue ID. If you are using a late day, please indicate it at the top of the
document. Do not send the instructors emails aboout Late Days.
2. Coding part: To submit your assignment, log into data.cs.purdue.edu (physically
go to the lab or use ssh remotely) and follow these steps:
(a) Place all of your code (Execution.py, Perceptron.py, and any other files you
used) in a folder titled cs578-hw2.Make sure your folder is called this!
(b) Change directory to outside of cs578-hw2/ folder (run cd .. from inside cs578-hw2/
folder)
(c) Execute the following command to turnin your code: turnin -c cs578 -p hw2Fall2021
cs578-hw2
(d) To overwrite an old submission, simply execute this command again.
(e) To verify the contents of your submission, execute this command: turnin -v -c
cs578 -p hw2Fall2021.
Do not forget the -v option, else your submission will be overwritten with an
empty submission.
Questions
1. Given n boolean variables (x1, …, xn), we define our target classification function f(·)
as f(·) = 1 if at least 3 variables are active. For n=5 show how this function can be
represented as (1) Boolean function and a (2) Linear function.
2. Let CONB be the set of all different monotone conjunctions defined over n boolean
variables. What is the size of CONB?
3. Let CONL be the set of all linear functions defined over n boolean variables that are
consistent with the functions in CONB . We say that two functions fb ∈ CONB and
fl ∈ CONL are consistent if ∀x; fb(x) = fl(x). What is the size of CONL?
4. Define in one sentence: Mistake bound (your answer should include all the components
described in class).
1
2
5. Suggest a mistake bound algorithm for learning Boolean conjunctions (hint: recall
the elimination algorithm for monotone conjunctions). Show that your algorithm is a
mistake bound algorithm for Boolean conjunctions.
6. Given a linearly separable dataset consisting of 1000 positive examples and 1000 neg-
ative examples, we train two linear classifiers using the perceptron algorithm. We
provide the first classifier with a sorted dataset in which all the positive examples
appear first, and then the negative examples appear. The second classifier is trained
by randomly selecting examples at each training iteration. (1) Will both classifiers
converge? (2) what will be the training error of each one of the classifiers?
7. We showed in class that using the Boolean kernel function K(x, y) = 2same(x,y) we can
express all conjunctions over the attributes of x and y (with both the positive and
negative literals). This representation can be too expressive. Suggest a kernel function
K(x, y) that can express all conjunctions of size at most k (where k is a parameter,
k < n).
Programming Assignment
In this part of the assignment you are required to implement the perceptron algorithm and
observe its performance on a binary sentiment classification task.
1. Dataset/Task The task will be a binary sentiment classification task where the per-
ceptron will take as input a document of text and then produce a binary output saying
it’s either positive (1) or negative (0). The dataset is essentially a table of two rows
where the first row represents some text and the second row corresponds to its sen-
timent. Ideally, you want your trained perceptron to accurately differentiate between
positive and negative sentiment by looking at words that are used in the text.
You will be given a single csv file that contains all the data corresponding to this task
called data.csv. One component of this task is to split the data into a train, validation,
and test data sets. The splits are determined by you. It is required that you use
5-fold cross validation when training your perceptron. You should not use the test set
in picking the optimal hyperparameter values. Simply report the performance of the
perceptron with most optimal hyperparameter values on the test set after observing
the performance on the training and validation datasets.
To program the splitting of the dataset, take a look at the function split dataset in
Execution.py. You will have to program how the original dataset will be split into train,
validation, and test datasets. It is imperative that you maintain the input and outputs
of the function as described in the program file. Additionally, each one of the dataset
outputs must have column called ‘Label’ which has either a 0 or 1 to label negative
or positive sentiment respectfully. For example, look at Fig. 1 below to see a sample
format. It’s important to keep this format because when your programs will be graded,
the datasets will be passed in with this format and the ground-truth labels for every
dataset will be extracted by simply calling the column ‘Label’ (look at Execution.py
for more information).
3
Figure 1: Sample pandas dataframe that contains the column ‘Label’ which represents the
sentiment for each row in column ‘Text’.
2. Feature Selection An important task within this homework is to find an effective
way to represent text into a meaningful quantitative representation to be used by the
perceptron. One example way is to use a representation called bag-of-words. This
representation simply represents a piece of text by the frequency of the words that
appear. For example, take the two sentences below:
• John likes to play basketball.
• Mary prefers to play soccer.
Now, we build a vocabulary of words. So, our vocabulary would be the following set:
{John,Mary, likes, prefer, to, play, soccer, basketball}. Since we have 8 distinct words
in our vocabulary, we can use 8-dimensional vectors where each dimension corresponds
to a word and the value in each dimension represents the frequency. For example,
• (1, 0, 1, 0, 1, 1, 0, 1)
• (0, 1, 0, 1, 1, 1, 1, 0).
If we take the first vector, we see that there is a 1 in the 1st, 3rd, 5th, 6th, and 8th
dimensions which corresponds to 1 word of each John, likes, to, play, and basketball.
This is one such representation to translate text into a vector representation.
While bag-of-words is a pretty simple approach, there are several drawbacks such that
it doesn’t preserve sentence structure and syntax that could be useful as features in
a model. In fact, if the features extracted from the text are not meaningful, then the
performance of the machine learning model can suffer greatly. Therefore, it is highly
important to construct and preserve relevant features that will be used as input to
the perceptron model. For example, another widely used technique is called n-gram
modeling which you can learn more from reading section 3.1 in the following pdf:
https://web.stanford.edu/~jurafsky/slp3/3.pdf.
You are encouraged to explore and try out several different feature engineering pro-
cesses and find the one that maximizes the performance of your model. It is highly
encouraged if you want to pursue other feature processing techniques that are not listed
as examples here.
https://web.stanford.edu/~jurafsky/slp3/3.pdf
4
3. Implementation The program Perceptron.py is where you will be implementing the
perceptron algorithm. There is a class called Perceptron that is partially setup to
mimic the perceptron model. This class has two objectives:
• Train the perceptron on the train set.
• Predict labels on a given data set as input.
As part of this assignment, you will have to implement a perceptron that trains on
logistic loss using stochastic gradient descent. Additionally, there are two hyperparam-
eters that you must experiment with: the learning rate and maximum iterations. The
learning rate controls how large each gradient update affects the weight vector, and
maximum iteration controls how many examples the training algorithm cycle through.
Essentially, your program will perform two steps. The first step is that it takes in as
input the .csv data file that contains the train data to construct the perceptron model
using stochastic gradient descent. Then, you will pass in data to the perceptron to
predict labels to test the model performance.
You have the freedom to modify most of the starter code and implement it how you
feel is optimal. However, there are certain requirements which you must follow that are
detailed explicitly in the comments of the starter code. As a high level overview, you
may not change the inputs to the init function and the input parameters and return
statements of either train() or predict() may not change. You can customize what code
to place inside these functions but these requirements must hold. The reason for this
is to streamline the process of evaluating your programs for test time on datasets that
your model hasn’t seen before.
The starter code and the comments contain explicit instructions on the requirements
and what pieces of the program you can modify.
4. Evaluation There is another program called Execution.py that is given to you to
evaluate the performance of your model on the datasets. Essentially, it instantiates
the Perceptron class, trains it on the train data set, produces predicted labels, and
then evaluates the accuracy and loss of those labels as compared to the ground-truth.
In this program, there are two functions which you must program: split dataset and
logistic loss. The function split dataset takes in the full csv dataset file and you must
specify how to split it into train, validation, and test sets. The other function, logis-
tic loss, finds the cumulative loss value between the predicted labels and true labels. It
is important that you code the logistic loss function correctly so that you can accurately
assess the performance of the perceptron on the data.
You can simply run Execution.py from the command line in the following way:
python3 Execution.py
It is important to note that the python command assumes that a python3 interpreter
is being used. Additionally, it is important that Execution.py, Perceptron.py, and
5
the .csv files stay in the same folder because Execution.py will be load the data files
assuming it’s in the same directory.
It is important to not alter the code for specific sections in this program because
these exact functions will be used to evaluate your code and model. More detailed
information is contained in the comments of the starter code.
5. Report You will need to submit a report detailing the results of your algorithm. Please
have the following sections labeled in the order specified. Below are the following
requirements for the report:
(a) Explain how you split the dataset and why you selected those splits for the train-
ing, validation, and test sets.
(b) Explain your feature engineering processes, the different strategies you used, and
which method you ultimately used. Also, give an explanation and a simple ex-
ample of how your strategy worked in processing the raw dataset.
(c) It’s important to show how you selected these hyperparameter values. For each
hyperparameter, plot the values of the hyperparameter on the x-axis and a train-
ing objective on the y-axis. For example, plot values of the learning rate (x-axis)
vs. accuracy (y-axis). Do this for both accuracy and logistic loss. Since 5-fold
cross validation will be used to select these hyperparameters, plot each of the
folds as separate curves on these graphs. Repeat these two plots also for the max
iterations. These plots only need to be done for the validation set.
(d) Report the optimal values for the hyperparameters and explain how you arrived
at those values. Additionally, report the performance of your perceptron on the
test set with those hyperparameter values for both accuracy and logistic loss.