CS计算机代考程序代写 python algorithm interpreter CS578 Fall 2021

CS578 Fall 2021

HW 2

Handed Out: October 25, 2021 Due: November 7, 2021

Instructions for submission

1. Conceptual Part: Submit a single PDF on Gradescope with all your answers.
You should have been added, but if not the Entry Code is N8NWDK. Make
sure you select the page corresponding to the beginning of each answer, else
points might be deducted. Your homework must be typed and must contain your
name and Purdue ID. If you are using a late day, please indicate it at the top of the
document. Do not send the instructors emails aboout Late Days.

2. Coding part: To submit your assignment, log into data.cs.purdue.edu (physically
go to the lab or use ssh remotely) and follow these steps:

(a) Place all of your code (Execution.py, Perceptron.py, and any other files you
used) in a folder titled cs578-hw2.Make sure your folder is called this!

(b) Change directory to outside of cs578-hw2/ folder (run cd .. from inside cs578-hw2/
folder)

(c) Execute the following command to turnin your code: turnin -c cs578 -p hw2Fall2021
cs578-hw2

(d) To overwrite an old submission, simply execute this command again.

(e) To verify the contents of your submission, execute this command: turnin -v -c
cs578 -p hw2Fall2021.
Do not forget the -v option, else your submission will be overwritten with an
empty submission.

Questions

1. Given n boolean variables (x1, …, xn), we define our target classification function f(·)
as f(·) = 1 if at least 3 variables are active. For n=5 show how this function can be
represented as (1) Boolean function and a (2) Linear function.

2. Let CONB be the set of all different monotone conjunctions defined over n boolean
variables. What is the size of CONB?

3. Let CONL be the set of all linear functions defined over n boolean variables that are
consistent with the functions in CONB . We say that two functions fb ∈ CONB and
fl ∈ CONL are consistent if ∀x; fb(x) = fl(x). What is the size of CONL?

4. Define in one sentence: Mistake bound (your answer should include all the components
described in class).

1

2

5. Suggest a mistake bound algorithm for learning Boolean conjunctions (hint: recall
the elimination algorithm for monotone conjunctions). Show that your algorithm is a
mistake bound algorithm for Boolean conjunctions.

6. Given a linearly separable dataset consisting of 1000 positive examples and 1000 neg-
ative examples, we train two linear classifiers using the perceptron algorithm. We
provide the first classifier with a sorted dataset in which all the positive examples
appear first, and then the negative examples appear. The second classifier is trained
by randomly selecting examples at each training iteration. (1) Will both classifiers
converge? (2) what will be the training error of each one of the classifiers?

7. We showed in class that using the Boolean kernel function K(x, y) = 2same(x,y) we can
express all conjunctions over the attributes of x and y (with both the positive and
negative literals). This representation can be too expressive. Suggest a kernel function
K(x, y) that can express all conjunctions of size at most k (where k is a parameter,
k < n). Programming Assignment In this part of the assignment you are required to implement the perceptron algorithm and observe its performance on a binary sentiment classification task. 1. Dataset/Task The task will be a binary sentiment classification task where the per- ceptron will take as input a document of text and then produce a binary output saying it’s either positive (1) or negative (0). The dataset is essentially a table of two rows where the first row represents some text and the second row corresponds to its sen- timent. Ideally, you want your trained perceptron to accurately differentiate between positive and negative sentiment by looking at words that are used in the text. You will be given a single csv file that contains all the data corresponding to this task called data.csv. One component of this task is to split the data into a train, validation, and test data sets. The splits are determined by you. It is required that you use 5-fold cross validation when training your perceptron. You should not use the test set in picking the optimal hyperparameter values. Simply report the performance of the perceptron with most optimal hyperparameter values on the test set after observing the performance on the training and validation datasets. To program the splitting of the dataset, take a look at the function split dataset in Execution.py. You will have to program how the original dataset will be split into train, validation, and test datasets. It is imperative that you maintain the input and outputs of the function as described in the program file. Additionally, each one of the dataset outputs must have column called ‘Label’ which has either a 0 or 1 to label negative or positive sentiment respectfully. For example, look at Fig. 1 below to see a sample format. It’s important to keep this format because when your programs will be graded, the datasets will be passed in with this format and the ground-truth labels for every dataset will be extracted by simply calling the column ‘Label’ (look at Execution.py for more information). 3 Figure 1: Sample pandas dataframe that contains the column ‘Label’ which represents the sentiment for each row in column ‘Text’. 2. Feature Selection An important task within this homework is to find an effective way to represent text into a meaningful quantitative representation to be used by the perceptron. One example way is to use a representation called bag-of-words. This representation simply represents a piece of text by the frequency of the words that appear. For example, take the two sentences below: • John likes to play basketball. • Mary prefers to play soccer. Now, we build a vocabulary of words. So, our vocabulary would be the following set: {John,Mary, likes, prefer, to, play, soccer, basketball}. Since we have 8 distinct words in our vocabulary, we can use 8-dimensional vectors where each dimension corresponds to a word and the value in each dimension represents the frequency. For example, • (1, 0, 1, 0, 1, 1, 0, 1) • (0, 1, 0, 1, 1, 1, 1, 0). If we take the first vector, we see that there is a 1 in the 1st, 3rd, 5th, 6th, and 8th dimensions which corresponds to 1 word of each John, likes, to, play, and basketball. This is one such representation to translate text into a vector representation. While bag-of-words is a pretty simple approach, there are several drawbacks such that it doesn’t preserve sentence structure and syntax that could be useful as features in a model. In fact, if the features extracted from the text are not meaningful, then the performance of the machine learning model can suffer greatly. Therefore, it is highly important to construct and preserve relevant features that will be used as input to the perceptron model. For example, another widely used technique is called n-gram modeling which you can learn more from reading section 3.1 in the following pdf: https://web.stanford.edu/~jurafsky/slp3/3.pdf. You are encouraged to explore and try out several different feature engineering pro- cesses and find the one that maximizes the performance of your model. It is highly encouraged if you want to pursue other feature processing techniques that are not listed as examples here. https://web.stanford.edu/~jurafsky/slp3/3.pdf 4 3. Implementation The program Perceptron.py is where you will be implementing the perceptron algorithm. There is a class called Perceptron that is partially setup to mimic the perceptron model. This class has two objectives: • Train the perceptron on the train set. • Predict labels on a given data set as input. As part of this assignment, you will have to implement a perceptron that trains on logistic loss using stochastic gradient descent. Additionally, there are two hyperparam- eters that you must experiment with: the learning rate and maximum iterations. The learning rate controls how large each gradient update affects the weight vector, and maximum iteration controls how many examples the training algorithm cycle through. Essentially, your program will perform two steps. The first step is that it takes in as input the .csv data file that contains the train data to construct the perceptron model using stochastic gradient descent. Then, you will pass in data to the perceptron to predict labels to test the model performance. You have the freedom to modify most of the starter code and implement it how you feel is optimal. However, there are certain requirements which you must follow that are detailed explicitly in the comments of the starter code. As a high level overview, you may not change the inputs to the init function and the input parameters and return statements of either train() or predict() may not change. You can customize what code to place inside these functions but these requirements must hold. The reason for this is to streamline the process of evaluating your programs for test time on datasets that your model hasn’t seen before. The starter code and the comments contain explicit instructions on the requirements and what pieces of the program you can modify. 4. Evaluation There is another program called Execution.py that is given to you to evaluate the performance of your model on the datasets. Essentially, it instantiates the Perceptron class, trains it on the train data set, produces predicted labels, and then evaluates the accuracy and loss of those labels as compared to the ground-truth. In this program, there are two functions which you must program: split dataset and logistic loss. The function split dataset takes in the full csv dataset file and you must specify how to split it into train, validation, and test sets. The other function, logis- tic loss, finds the cumulative loss value between the predicted labels and true labels. It is important that you code the logistic loss function correctly so that you can accurately assess the performance of the perceptron on the data. You can simply run Execution.py from the command line in the following way: python3 Execution.py It is important to note that the python command assumes that a python3 interpreter is being used. Additionally, it is important that Execution.py, Perceptron.py, and 5 the .csv files stay in the same folder because Execution.py will be load the data files assuming it’s in the same directory. It is important to not alter the code for specific sections in this program because these exact functions will be used to evaluate your code and model. More detailed information is contained in the comments of the starter code. 5. Report You will need to submit a report detailing the results of your algorithm. Please have the following sections labeled in the order specified. Below are the following requirements for the report: (a) Explain how you split the dataset and why you selected those splits for the train- ing, validation, and test sets. (b) Explain your feature engineering processes, the different strategies you used, and which method you ultimately used. Also, give an explanation and a simple ex- ample of how your strategy worked in processing the raw dataset. (c) It’s important to show how you selected these hyperparameter values. For each hyperparameter, plot the values of the hyperparameter on the x-axis and a train- ing objective on the y-axis. For example, plot values of the learning rate (x-axis) vs. accuracy (y-axis). Do this for both accuracy and logistic loss. Since 5-fold cross validation will be used to select these hyperparameters, plot each of the folds as separate curves on these graphs. Repeat these two plots also for the max iterations. These plots only need to be done for the validation set. (d) Report the optimal values for the hyperparameters and explain how you arrived at those values. Additionally, report the performance of your perceptron on the test set with those hyperparameter values for both accuracy and logistic loss.