1
•
• •
• •
Assignment 3: Text Classification
Niranjan Balsubramanian, Tianyi Zhao, and Tao Sun CSE 352 Spring 2021
Due Date and Collaboration
The assignment is due on April 9 at 11:59 pm. You have a total of four extra days for late submission across all the assignments. You can use all four days for a single assignment, one day for each assignment – whatever works best for you. Submissions between fourth and fifth day will be docked 20%. Submissions between fifth and sixth day will be docked 40%. Submissions after sixth day won’t be evaluated.
You can collaborate to discuss ideas to understand the concepts and math.
You should NOT collaborate at the code or pseudo-code level. This in- cludes all implementation activities: design, coding, and debugging.
You should NOT not use any code that you did not write to complete the assignment.
The homework will be cross-checked. Do not cheat at all! It’s worth doing the homework partially instead of cheating and copying your code and get 0 for the whole homework.
2 Goals
Naive Bayes, Logistic Regression, Support Vector Machines, and Random Forests
This is an analysis assignment, where you will investigate the utility of different
learning algorithms for a text classification task. You will be using the imple- mentations available in the scikit python library here: http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
The task is to learn classifiers that can classify input texts into one of the K-classes. You will create classifiers with three configurations:
1. Unigram Baseline (UB) – Basic sentence segmentation and tokenization. Use all words.
1
2. 3.
3
Bigram Baseline (BB) – Use all bigrams. (e.g. I ran a race => I ran, ran a, a race. )
My best configuration (MBC) – You can create your own configuration based on the design choices below.
Dataset
For this assignment we’re using the “20 newsgroup dataset” which contains 20,000 newsgroup documents partitioned into 20 news categories. We treat each of the categories as classes. We will use only 4 classes for this assignment.
1. rec.sport.hockey
2. sci.med
3. soc.religion.christian
4. talk.religion.misc
In each class, there two sets of documents, one for training and one for test. The format of each document is as follows:
Header Body
Consists of fields such as
Note that you can ignore the header when you’re extracting features.
4 Design Choices for your best configuration 4.1 Feature representations
Scikit provides few implementations of feature extractors. The extractors first segment the text into sentences, and tokenize them into words. Each document is then represented as a vector based on the words that occur in it.
1. CountVectorizer – Uses the number of times each word was observed.
2. TFIDFVectorizer – Uses relative frequencies normalized by the inverse of the number of documents in which the word was observed.
See Section 4.2.3 in http://scikit-learn.org/stable/modules/feature_
extraction.html for details. You can also explore many different choices for getting a good representation. Preprocessing often has a big impact in text tasks. Some choices include:
a. Lower case and filter out stopwords (e.g., I ran a race => I, ran, race). 2
b. Apply stemming (e.g, running, runs, ran => run). You can find a stemmer in nltk. Use porter stemmer shown here: http://www.nltk.org/howto/stem. html
4.2 Feature selection
This is often key to the performance of any ML based application. Depending on the type of algorithm used, you will have different choices for doing feature selection. For instance, you can do an external feature selection where you find the most informative features or you can try L1, L2 or Lasso regularizers. You should read this blog and pick any two regularizations. http://scikit-learn. org/stable/modules/feature_selection.html
4.3 Hyperparameters
For most learning algorithms there are many different options that can be tuned. Finding the best set of options can be tricky and require many rounds of explo- ration. Try the following:
1. Naive Bayes – No hyperparameters.
2. Logistic Regression – Regularization constant, num iterations
3. SVM – Regularization constant, Linear, polynomial or RBF kernels. 4. RandomForest – Number of trees and number of features to consider.
5 Implementation 5.1 Requirements
1. You must implement in Python 3.7+. You can use scikit library for the algorithms.
2. You CAN discuss with your friends about which algorithms to use and to understand the algorithms but not share or discuss code.
3. All code will be tested by running through plagiarism detection software.
4. You should include cite any web resources or friends you used/discussed with for pseudo-code or the algorithm itself. Failure to do so will result in an F.
5. Your algorithm should be documented at a method level briefly so they can be read and understood by the TAs.
3
6
What should you turn in?
• Basic Comparison with Baselines (50 points) – For all four methods (NB, LR, SVM, and RF), you should run both unigram and bigram baselines. You should turn in two results:
a UB BB.py (20 points) should run all 4 methods, each with 2 con- figurations.
python UB BB.py
b (20 points) A learning curve (LC) result, where you show the per- formance of each classifier only with the unigram representation. The learning curve is a plot of the performance of the classifier (F1- score on the y-axis) on the evaluation data, when trained on dif- ferent amounts of training data (size of training data on the x-axis). You can choose different training size for this figure. For example, you can randomly sample 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% training data. Note the output file only contains the results when using 100% training data. Plot all four learning curves in a single figure, and include the figure into the re- port. Use proper line styles and legends to clearly distinguish four methods. An example plot is shown below. Note the values are not real.
1.0 0.8 0.6 0.4 0.2
0.00 102030405060708090100 Size of training data (%)
NB LR SVM RF
c (10 points) Describe your findings and make arguments to explain why this is the case. You can use any online source to understand the relative performance of algorithms for varying training data sizes. Make arguments in your own words. Provide a citation.
• My best configuration (50 points) – Pick the best performing classification method from the above experiments and then use the design choices to find the best possible result on the evaluation data. You will create a model that we can run on a hidden test data.
a MBC exploration.py (20 points) For each of four methods (NB, LR, SVM, and RF), pick at least two design choices and output evaluation results to a file.
python MBC exploration.py
The output file should contain 4 lines. Each line should be ’
5
F1-score
You can name
tent of your output file into your report.
b MBC final.py (20 points) – Export the best configuration training model (you can use any combinations) and give us code that we can run to get performance of your model on a hidden test set. You can call helper functions from MBC exploration.py if you want. The organization of the test data we use is the same as the evaluation data you are given.
python MBC final.py
The output file should contain a single line evaluated on the test data in the same format as previous ones. The grading for this part is decided by competing all results among students. It means that the threshold for grading this question is based on results of the whole class.
c Explanation (10 points) – Explain your result based on your best understanding of the configuration options. Should be no longer than 2 paragraphs. You can use any online source to understand the options and what they mean. Provide a citation of the sources.
• Report (PDF) – Include all results/observations above into your report, named SBUID-LastName-FirstName-A3.pdf.
Please put all 4 files (Report, UB BB.py, MBC exploration.py, and MBC final.py) in the same folder. Your submission should be one zip file named SBUID- LastName-FirstName-A3.zip, containing all source code and report under a sin- gle folder named SBUID-LastName-FirstName-A3, with no other folders in the zip file. Please note that the file name for the source codes must be exactly the same as instructed.
7 Grading
See last section.
6