python 机器学习 自然语言处理 BANA 290 Homework 1: Rule-Based Classification

Last Modified: April 4, 2018

BANA 290: Machine Learning for Text: Spring 2018

Homework 1: Rule-Based Classification
Conal Sathi and Sameer Singh (with help from Yoshitomo Matsubara)

https://canvas.eee.uci.edu/courses/9097

The first programming assignment will familiarize you with the basic text processing methods, and the use of prebuilt lexicons and rules for text classification. The submissions are due by midnight on April 13, 2018.

1 Task: Sentiment Classification

The primary objective for the assignment is to predict the sentiment of a movie review. In particular, we will be providing you with a dataset containing the text of the movie reviews from IMDB, and for each review, you have to predict whether the review is positive or negative. We will also provide some code to help you read and write the output files.

1.1 Data

The data for this task is available on the Kaggle website1. The primary data file is named data.zip, which contains the following:

train.csv

train/

test/

lexicon/

test-basic.csv

1

1.2 Kaggle

Kaggle is a website that hosts machine learning competitions, and we will be using it to evaluate and compare the accuracy of your classifiers. We know the true sentiment for each of the unlabeled reviews, which we will use to evaluate your submissions, and thus your submission file to Kaggle should contain a predicted label for all the unlabeled reviews. In particular, the submission file should have the following format (code already does this):

◦ Start with a single line header: Fileindex,Category
◦ For each of the unlabeled speech (sorted by name) there is a line containing an increasing integer index

(i.e. line number-1), then a comma, and then the string label prediction of that speech. ◦ See on the Kaggle site for example.

You can make at most three submissions each day, so we encourage you to test your submission files early, and observe the performance of your system. By the end of the submission period, you will have to select the two submissions the best of which you want to be judged as your final submission.

1.3 Source Code

Some initial code is available for you to develop your submission, at https://canvas.eee.uci.edu/courses/ 9097/assignments/182966. In order to run the code, unzip data.zip to a local folder, say data/ , and execute:

  python3 run.py data test.csv

The file contains methods for loading the data and lexicons, and calling the methods to run and evaluate your classifier. It also contains the code to output the submission file from your classifier (called in the above example) that you will submit to Kaggle2. The file sentiment.py contains the skeleton of your classifier; this is the only file you should need to modify.

1 https://www.kaggle.com/c/sentiment- analysis- sp18/
2Try submitting this generated file even before making any other changes.

◦ ◦ ◦ ◦

: List of files and associated sentiment label, for evaluating your classifier. : Folder of text files containing the reviews that are part of labeled data.

: Folder of text files containing reviews that are not labeled.
: Two sentiment lexicons. The code for reading them is included.

run.py

test.csv

Homework 1 UC Irvine 1/ 2

BANA 290: Machine Learning for Text Spring 2018

2 What to submit?

Prepare and submit a single write-up (PDF, maximum 2 pages) and sentiment.py (compressed in a single zip file) to Canvas. Do not include your student ID number, since we might share it with the class if it’s worth highlighting. The write-up and code should address the following.

2.1 Preliminaries (5 points)

At the top of your write-up, include your Kaggle username, and the accuracy that your best submission obtained on Kaggle. You do not need to include any other details such as name, UCINet Id, etc.

2.2 Rule-Based Classifier (40 points)

Your main goal is to improve the basic classifier provided in sentiment.py . For this, you should consider doing both of the following:

◦ Lexicons: We have provided two lexicons for your use. Each lexicon is a dictionary containing words as keys and the sentiment as the value. For Harvard Inquirer ( 3), the value is a sentiment label: 0 for negative and 1 for positive. For SentiWordNet ( 4), each value is a pair of positive and negative scores, respectively. Use them as you see fit.

◦ Regular Expressions: After looking at some reviews, you may have ideas for rules on the review text that you think will help predict the sentiment. Implement them using if/then and regular expressions.

Implement your suggestions in classify() in sentiment.py , and describe them in a few sentences in your report. The primary evaluation for this part will be the performance of your classifier, combined with how cre- ative/interesting your proposed ideas are.

2.3 Examples (30 points)

In order to aid analysis, you also need to figure out the errors being made by your classifiers, i.e. split each prediction into four categories: true positives, true negatives, false positives, and false negatives. If you look at get_error_type() in sentiment.py , there is an incorrect implementation of this method. Fix this code to print the appropriate examples, which will result in 4 files full of reviews, called fp.txt , , tp.txt , and

. Include 2-3 examples from the false positives and negatives in your report.

2.4 Analysis (20 points)

Analyze the above false positive and false negatives in your writeup. In particular, in a few sentences, describe what is lacking in your approach, i.e. why do you think the errors exist. Write a sentence or two about how you would address them if you had more time. You will be evaluated on how well you were able to identify the problems, and the creativity of your proposed future solution.

3 Statement of Collaboration (5 points)

It is mandatory to include a Statement of Collaboration in each submission, with respect to the guidelines below. Include the names of everyone involved in the discussions (especially in-person ones), and what was discussed.

All students are required to follow the academic honesty guidelines posted on the course website. For pro- gramming assignments, in particular, we encourage the students to organize (perhaps using Piazza) to discuss the task descriptions, requirements, bugs in our code, and the relevant technical content before they start working on it. However, you should not discuss the specific solutions, and, as a guiding principle, you are not allowed to take anything written or drawn away from these discussions (i.e. no photographs of the blackboard, written notes, referring to Piazza, etc.). Especially after you have started working on the assignment, try to restrict the discussion to Piazza as much as possible, so that there is no doubt as to the extent of your collaboration.

3 http://www.wjh.harvard.edu/~inquirer/ 4 http://sentiwordnet.isti.cnr.it/

inqtabs_dict

swn_dict

fn.txt

tn.txt

Homework 1 UC Irvine 2/ 2