Project Report
1 Problem statement
For this course project, I need to develop the best possible sentiment classification model based on the “Pros and Cons” datasets. This dataset was crawled from reviews of epinions.com concerning things such as digital cameras, printers and Strollers. The training dataset has accurate target labels (Pro, Con) assigned to each example, the test dataset has only a dummy label for each example. I need to try my best to train my model on the training dataset and get the best generalization classification performance on the testing dataset.
2 Background
The Web has dramatically changed the way that consumers express their opinions. They can now post reviews of products at merchant sites and express their views on almost anything in Internet forums, discussion groups, and blogs. This online word-of-mouth behavior represents new and measurable sources of information for marketing intelligence. Techniques are now being developed to exploit these sources to help companies and individuals to gain such information effectively and easily.
In the past few years, there was a growing interest in mining opinions in the user-generated content (UGC) on the Web, e.g., customer reviews,
forum posts, and blogs. One major focus is sentiment classification and opinion mining.
3 Method and Approach
3.1 Project Plan
The first is the preprocessing of data. Our dataset is mainly composed of comments and tags in xml format, so we need to digitize these comments to meet the data requirements of data mining. Second, since the test set is unlabeled, the training and evaluation of our model is based on the training set, which requires the training dataset to be divided into a training set and a verification set. Finally, try different classifiers to complete the training and evaluation of the model, and select the model with the best classification performance to apply to the test set.
3.2 Solution
I use Python to complete the preprocessing of data. First, I separate the comments and labels (‘Pros’, ‘Cons’, ‘Labs’) from the training set and the test set. Then, I used the CountVectorizer function in the sklearn library to learn(fit) the word distribution of the comments in the training set and transform the word lists to the word count list in the two datasets(shown in Fig 3.1). I choose different thresholds to ignore words that appear less than the threshold in the training set to generate multiple different datasets. Because if I count every word, the number of the attributions of the dataset will be very large, which may cause a lot of redundant information and instability of the model(but by this experiment, I find it depends on the classifier). I also consider whether there are other ways to process text data. Maybe NLP has related methods, but it exceeds my ability, so I choose the most straightforward way to calculate words frequency.
Fig 3.1
Since the labels of the test set are not available, the training and evaluation of the model can only use the training set. The easiest way is to divide the training set into two, one for training and one for evaluation. However, such division is contingent and it is not enough to evaluate the quality of a model. So I used a cross-validation method and choose five folds.
The last one is the training of the model. I choose logistic regression, decision tree, support vector machine and random forest as the classifiers of this experiment. Logistic regression and decision tree are the simplest classifiers, with sufficient theoretical basis and strong interpretability. Support vector machines are the most popular classifiers, and they have a good generalization effect for most data sets. Random forests belong to the classifier of integrated learning and have strong data fitting ability, but there is a certain risk of over-fitting.
3.3 Experimental design
I set twelve different thresholds to create the twelve trainable datasets. I use the twelve datasets to train and evaluate four classifiers and compare their performance to get the best classifier and the threshold.
The training and evaluation of the entire model is done on the weka platform. The training and evaluation of the entire model is done on the weka platform. For each classifier, a 5-fold cross-validation was used to obtain an average evaluation metrics. Evaluation metrics use accuracy, accuracy, recall, F1-score.
4 Results
Shown in Table 4.1, the first row in the table means the different thresholds and the second row means the numbers of the each created dataset’s attributes.
Table 4.1
Shown in Fig 4.1, I train four classifiers(Logistic Regression, Decision Tree, Support Vector Machine and Random Forest) on the twelve datasets and get the forty-eight models. Also, I record accuracy, precision(Pro, Cro), recall(Pro, Cro) and F1-score metrics for each models. Also, I use PCA to decrease the dataset’s dimensionality but it doesn’t work, so I don’t show those.
Fig 4.1
5 Discussion
In the training dataset, there are 1019 ‘Con’ and 981 ‘Pro’, which means this is a balanced dataset and I don’t need to consider about the imbalance of the data. From the results of the models(shown in Fig 4.1), I also find the metrics of two classes don’t have obvious difference.
Due to the two classes’ balance, I take more attention on the accuracy of each model. I extract the accuracy of each model and corresponding thresholds to plot the bar graph(shown in Fig 5.1), which can help us observe the changes between the accuracy of models obviously. From Fig 5.1, we can find the three classifiers except LR perform almost stably and RF as well as SVM get the accuracy aroud 0.88 while DT get the accuracy around 0.8. Of course, RF and SVM’s accuracy decreases a little with the threshold increasing. Because the threshold’s increasing reduces the dataset’s complexity(the number of attributes decreases), I think it could make RF and SVM overfit which makes cross-validation performance decrease. At the same time, we could find LR’s performance has significant changes(first decrease and then increase). I think maybe LR is a simple classifier which don’t have strong big data fitting ability and when threshold is small and the dataset is complex, LR can’t learn more about the dataset, but when threshold is bigger and the dataset is simpler, LR can learn about the dataset better.
The aim that I set different threshold is that I think filtering out some words which have low frequency in the comments could make the model learn better. But the result that I get don’t prove my guess. The most accuracy is 0.8945 and RF get this when the threshold equals to 1. So I use this model to predict the preprocessed test dataset and use the prediction as my submission.
Fig 5.1
6 Conclusion
In this project, I try my best to find the best model and I think I find the best model within the bound of my abilities. I have a deeper understanding of the relationship between data sets and classifiers. I have more interest in data mine by this project. For this project, I think the preprocess of data could do more thing. For example, we could bring contextual information of a sentence into the features or use NLP model to pay attention to contextual information of the comments. Also, I just count the word’s number without notice the word class(none, verb, adjective,etc) and we can do more on this.
LR-Accuary 1 2 3 4 5 6 7 8 9 10 15 20 0.8075 0.8105 0.8115 0.77300000000000002 0.75900000000000001 0.80049999999999999 0.83050000000000002 0.83650000000000002 0.83399999999999996 0.84650000000000003 0.86599999999999999 0.86150000000000004 DT-Accuracy 1 2 3 4 5 6 7 8 9 10 15 20 0.80200000000000005 0.80200000000000005 0.80200000000000005 0.80200000000000005 0.80249999999999999 0.80249999999999999 0.80249999999999999 0.80249999999999999 0.80249999999999999 0.80249999999999999 0.80149999999999999 0.80700000000000005 SVM-Accuracy 1 2 3 4 5 6 7 8 9 10 15 20 0.88949999999999996 0.89100000000000001 0.88900000000000001 0.88949999999999996 0.89 0.89100000000000001 0.89200000000000002 0.88549999999999995 0.88400000000000001 0.88149999999999995 0.876 0.86350000000000005 RF-Accuracy 1 2 3 4 5 6 7 8 9 10 15 20 0.89449999999999996 0.89100000000000001 0.89149999999999996 0.88900000000000001 0.88500000000000001 0.88400000000000001 0.87450000000000006 0.873 0.87350000000000005 0.871 0.86550000000000005 0.85950000000000004 Thresh
Accuracy
Thresh123456789101520
Attributes22901099768600489407353307277247170131
ClassifierAccuracyPro-precisionPro-recallPro-F1_scoreCro-precisionCro-recallCro-F1_score
1-LR 0.80750.820.7790.7790.7970.8350.816
1-DT0.8020.850.7240.7820.7670.8770.819
1-SVM0.88950.9140.8550.8840.8690.9220.895
1-RF0.89450.9060.8760.8910.8840.9130.898
2-LR 0.81050.8050.8090.8070.8160.8120.814
2-DT0.8020.850.7240.7820.7670.8770.819
2-SVM0.8910.920.8520.8850.8670.9280.897
2-RF0.8910.9040.8710.8870.880.9110.895
3-LR 0.81150.8070.8090.8080.8160.8140.815
3-DT0.8020.850.7240.7820.7670.8770.819
3-SVM0.8890.9130.8550.8830.8690.9210.894
3-RF0.89150.8990.8780.8880.8850.950.895
4-LR 0.7730.7650.7750.770.7810.7710.776
4-DT0.8020.850.7240.7820.7670.8770.819
4-SVM0.88950.9210.8470.8830.8630.930.896
4-RF0.8890.890.8830.8860.8880.8950.891
5-LR 0.7590.7440.7760.7590.7750.7430.759
5-DT0.80250.8520.7230.7820.7670.8790.819
5-SVM0.890.9220.8470.8830.8640.9310.896
5-RF0.8850.8840.8810.8830.8860.8890.887
6-LR 0.80050.7920.8050.7980.8090.7960.803
6-DT0.80250.8520.7230.7820.7670.8790.819
6-SVM0.8910.9220.850.8840.8660.930.897
6-RF0.8840.8830.880.8820.8850.8880.886
7-LR 0.83050.8220.8350.8290.8390.8260.832
7-DT0.80250.8520.7230.7820.7670.8790.819
7-SVM0.8920.9250.8490.8850.8650.9330.898
7-RF0.87450.8670.8790.8730.8820.870.876
8-LR 0.83650.830.8380.8340.8438350.839
8-DT0.80250.8520.7230.7820.7670.8790.819
8-SVM0.88550.9210.8390.8780.8570.930.892
8-RF0.8730.8630.8810.8720.8830.8660.874
9-LR 0.8340.8340.8270.830.8340.8410.838
9-DT0.80250.8520.7230.7820.7670.8790.819
9-SVM0.8840.920.8360.8760.8550.930.891
9-RF0.87350.8650.880.8720.8820.8680.875
10-LR 0.84650.8460.840.8430.8470.8530.85
10-DT0.80250.8520.7230.7820.7670.8790.819
10-SVM0.88150.9020.8510.8760.8640.9110.887
10-RF0.8710.870.8660.8680.8720.8750.874
15-LR 0.8660.8740.8490.8610.8590.8820.87
15-DT0.80150.8540.7190.780.7650.8810.819
15-SVM0.8760.9210.8180.8660.8410.9320.885
15-RF0.86550.8680.8560.8620.8630.8740.869
20-LR 0.86150.8740.8380.8560.850.8840.867
20-DT0.8070.8670.7170.7850.7660.8940.825
20-SVM0.86350.9130.7980.8520.8270.9260.874
20-RF0.85950.8740.8340.8530.8470.8840.865