PowerPoint 演示文稿
Report
1. Problem statement
2. Background
3. Method and Approach
4. Results
5. Discussion
6. Conclusion
1. Problem statement
“Pros and Cons” dataset was crawled from reviews of epinions.com concerning things such as digital cameras, printers and Strollers. The training dataset has accurate target labels (Pro, Con) assigned to each example, the test dataset has only a dummy label for each example.
I need to develop the best possible sentiment classification model based on the “Pros and Cons” dataset.
2. Background
The Web has dramatically changed the way that consumers express their opinions. They can now post reviews of products at merchant sites and express their views on almost anything in Internet forums, discussion groups, and blogs.
Techniques are now being developed to exploit these sources to help companies and individuals to gain such information effectively and easily.
One major focus is sentiment classification and opinion mining.
3. Method and Approach
3.1 Project Plan
Data preprocess
Split the training set into a training set and a verification set
Model
3. Method and Approach
3.2 Solution
Data preprocess : use sklearn’s CountVectorizer function to convert word list to digital vector with different thresholds (words that appear less than the threshold would be ignored)
Split the training set into a training set and a verification set : using five folds cross-validation to avoid randomness
Model : choose logistic regression, decision tree, support vector machine and random forest as the classifiers for this experiment
3. Method and Approach
3.2 Solution
Verctorized dataset
3. Method and Approach
3.3 Experimental design
Set twelve different thresholds to create the twelve trainable datasets
The training and evaluation of the entire model is done on the weka platform
For each classifier, a 5-fold cross-validation is used to obtain an average evaluation metrics
Evaluation metrics use accuracy, accuracy, recall, F1-score
4. Results
Shown in Table 4.1, the first row in the table means the different thresholds and the second row means the numbers of the each created dataset’s attributes.
Thresh 1 2 3 4 5 6 7 8 9 10 15 20
Attributes 2290 1099 768 600 489 407 353 307 277 247 170 131
Table 4.1
4. Results
Shown in Fig 4.1, I train four classifiers(Logistic Regression, Decision Tree, Support Vector Machine and Random Forest) on the twelve datasets and get the forty-eight models.
I record accuracy, precision (Pro, Cro), recall (Pro, Cro) and F1-score metrics for each models.
I use PCA to decrease the dataset’s dimensionality but it doesn’t work, so I don’t show those.
Fig 4.1
5. Discussion
In the training dataset, there are 1019 ‘Cons’ and 981 ‘Pros’, which means this is a balanced dataset and I don’t need to consider about the imbalance of the data.
Due to the two classes’ balance, I take more attention on the accuracy of each model. I extract the accuracy of each model and corresponding thresholds to plot the bar graph(shown in Fig 5.1), which can help us observe the changes between the accuracy of models obviously.
Fig 5.1
11
LR-Accuary 1 2 3 4 5 6 7 8 9 10 15 20 0.8075 0.8105 0.8115 0.77300000000000002 0.75900000000000001 0.80049999999999999 0.83050000000000002 0.83650000000000002 0.83399999999999996 0.84650000000000003 0.86599999999999999 0.86150000000000004 DT-Accuracy 1 2 3 4 5 6 7 8 9 10 15 20 0.80200000000000005 0.80200000000000005 0.80200000000000005 0.80200000000000005 0.80249999999999999 0.80249999999999999 0.80249999999999999 0.80249999999999999 0.80249999999999999 0.80249999999999999 0.80149999999999999 0.80700000000000005 SVM-Accuracy 1 2 3 4 5 6 7 8 9 10 15 20 0.88949999999999996 0.89100000000000001 0.88900000000000001 0.88949999999999996 0.89 0.89100000000000001 0.89200000000000002 0.88549999999999995 0.88400000000000001 0.88149999999999995 0.876 0.86350000000000005 RF-Accuracy 1 2 3 4 5 6 7 8 9 10 15 20 0.89449999999999996 0.89100000000000001 0.89149999999999996 0.88900000000000001 0.88500000000000001 0.88400000000000001 0.87450000000000006 0.873 0.87350000000000005 0.871 0.86550000000000005 0.85950000000000004 Thresh
Accuracy
5. Discussion
The aim that I set different threshold is that I think filtering out some words which have low frequency in the comments could make the model learn better. But the result that I get don’t prove my guess.
The most accuracy is 0.8945 and RF get this when the threshold equals to 1(shown in Fig 5.2). So I use this model to predict the preprocessed test dataset and use the prediction as my submission.
Fig 5.1
5. Conclusion
I try my best to find the best model and I think I find the best model within the bound of my abilities.
For the future, we could bring contextual information of a sentence into the features or use NLP model to pay attention to contextual information of the comments.
Also, I just count the word’s number without notice the word class(none, verb, adjective, etc) and we can do more on this.
Thanks
ClassifierAccuracyPro-precisionPro-recallPro-F1_scoreCro-precisionCro-recallCro-F1_score
1-LR 0.80750.820.7790.7790.7970.8350.816
1-DT0.8020.850.7240.7820.7670.8770.819
1-SVM0.88950.9140.8550.8840.8690.9220.895
1-RF0.89450.9060.8760.8910.8840.9130.898
2-LR 0.81050.8050.8090.8070.8160.8120.814
2-DT0.8020.850.7240.7820.7670.8770.819
2-SVM0.8910.920.8520.8850.8670.9280.897
2-RF0.8910.9040.8710.8870.880.9110.895
3-LR 0.81150.8070.8090.8080.8160.8140.815
3-DT0.8020.850.7240.7820.7670.8770.819
3-SVM0.8890.9130.8550.8830.8690.9210.894
3-RF0.89150.8990.8780.8880.8850.950.895
4-LR 0.7730.7650.7750.770.7810.7710.776
4-DT0.8020.850.7240.7820.7670.8770.819
4-SVM0.88950.9210.8470.8830.8630.930.896
4-RF0.8890.890.8830.8860.8880.8950.891
5-LR 0.7590.7440.7760.7590.7750.7430.759
5-DT0.80250.8520.7230.7820.7670.8790.819
5-SVM0.890.9220.8470.8830.8640.9310.896
5-RF0.8850.8840.8810.8830.8860.8890.887
6-LR 0.80050.7920.8050.7980.8090.7960.803
6-DT0.80250.8520.7230.7820.7670.8790.819
6-SVM0.8910.9220.850.8840.8660.930.897
6-RF0.8840.8830.880.8820.8850.8880.886
7-LR 0.83050.8220.8350.8290.8390.8260.832
7-DT0.80250.8520.7230.7820.7670.8790.819
7-SVM0.8920.9250.8490.8850.8650.9330.898
7-RF0.87450.8670.8790.8730.8820.870.876
8-LR 0.83650.830.8380.8340.8438350.839
8-DT0.80250.8520.7230.7820.7670.8790.819
8-SVM0.88550.9210.8390.8780.8570.930.892
8-RF0.8730.8630.8810.8720.8830.8660.874
9-LR 0.8340.8340.8270.830.8340.8410.838
9-DT0.80250.8520.7230.7820.7670.8790.819
9-SVM0.8840.920.8360.8760.8550.930.891
9-RF0.87350.8650.880.8720.8820.8680.875
10-LR 0.84650.8460.840.8430.8470.8530.85
10-DT0.80250.8520.7230.7820.7670.8790.819
10-SVM0.88150.9020.8510.8760.8640.9110.887
10-RF0.8710.870.8660.8680.8720.8750.874
15-LR 0.8660.8740.8490.8610.8590.8820.87
15-DT0.80150.8540.7190.780.7650.8810.819
15-SVM0.8760.9210.8180.8660.8410.9320.885
15-RF0.86550.8680.8560.8620.8630.8740.869
20-LR 0.86150.8740.8380.8560.850.8840.867
20-DT0.8070.8670.7170.7850.7660.8940.825
20-SVM0.86350.9130.7980.8520.8270.9260.874
20-RF0.85950.8740.8340.8530.8470.8840.865
/docProps/thumbnail.jpeg