Problem Definition
Assignment 4 – Text Mining
This assignment has three objectives:
1. Learn how Text Classification is used in business per the Research Paper provided in this assignment
2. Apply concepts you learned around Text Preprocessing and Naïve Bayes Classifier
3. Implement Naïve Bayes using Python code provided to you for this assignment
TextMining.zip File Content
You have been given TextMining.zip. This zip file contains a Research Paper, Data, and Python Code for this assignment. Unzip this file in the location where you have been developing your Python code.
Once unzipped, the TextMining directory will contain the following files:
• ICIS2015MousaviRaghuFrey.pdf: The Research Paper that applies Naïve Bayes to solve a Business Problem. “Assessing Order Effects in Online Community-Based Health Forums”, R. Mousavi, T. S. Raghu, Keith Frey, Thirty Sixth International Conference on Information Systems, 2015.
• Select only One of two data sets you can use for this assignment:
o HealthProNonPro Directory: Contains Health data pertaining to the Research Paper. Read the cautionary note below (by the Authors of the Paper) when using the Healthcare dataset
CAUTIONARY NOTE:
The dataset we are providing your instructor is UNFILTERED REAL DATA from Yahoo! Answers and AskTheDoctor.com. There are many answers to health questions that you may find uncomfortable. TheassignmentsDONOTREQUIREyoutoopenthefilesandreadthe answers.Ifyouarelikelytobe offended by the language in these files, just don’t read them. Let your program do the job for you!
If you chose not to use the Health Data Set for the above reasons, we have given you an analogous datasetofMovieReviewsthatyoucanuseinstead.Theonlydifferenceis that you would be classifying the Movie Reviews into Positive and Negative (instead of classifying the Health Data into Professional and Non-Professional).
o MoviePosNeg Directory: Contains data pertaining to Movie Reviews which you can use if you chose not to use the Health data per the cautionary note above.
Copyright 2019, Arizona Board of Regents, Arizona State University Page 1 of 5
Requirements for this Assignment
The code in “SKLearnNB.py”, as it stands, executes the following steps:
• Reads in the Health (or Movie) Data (per your selection)
• Creates a “Pipeline” of data transformation and classification
• Uses Naïve Bayes Classifier to classify the Health (or Movie) data into Positive and Negative Sentiment
• Uses 6-fold cross-validation
Follow the steps below to complete this assignment:
1. Watch the assignment introduction video posted in the course site in its entirety.
2. Read through all of the instructions in this assignment document.
3. Read through the Paper (ICIS2015MousaviRaghuFrey.pdf) to understand how Text Classification is used to solve Business problems.
4.
5. Decide for yourself whether you want to use the Health dataset (see cautionary note above), or the Movie Reviews dataset. You must select one dataset. Once you select a dataset, follow the instructions in script lines 19 to 29 to select one of the two available datasets:
6. After selecting one of the two datasets, execute the entire script (as provided to you before your updates) to ensure you do not have errors in your Python console output.
7. 8.
9. After running the code as is, go ahead and note the Accuracy Score of the Classification. Below you can see an example Accuracy Score for the movie dataset. Note that if you run this script many times, the accuracy output may have slight changes, this is perfectly normal.
Using Anaconda Spyder 3.7+, open and update the script file named “SKLearnNB.py”
following the requirements in this document.
If you experience a Python console output error such as “NameError: name ‘SOURCES’ is
not defined”, go ahead and select a dataset as described in bullet 5 above.
If you experience Python console output errors or issues such as “Value Error” or “Cannot
have number of folds n_folds=6 greater than the number of samples: 0”, go ahead and
review the video named “Value or n_folds resolution” posted in the course site.
Copyright 2019, Arizona Board of Regents, Arizona State University Page 2 of 5
• Learn about the functions in lines 65, 66, 67 in the Script named SKLearnNB.py
• Then, add, remove, tweak, or update the function parameters to improve Accuracy of the Classifier.
• You can learn about the function parameters required in the script file lines 65, 66, and 67 in the following respective URL locations:
CountVectorizer
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
TfidfTransformer
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
MultinomialNB
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
•
• You may review the following URL article to refresh your memory (this should have been covered in prior stats classes) about confusion matrices: https://en.wikipedia.org/wiki/Confusion_matrix
Important:
• Your updated script will be run as-is on the existing data to cross-check your results using the Anaconda Spyder 3.7+ installation required for this class
Try different combinations of function parameters in lines 65, 66, and 67 until you can
settle on your personal best results, measure as the highest accuracy score and
minimization of false positives and false negatives in the confusion matrix in the python
console output below the Accuracy Score.
Copyright 2019, Arizona Board of Regents, Arizona State University Page 3 of 5
• You must use only Naïve Bayes Classification for this assignment.
• You must use only the given data.
• You must use only 6-fold cross-validation as provided.
• You may add, remove, tweak, or update any of the function parameters in lines 65, 66, and 67 to improve the Accuracy of the Classification.
• Points
will be deducted for your script or output that:
o is difficult to read/follow
o don’tcompile/run
o don’t have all the various pieces of code required
o have hard-code values instead of using variables
o havelogicalerrors
o don’t result in the expected output
o don’t have user-friendly output
o don’t run efficiently following the practices and lecture examples demonstrated in
our lectures and problem sets (located in the course site)
Submission Requirements for this Assignment
You will submit two files for this Assignment: File 1
In a single Python script file (note this file must have the .PY extension to be valid):
• Submit your final Updated Code that gave you BEST reported accuracy (2 points)
• Use the following naming convention for your final Updated Script file:
Example: EdgardLuqueAssignment4.py
File 2
In a single Word document (you may provide any name you like for this Word file):
• Provide a screenshot of the BEST Accuracy and Confusion Matrix you obtained with your Updated Code. Paste the screenshot in the Word document (1 points)
• Explain (in no more than half a page) the different combinations of parameter you tried for lines 65,66,67 in your updated script file, and WHY you believe the final set of parameters values you picked gave you the best Classifier Accuracy. Provide this explanation in the Word document (2 points)
DO NOT attach a screenshot of your updated script as this will receive zero grade points
for this portion of the assignment
Copyright 2019, Arizona Board of Regents, Arizona State University Page 4 of 5
Copyright 2019, Arizona Board of Regents, Arizona State University Page 5 of 5