编程辅导 IFN647 Week 11 Workshop: Text Classification

IFN647 Week 11 Workshop: Text Classification
******************************************************************** Task 1. Understand how to Validate an IF or a Classification Model
A good information filtering or text classification model has two properties:
(1) It has a good predictive power, and

(2) It generalizes well to new documents or data it hasn¡¯t seen.
To achieve this, we can use a validation strategy that defines an error measure (how wrong the model is).
There are two common error measures: the classification error rate for classification problems and the mean squared error for regression (prediction) problems.
For example, the classification error rate is the percentage of observations in the test data set that your model mislabelled; and the lower is better.
The common validation strategy is cross validation which partitions (manually or randomly) a labelled data collection into a training set with X% of the observations and keeping the rest (1-X%) as a test data set. You can use this strategy to tune experimental parameters and update your model.
k-folds cross validation
This strategy divides the data set into k parts and uses each part one time as a test data set while using the others as a training data set. It has the advantage that we can use all the data available in the data collection.
k=5 Experiment 1
Experiment 3
Experiment 2
Experiment 4

Experiment 5
Leave-1 out
This approach is the same as k-folds but with k=n, where n denotes the total number of examples. You always leave one observation out and train on the rest of the data. It is usually used only on small data sets, so it¡¯s more valuable to people evaluating laboratory experiments than to big data analysts.
Experiment 1
… Experiment n
Task 2. SVM classifier and confusion matrix
Given the following Multinomial Document Representation for 10 documents:
Let D = {d1, d2, …, d10} be a training set, and C = {spam, not spam}
Please use Python scikit-learn package to learn an SVM classifier using D and C.

(a) Let X_train be the list of document vectors of documents in D, and y_train be the list of the corresponding labels (¡°1¡± means spam, and ¡°-1¡± is not spam). Please show the python values of X_train and y_train, respectively.
(b) Design python code to learn a SVM classifier ¡°my_clf¡±using X_train and y_train; and test the classifier using test set X_test, where X_test = [[1, 1, 1, 1, 2], [3, 0, 2, 0, 3], [0, 0, 0, 0, 0], [6, 0, 5, 0, 1], [4, 2, 0, 2, 1], [0, 0, 1, 1, 1], [0, 1, 0, 0, 1], [1, 0, 0, 0, 1], [0, 1, 0, 0, 1], [1, 1, 0, 1, 2]].
(c) Produce the corresponding confusion matrix by using the corresponding justifications y_test, where y_test=[-1, 1, 1, 1, 1, -1, -1, -1, -1, -1], show the values of TP, FP, FN, and TN, and calculate the Accuracy.
Task 3. Naive Bayes classification & Rocchio Classification
If you have not completed Week 9 Lecture Review Questions 2 and 3, you can do them in this workshop. Week 9 Lecture Review Questions 2 and 3 show different data structures representing the training set. Please check that you fully understand these representations of training sets. Then you should understand the inputs and outputs of the functions TRAIN_MULTINOMIAL_NB() and train_Rocchio(). Finally, calculate the output based on the inputs.

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts