CS代写 Workshop Solution: Text Classification

Workshop Solution: Text Classification
******************************************************************** Task 1. Understand how to Validate an IF or a Classification Model
A good information filtering or text classification model has two properties:
(1) It has a good predictive power, and

(2) It generalizes well to new documents or data it hasn¡¯t seen.
To achieve this, we can use a validation strategy that defines an error measure (how wrong the model is).
There are two common error measures: the classification error rate for classification problems and the mean squared error for regression (prediction) problems.
For example, the classification error rate is the percentage of observations in the test data set that your model mislabelled; and the lower is better.
The common validation strategy is cross validation which partitions (manually or randomly) a labelled data collection into a training set with X% of the observations and keeping the rest (1-X%) as a test data set. You can use this strategy to tune experimental parameters and update your model.
k-folds cross validation
This strategy divides the data set into k parts and uses each part one time as a test data set while using the others as a training data set. It has the advantage that we can use all the data available in the data collection.
k=5 Experiment 1
Experiment 3
Experiment 2
Experiment 4

Experiment 5
Leave-1 out
This approach is the same as k-folds but with k=n, where n denotes the total number of examples. You always leave one observation out and train on the rest of the data. It is usually used only on small data sets, so it¡¯s more valuable to people evaluating laboratory experiments than to big data analysts.
Experiment 1
… Experiment n
Task 2. SVM classifier and confusion matrix
(a) Let X_train be the list of document vectors of documents in D, and y_train be the list of the corresponding labels (¡°1¡± means spam, and ¡°-1¡± is not spam). Please show the python values of X_train and y_train, respectively.
X_train=[[0, 0, 0, 0, 2], [3, 0, 1, 0, 1], [0, 0, 0, 0, 1], [2, 0, 3, 0, 2], [5, 2, 0, 0, 1], [0, 0, 1, 0, 1], [0, 1, 1, 0, 1], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1], [1, 1, 0, 1, 2]]
y_train=[-1, 1, -1, 1, 1, -1, -1, -1, -1, -1]
(b) Produce the corresponding confusion matrix, show the values of TP, FP, FN, and TN, and calculate the Accuracy.
Answer: TP = 6

Accuracy = (6+3)/10 = 90%
Task 3. Naive Bayes classification & Rocchio Classification
Please see Week 9 Question Solution for lecture review Questions 2 and 3.

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts