Predictive Models
Logistic Regression
Logistic regression is a model that is easy to understand. It models the probability of the value of y given x.
x is the feature vector, y is the class label and θ is the coefficients. So the main idea is that we compute the linear function value of feature vector x and force it to range [0-1] as the probability of class y being 1using the logistic function g.
The graph of g(z) is:
During the training phase, out job is to estimate the coefficients θ using the training data.
We can use maximum likelihood method. The probability to generate the training data is:
We want to find the θ that maximize this likelihood value. We can use stochastic gradient ascent algorithm to optimize the θ.
In prediction, we use the trained θ to compute P(y = 1 | x; θ) = hθ (x). If it bigger than 0.5, we predict the label as 1, otherwise as 0.
Support Vector Machine
Assume the label y is either 1 or -1.
The classifier function for Support Vector Machine is:
is the hyper plane. If is a big positive value, it indicates it is more confident to classify as 1, and if is a big negative value, it indicates it is confident to classify as -1. So we want to be as larger as possible. Also note that the magnitude of doesn’t influence the result of the classifier, the result is only determined by the sign of , . So we can fix .
In training, out job is to estimate the parameter and . We want to maximize the minimum distance of all sample points to the hyper plane. This optimization can be modeled as follows:
In the prediction, we just apply to each sample using estimated and to get the class label.
Decision Tree
Decision tree is a tree structure. To classify a sample, we go from the root node, in each branch, see if it satisfies the condition, if the answer is yes, then take that branch, traverse the tree until we reach the leaf node and take the class value of the leaf node as the class label for the sample. It is a simple classifier and easy for human to interpret the classification process.
During the training, our job is to construct the decision tree using the training data. We can build the tree as follows. Starting from the node with the whole sample points, binary split the sample points to get the next level to children, recursively build the left and right tree. We stop split the nodes, when the sample size is very small, say < 10 points. Each leaf node is assigned the majority class of the node. In each splitting, we need to decide the splitting variable and the splitting point. We can use the following measure to determine the splitting variable and the splitting point. We find the splitting variable and the splitting point that makes the measure value minimum. Naïve Bayes The naïve Bayes is a conditional probability model. It assumes all features are conditionally independent given the class label. From Bayes' theorem, we have P(Y|X1,…., Xn) = P(X1,…., Xn|Y)P(Y) / P(X1,…., Xn) Which is: Posterior Probability = Likelihood * Class Prior Probability / (Predictor Prior Probability) The P(X1,…., Xn) is constant for class Y. So we only need to compute the value of P(X1,…., Xn|Y)P(Y). From independence assumption, we have: P(X1,…., Xn|Y)P(Y) = P(X1|Y)… P(Xn|Y)P(Y) So P(Y|X1,…., Xn) P(X1|Y)… P(Xn|Y)P(Y) If P(Y = Yes|X1,…., Xn) > P(Y = No|X1,…., Xn), we classify as Yes, otherwise classify as No.
For numeric attributes, we use Gaussian density function.
Tackling Imbalance Data
As stated in the section 2, the class labels of this training data set is very imbalanced. The number of negative samples is about 24 times the positive samples. In such situation, using accuracy is pointeless, because we can easily achieve 96% accuracy by simply classify all samples as negative. And we can’t simply use the various classifiers without considering this imbalance. Because with all samples with equal weight, after training the classifier will classify nearly all samples as negative.
To solve this imbalance problem, we should try to make positive samples and negative samples have about equal size. One solution is to random choose samples from negative set to make its size equal to positive set. Another solution is to duplicate the positive samples to make its size equal to negative set. Solution 1’s problem is that it makes the training data size too small, for example, for this data set, the training data size will decrease from 38010 to 2994, less than 1/10 of original size. This solution discard too many samples. With such small training samples, it is very easy for overfitting. Solution 2 will make the data size about 2 times the original size and avoid the problem 1. But duplication of samples don’t add more useful information and certainly will make the training more slow.
Fortunately, there is another solution that can both avoid the problem of solution 1 and solution 2. Instead physically and also inelegantly making the negative samples’ size equal to positive samples’ size, it assigns different weight to the class labels according to their sample size. For the training data in this project, the number of negative sample is about 24 times the number of positive samples. So the weight of positive class should be about 24 times the weight of negative samples. In another word, rare class is more important and one positive sample is 24 times more important than one negative sample.
The classifier will use the class weight to adjust the penalty function during training. For this data set, it means the classifier will receive about 24 times penalty when it classifies a positive as negative than classifying negative as positive.
Implementation
I use LogisticRegression class from sklearn.linear_model for the logistic regression classfier, the GaussianNB class from sklearn.naive_bayes module for the naïve bayes classifier, SVC class from sklearn.svm module for the SVM classifier and the DecisionTreeClassifier from sklearn.tree module for the decision tree classifier.
These classifiers are used with parameter class-weight set on “balance”, which means the class weight is inversely proportional to its number of samples in the data training set.
I use the class method fit and predict to do the training and prediction.
Model Evaluation
ROC(Receiver Operating Characteristic)Curve
The X axis of ROC curve is FPR(false positive rate) and the Y axis of it is TPR(true positive rate).
Suppose the number of true positives is T, the number of all false negative is F, the number of false positive is FP and the number if true negative is TN.
Then
FPR = FP / (FP + TN)
TPR = TP / (TP + FN)
Example ROC curve.
The corner 4 points (0,0), (0,1), (1,0), (1,1) of the ROC graph has special meaning.
(0,0) indicates that FPR = 0 and FPR = 0, which means FP=0 and TP=0, so this classifier
classify all samples as negative. (1,1) indicates that FPR = 1 and FPR = 1, which means TN=0 and FN=0, so this classifier classifies all samples as positive. (0,1) indicates that FPR = 0 and FPR = 1, which means FP=0 and FN=0, so this classifier classifies all samples correctly, so it is a perfect classifier. (1,0) indicates that FPR = 1 and FPR = 0, which means TN=0 and TP=0, so this classifier classifies all samples incorrectly, so it is the worst classifier. The more northwest the point lies in the graph, the better the classifier is.
For a specific classifier and test data, we can only get 1 classification result and 1 point in the graph. We want to get a curve, so we need to get a sequence of points. We can let the threshold changes from 0 to 1, to get different classification result and a sequence of points on the graph. Normally, we can let the classifier output the probability of a given sample being positive, and for a given threshold, the classifier classifies the sample as positive only when the probability greater than the threshold. For threshold = 0, the classifier will classify all points as positive, leading a point (1, 1) in the ROC graph, when threshold = 1, the classifier will classify all points as negative, leading a point (0, 0) in the ROC graph, the smaller the threshold is, the more likely, the classifier will classify a sample as positive, thus leading both true positive rate and false positive rate rising.
The AUC is the area under the ROC curve. Obviously the value of AUC is between 0 and 1. The bigger the AUC is, the better the classifier is. When AUC = 1, it indicates the classifier is perfect.
Among the many metrics, ROC and AUC have the advantage that when the distribution of positive samples and negative samples changes, the ROC curve can keep the same. AUC is much more appropriate than accuracy for data with imbalance class when positive samples are much more or much less than negative samples. The dataset of this project has such imbalance class, it has for fewer positive samples than negative samples in the training data.
Cross Validation
I use k-fold cross validation to select the best model. First I randomly shuffle the training samples. Then I divide the samples to k parts with equal size. Then I do the training and validation k times. In iteration i, I use the ith part as validation data and other parts as training data, and compute the AUC. In the end, compute the average AUC of the k AUCs of each iteration. The advantage of this approach is that every sample takes part in as validation data once and as training data k-1 times.
I do the validation for all models and select the model with minimum average AUC of the k-fold cross validation.
The larger the k is, the more time it need to finish the cross validation process. And larger k may better reflect the overall performance. I set k to 3 which is very quick for this data set.
Implementation
I use KFold function from sklearn.cross_validation module to shuffle the training data and split to 3 parts of training and validation sample points. In each fold, use fit function to train and predict_proba to get the prediction probability. Use roc_curve function from sklearn.metrics to compute the true positive rate and false positive rate for a range of thresholds. Use auc function from sklearn.metrics to compute the area under curve. Plot the roc curve and show the AUC of this fold in the graph. After 3 folds finish, plot the mean roc curve and show the mean AUC of all 3 folds.
Result
From above graphs, we can see that Logistic Regression has the largest mean AUC(0.78),
Gaussian Naïve Bayes has the second largest mean AUC(0.75). Decision Tree and SVM have both mean AUC = 0.70. So I choose Logistic Regression model as my final classifier and use it to train on the whole training data and predict on the testing data.
/docProps/thumbnail.jpeg