Geolocation of Twitter Users with Machine Learning Report
Abstract
This project is to train a classifier to predict the tweets’ location based only on its content.
1. Introduction
Aside from features selected from project1, I also add the time stamp of tweet as a feature. And select three models(Logistic regression, SVM, Naïve Bayes) to train the data and evaluate their performance, and in the end choose Logistic regression as the best model for this problem.
2. Add new feature
Different location’s people’s habit may be different, they may like tweeting at different time in the day or in the year, so I added the tweet’s timestamp as a feature. And luckily Weka supports the date data type.
3. The Models
Logistic regression create a discriminative model p(y|x) directly and classify the data according to p(y|x).
Support V ector Machine (SVM) create a discriminant function f : X → Y, and classify according to f(x). It doesn’t involve probabilities.
Naïve Bayes create a generative model p(y), p(x|y), and compute p(y|x) with Bayes rule and Classify the data according to p(y|x).
3.1 Logistic regression
Advantages of Logistic Regression is that the features being correlated or not doesn’t impact the its performance much. But in Naive Bayes, the features being correlated will impact the model’s performance. And it has a nice probabilistic interpretation, unlike decision trees or SVMs, it can easily update your model to take in new data using an online gradient descent method.
In statistics, logistic regression or logit regression is a type of probabilistic statistical
classification model. The probabilities describing the possible outcomes of a single trial are modeled, as a function of the explanatory (predictor) variables, using a logistic function.
It’s called the logistic function or the sigmoid function, it models the probability of a outcome.
The regression coefficients are usually estimated using maximum likelihood estimation. Unlike linear regression, it is not possible to find an expression for the coefficient values that maximizes the likelihood function, so an iterative process must be used instead.
The regression coefficients are usually estimated using maximum likelihood estimation. Unlike linear regression, it is not possible to find an expression for the coefficient values that maximizes the likelihood function, so an iterative process must be used instead.
Newton’s method is used for the iterative process. If we want to minimize a function and find the location of global minimum, we can calculate the zero and the second derivative. Near the minimum we could make a Taylor expansion.
Newton’s method uses this fact, and minimizes a quadratic approximation to the function we are really interested in.
3.2 Support vector machine
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. svm is memory-intensive, hard to interpret in a staticaly way.
3.3 Naïve Bayes
A naive Bayes classifier assumes the features are independent given the class variable. Its main disadvantage is that it can’ t learn interactions between features. It is performance is poorer than svm and logistic on the tweets data because the independent assumption may be invalid in the tweet data. Naive Bayes is very simple, it simply does with counts, it can be trained much quicker than Logistic Regression and SVM.
4. Experiment
I use Weka system to train and test the tweets using the three models and evaluate their performance both in overall accuracy and detailed by class accuracy. I List their performance as follows.
4.1 Logistic regression performance
Correctly Classified Instances Incorrectly Classified Instances Kappa statistic
Mean absolute error
Root mean squared error Total Number of Instances
=== Confusion Matrix ===
11064 15974
0.1687 0.291 0.3885
27038
<-- classified as a = LA
b=NY c=C
d = At e = SF
40.9202 % 59.0798 %
a b 7816 312
cd e 92 22 60 |
5545 1099 89 17 47| 3472 147 1326 25 34|
3059 110 2707 145
29 219 17 37 8604
| |
4.3 Naïve Bayes performance
=== Error on test data ===
Correctly Classified Instances Incorrectly Classified Instances Kappa statistic
Mean absolute error
Root mean squared error Total Number of Instances
10038 17000
27038
<-- classified as a=LA
b=NY c=C
d = At e = SF
37.1255 % 62.8745 %
0.1311 0.2784 0.3926
=== Error on test data ===
Correctly Classified Instances Incorrectly Classified Instances Kappa statistic
Mean absolute error
6871 525 5024 1008 3144 205 2731 208 2564 210
e 200 588 118|
216 474 75 | 1269 345 41 |
84 370 41|
Root mean squared error Total Number of Instances
=== Confusion Matrix ===
0.18 0.2769
0.3723 27038
<-- classified as a = LA
b = NY c = C
d = At e = SF
5. Analysis
Logistic Regression and SVM's performance are
very similar, whether in overall accuracy or in detailed accuracy by class.But Logistic Regression is a little better than SVM.
Naive Bayes is much poorer than Logistic Regression and SVM. The reason is that the independent assumptions of features is not valid in this project which is important for Naive Bayes. Although its performance not good, the training time of Bayes is much faster than svm and logistic regression, because it just does simple counts.
6. Conclusions Concluding text.
In this project, I add the date feature, analyse three kinds of classifier and do experiment with them using Weka system, and in the end find the
11182 15856
41.3566 % 58.6434 %
=== Confusion Matrix === a b c d
84 123
520 |
a 7516 5177 3199 2819 2557
b c d 525 108 75
1382 104 70 337 1365 58 255 42 290 242 40 33
e 78 |
64 | 45 |
28 | 629 |
4.2 Support Vector Machine performance
=== Error on test data ===
best model (Logistic Regression) for this problem. Through this process, I gain more perspective on feature and model selection and also familiar with the Weka system which is a great tool for machine learning study.
References
D. A. Salazar, J. Ivan Velez and J. C. Salazar, "Comparison between SVM and Logistic Regression: Which One is Better to Discriminate?", Revista Colombiana De Estadistica, vol. 35, no. 2, (2012), pp. 223-37.
Depart of computer science Andrew Ng, 2013, CS229 Lecture notes. [ONLINE] Available at: http://cs229.stanford.edu/notes/cs229-notes3.p df
Naive Bayes. [ONLINE] Available at: http://en.wikipedia.org/wiki/Naive_Bayes_clas sifier
Logistic regression.[ONLINE] Available at: http://en.wikipedia.org/wiki/Logistic_regressio n