ECE 730 – Statistical Learning
Classifiers:
1
1. Logistic regression
2. Linear classifier based on the Gaussian generative model 3. Support vector machine
4. Decision tree
5. Random forest
6. Adaboost/tree
7. Adaboost/KNN
Feature selection method
ECE 730 Project
Compute the coefficient of variation (CoV) for each feature, and select the features whose CoV value is greater than a threshold. You can use other feature selection methods.
Simulation of training data
Suppose that there are K ≥ 2 classes and p features. The first p1 < p features determine the class label and the remaining features are irrelevant. First, select K p1 × 1 vectors mk, k = 1, . . . , K, where all vectors are different, then append p − p1 zeros to each of these vectors to form p × 1 vectors μk, k = 1, . . . , K. Generate nk data points from the Gaussian distribution with mean equal to μk and covariance σ2I, where I is the p × p identify matrix; these are nk data points from the kth class. Generate data points for all K classes to get a training dataset of n = Kk=1 nk samples.
Real data
Two datasets from the book “The Elements of Statistical Learning” by Hasite, Tib- shirani, and Friedman: the spam dataset and the zip code dataset, both are available at https://web.stanford.edu/ hastie/ElemStatLearn/
Please select one classifier and do the following:
1. Implement the classifier with and without feature selection. For logistic regression, you don’t use the CoV method to select feature, but use Lasso or elastic net penalty to select features. So, you will implement logistic regression without regularization and with regularization using Lasso or elastic net. For the other classifiers, you will use the CoV method to select features.
If you choose logistic regression, SVM, or decision tree, you need to use cross validation to determine parameters of the model. If you choose Adaboost, you can select any base classifier (not necessary on this list).
ECE 730 - Statistical Learning 2
You are allowed to use existing software that implements these classifiers. For example, you can use the glmnet software for logistic regression regularized with the elastic net penalty.
2. Use simulated data to investigate the performance of the classifier. More specifically, you need to investigate how the classification error is affected by the noise level (σ2), the number of features p, the number of relevant feature p1, the number of training data samples n, and the feature selection method, ect.
3. Select one real dataset, and run your classifier on the real dataset, compare the per- formance of the classifier with and without feature selection.
Each student selects one classifier and one of the following two options for simulation and real data: 1) K=2 for computer simulation and the spam dataset, and 2) K > 2 for computer simulation and the zip code dataset. This yields 12 settings. No setting can be chosen by more than one student.
ECE 730 – Statistical Learning 3 1. The report should be clearly written and include at least the following sections:
(a) Introduction. You should clearly describe the problem need to be solved and elaborate the purpose or motivation of the project. If you choose the project I assigned, you can use part of my problem statement, but you need to depict the purpose or motivation based on your understanding.
(b) Describe your approach to solving the problem in detail. You can choose a proper title for this section.
(c) Results and Discussions. Describe your simulation setup and real data in detail so that others can repeat your experiments. Present your results using figures, tables, ect., and interpret and discuss your results.
(d) Conclusions. Summarize the project briefly and draw conclusions based on your results.
2. Submit your computer program as well, and I may ask you to run your program to obtain the results in your report. You can use any computer language. My recommen- dation is R and Matlab.