Mariia Okuneva, M.Sc.
Data Mining Home Assignment 2
In this home assignment you will again use Smarket data set which is part of the ISLR library. In the rst part, you will write your own function designed to t an LDA model to the train set and compare its performance with a pre-impelemnted lda() function. Furthermore, you will use this tted model to make predictions on the test set. In the second part, you will learn about another classication method called KNN (K nearest neighbours). You will program it from scratch and compare results with the output of existing knn() function. In both parts of this assignment, your goal is to predict Direction of the market with Lag1, and Lag2, percentage returns one and two days ago respectively. Your task is to write an R script that contains the following parts. But rst, download the script template HA2_yournames.R from OLAT.
Part 1. LDA
1. Write your own function mylda to t an LDA model to the train data set. This function should work for any number of features, but only for binary classication. Of course, you are welcome to make the function more general, but this is not required for successful completion of the assignment. In part one, train data set should include observations from the time period between 2001 and 2004. Please nd additional information on the inputs and outputs of this function in the script template HA2_yournames.R. Note that lda() function applies scaling to the discriminant function coecients (formula is given in the template, please see ESL, Chapter 4, Equations 4.15, 4.16 for details). Compare your results with the output of lda() function.
2. Visualize the data and a decision boundary.
3. Write a function my_predict which will output predicted classication for one observation. Use apply() and my_predict functions to produce predicted classes for the whole test data set. Calculate the accuracy of prediction, produce confusion matrix. Compare your results with the results based on pre-implemented lda() function.
Part 2. KNN
1. Program knn classication algorithm from scratch (do not use any integrated knn func- tions) . Detailed information on the inputs and output of the function myknn can be found in the template. In part two of this assignment, use the rst half of the data set as a train set. Use Euclidean distance as a measure of similarity.
2. Write a function MER for calculating misclassication error rate. Test performance of myknn and MER with K=5. Additionally, produce a confusion matrix.
3. Compare results you get with knn() pre-implemented function and your own functions from Task 1 and Task 2.
1
Remarks: Write comments for everything you do. Codes that are not written using the template and/or that return error messages will not be evaluated. If you are working in groups (not more than 5 students in one group), make sure to note down every participant’s name and ID. Submission: Submit your scripts via email to mokuneva[at]stat-econ.uni-kiel.de until the end of June 10th (until 00:00:00, June 11th)
2