IT代考 主題: R Studio利用PCA (Principal Components Analysis –主成份分析)及KNN (邻近算法) 去讀手寫數字

主題: R Studio利用PCA (Principal Components Analysis –主成份分析)及KNN (邻近算法) 去讀手寫數字

Individual Project: PCA + kNN Approach in Supervised Handwritten Digit Recognition
Data: A file, project.RData is available. This file contains the following three R objects:

Copyright By PowCoder代写 加微信 powcoder

trainD : A 2000 x 784 matrix with each row storing the 28 x 28 pixels of a handwritten digit. Hereafter, we call matrix with this structure a digit data matrix.

TDigit : A vector of size 2000 with the i-th element being the digit corresponding to the i-th row of trainD. Hereafter, we call vector with this structure a digit vector.

printDigit(v,d=NA) : A function to print a digit image. Argument v is a vector of length 784 for a handwritten digit, and d is the digit for the vector v (default is NA). For example, printDigit(trainD[3,],TDigit[3]) displays the digit image in the 3rd row of trainD.

(需要提交的: 一個PDF 報告 及1個R檔, R檔內有2個function)
(a) A pdf file of report: Not more than five A4 size pages excluding appendices. It is recommended to put tables, figures, and program listing in appendices.
(b) One R file containing two functions, \Prepare” and \Classify” as specified below (sample \Prepare” and \Classify” functions are given in Appendix C). No global variable can be used in the functions.

(i) Prepare(trainData,DigitV)
Input: (i) trainData: a digit data matrix; (ii) DigitV: the corresponding digit vector
Output: A list containing all necessary information to be used in the \Classify”

(ii) Classify(QueryData,OutPre)
Input: (i) QueryData: a digit data matrix to be classified; OutPre: Output of \Pre-
pare” function.
Output: A vector containing the estimated digits of the query data in QueryData.
Each estimated digit is one of the following digits (-1, 0, 1, 2, …, 9) with the digit \-1″ meaning \unknown and to be classified manually”.
Restrictions on Methods:
1. Principal components analysis: (a) Free to perform any kind of transformation before principal components analysis; (b) Must use function \prcomp” for principal components analysis; (c) Free to determine the number of principal components chosen; (d) Can use \prcomp” a number of times.

2. Classifier: (a) You can only use the simplest form of k-nearest neighbor algorithm (kNN) to classify handwritten digits where k can be any positive integer (the classifier used in Section 3.8.2 is 1-nearest neighbor classifier; see Appendix A for a brief introduction to kNN); (b) The input of the kNN algorithm can be the principal components or any transformed form of the principal components; (c) You can use function \knn” in \class” package for k-nearest neighbor classification.

3. Cross-validation: (a) You are recommended to use cross-validation to assess performance of several candidate classifiers and choose the best one as your final method; (b) You can use the simplest form of cross-validation which is used in Section 3.8.2, or use k-fold cross-validation (see Appendix B for a brief introduction; see Appendix D for a sample program).

Appendix A: k-NN algorithm:
Step 1: Select a positive integer k.
Step 2: Find k nearest neighbors of a query data.
Step 3: Find the categories of the k neighbors. Assign the query data to the category of the majority of its k neighbors.

Appendix B: k-fold cross-validation:
Step 1: Divide the available dataset of size n randomly into k roughly equal groups, say A1…Ak.
Step 2: For i = 1….. k, do {
Use Ai as test data and combine all remaining k -1 groups to form our training data. Use the training data to build a classifier. Apply the classifier to classify the test data. Compute ai, the number of correct classification. }
Step 3: The estimated correct classification rate is

Appendix C: Sample \Prepare” and \Classify” functions

Prepare <- function(trainData,DigitV) { # If needed, enter library command(s) here. d <- prcomp(trainData) list(mu=d$center,u=d$rotation[,1:30], y=d$x[,1:30],Digit=DigitV,epsilon=25e5) } Classify <- function(QueryData,OutPre) { # If needed, enter library command(s) here. m <- dim(QueryData)[1] r <- numeric(m) for (i in 1:m) { w <- t(OutPre$u)%*%(QueryData[i,]-OutPre$mu) minD <- Inf for (j in 1:(dim(OutPre$y)[1])) { dist <- sum((w-OutPre$y[j,])^2) if (dist < minD) { r[i] <- OutPre$Digit[j]; minD <- dist }} if (minD > OutPre$epsilon) r[i] <- -1 } Appendix D: Sample k-fold cross-validation program CValidate <- function(dataSet,TDigit,k) { # Perform k-fold cross-validation # for the provided "Prepare" and "Classify" functions. n <- dim(dataSet)[1] b <- sample(rep(1:k,length=n)) TrueDigit <- EstDigit <- NULL for (i in 1:k) { train <- dataSet[b!=i,] # training data test <- dataSet[b==i,] # test data v <- Prepare(train,TDigit[b!=i]) r <- Classify(test,v) TrueDigit <- c(TrueDigit,TDigit[b==i]); EstDigit <- c(EstDigit,r) } print(table(`True digit`=TrueDigit,`Estimated digit`=EstDigit)) } Assessment Scheme: The performance of the \Prepare" and \Classify" functions will be evaluated using 1000 test images. The grade is determined by the following four factors: (1) Correct classification rate for 1000 query data (40%): Rate = [(number of correctly classified digits) + 0.5(number of unknown digits)]/1000. Fraction of mark obtained is max([(r-0.9)/(MaxR-0.9)]40%,0), where r is the rate of the provided classifier and MaxR is the best rate in the whole class. (2) Economy in storage (30%): Storage used is the size of the output of \Prepare". Fraction of mark is max([(120000-s)/(120000-MinS)]30%,0) where s is the storage used by the provided classifier, and MinS is the minimum storage used in the whole class. (3) Elegance of method (20%) (4) report writing (10%) 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com