COMP 4925
Final Exam
Each student must work on this lab on their own and submit their own original work. Your instructor may follow up with a one-on-one interview, if needed, to verify the authenticity of each submission.
All submissions that are similar to each other will receive 0!
Answer each of the following questions in a Word document (3 marks each):
• For a neural network, what are the drawbacks of a large hidden layer? What are the drawbacks of a small hidden layer?
If the number of hidden layer nodes is too small, the network may not be trained at all, or the network performance is poor.
The too large hidden layer results in stooverfitting, which memory the training set at the cost of generalization of the verification set.
• For a decision tree, what are the possible situations when no further splits can be made at a decision node?
Each branch either connects to another decision node or reaches a termination node
• What is meant by classification in kNN algorithm?
If most of the k nearest samples of a sample in the feature space belong to a certain category, the sample is also divided into this category. In KNN algorithm, the selected neighbors are all correctly classified objects. This method only determines the category of the samples to be subdivided according to the category of the nearest one or several samples.
Document each of the following steps in the Word document including R statements and screenshots:
• Consider the census income dataset at https://archive.ics.uci.edu/ml/datasets/Census+Income
• I have downloaded the dataset for you in data.csv. This is a smaller data set than the one from the website. The data is on different people with different background and whether they make more than $50,000 (>50K) or less than $50,000 (<=50K) per year. This is the last column.
• There are other attributes provided for you
Age: continuous.
Working class: Private, Self-emp-not-inc, etc. Education: Bachelors, Some-college, etc.
Education Years: Number of years in school
Marital status: Married-civ-spouse, Divorced, etc.
Occupation: Tech-support, Craft-repair, etc.
Relationship: Wife, Own-child, etc.
Race: White, Asian-Pac-Islander, etc.
Sex: Female, Male.
Hours per week: working hours per week.
Native Country: United-States, Cambodia, etc. • You can get more information from the website.
• Part A: Decision Tree (5 marks)
• Decide on a good proportion to split the data into a training data set and a test data set.
• Create a decision tree to classify individuals who make <= 50K and those who make > 50K, and test it against the test data set.
• Show the confusion matrix and explain the results from the table.
• Try to improve the performance and recreate the confusion matrix. Is there any improvement?
eduData<-read.csv("D:\\COMP4925\\期末考试\\data.csv")
str(eduData)
set.seed(12345)
credit_rand <- credit[order(runif(500)), ]
credit_train <- credit_rand[1:288, ]
credit_test <- credit_rand[289:500, ]
• Part B: Association Rules (5 marks)
• Focus on the individuals who make <= 50K.
• Pick at least eight variables from the data set which will be used to create association rules. Age must be one of the eight variables.
• Pick support and confidence level to generate no more than 150 rules with minimum length of
3.
• Show some of the rules that have at least 5 items in them.
• What are your impressions of the rules? Explain.