CSC 480: Previous Exam
NOTE: This is a very long exam! Please, do as much as you can! I will adjust the scores afterwards!
1. Decision Trees (15 points: 7, 6, 2)
Consider the following database of 10 voters described by three features: Age, Income, Gender and a binary class, Vote.
ID
Age
Income
Gender
Vote
1
Young
Low
Male
Trump
2
MiddleAged
Medium
Female
Trump
3
Old
High
Female
Clinton
4
Old
Low
Female
Trump
5
MiddleAged
High
Male
Trump
6
Young
Medium
Male
Clinton
7
MiddleAged
Low
Female
Clinton
8
Old
Medium
Male
Trump
9
Young
Low
Female
Clinton
10
Young
Medium
Male
Clinton
a. What feature would be best to place at the root of a decision tree according to an information-gain type of criterion? Explain your answer. [Important: Please do not calculate the actual information gain values of each feature. Instead, draw the three competing partial decision trees (i.e., the root + division into two or three subsets of data) and estimate the one that you believe would give you the highest information gain. Explain why you chose that one over the others].
b. Using the partial tree you chose in the last question, build a complete decision tree (it does not have to be optimal).
c. Would it make sense to prune your decision tree? Please explain why or why not?
2. Neural Networks (15 points: 6, 6, 3)
Consider the following matrices representing the Multi-Layer initial weight values (first table for the weights going from the input layer to the hidden layer; second table for the weights going from the hidden layer to the output layer). The network’s input layer i is of size 4, its hidden layer j is of size 3, and its output layer k is of size 1.
wij’s
1
2
3
1
5
8
2
2
3
2
4
3
1
3
2
4
4
1
2
Wjk’s
1
1
2
2
1
3
3
a. b.
c.
Draw the MLP network associated with these matrices. Indicate the weight values on your graph.
Calculate the value output by the network for the following instance:
To simplify calculations, we will assume that the g() function applied to the hidden layer and to the output layer is not the sigmoid function, but instead the following linear function: g(x) = 2*x + 3.
Assuming that the expected output for the above instance is 100, do you expect the delta value added to each weight during the first pass of the backpropagation backward pass to be positive or negative? Why?
i’s
1
2
3
4
1
2
1
3
3. Theory I (15 points: 3, 6, 3, 3)
a. What is the hypothesis space restriction bias (representational bias) of a classifier?
b. How is that bias implemented in
i. The candidate elimination algorithm (the algorithm used with version
spaces)
ii. Decision Trees
iii. Multi-Layer Perceptrons
c. What is the name of the other inductive bias we introduced?
d. Do Multi-Layer Perceptrons implement that other bias? If so, how?
4. Theory II (10 points: 3, 4, 3)
a. What is the dimensionality of a problem?
b. Why is this dimensionality sometimes a curse?
c. What can be done to remedy the so-called “curse of dimensionality”?
5.
Theory III (15 points: 6, 4, 5)
Does a classifier that overfits the data have a tendency to exhibit an error dominated by bias or variance? Please answer this question going through the following steps:
a. Give a clear definition of bias and variance in the context of the bias/variance dilemma (in your own words)
b. Give a clear definition of overfitting (in your own words)
c. Answer the question by showing the reasoning that leads you to an answer.
Feel free to illustrate your answers with graphs or other kinds of pictures.
Evaluation (15 points: 3, 2, 4, 6)
Consider the following table listing the AUC obtained by C4.5 and MLP on 7 different domains, D1 to D7
6.
C4.5
MLP
D1
.8
.7
D2
.7
.9
D3
.6
.5
D4
1
1
D5
.9
.6
D6
.8
.9
D7
.9
.9
a. In your opinion which classifier performs better? Using only common sense, would you say that this result is statistically significant? Please explain your answer.
b. In a situation like this one, two statistical tests are recommended. One is Wilcoxon`s Signed-Rank Test. What is the other one called?
c. What are the advantages and disadvantages of each of these tests?
d. Please compute the TWilcox statistics.
7. Practical Problem (Keep this question for last as it is open ended and may take you a long time) (15 points: 8, 3, 4)
As a data scientist for a school board, you were asked to build a classifier to help the board hire suitable high school teachers for the district it represents. Assume that by the time your system kicks in, the candidates have already interviewed and plenty of data is available about them. That same information and more is available about teachers (good ones and bad ones) who have been working in the system for a while.
a. How would you go about setting up such a system? Please, give a description of what instances would look like, including the features and class you would use for them. [Assume that any feature you wish for is available from both the background and interview data]
b. Would you treat the problem as a single problem or would you divide it into subproblems? Which ones and how?
c. In your hearts of hearts, do you believe that this endeavour is a worthy one? Why or why not (don ́t worry, I won ́t fire you!)?