CptS 315 Introduction to Data Mining Midterm Exam 1, Spring 2021
Exam date: Mar 18 @ 9am to Mar 19 @5pm
Your Name and WSU ID:
Instructions.
• The maximum score of the exam is 100 points.
• Read all the questions before starting to answer. Try to answer those questions, which you think are easy from your perspective first.
• Work efficiently. Most questions don’t require much work.
• Keep your answers short and simple.
• Short Questions (20 points)
Please keep your answers short. For the True / False questions, please also provide a short justification.
• (4 points) Park-Chen-Yu (PCY) algorithm is always more efficient than Apriori algorithm for finding frequent item-pairs (True/False)
• (4 points) User-user collaborative filtering and item-item collaborative filtering are duals of each other. Therefore, they both are equally accurate when applied to real-world applica- tions. (True/False)
• (4 points) A classifier trained on more training data is more likely to over-fit (True / False)
• (4 points) Passive learning is label-efficient when compared to active learning protocol (True/False)
• (4 points) If you are given m data points, and use half for training and half for testing, the difference between training error and testing error decreases as m increases. (True/False) Please provide one sentence justification.
• Frequent Itemset and Association Rule Mining (25 points)
• (6 points) Suppose the support of {A} is 5, support of {B} is 7, support of {A, B} is 4, support of {B, C} is 3, support of {A, C} is 4, and support of {A, B, C} is 2. What is the confidence of following association rules?
6.1 A ⇒ {B, C}
6.2 {A, B} ⇒ C
• (4 points) Describe the key property that is exploited by the Apriori algorithm for effi- ciently computing the frequent itemsets. Is is true that this property is applicable to compute only frequent pairs?
• (5 points) Suppose we have a total of 100 items. The number of frequent items equals 10. How many candidate pairs will Apriori algorithm consider for counting in Pass 2?
• (5 points) Suppose we have a total of 100 items. The number of frequent items equals
• The number of frequent pairs equals 15. What can you say about the number candidate triples Apriori algorithm will consider for counting in Pass 3?
• (5 points) Suppose you have a real-world application with 1 billion items and 100 billion baskets. You have access to lot of of parallel computing resources. Which of the following frequent itemset mining algorithms will you employ?
• Apriori algorithm
• Park-Chen-Yu (PCY) algorithm
• SON algorithm
• Toivonen algorithm
Please write one sentence justification
• (5 points) Association rules with 100 percent confidence are useless (True/False)
• Recommender Systems (20 points)
• (5 points) What is the key idea behind content-based filtering algorithm to answer the basic filtering question: “will user U like item X?”
• Look at what items U likes, and then check if X is similar to those items
• Look at which users like X, and then check if U is similar to those users
• (5 points) Root Mean Squared Error (RMSE) is the appropriate metric to evaluate the effectiveness or predictions of recommendation algorithms. (True/False)
• (5 points) When applying the recommendation algorithms, what do we achieve by nor- malization of each row of the utility matrix (subtract the mean from rating values)?
• (5 points) List two drawbacks for both content-based filtering and collaborative filtering approaches.
• Machine Learning (30 points)
• (5 points) Suppose we have a binary classification data with classes Y ∈ {+1, −1} and d features with each feature fi ∈ {+1, −1}. To improve the performance of the classifier, Jana decided to duplicate each feature. Hence, each training example now has 2d features with fd+i = fi for i = 1, 2, · · · , d. This question is about comparing the training problem with original feature set and double feature set. Assume that there are same number of training
examples for both positive and negative class, and in case of ties, you will chose positive class. For a Perceptron classifier, select all that apply.
• Test accuracy with original feature set could be higher
• Test accuracy with double feature set could be higher
• Test accuracy will be same with both original and double feature set Please write short justification
Consider the following Perceptron, for which the inputs are the always “1” feature and two binary features x1 ∈ {0, 1} and x2 ∈ {0, 1}. The output label y ∈ {0, 1}. Suppose w0, w1, w2 stands for weights of the three features. The classification decision is made as follows: y = 1 if (w0 + w1 · x1 + w2 · x2) > 0. Otherwise, y = 0.
• (5 points) Which of the following choices for the weight vector (w0, w1, w2) can classify y as y = (x1 XOR x2)? XOR is the logical exclusive OR operation, which equals to ZERO when x1 equals to x2, and equals to ONE when x1 is different from x2.
a) (1, 1, 0)
b) (-2, 1, 1.5)
• Any weights that satisfy (−w1 − w2) < w0 < min(0; −w1; −w2)
• No weights can compute the XOR logical relation
• (5 points) Which of the following choices for the weight vector (w0, w1, w2) can classify y as y = (x1 AND x2)? Here AND refers to the logical AND operation, which equals to ONE when x1 = 1 and x2 = 1, and equals to ZERO for all other combinations.
a) (1, 1, 0)
b) (-1.5, 1, 1)
c) (-2, 1, 1.5)
• Any weights that satisfy (−w1 − w2) < w0 < min(0; −w1; −w2)
• No weights can compute the AND logical relation
• (5 points) As the number of passes over training data increases for perceptron based learning, which of the following are False?
• training accuracy increases
• number of mistakes decreases
• training accuracy decreases
• number of mistakes increases Please write one sentence justification
• (5 points) As the number of training examples used to learn a linear classifier are increased, which of the following are False?
• training accuracy decreases
• testing accuracy decreases
• training accuracy increases
• testing accuracy increases
Please write one sentence justification
• (5 points) Suppose we are training a multi-class Perceptron for a task with three classes (good, bad, ugly) problem. The current weights of the multi-class Perceptron classifier are as follows.
wgood = (-1, -1, -1, -1) for class good
wbad = (-1, +1, +1, -1) for class bad
wugly = (-1, -1, -1, -1) for class ugly
Consider the training example x = (0, 1, 1, −1) with correct label good.
• Which classification label is predicted for the training example x with the current weights?
• What are the weights (wgood, wbad, wugly) after the update that incorporates the training example x?
Extra Sheet
Extra Sheet