Data Mining Techniques (CS 6220) Unsupervised Machine Learning and Data Mining (DS 5230) Spring 2019 – Homework 1– 100 points
Due Friday February 1 at 11:59 PM Eastern
This homework consists of problems covering association rules, frequent itemsets, and density estimation.
Here are few instructions to make life easier for all of us:
• Turn in your answers in a single file, called
it into Blackboard.
• Please write concisely and clearly. There are points for intermediate steps, but not in “talking problems to death.”
• Have the following at the top of every page:
o Example: Smith, John [1/10]
• Your solution should have a cover-page that provides the following information:
Problem Number
Solution Page Number
1a
1b
1c
…
If you did not answer a problem, then enter “not done” instead of the solution page number.
• It is highly recommended that you turn-in a typed copy of your homework (as opposed to scanned hand-written copy). This is especially true for equations and plots.
• As always, your TAs (Deeksha Doddahonnaiah
1
[4 points] Question 1: Suppose there are 100 items, numbered 1 to 100, and also 100 baskets, also numbered 1 to 100. Item i is in basket b if and only if i divides b with no remainder. Thus, item 1 is in all the baskets, item 2 is in all fifty of the even-numbered baskets, and so on. Basket 12 consists of items {1, 2, 3, 4, 6, 12}, since these are all the integers that divide 12. If the support threshold is 5, which items are frequent? Provide a brief explanation.
[3 points] Question 2: Suppose there are 20 items, numbered 1 to 20. If we use a triangular matrix to count pairs, what pair’s count is in array[120]? Provide a brief explanation.
[4 points] Question 3:
[1.5 point] 3.a: For two itemsets X and Y, is the confidence for X→Y always equal to the confidence for
Y→X? Briefly explain your answer.
[2.5 points] 3.b: Suppose you have seven items A, B, C, D, E, F, and G. Which of the following association rules has a confidence that is certain to be at least as great as the confidence of AB®CDEFG and no greater than the confidence ABCD®E? Briefly explain your answer.
(a) ABC®DEF (b) ACD®BEF (c) ABC®DFG (d) ACD®BEG
[6 points] Question 4: Suppose you are given the following frequent pairs: {A, B}, {A, C}, {A, D}, {B, D}, {B, E}, {C, E}, and {D, E}.
[2 points] 4.a: List the candidate triples that satisfy Apriori’s Monotonicity Principle (which says that if a set of items I appears at least s times, so does every subset J of I). Briefly explain your answer.
[4 points] 4.b: List the candidate triples that do not satisfy Apriori’s Monotonicity Principle. Briefly explain your answer.
[6 points] Question 5: During a run of Toivonen’s Algorithm with a set of items {A,B,C,D,E,F,G,H}, a sample is found to have the following maximal frequent itemsets: {A,B}, {A,C}, {A,D}, {B,C}, {E}, {F}. Identify in the list below the sets that are NOT in the negative border. Give a brief explanation for your choice(s).
• {B,F}
• {B,E}
• {A,E}
• {A,B,D} • {G}
• {G,H} • {B,D} • {D}
2
[6 points] Question 6: Imagine there are 100 baskets, numbered 1, 2, …, 100, and 100 items, similarly numbered. Item i is in basket j if and only if i divides j evenly. For example, basket 24 is the set of items {1,2,3,4,6,8,12,24}. Describe all the association rules that have 100% confidence. Which of the following rules has 100% confidence?
(i) {4,10,12}®80 (ii) {3,4,5}®42 (iii){4,6}®12
(iv) {2,3,5}®45
[5 points] Question 7: Suppose ABC is a frequent itemset and BCDE is NOT a frequent itemset. Given this information, we can be sure that certain other itemsets are frequent and sure that certain itemsets are NOT frequent. Other itemsets may be either frequent or not. Which of the following is a correct classification of an itemset? Give a brief explanation for your choices.
(1) ABCDE is not frequent. (3) CDE is not frequent. (5) BCE is frequent.
(7) ABCD is frequent.
(9) BC is frequent.
(2) BCD can be either frequent or not frequent. (4) AC can be either frequent or not frequent.
(6) BC can be either frequent or not frequent.
(8) B can be either frequent or not frequent.
(10) ABCD can be either frequent or not frequent.
[3 points] Question 8: Suppose you are given the following association rule A®B and are told that its confidence is the same as its support. What does that tell you about A and B? Briefly explain your answer. Recall that Support(A®B) = Support(A È B)
[10 points] Question 9: Suppose we have transactions that satisfy the following assumptions:
• The support threshold s is 10,000.
• There are one million items, which are represented by the integers 0,1,…,999999.
• There are N frequent items – i.e., items that occur 10,000 times or more.
• There are one million pairs that occur 10,000 times or more.
• There are 2M pairs that occur exactly once. M of these pairs consist of two frequent items; the
other M each have at least one non-frequent item.
• No other pairs occur.
• Integers are always represented by 4 bytes.
Suppose we run the A-priori algorithm to find frequent pairs and can choose on the second pass between the triangular-matrix method for counting candidate pairs (where a triangular array count[i][j] that holds an integer count for each pair of items (i, j) where i < j) and a hash table of item-item-count triples. In the first case (i.e., the triangular-matrix method), neglect the space needed to translate between the original item numbers and the numbers for the frequent items. In the second case (i.e., the hash table of item-item- count triples), neglect the space needed for the hash table. Assume that item numbers and counts are always 4-byte integers. As a function of N and M, what is the minimum number of bytes of main memory
3
needed to execute the A-priori algorithm on this data? Demonstrate that you have the correct formula by selecting, from the choices below, the triple consisting of values for N, M, and the (approximate – i.e., to within 10%) minimum number of bytes of main memory S, needed for the A-priori algorithm to execute with this data.
(a) N = 50,000; M = 40,000,000; S = 800,000,000 (b) N = 10,000; M = 40,000,000; S = 200,000,000 (c) N = 30,000; M = 100,000,000; S = 500,000,000 (d) N = 10,000; M = 50,000,000; S = 600,000,000
Hints: In all possible choices, the values of N and M are such that you can ignore the space needed for the first pass; and you can ignore the space needed to list the frequent items on the second pass.
[10 points] Question 10: Suppose we perform the PCY algorithm to find frequent pairs, with market- basket data having the following specifications:
• The support threshold s is 10,000.
• There are one million items, which are represented by the integers 0,1,...,999999.
• There are 250,000 frequent items – i.e., items that occur 10,000 times or more.
• There are one million pairs that occur 10,000 times or more.
• There are P pairs that occur exactly once and consist of 2 frequent items.
• No other pairs occur.
• Integers are always represented by 4 bytes.
• When we hash pairs, they distribute among buckets randomly, but as evenly as possible – i.e, you may assume that each bucket gets exactly its fair share of the P pairs that occur once.
Suppose there are S bytes of main memory. In order to run the PCY algorithm successfully, the number of buckets must be sufficiently large. In addition, on the second pass, there must be enough room to count all the candidate pairs. As a function of S, what is the largest value of P for which we can successfully run the PCY algorithm on this data? Demonstrate that you have the correct formula by indicating which of the following is a value for S and a value for P that is approximately (i.e., to within 10%) the largest possible value of P for that S.
(a) S = 500,000,000; P = 5,000,000,000 (b) S = 200,000,000; P = 400,000,000
(c) S = 1,000,000,000; P = 10,000,000,000 (d) S = 100,000,000; P = 120,000,000
Hints: Here are a few simplifying assumptions that hold for the choices of S and P in this question:
• The space needed on the first pass to count items is negligible.
• On the first pass, buckets without a frequent pair will not have a large enough count to be frequent.
• Almost all the candidate pairs will be infrequent – i.e., there are much more than 1,000,000 such
pairs.
• Only buckets, which contain one of the frequent pairs, are frequent buckets.
• On the second pass, you can neglect the space needed to store the frequent items.
4
[6 points] Question 11: Suppose we perform the 3-pass Multistage Algorithm to find frequent pairs, with market-basket data meeting the following specifications:
• The support threshold is 10,000.
• There are one million items, represented by the integers 0, 1, ... , 999999.
• All items are frequent; that is, they occur at least 10,000 times.
• There are one million pairs that occur 10,000 times or more.
• There are P pairs that occur exactly once.
• No other pairs occur at all.
• Integers are always represented by 4 bytes.
• When we hash pairs, they distribute among buckets randomly, but as evenly as possible; i.e., you may assume that each bucket gets exactly its fair share of the P pairs that occur once.
• The hash functions used on the first two passes are completely independent.
Suppose there are S bytes of main memory. As a function of S and P, what is the expected number of candidate pairs on the third pass of the Multistage Algorithm? That is, compute a formula that is approximately N as a function of S and P.
Demonstrate the correctness of your formula by identifying which of the following triples of values for S, P, and N is approximately (i.e., to within 10%) the expected number of candidate pairs for the third pass?
(a) S = 100,000,000; P = 1,000,000,000; N = 1,330,000
(b) S = 300,000,000; P = 100,000,000,000; N = 9,300,000 (c) S = 300,000,000; P = 100,000,000,000; N = 19,000,000 (d) S = 100,000,000; P = 1,000,000,000; N =640,000
Hints:
• Neglect the space needed to count frequent items or to store the list of frequent items on later passes.
• You can neglect the fraction of space used by the bitmaps --- they are only 1/32 of S.
• Remember all items are frequent, so the only way to rule out a candidate pair is for it to appear in
an infrequent bucket.
• You may assume that all frequent pairs wind up in different buckets -- almost all will anyway.
• You may assume that there are not enough infrequent pairs to make a bucket frequent without having a frequent pair.
[18 points] Question 12: Consider kernel density estimation with a Boxcar Kernel. Given data, X1, X2, ..., Xn, the empirical PDF is defined as follows:
5
1 ∑4 𝐼,-𝑥 − 𝑋 - ≤ h3
0
where |y| indicates absolute value of y; and I(b) is an indicator function that is 1 if the Boolean argument b evaluates to true, and 0 otherwise. The Boxcar Kernel is sometimes referred to as the Rectangular or Tophat Kernel.
[6 points] 12.a: Draw 𝑝̂#(𝑥) for the data listed at http://eliassi.org/datamining/eruptionData.csv with h = 0.37 and domain of x going from -¥ to ¥. Report 𝑝̂#(2) and 𝑝̂#(4) to two significant digits.
Hint: You can use whatever programming language that you like but I recommend using one of the following software:
• Matlab’s dfittool, or
• Mathematica’s SmoothKernelDistribution with the InterpolationPoints
option, or
• Python’s sklearn.neighbors.KernelDensity, or
• R’s kde function.
[6 points] 12.b: Let F(x) = P(X ≤ x) represent the true CDF over a random variable X. Show that 𝐸[𝑝̂#(𝑥)] = 𝐹(𝑥 + h) − 𝐹(𝑥 − h)
2h
Hints:
• E[X] = ∑4 (𝑋 × 𝑝(𝑋 ))
• The expectation of a sum is equal to the sum of the expectations.
• Since I is an indicator function, 𝐸[𝐼(|𝑥 − 𝑋| ≤ h)] = 𝑃(|𝑥 − 𝑋| ≤ h).
[6 points] 12.c: Determine if our kernel density estimator will be an unbiased estimator in the following scenarios. Let PDF(x) denote the true probability density function. Recall that an estimator is called unbiased if the difference between the estimator’s expected value of the parameter of interest and the true value of that parameter is zero. Hint: PDF(x) = dF(x) / dx, where F(x) is the true CDF over a random variable X.
• Scenario A:
o PDF(x) is uniform between 0 and 1. o 𝑝̂#(1⁄2) is estimated using h = 1/3.
• Scenario B:
o PDF(x) = 2x for 0 ≤ x ≤ 1.
o 𝑝̂#(1⁄4) is estimated using h = 1/5. • Scenario C:
o PDF(x) = 1.5x2 for -1 ≤ x ≤ 1.
o 𝑝̂#(0) is estimated using h = 1/5.
[19 points] Question 13: Suppose you have six items: 1 through 6. Suppose you have the following 12 6
𝑝̂#(𝑥) = 056
𝑛 2h
>56> >
baskets, each containing 3 items:
{1,2,3} {2,3,4} {3,4,5} {4,5,6} {1,3,5} {2,4,6} {1,3,4} {2,4,5} {3,5,6} {1,2,4} {2,3,5} {3,4,6}
Set the support threshold to 4. On the first pass of the PCY Algorithm use a hash table with 11 buckets, and the set {i, j} is hashed to bucket 𝑖 × 𝑗 𝑚𝑜𝑑 11.
[3 points] 13.a: List the support for each item and each pair of items.
[3 points] 13.b: Which pairs hash to which buckets?
[3 points] 13.c: Which buckets are frequent?
[3 points] 13.d: Which pairs are counted on the second pass of the PCY Algorithm?
[3 points] 13.e: Let’s extend the above to be a Multistage algorithm. For the second pass, hash pairs to 9 buckets, using the hash function that hashes {i, j} to bucket 𝑖 + 𝑗 𝑚𝑜𝑑 9. Determine the counts of the buckets on the second pass. (Note that all items are frequent, so the only reason a pair would not be hashed on the second pass is if it hashed to an infrequent bucket on the first pass.)
[2 points] 13.f: Does the first pass reduce the set of candidate pairs?
[2 points] 13.g: Does the second reduce the set of candidate pairs? ————————————————– End of Homework 1 —————————————–
7