分布式算法代写 Exercise of frequent item set

Exercise of frequent item set

Master student: do at least 6 of the following exercise.

Research student: do all

(6.1.1) : Suppose there are 100 items, numbered 1 to 100, and also 100 baskets, also numbered 1 to 100. Item i is in basket b if and only if i divides b with no remainder. Thus, item 1 is in all the baskets, item 2 is in all fifty of the even-numbered baskets, and so on. Basket 12 consists of items {1, 2, 3, 4, 6, 12}, since these are all the integers that divide 12. Answer the following questions:

(a) If the support threshold is 5, which items are frequent?

(b) If the support threshold is 5, which pairs of items are frequent?

(6.2.3) : Let there be I items in a market-basket data set of B baskets. Suppose that every basket contains exactly K items. As a function of I, B, and K:

(a) How much space does the triangular-matrix method take to store the counts of all pairs of items, assuming four bytes per array element?

(b) What is the largest possible number of pairs with a nonzero count?

(6.2.4) : How would you count all itemsets of size 3 by a generalization of the triangular-matrix method? That is, arrange that in a one-dimensional array there is exactly one element for each set of three items.

(6.2.6) : Apply the A-Priori Algorithm with support threshold 5 to the data of Exercise 6.1.1.

(6.2.7) : Suppose we have market baskets that satisfy the following assumptions:
The support threshold is 10,000.
There are one million items, represented by the integers 0, 1, . . . , 999999.
There are N frequent items, that is, items that occur 10,000 times or more.
There are one million pairs that occur 10,000 times or more.
There are 2M pairs that occur exactly once. Of these pairs, M consist of two frequent items; the other M each have at least one nonfrequent item.
No other pairs occur at all.
Integers are always represented by 4 bytes.

Suppose we run the A-Priori Algorithm and can choose on the second pass between the triangular-matrix method for counting candidate pairs and a hash table of item-item-count triples. Neglect in the first case the space needed to translate between original item numbers and numbers for the frequent items, and in the second case neglect the space needed for the hash table. As a function of N and M, what is the minimum number of bytes of main memory needed to execute the A-Priori Algorithm on this data?

(6.3.1) : Here is a collection of twelve baskets. Each contains three of the six items 1 through 6.

{1, 2, 3} {2, 3, 4} {3, 4, 5} {4, 5, 6}

{1, 3, 5} {2, 4, 6} {1, 3, 4} {2, 4, 5}

{3, 5, 6} {1, 2, 4} {2, 3, 5} {3, 4, 6}

Suppose the support threshold is 4. On the first pass of the PCY Algorithm we use a hash table with 11 buckets, and the set {i, j} is hashed to bucket i×j mod 11.

(a) By any method, compute the support for each item and each pair of items.

(b) Which pairs hash to which buckets?

(d) Which pairs are counted on the second pass of the PCY Algorithm?

(6.3.3) : Suppose we run the Multihash Algorithm on the data of Exercise 6.3.1. We shall use two hash tables with five buckets each. For one, the set {i, j}, is hashed to bucket 2i+3j+4 mod 5, and for the other, the set is hashed to i+4j mod 5. Since these hash functions are not symmetric in i and j, order the items so that i < j when evaluating each hash function. Determine the counts of each of the 10 buckets. How large does the support threshold have to be for the Multistage Algorithm to eliminate more pairs than the PCY Algorithm would, using the hash table and function described in Exercise 6.3.1?

(6.3.4) : Suppose we perform the PCY Algorithm to find frequent pairs, with market-basket data meeting the following specifications:
The support threshold is 10,000.
There are one million items, represented by the integers 0, 1, . . . , 999999.
There are 250,000 frequent items, that is, items that occur 10,000 times or more.
There are one million pairs that occur 10,000 times or more.
There are P pairs that occur exactly once and consist of two frequent items.
No other pairs occur at all.
Integers are always represented by 4 bytes.
When we hash pairs, they distribute among buckets randomly, but as evenly as possible; i.e., you may assume that each bucket gets exactly its fair share of the P pairs that occur once.

Suppose there are S bytes of main memory. In order to run the PCY Algorithm successfully, the number of buckets must be sufficiently large that most buckets are not frequent. In addition, on the second pass, there must be enough room to count all the candidate pairs. As a function of S, what is the largest value of P for which we can successfully run the PCY Algorithm on this data?

(6.3.5) : Under the assumptions given in Exercise 6.3.4, will the Multihash Algorithm reduce the main-memory requirements for the second pass? As a function of S and P, what is the optimum number of hash tables to use on the first pass?

(6.3.6) : Suppose we perform the 3-pass Multistage Algorithm to find frequent pairs, with market-basket data meeting the following specifications:
The support threshold is 10,000.
There are one million items, represented by the integers 0, 1, . . . , 999999. All items are frequent; that is, they occur at least 10,000 times.
There are one million pairs that occur 10,000 times or more.
There are P pairs that occur exactly once.
No other pairs occur at all.
Integers are always represented by 4 bytes.
When we hash pairs, they distribute among buckets randomly, but as evenly as possible; i.e., you may assume that each bucket gets exactly its fair share of the P pairs that occur once.
The hash functions used on the first two passes are completely independent.

Suppose there are S bytes of main memory. As a function of S and P, what is the expected number of candidate pairs on the third pass of the Multistage Algorithm?

(6.4.2) : Apply Toivonen’s Algorithm to the data of Exercise 6.3.1, with a support threshold of 4. Take as the sample the first row of baskets:

{1, 2, 3}, {2, 3, 4}, {3, 4, 5}, and {4, 5, 6}, i.e., one-third of the file. Our scaled down support theshold will be 1.

(a) What are the itemsets frequent in the sample?

(b) What is the negative border?

(c) What is the outcome of the pass through the full dataset? Are any of the itemsets in the negative border frequent in the whole?

(6.4.3) : Suppose item i appears exactly s times in a file of n baskets, where s is the support threshold. If we take a sample of n/100 baskets, and lower the support threshold for the sample to s/100, what is the probability that i will be found to be frequent? You may assume that both s and n are divisible by 100.

(6.5.1) : Suppose we are counting frequent itemsets in a decaying window with a decay constant c. Suppose also that with probability p, a given i and j. Additionally, with probability p the basket contains i but not j, and with probability p it contains j but not i. As a function of c and p, what is the fraction of time we shall be scoring the pair {i, j}?

Related Posts