Mining data stream
Anand Rajaraman, Jure Leskovec, Stanford Univ. and Jeffrey D. Ullman Stanford Univ.“Mining of Massive Datasets” , chapter 4.
Master student: do at least 6 of the following exercise.
Research student: do all
- (4.2.1) : Suppose we have a stream of tuples with the schema Grades(university, courseID, studentID, grade) Assume universities are unique, but a courseID is unique only within a university
(i.e., different universities may have different courses with the same ID, e.g., “CS101”) and likewise, studentID’s are unique only within a university (different universities may assign the same ID to different students). Suppose we want to answer certain queries approximately from a 1/20th sample of the data. For each of the queries below, indicate how you would construct the sample. That is, tell what the key attributes should be.
(a) For each university, estimate the average number of students in a course.
(b) Estimate the fraction of students who have a GPA of 3.5 or more.
(c) Estimate the fraction of courses where at least half the students got “A.”
- (4.3.1): For the situation of our running example (8 billion bits, 1 billion members of the set S), calculate the false-positive rate if we use three hash functions? What if we use four hash functions?
- (4.3.2) : Suppose we have n bits of memory available, and our set S has m members. Instead of using k hash functions, we could divide the n bits into k arrays, and hash once to each array. As a function of n, m, and k, what is the probability of a false positive? How does it compare with using k hash functions into a single array?
- (4.4.1) : Suppose our stream consists of the integers 3, 1, 4, 1, 5, 9, 2, 6, 5. Our hash functions will all be of the form h(x) = ax+b mod 32 for some a and b. You should treat the result as a 5-bit binary integer. Determine the tail length for each stream element and the resulting estimate of the number of distinct elements if the hash function is:
(a) h(x) = 2x + 1 mod 32.
(b) h(x) = 3x + 7 mod 32.
(c) h(x) = 4x mod 32.
- (4.4.2) : Do you see any problems with the choice of hash functions in Exercise 4.4.1? What advice could you give someone who was going to use a hash function of the form h(x) = ax + b mod 2k?
- (4.5.3) : Suppose we are given the stream of 3, 1, 4, 1, 3, 4, 2, 1, 2.,
- to which we apply the Alon-Matias-Szegedy Algorithm to estimate the surprise number. For each possible value of i, if Xi is a variable starting position i, what is the value of Xi.value?
- if the intent of the variables is to compute third moments. What is the value of each variable at the end? What estimate of the third moment do you get from each variable? How does the average of these estimates compare with the true value of the third moment?
- (4.5.6) : If we wanted to compute fourth moments, how would we convert X.value to an estimate of the fourth moment?
- (4.5.2) : If a stream has n elements, of which m are distinct, what are the minimum and maximum possible surprise number, as a function of m and n?
- (4.6.1 ): Suppose the window is as shown in Fig. 4.2. Estimate the number of 1’s the the last k positions, for k = (a) 5 (b) 15. In each case, how far off the correct value is your estimate?
- (4.6.2) : There are several ways that the bit-stream 1001011011101 could be partitioned into buckets. Find all of them.