COMP 3801 Mid-Term 08 : 35 − 09 : 50 31.10.18
Name −−−<:
Student Number −>−−:
Instructions
1. 1 − (1 − r)b .. …. OK …now the spooky stuff ends ….
2. The exam consists of four questions, each worth 25 Marks.
1
3. You can do the rough work on the reverse/opposite side of the sheets. Ensure that the answers are provided within the allotted space (i.e. within the box). There is an extra sheet in the end, in case you run out of space.
4. Pleasedon’treprove/redesignwhathasbeenalreadybeendoneintheassignment/textbook/class – just cite and explain how you are using that.
5. If you need to, you can use a calculator.
Question
Question 1 Question 2 Question 3 Question 4
Total
Marks
2 Question 1: Assume that the underlying ‘web-graph’ is strongly-connected, i.e. between any two nodes u and v in the graph, there is a directed path from u to v and from v to u. We saw that the corresponding Markovian matrix M for this graph has all the nice prop- erties (sum of the column entries is 1, …), and the page rank corresponds to the principal Eigenvector determined by the principal Eigenvalue (λ = 1). Now suppose we incorporate the teleportation factor 0 < β < 1 in the calculation of the page rank. Consider the matrix M′ that is obtained by multiplying each of the entries of M by β and then adding (1−β)/n to it. Will the page rank with respect to teleportation correspond to the principal Eigen- vector corresponding to the principal Eigenvalue of M′. In other words, does M′ satisfy the Markovian matrix and the strongly-connected web-graph property? Answer: There was some confusion with this one. We take β · M, then add 1−β to every term in n M. We are essentially computing pagerank with teleportation using another method. I accepted two answers. You could prove that the principle Eigenvector of M′ was equal to the pagerank column vector you get using the iterative method with teleportation (that is, you multiply β · M times your pagerank vector, then add the teleportation element after, solve it iteratively or using the system of linear equations, then compare it to the principle Eigenvector of M′). This is complicated. Or you could show the resulting web-graph is strongly connected (which, since we are essentially adding the complete graph, it is), and show that M′ is Markovian. To show M′ is Markovian, do some version of the following. Let a1, a2, ..., an be some arbitrary column of M. Since M is Markovian, ni=1 ai = 1. This same column in M′ is βa1 + 1−β,βa2 + 1−β,...,βan + 1−β. Thus the sum is nnn n 1−β βai+ n i=1 n n1−β =βai + which means M′ is Markovian. i=1 i=1 n n =βai +1−β i=1 =β + 1 − β =1 3 Question 2: This question asks you to think about signature matrix and the banding technique under the Locality-Sensitive Hashing topic. Show that if two sets have Jaccard Distance of at most 0.3, then by choosing 40 signatures grouped in 10 bands where each band consists of four rows, then there is at least 93% chance that they will be detected similar. What will be the probability that they will be detected similar if their distance is atleast 0.7? Answer: I won’t do the computation, since it is straightforward. The formula (given on the first page) is 1 − (1 − sr)b. The only curveball was that the formula uses similarity, and you are given distance, which is (1 − s), if s is similarity. 4 Question 3: This problem refers to Count-Min Sketch. Suppose we have designed a beautiful CMS table with b columns and r rows for a streaming application that needs to be launched today afternoon. You worked extremely hard leading up to now to figure out the exact values of b and r. But, just now your manager informed you that you can actually increase the size of the CMS table by 50%. Should you add more rows? More columns? A combination of both? Justify your answer. Answer: Basically show that you understand the relationship between r, b, and your output. Some- thing like “I can increase the number of columns in the CMS table, and that would decrease the margin of error we expect. Or I can increase the number of rows, which would increase the probability that we are within that margin of error.” 5 Question 4: Suppose we have a stream of N = s + d numbers. In this stream, there are two types of numbers. 1. Singles - the numbers that occur exactly once, and let s be their total number. 2. Doubles - the numbers that occur exactly twice, and let d be their total number. Our task is to estimate the proportion of s and d in the stream, i.e. the quantities s/N and d/N. Since N is very very large and we have only the capability to handle approximately N/10 elements, the following method is suggested. Sample each element with probability 1/10, uniformly at random and independent of any other elements from the stream, and let the sampled set be S. Note that E[|S|] = N/10, thus we can count the number of singles and doubles in S. Let s′ be the number of singles in S and let d′ be the number of doubles in S. Are s′/|S| and d′/|S| good estimates of s/N and d/N, respectively? Answer: ForeachsinN,letX=1iss∈S,andX=0otherwise. Then E[X] = Pr[X]·1+Pr[X]·0 = s . This is not how many single numbers single∈N 10 are in S, but how many single numbers from N were brought to S. A similar operation to the one above tells us that the expected number of d ele d . The value 10 d′ indicates each element in S with a duplicate element also in S. The probability of picking both elements from N is 1 . Thus the expected number of d′ = d . That means 9d elements ofddonothavingamatchingelementinS,sowewouldcountthemins′. Thuss′ = s + 9d . 10 100 Thus d = 100 = d/10. Similarly s = 100 = s+(9d/10). Thus s′/|S| and d′/|S| are not |S| N N |S| N N good estimates of s/N and d/N. 100 100 100 ′ d ′ 10s+9d 10 10 6 Extra Sheet