COMPSCI 753
THE UNIVERSITY OF AUCKLAND
SEMESTER TWO 2020
Campus: City, Offshore Online, UoA CLC – Northeast Forestry, UoA CLC – Southwest University
Copyright By PowCoder代写 加微信 powcoder
COMPUTER SCIENCE Algorithms for Massive Data
(Time Allowed: TWO hours)
– Attempt ALL questions in this exam.
– There are 50 marks in this exam.
– The exam counts 50% towards your final mark.
Page 1 of 8
COMPSCI 753
By completing this assessment, I agree to the following declaration:
I understand that the University expects all students to complete course- work with integrity and honesty. I promise to complete this online as- sessment with the same academic integrity standards and values. Any identified form of poor academic practice or academic misconduct will be followed up and may result in disciplinary action.
As a member of the Universitys student body, I will complete this assess- ment in a fair, honest, responsible and trustworthy manner. This means that:
– I declare that this assessment is my own work.
– I will not seek out any unauthorized help in completing this assess-
– I declare that this work has not been submitted for academic credit
in another University of Auckland course, or elsewhere.
– I am aware that the University of Auckland may use Turnitin or any
other plagiarism detecting methods to check my content.
– I will not discuss the content of this assessment with anyone else in any form, including Canvas, Piazza, Facebook, Twitter or any other
social media / online platform within the assessment period.
– I will not reproduce the content of this assessment anywhere in any
Page 2 of 8
1 Locality-sensitive Hashing
1.1 Computing MinHash signatures
Given 4 sets:
S1 = {3,4,5},S2 = {0,1,2},S3 = {0,5}, Q = {0,1,2,3,4,5}.
1. Presentthesesetsasabinarymatrixwherethesetelementsare{0,1,2,3,4,5}. [1 mark]
2. Construct the MinHash signature matrix using 4 universal hash functions
h1(x)=x mod 6,
[1 mark] h2(x)=(x+1) mod 6,
h4(x) = (x+5) mod 6
J(S1, Q), J(S2, Q), and J(S3, Q). [1 mark]
4. Can we use the hash function in the form of h(x) = (x + a) mod 6, where a is an integer to simulate random permutations for our sets? Explain you answer. [2 marks]
1.2 Tuning parameters for LSH [2 marks]
Given the number of bands b and the number of rows per band r, let p = 1 − (1 − sr)b be the probability of being a candidate pair for the pair with Jaccard similarity s.
Given the following values of r and b: r = 3 and b = 10; r = 6 and b = 20; r = 5 and b = 50, we compute the value p for s in range {0.1,0.2,…,1} as follows:
h3(x) = (x+3) mod 6,
3. Consider the set Q as the query set, estimate the Jaccard similarities
0.1 0.0100
0.2 0.0772
0.3 0.2394
0.4 0.4839
0.5 0.7369
0.6 0.9123
0.7 0.9850
0.8 0.9992
0.9 1.0000
(6, 20) 0.0000 0.0013 0.0145 0.0788 0.2702 0.6154 0.9182 0.9977 1.0000
Page 3 of 8
(5, 50) 0.0005 0.0159 0.1145 0.4023 0.7956 0.9825 0.9999 1.0000 1.0000
COMPSCI 753
[10 marks]
COMPSCI 753
We would like to solve the near neighbor search problem using the Jaccard similarity. In particular, given a query set Q, we want to find all sets Si such that J(Si,Q) ≥ 0.5. Which settings of b and r above should we use such that:
1. The probability that any 50%-similar pair is a candidate pair is at least 70%. Explain your solution. [1 mark] 2. The probability that any 50%-similar pair is a candidate pair is at least 70% and the number of candidate pairs is minimized. [1 mark]
1.3 Linear time of LSH on finding all similar pairs [3 marks]
Assume that the average number of words in a document is constant. Without using the shingling technique, the running time of the na ̈ıve algorithm for finding all Jaccard similarity pairs is O(n2) where n is the number of docu- ments. In the lecture note, we state that “With LSH, we can approximately find all similar pairs in O(n) time.” Is the statement true or false? Explain your answer.
2 Streaming Algorithms [15 marks]
2.1 Reservoir Sampling [5 marks]
Fig. 1.: Illustration of how pRS1 works.
In our lecture, we have studied the reservoir sampling which samples an element
from a stream of size m with the same probability. If we use the reservoir Page 4 of 8
COMPSCI 753
sampling with the summary size s = 1, each element of a stream will be sampled with probability 1/m. We name this method as RS1. The generalized version of reservoir sampling with the summary size s > 1 guarantees that each element in a stream will be sampled with the same probability s/m. We name this method as RSs.
In the exam, we consider a new algorithm, called pRS1, that simulates RSs for s > 1 by running s independent RS1 instances in parallel. pRS1 also uses a summary of size s, as shown in Figure 1.
As a function of m and s, what is the probability an element of a stream is sampled by pRS1? [2 marks] Let fi be the number of occurrences of the element ai in a stream. Explain how we can use RSs and pRS1 for estimating fi. [3 marks]
Misra-Gries vs. Reservoir Sampling [5 marks]
Run the Misra-Gries summary with k = 2 counters for the stream below: {1,4,5,4,4,5,4,4}
1. Present the final summary, including the elements and their counter values, when the execution of the algorithm is finished. [1 mark]
2. If we use the generalized reservoir sampling RSs with s = 2 on this
stream, what is the probability that the element 4 is in our RSs summary?
3. On Assignment 2, there is a request to “report the average number of
times the reservoir summary has been updated over 5 runs”. As a function of the summary size s and the stream length m, what is the expected number of times the reservoir summary has been updated after processing the stream? [2 marks]
2.3 Count [5 marks]
Apply Count to estimate the frequency of each element in the stream below:
{1,4,5,4,4,5,4,4,1,4,5,4,4,5,4,4,1,4,5,4,4,5,4,4}
Our Count uses d = 3 arrays with the hash functions as follows:
h1(x) = (x+1) mod 3, h2(x)=(3x+1) mod 3, h3(x)=(5x+2) mod 3.
Page 5 of 8
COMPSCI 753
Present the Count summary after processing all elements and the estimated frequency of each element. [2 marks] Among Reservoir Sampling, Misra-Gries and Count , which algorithm we should use to find the top-1 frequent element in this stream.
Explain your choice.
Algorithms for Large Graphs
Computing PageRank
[15 marks]
1. Convert A into a column-stochastic matrix M.
2. In the lecture, we have shown that the PageRank r = M · r is the
eigenvector of the column-stochastic M corresponding to eigenvalue λ = 1. Compute the PageRank of all nodes in the above graph using the eigen equation. [2 marks]
3. From the resulting PageRank scores in question 2, explain the problem of running the power iteration algorithm r(t+1) = M · r(t) on M. Describe how to solve the problem. [2 marks]
3.2 Girvan-Newman [5 marks]
1. Compute the edge betweenness for all edges in the social network in Figure 2. Which edge will be removed to partition the graph into two parts using the Girvan-Newman method? [3 marks]
Fig. 2.: An example social network
Page 6 of 8
Given the following raw adjacency matrix of a graph:
1 0 0 0 A = 0 1 0 0 0 0 1 0
COMPSCI 753
2. In our lecture, we mentioned that we can use the Brandes’ algorithm to calculate the shortest path from a node to all others. Does the algorithm apply for a weighted graph? Explain your answer. [2 marks]
Influence Maximization [5 marks]
Compute the influence spread of the seed set S = {a} using the inde- pendent cascade model on the graph in Figure 3. Hint: Convert the stochastic graph to deterministic graphs. [3 marks]
Fig. 3.: A social network with activation probabilities on edges.
In the lecture, we gave the definition of submodular function as f(S ∪ {v})−f(S) ≥ f(T ∪{v})−f(T) for S ⊂ T ⊆ U, where U is the set of all items. Another definition of submodular function is that f(A) + f(B) ≥ f(A∪B)+f(A∩B) for any two sets A,B ⊆ U. Show the two definitions
are equivalent.
Recommender Systems
Collaborative Filtering
[10 marks]
Given the following transactions in the form of (user, item, rating) tuples in a recommender system.
(u1, p1, 1.5), (u1, p3, 4), (u1, p5, 0.5), (u2, p2, 4), (u2, p4, 2), (u3, p1, 4.5), (u3, p4, 2.5), (u3, p5, 5), (u4, p2, 2), (u4, p3, 3.5), (u4, p4, 4), (u4, p5, 2.5)
Let the set of users be {u1, u2, u3, u4} and the set of items be {p1, p2, p3, p4, p5}.
1. Construct the user-item interaction matrix based on the above transac- tions. Use question marks to denote missing values. [1 mark]
2. Apply the basic user-based collaborative filtering with the Pearson corre-
lation coefficient for user u2 without considering bias. Give the top-1 rec- ommended item to u2 and the corresponding predicted rating. [2 marks]
Page 7 of 8
COMPSCI 753
3. In the lecture, we discussed how to model the rating bias including the bias over all transactions, the bias of a user and the bias of an item. Give the predicted rating of user u2 to item p5 using the collaborative filtering that incorporates the above bias information. [3 marks]
Note: The predicted ratings should round to one decimal place.
4.2 Factorization Machine [4 marks]
Suppose you are asked to build a system to recommend events. Users {u1, u2, u3, u4} attend events from {e1, e2, e3} in groups. Events are held in one of the two stadiums s1 and s2. Table 1 shows the transactions.
Table1.: Transactions of the event recommendation system
Transaction ID Group of users Event Stadium 1 u1,u2 e1 s1 2 u1,u3,u4 e2 s1 3 u2,u4 e3 s2 4 u3,u4 e1 s2
1. Construct the input feature vectors for the factorization machine using the event transactions in Table 1. [1 mark] 2. Can factorization machine predict the rating that an individual user u2 may put on e2 held in s2? Explain your answer. [1 mark] 3. If we ignore the stadium information in the transactions and only consider users and events in the above example, does the factorization machine
reduce to the latent factor model? If yes, explain your answer. If no, explain in what situation the factorization machine reduces to the latent
factor model.
Page 8 of 8
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com