Sampling Techniques
COMPCSI 753: Algorithms for Massive Data
Instructor: Ninh Pham
University of Auckland
Parts of this material are modifications of the lecture slides from
http://mmds.org
Designed for the textbook Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.
Auckland, Aug 17, 2020
1
Outline
Sampling from a data stream Sampling a fixed proportion
Sampling a fixed-size (reservoir sampling)
2
Uniformly sampling from a stream
Motivation:
Since we cannot store the entire stream, we store a uniform sample of the
stream to answer queries.
Goal:
Out of the large number of elements a1, … , am, we choose a small number
of elements to keep in memory.
Sample from the stream a1, … , am a single element x uniformly at random, i.e. the probability of being the sample for each element is the same:
Pr[ai isthesamplex]=1/m
We draw a uniform sample over the stream, not over the universe U.
3
Two different problems
Problem 1 (easy):
Sample a fixed proportion of elements in the stream (1/10).
The sample size is m/10 given the stream of m elements. Problem 2 (hard):
Maintain a random sample of fixed size over an infinite stream. The sample size is fixed, i.e. s for at any time k.
Both problems: Pr [ ai is sampled ] = 1/m
4
Sampling a fixed proportion
Scenario:
Search engine receives a stream of tuples (user, query, time).
Query: How many users run the same query in a single day? Memory: We have enough space to store 1/10 of the stream.
Naïve solution to keep 1/10 stream by:
Generate a random integer in [0, … , 9] for each query. Store the query if the integer is 0; otherwise, discard.
Can we approximately answer the query? Yes
5
Problem with naïve approach
Scenario:
Search engine receives a stream of tuples (user, query, time).
Each user sends x queries once and d queries twice (total of x + 2d). Query: What fraction of queries by an average user are duplicates?
Memory: We have enough space to store 1/10 of the stream.
Solution: d/(x + d)
Can we approximately answer the query using previous approach?
No
6
Problem with naïve approach
Naïve solution to keep 1/10 stream for each query:
Sample contains x/10 singleton queries and 2d/10 duplicate queries at
least once.
Sample contains only d/100 pairs of duplicates.
o A query is sampled once with prob. 1/10.
o A query is sampled twice with prob. 1/10*1/10 = 1/100. Of d duplicates, 18d/100 appear exactly once in the sample.
o A query is sampled at the first time with prob. 1/10.
o A query is not sampled at the second time with prob. 9/10.
Solution:
𝒅 𝟏𝟎𝒙𝟏𝟗𝒅
7
Fix
Scenario:
Search engine receives a stream of tuples (user, query, time).
Each user sends x queries once and d queries twice (total of x + 2d). Query: What fraction of queries by an average user are duplicates?
Solution:
Pick 1/10 of users (not queries) and take all their searches in the sample.
Use a hash function to hash the user ID uniformly into 10 buckets and pick user IDs on the first bucket.
8
Generalized solution
Stream of tuples with keys:
Key is some subset of each tuple’s components. E.g. tuple(user, query, time) with user as a key. Choice of key depends on application.
To get a sample of a/b fraction of the stream: Hash each tuple’s key uniformly into b buckets.
Pick the tuples that hashed into the first a buckets. 12ab
9
Sampling vs. hashing
Sampling:
Hashing:
Source: Graham Cormode, Marios Hadjieleftheriou, CACM 2009
10
Outline
Sampling from a data stream Sampling a fixed proportion
Sampling a fixed-size (reservoir sampling)
11
Sample a fixed size
Challenge:
The sample set S has fixed size s regardless the length of the stream. Uniformly sampling element from the stream
Reservoir Sampling (s = 1) [Vitter’85]: s = a1
a2 is picked as s with prob. 1/2 a3 is picked as s with prob. 1/3 …
an is picked as s with prob. 1/n …
12
Decision tree of reservoir sampling
Reservoir sampling a stream a1, a2, a3, a4
Elena Ikonomovska, Mariano Zelke’13
13
Proof
For a stream a1, a2, … , aL, aL+1, … , am, we need to prove Pr[aL issampled]=1/m
Proof:
WhatistheprobabilitythatsomeaL isthefinalsfor1≤L≤m?
This happen if aL is chosen as s and the rest aL+1, … , am are not.
14
Generalized solution
Reservoir Sampling (s > 1) [Vitter’85]: Store all the first s elements of the stream to S.
Suppose we have seen m-1 elements, and now the mth element arrives (m > s).
Theorem:
o With prob. s/m, keep mth element, else discard it.
o If we keep mth element, then it replaces one of the s elements in S,
picked uniformly at random.
After m elements, the sample S contains each element seen so far with the same probability s/m.
15
Proof by induction
Proof:
Assumethataftermelements:Pr[aL issampled]=s/m,1≤L≤m. When the (m+1)th element arrives, we need to show
Pr[aL issampled]=s/(m+1),1≤L≤m+1. Base case:
Whenweseethefirstselements,Pr[aL issampled]=1,1≤L≤s. (m+1)th element am+1 arrives:
Pr[aL issampled]=s/m,1≤L≤m(hypothesis)
For aL already in S, the probability that aL is still in S is:
1
Pr[aL issampled]=(s/m)*(m/m+1)=s/m+1,1≤L≤m. The new element am+1 is sampled with prob. s/(m+1)
16
Homework
Implement the uniformly sampling algorithms on the dataset from Assignment 1:
Consider each line of data as a streaming transaction in e-commerce o docIDisauserID
o wordIDisanitemID
Queries:
o Whatisthemostfrequentitemshasbeenbought?
o Whatisanaveragenumberofitemsboughtbyauser?
17