程序代写代做代考 decision tree algorithm AI Sampling Techniques

Sampling Techniques
COMPCSI 753: Algorithms for Massive Data
Instructor: Ninh Pham
University of Auckland
Parts of this material are modifications of the lecture slides from
http://mmds.org
Designed for the textbook Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.
Auckland, Aug 17, 2020
1

Outline
 Sampling from a data stream  Sampling a fixed proportion
 Sampling a fixed-size (reservoir sampling)
2

Uniformly sampling from a stream
 Motivation:
 Since we cannot store the entire stream, we store a uniform sample of the
stream to answer queries.
 Goal:
 Out of the large number of elements a1, … , am, we choose a small number
of elements to keep in memory.
 Sample from the stream a1, … , am a single element x uniformly at random, i.e. the probability of being the sample for each element is the same:
Pr[ai isthesamplex]=1/m
 We draw a uniform sample over the stream, not over the universe U.
3

Two different problems
 Problem 1 (easy):
 Sample a fixed proportion of elements in the stream (1/10).
 The sample size is m/10 given the stream of m elements.  Problem 2 (hard):
 Maintain a random sample of fixed size over an infinite stream.  The sample size is fixed, i.e. s for at any time k.
 Both problems: Pr [ ai is sampled ] = 1/m
4

Sampling a fixed proportion
 Scenario:
 Search engine receives a stream of tuples (user, query, time).
 Query: How many users run the same query in a single day?  Memory: We have enough space to store 1/10 of the stream.
 Naïve solution to keep 1/10 stream by:
 Generate a random integer in [0, … , 9] for each query.  Store the query if the integer is 0; otherwise, discard.
 Can we approximately answer the query?  Yes
5

Problem with naïve approach
 Scenario:
 Search engine receives a stream of tuples (user, query, time).
 Each user sends x queries once and d queries twice (total of x + 2d).  Query: What fraction of queries by an average user are duplicates?
 Memory: We have enough space to store 1/10 of the stream.
 Solution: d/(x + d)
 Can we approximately answer the query using previous approach?
No
6

Problem with naïve approach
 Naïve solution to keep 1/10 stream for each query:
 Sample contains x/10 singleton queries and 2d/10 duplicate queries at
least once.
 Sample contains only d/100 pairs of duplicates.
o A query is sampled once with prob. 1/10.
o A query is sampled twice with prob. 1/10*1/10 = 1/100.  Of d duplicates, 18d/100 appear exactly once in the sample.
o A query is sampled at the first time with prob. 1/10.
o A query is not sampled at the second time with prob. 9/10.
 Solution:
􏰝􏰞 􏰝􏰞􏰞 􏰝􏰞􏰞
􏰜
𝒅 𝟏𝟎𝒙􏰓𝟏𝟗𝒅
􏰝􏰞􏰞
􏰟 􏰓 􏰜 􏰓􏰝􏰠􏰜
7

Fix
 Scenario:
 Search engine receives a stream of tuples (user, query, time).
 Each user sends x queries once and d queries twice (total of x + 2d).  Query: What fraction of queries by an average user are duplicates?
 Solution:
 Pick 1/10 of users (not queries) and take all their searches in the sample.
 Use a hash function to hash the user ID uniformly into 10 buckets and pick user IDs on the first bucket.
8

Generalized solution
 Stream of tuples with keys:
 Key is some subset of each tuple’s components.  E.g. tuple(user, query, time) with user as a key.  Choice of key depends on application.
 To get a sample of a/b fraction of the stream:  Hash each tuple’s key uniformly into b buckets.
 Pick the tuples that hashed into the first a buckets. 12ab
9

Sampling vs. hashing
 Sampling:
 Hashing:
Source: Graham Cormode, Marios Hadjieleftheriou, CACM 2009
10

Outline
 Sampling from a data stream  Sampling a fixed proportion
 Sampling a fixed-size (reservoir sampling)
11

Sample a fixed size
 Challenge:
 The sample set S has fixed size s regardless the length of the stream.  Uniformly sampling element from the stream
 Reservoir Sampling (s = 1) [Vitter’85]:  s = a1
 a2 is picked as s with prob. 1/2  a3 is picked as s with prob. 1/3 …
 an is picked as s with prob. 1/n …
12

Decision tree of reservoir sampling
 Reservoir sampling a stream a1, a2, a3, a4
Elena Ikonomovska, Mariano Zelke’13
13

Proof
 For a stream a1, a2, … , aL, aL+1, … , am, we need to prove Pr[aL issampled]=1/m
 Proof:
WhatistheprobabilitythatsomeaL isthefinalsfor1≤L≤m?
 This happen if aL is chosen as s and the rest aL+1, … , am are not.
14

Generalized solution
 Reservoir Sampling (s > 1) [Vitter’85]:  Store all the first s elements of the stream to S.
 Suppose we have seen m-1 elements, and now the mth element arrives (m > s).
 Theorem:
o With prob. s/m, keep mth element, else discard it.
o If we keep mth element, then it replaces one of the s elements in S,
picked uniformly at random.
 After m elements, the sample S contains each element seen so far with the same probability s/m.
15

Proof by induction
 Proof:
Assumethataftermelements:Pr[aL issampled]=s/m,1≤L≤m.  When the (m+1)th element arrives, we need to show
Pr[aL issampled]=s/(m+1),1≤L≤m+1.  Base case:
Whenweseethefirstselements,Pr[aL issampled]=1,1≤L≤s.  (m+1)th element am+1 arrives:
Pr[aL issampled]=s/m,1≤L≤m(hypothesis)
 For aL already in S, the probability that aL is still in S is:
1􏰉􏰡 􏰁􏰡􏰡􏰖􏰊􏰀􏰢 􏰢􏰓􏰊 􏰢􏰓􏰊 􏰡 􏰢􏰓􏰊
Pr[aL issampled]=(s/m)*(m/m+1)=s/m+1,1≤L≤m.  The new element am+1 is sampled with prob. s/(m+1)
16

Homework
 Implement the uniformly sampling algorithms on the dataset from Assignment 1:
 Consider each line of data as a streaming transaction in e-commerce o docIDisauserID
o wordIDisanitemID
 Queries:
o Whatisthemostfrequentitemshasbeenbought?
o Whatisanaveragenumberofitemsboughtbyauser?
17