Machine Learning at Scale
COMPCSI 753: Algorithms for Massive Data
Instructor: Ninh Pham
University of Auckland
Auckland, Aug 25, 2020
1
Outline
An overview of machine learning
Scale up supervised linear learning with Count Sketches
Scale up supervised nonlinear learning with Tensor Sketches
2
General overview of ML
Source: https://techgrabyte.com/10-machine-learning-algorithms-application/
3
Supervised learning
Source: https://medium.com/@gowthamy/machine-learning-supervised-learning-vs-unsupervised-learning-f1658e12a780/
4
Unsupervised learning
Source: https://medium.com/@gowthamy/machine-learning-supervised-learning-vs-unsupervised-learning-f1658e12a780/
5
Reinforcement learning
Source: https://www.quora.com/What-is-reinforcement-learning
6
Outline
An overview of machine learning
Scale up supervised linear learning with Count Sketches
Scale up supervised nonlinear learning with Tensor Sketches
7
Linear vs nonlinear in data space
There exists linear classifiers No linear classifier exists
(text data) (image data)
8
Support vector machine
Introduced in COLT-92 by Boser, Guyon & Vapnik
Theoretically well motivated Which hyperplane is optimal?
Empirically good performance Training & testing are fast
Data space
Source: https://en.wikipedia.org/wiki/Support-vector_machine
9
Linear SVM: Basics
Training set:
𝒙 is a d-dimensional point 𝑦 is the class label {+1, -1}
Hyperplane classifier: 𝒘,𝒙 𝑏0
𝒘 is the normal vector Goal:
Find the maximum-margin hyperplane
Data space
Source: https://en.wikipedia.org/wiki/Support-vector_machine 10
Linear SVM: Soft margin
Decision function:
𝑓 𝒙 𝒘,𝒙 𝑏
errors
Loss functions:
𝐻1 𝑓𝒙,𝑦 max0,1𝑦𝑓𝒙
𝐻2 𝑓 𝒙 , 𝑦 𝑓 𝒙 𝑦
𝐻 3 𝑓 𝒙 , 𝑦 𝑓 𝒙 𝑦 2
margin
Optimization problem:
11
Training linear SVM
Decision function: 𝑓𝒙𝑖 𝒘,𝒙𝑖
Loss function:
𝐻 3 𝑓 𝒙 𝑖 , 𝑦 𝑖 𝑓 𝒙 𝑖 𝑦 𝑖 2
Simplified problem:
Given n training data 𝒙𝑖, 𝑦𝑖, find 𝒘 such that
margin
m i n P 𝐰 𝒘 , 𝒘 + ∑ 𝒘 , 𝒙 𝑖 𝑦 𝑖 2
maximize margin minimize training errors
errors
12
SGD for training linear SVM
Optimization problem:
m i n P 𝐰 𝒘 , 𝒘 + ∑ 𝒘 , 𝒙 𝑖 𝑦 𝑖 2
Stochastic gradient descend (SGD): For each point
𝒘𝑡 𝒘 𝛼𝛻𝑃 𝒘
𝑖𝑖 𝒘1𝛼 𝛼 𝒘,𝒙𝑖 𝒙𝑖
𝛻𝑃 𝒘 𝑑𝑤
𝑃𝒘
learning rate
0
13
SGD for training linear SVM
Optimization problem:
m i n P 𝐰 𝒘 , 𝒘 + ∑ 𝒘 , 𝒙 𝑖 𝑦 𝑖 2
Stochastic gradient descend (SGD): For each point
𝒘𝑡 𝒘 𝛼𝛻𝑃 𝒘 𝒘1𝛼 𝛼 𝒘,𝒙𝑖 𝒙𝑖
Data space in d dimensions
Reduced space in d’ < d dimensions
14
𝒘,𝒙𝑖 𝒘′,𝒙′𝑖
Count Sketch
𝑖𝑖
a
a1
a2 a3 w2 w3
a4 a5 w4 w5
a’ w’3 w’
Count Sketch for a vector
View a high-dimensional vector as a data stream and compute Count Sketch with one counter with hash functions
w w1
w’1 w’2
h:{1,...,d} {1,...,d’}ands:{1,...,d} {+1, 1}
Data space in d dimensions
a1 – a2 Count Sketch
a4
a3 + a5
Reduced space in d’ < d dimensions Eh,s[𝒘′,𝒂′ 𝒘,𝒂
Linear learning in d dimension can be approximated by linear learning in d’ dimensions.
15
a’1 a’2 a’3
Count Sketch for a vector
Example: h0121
d = 4
d’ = 3
s +1 -1 -1 +1
1 2 3 1
+1 -1 -3
Eh,s[𝒘′,𝒂′ 𝒘,𝒂
Consider w1 & a1, they will be hashed into the same bucket. So w1a1 is
Property: preserved.
Consider wi & ai, i ≠ 1 which are hashed into the bucket h(w1) with prob. 1/d’.
The expected contribution of wiai is 0 due the random sign of s.
16
Count Sketch as feature hashing
Large d Even larger d Count Sketch
TensorFlow: tf.keras.preprocessing.text.hashing_trick Spark ML: HashingTF Transformer
sklearn: FeatureHasher
17
Personalized spam email filtering
Setting:
n different users have n different spam email labels (different training sets) Users are often lazy to label spam emails
Get all spam emails from all users to a global user u0 (global training set)
Training with Count Sketches (Weinberger et al.’08):
Hash the global user u0 with H0 = {h0, s0} and learn the weight w’0
Hash all users ui with Hi = {hi, si} and learn the weights w’i for any 1 ≤ i ≤ n Finalresult:w’h =w’0 +(w’1 +...+w’n)
18
Personalized spam email filtering
Testing with Count Sketches: Givenanemailxforauserui,ourdecisionis 𝑯𝟎 𝒙 𝑯𝒊 𝒙 ,𝒘′𝒉
𝑯𝟎 𝒙 +𝑯𝒊 𝒙 𝑯𝟎 𝒙 𝑯𝒊 𝒙 ,𝒘𝟎 𝒘𝟏 ...𝒘𝒊 ⋯𝒘𝒏
Accuracy:
𝑯𝟎𝒙,𝒘𝟎 𝑯𝒊𝒙,𝒘𝒊 ∑𝒋𝟎𝑯𝟎𝒙,𝒘𝒋+∑𝒋𝒊𝑯𝒊𝒙,𝒘𝒋
Error from global user
Error from other users
19
Outline
An overview of machine learning
Scale up supervised linear learning with Count Sketches
Scale up supervised nonlinear learning with Tensor Sketches
20
Linear vs nonlinear in data space
There exists linear classifiers No linear classifier exists
(text data) (image data)
21
SVM: Non-linear case
SVM solution (Cortes & Vapnik’95):
Map data into a richer feature space including nonlinear features by using
kernel tricks
Construct a hyperplane in that space via pairwise kernel function Φ𝒙 Φ𝒚 𝐾𝒙,𝒚
Data space
Feature space
22
SVM with polynomial kernel
2
Source: https://www.youtube.com/watch?v=3liCbRZPrZA
23
Algorithms for non-linear SVM
Kernel tricks:
Implicit nonlinear data mapping from original data space into high-
dimensional feature space in expectation that the linear structure is gained
Training using traditional decomposition methods:
O(dn3) time and O(n2) space complexities with n training points in d-
dimensional Euclidean space.
Data space
Feature space
24
How to accelerate non-linear SVM
Source: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 25
Accelerate non-linear SVM
Data space
Kernel tricks
Dimensionality reduction
Random features
Feature space O(dn3)
Randomized feature space O(dn) 26
Accelerate non-linear SVM
Data space
Random features
Feature space O(dn3)
Randomized feature space O(dn) 27
Random features
Random feature mapping f (Rahimi & Retch’07) from the original data space to the randomized feature space such that:
Data space
f
Randomized feature space
Ε𝑓𝒙𝑓𝒚 𝐾𝒙,𝒚
Feature space
28
Random feature mapping for Gaussian kernels
Gaussian kernel:
𝐾𝒙,𝒚 exp 𝒙𝒚|2
Random Fourier features (Rahimi & Retch’07):
𝑓𝑅𝑅 𝒘,𝑏,𝒙 :𝒙 ↦𝑓 𝒙 cos 𝒘,𝒙 𝑏
where 𝒘 is a multivariate normal distribution vector 𝑵𝟎, 𝟏 and 𝑏 is drawn uniformly in 0, 2𝜋.
29
Code of random Fourier features
30
Random feature mapping for polynomial kernels
Polynomial kernel:
𝐾𝒙,𝒚 𝒙𝒚2
Tensor Sketches (Pham & Pagh’13):
𝑓𝑃𝑃 𝒙 : 𝒙 ↦ 𝑓 𝒙 IFFTFFTCS1 𝒙 ⊗ FFTCS2𝒙
where CSi is the Count Sketches of 𝒙 (a polynomial presenting for a dimensionality reduction of 𝒙).
31
Convolution of Count Sketches
View count sketching as the polynomial transforming
1
P() d s (i)xh (i) with hash functions h and s
1 i11i 11 P () d s (i)xh2 (i) with hash functions h and s
2 i12i 22 P(ω) is a Count Sketch of outer product with hash
functions H(i, j) = h1(i) + h2(j) mod D and S(i, j) =
s1(i)s2(j): P() FFT1(FFT(P()*FFT(P ())) 12
32
Tensor Sketches
Approximating non-linear kernels and accelerating the training kernel machines
10 lines of Matlab code, speedup SVMs up to 1600 times
Implemented in public deep learning libraries (Caffe, MatConvNet)
Fukui et al., EMNLP’16
33
Summary
By using stream sketching techniques as dimensionality reduction, we can
Scale up many linear and nonlinear learning in machine learning Reduce dimensionality in deep learning
Scale up clustering
...
34
Reference
Bernhard E. Boser, Isabelle Guyon, Vladimir Vapnik. A Training Algorithm for Optimal Margin Classifiers. COLT, 1992.
Weinberger, Dasgupta, Langford, Smola and Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. ICML 2008.
Corinna Cortes, Vladimir Vapnik. Support- Vector Networks. Machine Learning,1995.
Ali Rahimi, Benjamin Recht. Random Features for Large-Scale Kernel
Machines. NIPS, 2007.
Ninh Pham, Rasmus Pagh. Fast and scalable polynomial kernels via explicit
feature map. KDD, 2013.
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. EMNLP, 2016.
35