程序代写代做代考 kernel Keras html algorithm deep learning Machine Learning at Scale

Machine Learning at Scale
COMPCSI 753: Algorithms for Massive Data
Instructor: Ninh Pham
University of Auckland
Auckland, Aug 25, 2020
1

Outline
 An overview of machine learning
 Scale up supervised linear learning with Count Sketches
 Scale up supervised nonlinear learning with Tensor Sketches
2

General overview of ML
Source: https://techgrabyte.com/10-machine-learning-algorithms-application/
3

Supervised learning
Source: https://medium.com/@gowthamy/machine-learning-supervised-learning-vs-unsupervised-learning-f1658e12a780/
4

Unsupervised learning
Source: https://medium.com/@gowthamy/machine-learning-supervised-learning-vs-unsupervised-learning-f1658e12a780/
5

Reinforcement learning
Source: https://www.quora.com/What-is-reinforcement-learning
6

Outline
 An overview of machine learning
 Scale up supervised linear learning with Count Sketches
 Scale up supervised nonlinear learning with Tensor Sketches
7

Linear vs nonlinear in data space
There exists linear classifiers No linear classifier exists
(text data) (image data)
8

Support vector machine
 Introduced in COLT-92 by Boser, Guyon & Vapnik
 Theoretically well motivated  Which hyperplane is optimal?
 Empirically good performance  Training & testing are fast
Data space
Source: https://en.wikipedia.org/wiki/Support-vector_machine
9

Linear SVM: Basics
 Training set:
 𝒙 is a d-dimensional point  𝑦 is the class label {+1, -1}
 Hyperplane classifier: 𝒘,𝒙 􏰉𝑏􏰀0
 𝒘 is the normal vector  Goal:
 Find the maximum-margin hyperplane
Data space
Source: https://en.wikipedia.org/wiki/Support-vector_machine 10

Linear SVM: Soft margin
 Decision function:
𝑓 𝒙 􏰀 𝒘,𝒙 􏰉𝑏
errors
 Loss functions:
𝐻1 𝑓𝒙,𝑦 􏰀max􏰅0,1􏰉𝑦𝑓𝒙􏰆
 𝐻2 𝑓 𝒙 , 𝑦 􏰀 𝑓 𝒙 􏰉 𝑦
 𝐻 3 𝑓 𝒙 , 𝑦 􏰀 􏰊􏰔 𝑓 𝒙 􏰉 𝑦 2
margin
 Optimization problem:
11

Training linear SVM
 Decision function: 𝑓𝒙𝑖 􏰀𝒘,𝒙𝑖
 Loss function:
 𝐻 3 𝑓 𝒙 𝑖 , 𝑦 𝑖 􏰀 􏰊􏰔 𝑓 𝒙 𝑖 􏰉 𝑦 𝑖 2
 Simplified problem:
 Given n training data 􏰅𝒙𝑖, 𝑦𝑖􏰆, find 𝒘 such that
margin
m i n P 𝐰 􏰀 􏰊􏰔 𝒘 , 𝒘 + 􏰊􏰔 ∑ 𝒘 , 𝒙 𝑖 􏰉 𝑦 𝑖 2
maximize margin minimize training errors
errors
12

SGD for training linear SVM
 Optimization problem:
 m i n P 𝐰 􏰀 􏰊􏰔 𝒘 , 𝒘 + 􏰊􏰔 ∑ 𝒘 , 𝒙 𝑖 􏰉 𝑦 𝑖 2
 Stochastic gradient descend (SGD): For each point
𝒘𝑡 􏰀 𝒘􏰕􏰖􏰊 􏰉 𝛼𝛻𝑃 𝒘􏰕􏰖􏰊
𝑖𝑖 􏰀 𝒘􏰕􏰖􏰊􏰅1􏰉𝛼􏰆􏰉 𝛼 𝒘􏰕􏰖􏰊,𝒙𝑖 𝒙𝑖
𝛻𝑃 𝒘 𝑑𝑤
𝑃𝒘
learning rate
􏰗0
13

SGD for training linear SVM
 Optimization problem:
 m i n P 𝐰 􏰀 􏰊􏰔 𝒘 , 𝒘 + 􏰊􏰔 ∑ 𝒘 , 𝒙 𝑖 􏰉 𝑦 𝑖 2
 Stochastic gradient descend (SGD): For each point
𝒘𝑡 􏰀 𝒘􏰕􏰖􏰊 􏰉 𝛼𝛻𝑃 𝒘􏰕􏰖􏰊 􏰀 𝒘􏰕􏰖􏰊􏰅1􏰉𝛼􏰆􏰉 𝛼 𝒘􏰕􏰖􏰊,𝒙𝑖 𝒙𝑖
Data space in d dimensions
Reduced space in d’ < d dimensions 14 𝒘,𝒙𝑖 􏰘 𝒘′,𝒙′𝑖 Count Sketch 𝑖𝑖 a a1 a2 a3 w2 w3 a4 a5 w4 w5 a’ w’3 w’ Count Sketch for a vector  View a high-dimensional vector as a data stream and compute Count Sketch with one counter with hash functions w w1 w’1 w’2 h:{1,...,d} {1,...,d’}ands:{1,...,d} {+1, 1} Data space in d dimensions a1 – a2 Count Sketch a4 a3 + a5 Reduced space in d’ < d dimensions Eh,s[𝒘′,𝒂′􏰇􏰀 𝒘,𝒂 Linear learning in d dimension can be approximated by linear learning in d’ dimensions. 15 a’1 a’2 a’3 Count Sketch for a vector  Example: h0121 d = 4 d’ = 3 s +1 -1 -1 +1 1 2 3 1 +1 -1 -3 Eh,s[𝒘′,𝒂′􏰇􏰀 𝒘,𝒂  Consider w1 & a1, they will be hashed into the same bucket. So w1a1 is  Property: preserved.  Consider wi & ai, i ≠ 1 which are hashed into the bucket h(w1) with prob. 1/d’.  The expected contribution of wiai is 0 due the random sign of s. 16 Count Sketch as feature hashing Large d Even larger d Count Sketch  TensorFlow: tf.keras.preprocessing.text.hashing_trick  Spark ML: HashingTF Transformer  sklearn: FeatureHasher 17 Personalized spam email filtering  Setting:  n different users have n different spam email labels (different training sets)  Users are often lazy to label spam emails  Get all spam emails from all users to a global user u0 (global training set)  Training with Count Sketches (Weinberger et al.’08):  Hash the global user u0 with H0 = {h0, s0} and learn the weight w’0  Hash all users ui with Hi = {hi, si} and learn the weights w’i for any 1 ≤ i ≤ n Finalresult:w’h =w’0 +(w’1 +...+w’n) 18 Personalized spam email filtering  Testing with Count Sketches: Givenanemailxforauserui,ourdecisionis 𝑯𝟎 𝒙 􏰁𝑯𝒊 𝒙 ,𝒘′𝒉 𝑯𝟎 𝒙 +𝑯𝒊 𝒙 𝑯𝟎 𝒙 􏰁𝑯𝒊 𝒙 ,𝒘􏰙𝟎 􏰁𝒘􏰙𝟏 􏰁 ...􏰁𝒘􏰙𝒊 􏰁⋯􏰁𝒘􏰙𝒏 􏰀  Accuracy: 𝑯𝟎𝒙,𝒘􏰙𝟎 􏰁𝑯𝒊𝒙,𝒘􏰙𝒊 􏰁∑𝒋􏰚𝟎𝑯𝟎𝒙,𝒘􏰙𝒋+∑𝒋􏰚𝒊𝑯𝒊𝒙,𝒘􏰙𝒋 Error from global user Error from other users 19 Outline  An overview of machine learning  Scale up supervised linear learning with Count Sketches  Scale up supervised nonlinear learning with Tensor Sketches 20 Linear vs nonlinear in data space There exists linear classifiers No linear classifier exists (text data) (image data) 21 SVM: Non-linear case  SVM solution (Cortes & Vapnik’95):  Map data into a richer feature space including nonlinear features by using kernel tricks  Construct a hyperplane in that space via pairwise kernel function Φ𝒙 􏰛Φ𝒚 􏰀𝐾􏰅𝒙,𝒚􏰆 Data space Feature space 22 SVM with polynomial kernel 2 Source: https://www.youtube.com/watch?v=3liCbRZPrZA 23 Algorithms for non-linear SVM  Kernel tricks:  Implicit nonlinear data mapping from original data space into high- dimensional feature space in expectation that the linear structure is gained  Training using traditional decomposition methods:  O(dn3) time and O(n2) space complexities with n training points in d- dimensional Euclidean space. Data space Feature space  24 How to accelerate non-linear SVM Source: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 25 Accelerate non-linear SVM Data space Kernel tricks Dimensionality reduction Random features Feature space O(dn3) Randomized feature space O(dn) 26 Accelerate non-linear SVM Data space Random features Feature space O(dn3) Randomized feature space O(dn) 27 Random features  Random feature mapping f (Rahimi & Retch’07) from the original data space to the randomized feature space such that: Data space f Randomized feature space Ε𝑓𝒙􏰛𝑓𝒚 􏰀𝐾􏰅𝒙,𝒚􏰆  Feature space 28 Random feature mapping for Gaussian kernels  Gaussian kernel: 𝐾𝒙,𝒚 􏰀exp􏰉 𝒙􏰉𝒚|2  Random Fourier features (Rahimi & Retch’07): 𝑓𝑅𝑅 𝒘,𝑏,𝒙 :𝒙 ↦𝑓 𝒙 􏰀cos 𝒘,𝒙 􏰁𝑏 where 𝒘 is a multivariate normal distribution vector 𝑵􏰅𝟎, 𝟏􏰆 and 𝑏 is drawn uniformly in 􏰂0, 2𝜋􏰇. 29 Code of random Fourier features 30 Random feature mapping for polynomial kernels  Polynomial kernel: 𝐾𝒙,𝒚 􏰀 𝒙􏰛𝒚2  Tensor Sketches (Pham & Pagh’13): 𝑓𝑃𝑃 𝒙 : 𝒙 ↦ 𝑓 𝒙 􏰀 IFFT􏰅FFT􏰅CS1 𝒙 ⊗ FFT􏰅CS2􏰅𝒙􏰆􏰆􏰆 where CSi is the Count Sketches of 𝒙 (a polynomial presenting for a dimensionality reduction of 𝒙). 31 Convolution of Count Sketches  View count sketching as the polynomial transforming 1  P()  d s (i)xh (i) with hash functions h and s 1 i11i 11  P ()  d s (i)xh2 (i) with hash functions h and s 2 i12i 22 P(ω) is a Count Sketch of outer product with hash functions H(i, j) = h1(i) + h2(j) mod D and S(i, j) = s1(i)s2(j): P()  FFT1(FFT(P()*FFT(P ())) 12 32 Tensor Sketches  Approximating non-linear kernels and accelerating the training kernel machines  10 lines of Matlab code, speedup SVMs up to 1600 times  Implemented in public deep learning libraries (Caffe, MatConvNet) Fukui et al., EMNLP’16 33 Summary  By using stream sketching techniques as dimensionality reduction, we can  Scale up many linear and nonlinear learning in machine learning  Reduce dimensionality in deep learning  Scale up clustering ... 34 Reference  Bernhard E. Boser, Isabelle Guyon, Vladimir Vapnik. A Training Algorithm for Optimal Margin Classifiers. COLT, 1992.  Weinberger, Dasgupta, Langford, Smola and Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. ICML 2008.  Corinna Cortes, Vladimir Vapnik. Support- Vector Networks. Machine Learning,1995.  Ali Rahimi, Benjamin Recht. Random Features for Large-Scale Kernel Machines. NIPS, 2007.  Ninh Pham, Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature map. KDD, 2013.  Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. EMNLP, 2016. 35