程序代写代做代考 Excel algorithm finance Ensemble

Ensemble

Lecture 11: Online Learning
Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong

1
CMSC5741 Big Data Tech. & Apps.

A Motivating Example– Spam Filtering
2

Incoming Emails
Spams
Not Spams

Spam Filter

Incoming Emails
Spams
Not Spams

Spam Filter
Traditional Method: Training

3
Feature extraction: X
A batch of Emails
Linear classifier- WTX
Model selection

Incoming Emails
Spams
Not Spams

Spam Filter
Traditional Method: Test

4
Feature extraction – X
A batch of Emails
Linear classifier- WTX
Prediction:
Non-spam – Y’=1
Prediction: spam – Y’=-1
Determination
A new mail

Incoming Emails
Spams
Not Spams

Spam Filter
Online Protocol

5
Feature extraction – X
An online mail
Linear classifier- WTX
Prediction: Non-spam – Y’=1
Prediction: spam – Y’=-1
User judge
Update model

Outline
Introduction
Learning paradigms
Online learning and its applications
Online learning algorithms
Perceptron
Online non-sparse learning
Online sparse learning
Online unsupervised learning
Conclusion

Learning Paradigms Overview
Learning paradigms
Supervised learning
Semisupervised learning
Transductive learning
Unsupervised learning
Universum learning
Transfer learning

Train on labeled data

Test on test data
Supervised Learning
11

X
–
X
–

?
?

Horse and donkey
11

Train on labeled and unlabeled data

Test on test data
Semisupervised Learning
12

X
–
X
–
o
o
o
o

o
o

?
?

Horse and donkey
12

Train on labeled and test data

Test on test data
Transductive Learning
13

?
?

X
–
X
–
o
o
o
o

o
o

Horse and donkey
13

Unsupervised Learning
Train on unlabeled data

Test on test data (Test reconstruction error)

jump
height

?
?

Train on labeled and universum data

Test on test data

Universum Learning
15

?
?

X
–
X
–

Train on horse and donkey, universum is mule
15

Transfer Learning
Train on labeled from source and target domains

Test on test data

?
?

Source: goat and sheep
Target: horse and donkey
16

What is Online Learning?
Batch/Offline learning
Observe a batch of training data

Learn a model from them
Predict new samples accurately
Online learning
Observe a sequence of data

Learn a model incrementally as instances come
Make the sequence of online predictions accurately
18

Update a model

True response
user
Make prediction

Online learning is the process of answering a sequence of questions given (maybe partial) knowledge of the correct answers to previous questions and possibly additional available information. [Shal11]

Difference between online and incremental learning

Incremental learning includes informationally incremental learning and operationally incremental learning
Informationally incremental learning: work incrementally, no permission to look back at the whole history of information presented.
Operationally incremental learning: permission to look back, not allow to use information of the past in some effective way.

Incremental learning: incremental, not looking back

Online learning: the key defining characteristic is that soon after the prediction is made, the true label of the instance is discovered. This information can then be used to refine the prediction hypothesis used by the algorithm. The goal of the algorithm is to make predictions that are close to the true labels.

Stochastic gradient descent: in machine learning, the problem is usually to minimize an objective function.
18

Online Prediction Algorithm
An initial prediction rule
For t = 1, 2, …
We observe and make a prediction
We observe the true outcome yt and then compute a loss
The online algorithm updates the prediction rule using the new example and construct

Update ft

yt
user

Online Prediction Algorithm
The total error of the method is

Goal: this error to be as small as possible
Predict unknown future one step a time: similar to generalization error

yt
user

Regret Analysis
: optimal prediction function from a class H, e.g., the class of linear classifiers

with minimum error after seeing all examples
Regret for the online learning algorithm

We want regret as small as possible

Why Low Regret?
Regret for the online learning algorithm

Advantages
We do not lose much from not knowing future events
We can perform almost as well as someone who observes the entire sequence and picks the best prediction strategy in hindsight
We can also compete with changing environment
22

Advantages of Online Learning
Meet many applications for data arriving sequentially while predictions are required on-the-fly
Avoid re-training when adding new data
Applicable in adversarial and competitive environment
Strong adaptability to changing environment
High efficiency and excellent scalability
Simple to understand and easy to implement
Easy to be parallelized
Theoretical guarantees
23

In many cases, data arrives sequentially while predictions are required on-the-fly
Applicable also in adversarial and competitive environments (e.g. spam filtering, stock market)
Can adapt to changing environment
Efficient to run (scalability) and easy to code
Theoretical guarantees

Where to Apply Online Learning?
24

Online Learning

Social Media

Finance

Internet Security

Robot Motion Planning

Online Learning for Social Media
Recommendation, sentiment/emotion analysis

Where to Apply Online Learning?
26

Online Learning

Social Media

Finance

Internet Security

Robot Motion Planning

Online Learning for Robot Motion Planning
Tasks
Exploring an unknown terrain
Finding a destination

Rock-Paper-Scissors: You vs. the Computer
Robot Dog

Where to Apply Online Learning?
28

Online Learning

Social Media

Finance

Internet Security

Robot Motion Planning

Online Learning for Internet Security
Electronic business sectors
Spam email filtering
Fraud credit card transaction detection
Network intrusion detection system, etc.

Where to Apply Online Learning?
30

Online Learning

Social Media

Finance

Internet Security

Robot Motion Planning

Online Learning for Financial Decision
Financial decision
Online portfolio selection
Sequential investment, etc.
31

Perceptron Algorithm (F. Rosenblatt, 1958)
One of the oldest machine learning algorithm
Online algorithm for learning a linear threshold function with small error
34

Perceptron Algorithm (F. Rosenblatt, 1958)
Goal: find a linear classifier with small error

35
If no error, keeping the same; otherwise, update.

Intuition Explanation
Want positive margin:

Effect of Perceptron update on margin:

So margin increases

Geometric View
w0

Initialization

Geometric View
w0

t=1

(x1, y1(=1))

Misclassification!

Geometric View
w0
Update

(x1, y1(=1))

w1=w0+x1y1

Geometric View
t=2

(x1, y1(=1))

(x2, y2(=-1))
Misclassification!

Geometric View
(x1, y1)
w1
(x2, y2(=-1))
w2=w1+x2y2
Update
Continue…

In-class Practice
Go to practice
42

Perceptron Mistake Bound
Consider w* separate the data:
Define margin

The number of mistakes perceptron makes is at most
43

Norm of x: the larger, the larger mistake bound

The larger, the more confidence

Proof of Perceptron Mistake Bound [Novikoff, 1963]
Proof: Let be the hypothesis before the k-th mistake. Assume that the k-th mistake occurs on the input example .

First,

Second,

Hence,

In the left (first),
1. y_i*(v_k^Tx_i)<0; 2. |x|_i^2<=R^2 3. v_k+1 has made k-th mistakes In the second, y_ix_i^Tu = |x_i^Tu|>=\gamma R

Overview

46
Online Learning
Non-sparse learning
Sparse learning
Unsupervised learning

Online/Stochastic Gradient Descent
Online gradient descent

Stochastic gradient descent
47

Update ft

yt
user

Update ft

yt
user

In stochastic (or “on-line”) gradient descent, the true gradient of is approximated by a gradient at a single example: w(j)  w(j) –  (j,i)
As the algorithm sweeps through the training set, it performs the above update for each training example. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles.
47

Online Non-Sparse Learning
Decision function can be linear and non-linear as

First order learning methods
Online gradient descent (Zinkevich, 2003)
Passive aggressive learning (Crammer et al., 2006)
Others (including but not limited)
ALMA: A New Approximate Maximal Margin Classification Algorithm (Gentile, 2001)
ROMMA: Relaxed Online Maximum Margin Algorithm (Li and Long, 2002)
MIRA: Margin Infused Relaxed Algorithm (Crammer and Singer, 2003)
DUOL: A Double Updating Approach for Online Learning (Zhao et al., 2009)
49

Online margin-based prediction algorithms
49

Online Gradient Descent (OGD)
(Zinkevich, 2003)
Online convex optimization
Consider a convex objective function

where is a bounded convex set
Update by Online Gradient Descent (OGD) or Stochastic Gradient Descent (SGD)

where is a learning rate
50

gradient descent

projection
Provide a framework to prove regret bound for online convex optimization

Online learning does not necessarily need gradient descent.

But in classification/regression, they have a minimization objective function. One typical solution is gradient descent. \nabla f (w_t) can be derivative of w_t.

Stochastic gradient descent is equivalent to online gradient descent in this case.

The contribution of this work triggers the investigation of online convex optimization, where the objective is a convex function and is performend in the online mode.

Online Gradient Descent (OGD) (Zinkevich, 2003)
For t = 1, 2, …
An unlabeled sample arrives
Make a prediction based on existing weights

Observe the true class label
Update the weights by

where  is a learning rate

Update wt+1

yt
user

regret bound is established.

This is the first generalized online convex optimization. Regret bound on the gradient descent algorithm is established.

O(\sqrt{T})
51

Passive-Aggressive Online Learning (Crammer et al., 2006)
Each example defines a set of consistent hypotheses:
The new vector is set to be the projection of onto

Classification
Regression
Uniclass

assume that there exist a weight vector w* and an insensitivity parameter epsilon* for which the data is perfectly realizable. Denote the set by C.
For classification, C is a half space C={w: -y_t w*x_t\le\epsilon}
For regression, C is an \epsilon-hyper-slab, C={w: |w*x_t-y_t|\le\epsilon}
For uniclass, it is a ball of radius \epsilon centered at x_t, C={w: |w-x_t|\le\epsilon}
The optimization problem attempts to keep w_{t+1} as close to w_t as possible, while forcing w_{t+1} to achieve a zero loss on the most recent example.
The resulting algorithm is passive whenever the loss is zero.
52

Passive-Aggressive

achieve a zero loss on the most recent example
53

Passive Aggressive Online Learning
(Crammer et al., 2006)
PA (Binary classification)

PA-I (C-SVM)

PA-II (Relaxed C-SVM)
Closed-form solution

PA: hinge loss for binary classification
PA-I: hinge loss with a slack variable (hinge loss)
PA-II: slack variable is not necessary non-negative (square loss)
54

Passive Aggressive Online Learning
(Crammer et al., 2006)
Algorithm
Objective

Closed-form solutions
55

Perceptron: \tao_t=1

Margin based online learning algorithm, binary classification, C-SVM, regression

the core of our construction can be viewed as finding a support vector machine based on a single example while replacing the norm constraint of SVM with a proximity constraint to the current classifier. The benefit of this approach is two fold. First, we get a closed form solution for the next classifier. Second, we are able to provide a unified analysis of the cumulative loss for various online algorithms used to solve different decision problems. Specifically, we derive and analyze versions for regression problems, uniclass prediction, multiclass problems, and sequence
prediction tasks.

The resulting algorithm is passive whenever the hinge is zero. If the loss is positive, the algorithm aggressively forces w_{t+1} to satisfy the constraint regardless of the step-size required.
55

Online Non-Sparse Learning
First order methods
Learn a linear weight vector (first order) of model
Pros and Cons
Simple and easy to implement
Efficient and scalable for high-dimensional data
Relatively slow convergence rate

Online Non-Sparse Learning
Second order online learning methods
Update the weight vector w by maintaining and exploring both first-order and second-order information
Some representative methods, but not limited
SOP: Second Order Perceptron (Cesa-Bianchi et al., 2005)
CW: Confidence Weighted learning (Dredze et al., 2008)
AROW: Adaptive Regularization of Weights (Crammer et al., 2009)
IELLIP: Online Learning by Ellipsoid Method (Yang et al., 2009)
NHERD: Gaussian Herding (Crammer & Lee 2010)
NAROW: New variant of AROW algorithm (Orabona & Crammer 2010)
SCW: Soft Confidence Weighted (SCW) (Hoi et al., 2012)
Pros and Cons
Faster convergence rate
Expensive for high-dimensional data
Relatively sensitive to noise

Sparse Learning
59
Natural Images
Learned bases (f1 , …, f64): “Edges”

» 0.8 * + 0.3 * + 0.5 *

Test example

x » 0.8 * f36 + 0.3 * f42 + 0.5 * f63

Online Sparse Learning
Motivation
Space constraint: RAM overflow
Test-time constraint
How to induce Sparsity in the weights of online learning algorithms?
60

Online Sparse Learning
Objective function

Problem in online learning
Standard stochastic gradient descent

It does not yield sparse solution
Some representative work
Truncated gradient (Langford et al., 2009)
FOBOS: Forward Looking Subgradients (Duchi and Singer, 2009)
Dual averaging (Xiao, 2009)
etc.

Objective function

Stochastic gradient descent

Simple coefficient rounding

when the coefficient is small

Truncated Gradient (Langford et al., 2009)
62

Truncated gradient: impose sparsity by modifying the stochastic gradient descent

SGD does not attain sparse solution. Truncated gradient: if the f(wi) is less than theta, then truncate the value of wi to 0.
62

Simple Coefficient Rounding vs. Less aggressive truncation

Truncated Gradient (Langford et al., 2009)
63

If i/K is an integer, update by rounding. How to select K is a main drawback of the method.

If K is too large, maintain too many non-zero elements.
If K=1, the step size is small, rounding pulls to zero.

Directly rounding to zero is too aggressive.

A less aggressive version is to shrink the coefficient to zero by a smaller amount.
63

K, …
Truncated Gradient (Langford et al., 2009)

The amount of shrinkage is measured by a gravity parameter
When , the update rule is identical to the standard SGD
The truncation can be performed every K online steps
Loss functions:
Logistic
SVM (hinge)
Least square
64

The larger the parameters g and \theta are, the more sparsity is incurred.

Theoretical result (T: No. of samples)

Let , the regret is
Truncated Gradient (Langford et al., 2009)
65

regret bound is established.

Truncated gradient is consistently competitive with the other two online algorithms and significantly outperformed them in some problems. This suggests the effectiveness of truncated gradient.

it is interesting to observe that the qualitative behavior of truncated gradient is often similar to LASSO, especially when very sparse weight vectors are allowed.

LASSO usually has worse performance when the allowed number of nonzero weights is set too large.

it is worth emphasizing that the experiments in this subsection try to shed some light on the relative strengths of these algorithms in terms of feature sparsification. For large data sets such as Big_Ads only truncated gradient, coefficient rounding, and the sub-gradient algorithms are
applicable.
65

Dual Averaging (Xiao, 2010)
Objective function

Problem: truncated gradient doesn’t produce truly sparse weight due to small learning rate
Fix: dual averaging which keeps two state representations:
parameter
average gradient vector
66

Dual Averaging (Xiao, 2010)
has entry-wise closed-form solution
Advantage: sparse on the weight
Disadvantage: keep a non-sparse subgradient
67

Algorithm

Average regret

Theoretical bound: similar to gradient descent

Convergence and Regret
68

average regret bound is established.

Variants of Online Sparse Learning Models
Online feature selection (OFS)
A variant of sparse online learning
The key difference is that OFS focuses on selecting a fixed subset of features in online learning process
Could be used as an alternative tool for batch feature selection when dealing with big data
Other existing work
Online learning for Group Lasso (Yang et al., 2010) and online learning for multi-task feature selection (Yang et al. 2013) to select features in group manner or features among similar tasks
69

Online Sparse Learning
Objective
Induce sparsity in the weights of online learning algorithms
Pros and Cons
Simple and easy to implement
Efficient and scalable for high-dimensional data
Relatively slow convergence rate
No perfect way to attain sparsity solution yet
70

Online Unsupervised Learning
Assumption: data generated from some underlying parametric probabilistic density function
Goal: estimate the parameters of the density to give a suitable compact representation
72

Online Unsupervised Learning
Some representative work
Online singular value decomposition (SVD) (Brand, 2003)
Online principal component analysis (PCA) (Warmuth and Kuzmin, 2006)
Online dictionary learning for sparse coding (Mairal et al. 2009)
Online learning for latent Dirichlet allocation (LDA) (Hoffman et al., 2010)
Online variational inference for the hierarchical Dirichlet process (HDP) (Wang et al. 2011)
Online Learning for Collaborative Filtering (Ling et al. 2012)
…
73

Adiabatic (隔热)
73

: input data matrix
matrix (e.g. documents, terms)
: left singular vectors
matrix (documents, topics)
: singular values
diagonal matrix (strength of each “topic”)
rank of matrix
Nonnegative and sorted
: right singular vectors
matrix ( terms, topics)
SVD: Definition
74

: scalar
: vector
: vector

𝑉𝑇

Online SVD (Brand, 2003)
Challenges: storage and computation
Idea: an incremental algorithm computes the principal eigenvectors of a matrix without storing the entire matrix in memory
75

Online SVD (Brand, 2003)

Online SVD (Brand, 2003)
Complexity

Store

The online SVD has more error, but it is comparable to PCA (SVD)
77

Online SVD
Unsupervised learning: minimizing the reconstruction errors
Each update will increase the rank by at most one, until a user-specified ceiling is reached
Pros and Cons
Simple to implement
Fast computation
Comparable performance
Lack of theoretical guarantee
78

One-slide Takeaway
Basic concepts
What is online learning?
What is regret analysis?
Online learning algorithms
Perceptron
Online gradient descent
Passive aggressive
Truncated gradient
Dual averaging
Online SVD

Resources
Book and Video:
Prediction Learning and Games. N. Cesa-Bianchi and G. Lugosi. Cambridge university press, 2006.
[Shal11] Online Learning and Online Convex Optimization. Shai Shalev-Shwartz. Foundations and Trends in Machine Learning, Vol. 4, No. 2, 2011, 107-194, DOI: 10.1561/2200000018
http://videolectures.net/site/search/?q=online+learning
Software:
Pegasos: http://www.cs.huji.ac.il/~shais/code/index.html
VW: hunch.net/~vw/
SGD by Leon Bottou: http://leon.bottou.org/projects/sgd
81

References
Cesa-Bianchi, Nicol`o, Conconi, Alex, and Gentile, Claudio. A second-order perceptron algorithm. SIAM J. Comput., 34(3):640–668, 2005.
M. Brand. Fast online svd revisions for lightweight recommender systems. In SDM, 2003.
N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order perceptron algorithm. SIAM J. Comput., 34(3):640–668, 2005.
K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551–585, 2006.
K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. In NIPS, pages 414–422, 2009.
K. Crammer and D. D. Lee. Learning via gaussian herding. In NIPS, pages 451–459, 2010.
K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3:951–991, 2003.
M. Dredze, K. Crammer, and F. Pereira. Confidence-weighted linear classification. In ICML, pages 264–271, 2008.
C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242, 2001.
M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent dirichlet allocation. In NIPS, pages 856–864, 2010.
S. C. H. Hoi, J. Wang, and P. Zhao. Exact soft confidence-weighted learning. In ICML, 2012.

References
J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777–801, 2009.
Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1-3):361–387, 2002.
Guang Ling, Haiqin Yang, Irwin King and M.R. Lyu. Online Learning for Collaborative Filtering, IJCNN, Brisbane, Australia, 2012.
P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal., 16(6):964–979, 1979.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In ICML, page 87, 2009.
Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion Paper 2007/76, Catholic University of Louvain, Center for Operations Research and Econometrics, 2007.
F. Orabona and K. Crammer. New adaptive algorithms for online classification. In NIPS, pages 1840–1848, 2010.
F. Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–408, 1958.

References
C. Wang, J. W. Paisley, and D. M. Blei. Online variational inference for the hierarchical dirichlet process. Journal of Machine Learning Research – Proceedings Track, 15:752–760, 2011.
M. K. Warmuth and D. Kuzmin. Randomized pca algorithms with regret bounds that are logarithmic in the dimension. In NIPS, pages 1481–1488, 2006.
S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, 2009.
L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543–2596, 2010.
H. Yang, M. R. Lyu, and I. King. Efficient online learning for multi-task feature selection. ACM Transactions on Knowledge Discovery from Data, 2013.
H. Yang, Z. Xu, I. King, and M. R. Lyu. Online learning for group lasso. In ICML, pages 1191–1198, Haifa, Israel, 2010.
L. Yang, R. Jin, and J. Ye. Online learning by ellipsoid method. In ICML, page 145, 2009.
P. Zhao, S. C. H. Hoi, and R. Jin. Duol: A double updating approach for online learning. In NIPS, pages 2259–2267, 2009.
M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, pages 928–936, 2003.

In-class Practice
We have two data and , how to get a classifier by Perceptron learning rule?
Assume
is in class (the first data)
is in class
Data points are linearly separable and can be applied repeatedly (for validation).

(
)
{
}
N
i
i
i
y
1
,
=
x

(
)
(
)
t
t
y
y
,
,
,
,
1
1
x
x
K

)
(
0
×
f

)
(
1
t
t
f
x
–

)
),
(
(
1
t
t
t
y
f
l
x
–

)
(
x
t
f

t
x

)
(
1
t
t
f
x
–

å
=
–
T
t
t
t
t
y
f
l
1
1
)
),
(
(
x

(
)
å
=
Î
=
×
T
t
t
t
H
f
y
f
l
f
1
*
)
),
(
(
min
arg
x

)
(
*
×
f

[
]
å
=
–
–
=
T
t
t
t
t
t
t
y
f
l
y
f
l
T
1
*
1
)
),
(
(
)
),
(
(
1
regret
x
x

0
Lavf56.36.100

0
*
>
i
i
T
y
x
w

2
2
*
*
sup
min
i
i
i
T
i
x
w
x
w
=
g

2
–
g

k
v

(
)
i
i
y
,
x

2
2
*
*
sup
min
i
i
i
T
i
x
w
x
w
=
g

)
sgn(
ˆ
t
T
t
t
y
x
w
=

{
}
1
,
1
+
–
Î
t
y

t
y
ˆ

50100150200250300350400450500
50
100
150
200
250
300
350
400
450
500

t
w

(
)
å
=
=
t
i
i
i
t
w
f
t
g
1
1

thr?
< p /docProps/thumbnail.jpeg

Related Posts