代写代考 STA314-Lec9 1 / 51

. of Toronto
Intro ML (UofT) STA314-Lec9 1 / 51
STA 314: Statistical Methods for Machine Learning I
Lecture 9 – Matrix Factorization, Probabilistic Models

Generalization of PCA: matrix factorization. Unifying the course: probabilistic models
Intro ML (UofT) STA314-Lec9 2 / 51

Recall: PCA
Dimensionality reduction aims to find a low-dimensional representation of the data.
PCA projects the data onto a subspace which maximizes the projected variance, or equivalently, minimizes the reconstruction error.
The optimal subspace is given by the top eigenvectors of the empirical covariance matrix.
PCA gives a set of decorrelated features.
Intro ML (UofT) STA314-Lec9 3 / 51

Some recommender systems in action
Ideally recommendations should combine global and seasonal interests, look at your history if available, should adapt with time, be coherent and diverse, etc.
Intro ML (UofT) STA314-Lec9 4 / 51

Some recommender systems in action
Intro ML (UofT) STA314-Lec9 5 / 51

The Netflix problem
Movie recommendation: Users watch movies and rate them out of 5⭑.
User Movie Frozen Chained Bambi Titanic Goodfellas Dumbo Twilight
⭑⭐⭐⭐⭐ ⭑⭑⭐⭐⭐ ⭑⭑⭑⭐⭐ ⭑⭑⭑⭑⭐ ⭑⭑⭑⭑⭑ ⭑⭑⭑⭐⭐ ⭑⭑⭑⭑⭑ ⭑⭑⭑⭑⭑ ⭑⭑⭐⭐⭐
Users only rate a few items, so would like to infer their preference for unrated items
Intro ML (UofT) STA314-Lec9

Netflix Prize
Intro ML (UofT) STA314-Lec9 7 / 51

PCA as a Matrix Factorization
Recall PCA: project data onto a low-dimensional subspace defined by the top eigenvalues of the data covariance
Today we consider a generalization, matrix factorizations
▶ view PCA as a matrix factorization problem
▶ extend to matrix completion, where the data matrix is only partially observed
▶ extend to other matrix factorization models, which place different kinds of structure on the factors
Intro ML (UofT) STA314-Lec9 8 / 51

PCA as Matrix Factorization
Recall PCA: each input vector x(i) ∈ RD is approximated as μˆ +Uz(i), x(i) ≈ ̃x(i) = μˆ + Uz(i)
where μˆ = n1 ∑i x(i) is the data mean, U ∈ RD×K is the orthogonal
basis for the principal subspace, and z(i) ∈ RK is the code vector, and ̃x(i) ∈ RD is x(i)’s reconstruction or approximation.
Assume for simplicity that the data is centered: μˆ = 0. Then, the approximation looks like
x(i) ≈ ̃x(i) = Uz(i).
Intro ML (UofT) STA314-Lec9 9 / 51

PCA as Matrix Factorization
PCA(on centered data): input vector x(i) is approximated as Uz(i) x(i) ≈ Uz(i)
Write this in matrix form, we have X ≈ ZU⊤ where X and Z are matrices with one row per data point
⎡⎢[x(1)]⊤ ⎤⎥ ⎢[x(2)]⊤ ⎥ N×D
⎡⎢[z(1)]⊤ ⎤⎥ ⎢[z(2)]⊤ ⎥ N×K
X=⎢ ⋮ ⎥∈R ⎢⎣[x(N ) ]⊤ ⎥⎦
and Z=⎢ ⋮ ⎥∈R ⎢⎣[z(N ) ]⊤ ⎥⎦
Can write the squared reconstruction error as
N(i)(i)2 ⊤2 ∑∥x −Uz ∥ =∥X−ZU ∥F,
∥⋅∥F denotestheFrobeniusnorm:
∥Y∥2 =∥Y⊤∥2 =∑y2=∑∥y(i)∥2. F F ij
i,j i Intro ML (UofT) STA314-Lec9

PCA as Matrix Factorization
So PCA is approximating X ≈ ZU⊤, or equivalently X⊤ ≈ UZ⊤.
Based on the sizes of the matrices, this is a rank-K approximation. Since U was chosen to minimize reconstruction error, this is the
optimal rank-K approximation, in terms of error ∥X⊤ − UZ⊤∥2F . Intro ML (UofT) STA314-Lec9

Supplement: Singular-Value Decomposition (SVD)
Close relationship to the Singular Value Decomposition (SVD) of X which is a matrix factorization technique. Consider an N × D matrix X ∈ RN×D with SVD
Q, S, and U⊤ provide a real-valued matrix factorization of X.
Q is a N × D matrix with orthonormal columns, Q⊤ Q = ID , where ID is the
D × D identity matrix.
U is an orthonormal D × D matrix, U⊤ = U−1.
S is a D × D diagonal matrix, with non-negative singular values, s1,s2,…,sD, on the diagonal, where the singular values are conventionally ordered from largest to smallest.
Note that standard SVD notation is X = UDV⊤. We are using X = QSU⊤.
Intro ML (UofT) STA314-Lec9 12 / 51

Matrix Completion
We just saw that PCA gives the optimal low-rank matrix factorization to a matrix X.
Can we generalize this to the case where X is only partially observed?
▶ A sparse 1000 × 1000 matrix with 50,000 observations (only 5% observed).
▶ A rank 5 approximation requires only 10,000 parameters, so it’s reasonable to fit this.
▶ Unfortunately, no closed form solution.
Intro ML (UofT) STA314-Lec9

The Netflix problem
Recall: movie recommendation.
User Movie
Thor ⭑⭐⭐⭐⭐
Chained ⭑⭑⭐⭐⭐ Frozen ⭑⭑⭑⭐⭐ Chained ⭑⭑⭑⭑⭐ Bambi ⭑⭑⭑⭑⭑ Titanic ⭑⭑⭑⭐⭐ Goodfellas ⭑⭑⭑⭑⭑ Dumbo ⭑⭑⭑⭑⭑ Twilight ⭑⭑⭐⭐⭐
Intro ML (UofT)
STA314-Lec9 14 / 51

Matrix Completion
Matrix completion problem: Transform the table into a N users by M movies matrix R
Ninja Cat Angel Neutral
Rating matrix
Data: Users rate some movies. Ruser,movie. Very sparse
Task: Predict missing entries,
i.e. how a user would rate a movie they haven’t previously rated
Evaluation Metric: Squared error (used by Netflix Competition). Is this a reasonable metric?
23?????1? 4?5?????? ???355??? ??????2?? ?5??????? ????????1
Intro ML (UofT)
STA314-Lec9
Chained Frozen
Bambi Titanic
Goodfellas Dumbo

Matrix Completion
In our current setting, latent factor models attempt to explain the ratings by characterizing both movies and users on a number of factors K inferred from the ratings patterns.
That is, we seek representations for movies and users as vectors in RK that can ultimately be translated to ratings.
For simplicity, we can associate these factors (i.e. the dimensions of the vectors) with idealized concepts like
▶ But also uninterpretable dimensions
Can we use the sparse ratings matrix R to find these latent factors automatically?
Intro ML (UofT) STA314-Lec9 16 / 51

Matrix Completion
Let the representation of user i in the K-dimensional space be ui and the representation of movie j be zj
▶ Intuition: maybe the first entry of ui says how much the user likes horror films, and the first entry of zj says how much movie j is a horror film.
Assume the rating user i gives to movie j is given by a dot product: Rij ≈u⊤i zj
∣⎤⎥ … zM⎥
In matrix form, if:
U = ⎢ ⋮ ⎥ and Z⊤ = ⎢z1
then: R ≈ UZ⊤
This is a matrix factorization problem!
⎡⎢—u⊤1 —⎤⎥ ⎡⎢∣ ⎢⎣—u⊤N —⎥⎦ ⎢⎣∣
Intro ML (UofT) STA314-Lec9

Matrix Completion
Recall PCA: To enforce X⊤ ≈ UZ⊤, we minimized min∥X⊤−UZ⊤∥2F=∑(xji−u⊤i zj)2
where ui and zi are the i-th rows of matrices U and Z, respectively.
What’s different about the Netflix problem?
▶ Most entries are missing!
▶ We only want to count the error for the observed entries.
Intro ML (UofT) STA314-Lec9 18 / 51

Matrix Completion
Let O = {(n, m) ∶ entry (n, m) of matrix R is observed}
Using the squared error loss, matrix completion requires solving
min1 ∑ (R −u⊤z)2
U,Z 2 (i,j)∈O
The objective is non-convex in U and Z jointly, and in fact it’s generally
NP-hard to minimize the above cost function exactly.
As a function of either U or Z individually, the problem is convex and easy to optimize. We can use coordinate descent, just like with K-means and mixture models!
Alternating Least Squares (ALS): fix Z and optimize U, followed by fix U and optimize Z, and so on until convergence.
Intro ML (UofT) STA314-Lec9 19 / 51

Alternating Least Squares
Want to minimize the squared error cost with respect to the factor U. (The case of Z is exactly symmetric.)
We can decompose the cost into a sum of independent terms:
∑ (R −u⊤z) =∑ ∑ (R −u⊤z) ij i j ij i j
This can be minimized independently for each ui .
This is a linear regression problem in disguise. Its optimal solution is:
⎛ ⎞−1 u=⎜∑zz⊤⎟ ∑Rz
i j∶(i,j)∈O 􏰅􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰇􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰆
i⎝ jj⎠ ijj j∶(i,j)∈O j∶(i,j)∈O
only depends on ui
Intro ML (UofT) STA314-Lec9 20 / 51

Alternating Least Squares
ALS for Matrix Completion problem
Initialize U and Z randomly repeat until convergence
for i = 1, .., N do
u = (∑ z z⊤)−1 ∑
R z j∶(i,j)∈O ij j
i j∶(i,j)∈O j j for j = 1, .., M do
z = (∑ u u⊤)−1 ∑ R u
j i∶(i,j)∈O i i
i∶(i,j)∈O ij i
Intro ML (UofT)
STA314-Lec9 21 / 51

So far we have motivated unsupervised learning by discovering latent structure in data: clusters or linear structure.
Now we will start to put together a more formulaic (perhaps more principled) view of unsupervised learning as a probabilistic method.
Actually this point of view will also unify supervise learning.
Intro ML (UofT) STA314-Lec9 22 / 51

So far in the course we have adopted a modular perspective, in which the model, loss function, optimizer, and regularizer are specified separately.
Today we will begin putting together a probabilistic interpretation of the choice of model and loss, and introduce the concept of maximum likelihood estimation.
Let’s start with a simple biased coin example.
▶ You flip a coin N = 100 times and get outcomes {x1,…,xN} where xi ∈ {0, 1} and xi = 1 is interpreted as heads H.
▶ Suppose you had NH = 55 heads and NT = 45 tails.
▶ What is the probability it will come up heads if we flip again? Let’s
design a model for this scenario, fit the model. We can use the fit model to predict the next outcome.
Intro ML (UofT) STA314-Lec9

The coin is possibly loaded. So, we can assume that one coin flip outcome x is a Bernoulli random variable for some currently unknown parameter θ ∈ [0, 1].
p(x =1∣θ)=θ and p(x =0∣θ)=1−θ ormoresuccinctly p(x∣θ)=θx(1−θ)1−x
It’s sensible to assume that {x1, . . . , xN } are independent and identically distributed (i.i.d.) Bernoullis.
Thus the joint probability of the outcome {x1,…,xN} is
p(x1, …, xN ∣θ) = ∏ θxi (1 − θ)1−xi
Intro ML (UofT) STA314-Lec9 24 / 51

We call the probability mass (or density for continuous) of the observed data the likelihood function (as a function of the parameters θ):
L(θ) = ∏ θxi (1 − θ)1−xi
We usually work with log-likelihoods:
l(θ)=∑xi logθ+(1−xi)log(1−θ)
How can we choose θ? Good values of θ should assign high probability to the observed data. This motivates the maximum likelihood criterion, that we should pick the parameters that maximize the likelihood:
=arg max l(θ) θ∈[0,1]
Intro ML (UofT)
STA314-Lec9 25 / 51

Maximum Likelihood Estimation for the Coin Example
Remember how we found the optimal solution to linear regression by setting derivatives to zero? We can do that again for the coin example.
dθ = dθ(∑xi logθ+(1−xi)log(1−θ))
= d (NH logθ+NT log(1−θ))
θ 1−θ whereNH =∑ixi andNT =N−∑ixi.
Setting this to zero gives the maximum likelihood estimate: θˆML= NH .
Intro ML (UofT) STA314-Lec9 26 / 51

Maximum Likelihood Estimation
Notice, in the coin example we are actually minimizing cross-entropies!
θˆ =arg max l(θ)
=arg min −l(θ)
=arg min ∑−xi logθ−(1−xi)log(1−θ) θ∈[0,1] i=1
This is an example of maximum likelihood estimation.
▶ define a model that assigns a probability (or has a probability density
at) to a dataset
▶ maximize the likelihood (or minimize the neg. log-likelihood).
Many examples we’ve considered fall in this framework! Let’s consider classification again.
Intro ML (UofT) STA314-Lec9 27 / 51

Generative vs Discriminative
Two approaches to classification:
Discriminative approach: estimate parameters of decision boundary/class separator directly from labeled examples.
▶ Model p(t∣x) directly (logistic regression models)
▶ Learn mappings from inputs to classes (linear/logistic regression,
decision trees etc)
▶ Tries to solve: How do I separate the classes?
Generative approach: model the distribution of inputs characteristic of the class (Bayes classifier).
▶ Model p(x∣t)
▶ Apply Bayes Rule to derive p(t∣x).
▶ Tries to solve: What does each class ”look” like?
Key difference: is there a distributional assumption over inputs?
Intro ML (UofT) STA314-Lec9

A Generative Model: Bayes Classifier
Aim to classify text into spam/not-spam (yes c=1; no c=0) Example: “You are one of the very few who have been selected as a
winners for the free $1000 Gift Card.”
Use bag-of-words features, get binary vector x for each email
Vocabulary: ▶ “a”:1
▶ “car”: 0
▶ “card”: 1
▶ “win”: 0
▶ “winner”: 1 ▶ “winter”: 0 ▶ …
▶ “you”: 1
Intro ML (UofT) STA314-Lec9 29 / 51

Bayes Classifier
Given features x = [x1,x2,⋯,xD]T we want to compute class probabilities using Bayes Rule:
More formally
= p(x,c) =
Pr. class given words
posterior = Class likelihood × prior Evidence
Pr. words given class
􏰊􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰌􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰋
p(x∣c) p(c)
􏰅􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰇􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰆
How can we compute p(x) for the two class case? (Do we need to?) p(x)=p(x∣c =0)p(c =0)+p(x∣c =1)p(c =1)
To compute p(c∣x) we need: p(x∣c) and p(c)
Intro ML (UofT) STA314-Lec9 30 / 51

Na ̈ıve Bayes
Assume we have two classes: spam and non-spam. We have a dictionary of D words, and binary features x = [x1,…,xD] saying whether each word appears in the e-mail.
If we define a joint distribution p(c,x1,…,xD), this gives enough information to determine p(c) and p(x∣c).
Problem: specifying a joint distribution over D + 1 binary variables requires 2D+1 − 1 entries. This is computationally prohibitive and would require an absurd amount of data to fit.
We’d like to impose structure on the distribution such that:
▶ it can be compactly represented
▶ learning and inference are both tractable
Intro ML (UofT) STA314-Lec9

Na ̈ıve Bayes
Na ̈ıve assumption: Na ̈ıve Bayes assumes that the word features xi are conditionally independent given the class c.
▶ This means xi and xj are independent under the conditional distribution p(x∣c).
▶ Note: this doesn’t mean they’re independent.
▶ Mathematically,
p(c, x1, . . . , xD ) = p(c)p(x1∣c)⋯p(xD ∣c).
Compact representation of the joint distribution
▶ Prior probability of class: p(c = 1) = π (e.g. spam email)
▶ Conditional probability of word feature given class: p(xj = 1∣c) = θjc
(e.g. word ”price” appearing in spam)
▶ 2D + 1 parameters total (before 2D+1 − 1)
Intro ML (UofT) STA314-Lec9 32 / 51

Bayes Nets
We can represent this model using an directed graphical model, or Bayesian network:
This graph structure means the joint distribution factorizes as a product of conditional distributions for each variable given its parent(s).
Intuitively, you can think of the edges as reflecting a causal structure. But mathematically, this doesn’t hold without additional assumptions.
Intro ML (UofT) STA314-Lec9 33 / 51

Na ̈ıve Bayes: Learning
The parameters can be learned efficiently because the log-likelihood decomposes into independent terms for each feature.
l(θ) = ∑ log p(c(i), x(i)) = ∑ log {p(x(i)∣c(i))p(c(i))}
∑ log p(c(i))
Bernoulli log-likelihood of labels
+ ∑ ∑ log p(x(i) ∣ c(i)) j
j=1 i=1 􏰅􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰇􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰆
Bernoulli log-likelihood for feature xj
i=1 i=1 ND
= ∑ log {p(c(i)) ∏ p(x(i) ∣ c(i))} j
i=1 j=1 ND
= ∑ [log p(c(i)) + ∑ log p(x(i) ∣ c(i))] j
i=1 j=1 NDN
􏰅􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰇􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰈􏰆
Each of these log-likelihood terms depends on different sets of parameters, so they can be optimized independently.
Intro ML (UofT) STA314-Lec9

Na ̈ıve Bayes: Learning
We can handle these terms separately. For the prior we maximize: ∑Ni=1 log p(c(i))
This is a minor variant of our coin flip example. Let p(c(i) = 1)=π.
c(i) 1−c(i) (1 − π) .
NNN ∑logp(c(i))=∑c(i) logπ+∑(1−c(i))log(1−π) i=1 i=1 i=1
Obtain MLEs by setting derivatives to zero:
πˆ = ∑i 1I[c(i) = 1] = # spams in dataset
(i) Log-likelihood:
total # samples
Intro ML (UofT)
STA314-Lec9

Na ̈ıve Bayes: Learning
Each θ ’s can be treated separately: maximize ∑N log p(x(i)∣ c(i)) jc i=1 j
This is (again) a minor variant of our coin flip example.
(i) (i) Letθ =p(x(i) =1∣c). Notep(x(i)∣c)=θxj (1−θ )1−xj .
Log-likelihood:
∑logp(x(i)∣c(i))=∑c(i){x(i)logθ +(1−x(i))log(1−θ )}
j jj1jj1 i=1 i=1
+∑(1−c(i)){x(i) logθ +(1−x(i))log(1−θ )}
Obtain MLEs by setting derivatives to zero:
∑i 1I[c(i) = c]
for c = 1 =
STA314-Lec9
Intro ML (UofT)
∑ 1I[x(i) =1&c(i) =c]
#word j appears in spams # spams in dataset

Na ̈ıve Bayes: Inference
We predict the category by performing inference in the model. Apply Bayes’ Rule:
p(c)p(x ∣ c) p(c) ∏Dj=1 p(xj ∣ c) p(c ∣ x) = ∑c′ p(c′)p(x ∣ c′) = ∑c′ p(c′) ∏Dj=1 p(xj ∣ c′)
We need not compute the denominator if we’re simply trying to determine the most likely c.
Shorthand notation:
p(c ∣x) ∝ p(c)∏p(xj ∣c)
For input x, predict by comparing the values of p(c)∏Dj=1 p(xj ∣c) for different c (e.g. choose the largest).
Intro ML (UofT) STA314-Lec9

Na ̈ıve Bayes
Na ̈ıve Bayes is an amazingly cheap learning algorithm!
Training time: estimate parameters using maximum likelihood
▶ Compute co-occurrence counts of each feature with the labels.
▶ Requires only one pass through the data!
Test time: apply Bayes’ Rule
▶ Cheap because of the model structure. (For more general models,
Bayesian inference can be very expensive and/or complicated.)
We covered the Bernoulli case for simplicity. But our analysis easily extends to other probability distributions.
Unfortunately, it’s usually less accurate in practice compared to discriminative models due to its “na ̈ıve” independence assumption.
Intro ML (UofT) STA314-Lec9 38 / 51

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts