Matrix Factorization Methods
Lecture 7: Recommender Systems / Matrix Factorization
1
CMSC5741 Big Data Tech. & Apps.
Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong
The Netflix Problem
Netflix database
About half a million users
About 18,000 movies
People assign ratings to movies
A sparse matrix
2
2
The Netflix Problem
Netflix database
Over 480,000 users
About 18,000 movies
Over 100,000,000 ratings
People assign ratings to movies
A sparse matrix
Only 1.16% of the full matrix is observed
3
The Netflix Problem
Netflix database
About half a million users
About 18,000 movies
People assign ratings to movies
A sparse matrix
Challenge:
Complete the “Netflix Matrix”
Many such problems: collaborative filtering, partially filled out surveys …
4
BellKor Recommender System
The winner of the Netflix Challenge!
Multi-scale modeling of the data:
Combine top level, “regional”
modeling of the data, with
a refined, local view:
Global:
Overall deviations of users/movies
Factorization:
Addressing “regional” effects
Collaborative filtering:
Extract local patterns
5
Global effects
Factorization
Collaborative filtering
Modeling Local & Global Effects
Global:
Mean movie rating: 3.7 stars
The Sixth Sense is 0.5 stars above avg.
Joe rates 0.2 stars below avg.
Baseline estimation:
Joe will rate The Sixth Sense 4 stars
Local neighborhood (CF/NN):
Joe didn’t like related movie Signs
Final estimate:
Joe will rate The Sixth Sense 3.8 stars
6
6
Outline
Introduction
LU Decomposition
Singular Value Decomposition
Probabilistic Matrix Factorization
Non-negative Matrix Factorization
Recent Development of Matrix Factorization methods in Collaborative Filtering
7
Outline
Introduction
LU Decomposition
Singular Value Decomposition
Probabilistic Matrix Factorization
Non-negative Matrix Factorization
Recent Development of Matrix Factorization methods in Collaborative Filtering
8
High Dimensional Data
9
High dim. data
Locality Sensitive Hashing
Clustering
Graph
data
PageRank, SimRank
Community Detection
Infinite
data
Filtering Data Streams
Web Advertising
Dimensionality Reduction
Spam Detection
Queries on Streams
Machine learning
SVM
Decision Trees
Perceptron, kNN
Apps
Recommender Systems
Association Rules
Duplicate Document Detection
Matrix Completion
Matrix
Observe subset of entries
Can we guess the missing entries?
Everyone would agree this looks impossible.
10
Massive High-dimensional Data
Engineering/scientific applications:
Unknown matrix often has (approx.) low rank.
Images
Videos
Text
Web data
High-dimensionality
but often
low-dimensional
structure
11
Matrix Recovery Algorithm
NP hard: not feasible for N > 10!
Resort to other approaches
Select a low rank K, and approximate X by a rank K matrix X’
Recovery by minimum complexity (assuming no noise)
Observation:
Try to recover a lowest complexity (rank) matrix that agrees with the observation
12
Low Rank Factorization
Assume X can be recovered by a rank K matrix X’
Then X’ can be factorized into the product of
Let be the loss function
Recovery by rank K matrix
13
Overview of Matrix Factorization Methods
Some methods are traditional mathematical way of factorizing a matrix.
SVD, LU, Eigen Decomposition
Some methods are used to factorize partially observed matrix.
PMF, SVD++, MMMF
Some methods have multiple applications.
NMF in image processing
NMF in collaborative filtering
14
MMMF – Maximum Margin Matrix Factorization
14
Outline
Introduction
LU Decomposition
Singular Value Decomposition
Probabilistic Matrix Factorization
Non-negative Matrix Factorization
Recent Development of Matrix Factorization methods in Collaborative Filtering
15
LU Decomposition
16
LU Decomposition
The LU Decomposition factors a matrix as the product of a lower triangular matrix (L) and an upper triangular matrix (U).
Lower triangular matrix:
Every entry above the main
diagonal are zero.
Upper triangular matrix:
Every entry below the main
diagonal are zero.
LU Decomposition
LU Decomposition is useful when
Solving a system of linear equations
Inverting a matrix
Computing the determinant of a matrix
LU Decomposition can be computed using a method similar to Gaussian Elimination
17
LU Decomposition
Computing LU Decomposition of a matrix A
Using Gaussian elimination to compute U
Apply inverse operation on the corresponding entry to I to get L
18
A
U
R2 – 2R1
R3 – 3R1
R2 * (-1/8)
R3 + 15R2
R3 * (-1/12)
LU Decomposition
Computing LU Decomposition of a matrix A
Using Gaussian elimination to compute U
Apply inverse operation on the corresponding entry to I to get L
Any row operations that involves adding a multiple of one row to another, for example, Ri + kRj, put the value –k in the ith-row, jth-column of the identity matrix.
Any row operations that involves getting a leading one on the main diagonal, for example, kRi, put the value 1/k in the position of the identity matrix where the leading one occurs.
19
19
LU Decomposition
Computing LU Decomposition of a matrix A
Using Gaussian elimination to compute U
Apply inverse operation on the corresponding entry to I to get L
20
I
L
R2 + 2R1
R3 + 3R1
R2 * (-8)
R3 – 15R2
R3 * (-12)
LU Decomposition
Computing LU Decomposition of a matrix A
Using Gaussian elimination to compute U
Apply inverse operation on the corresponding entry to I to get L
21
A
L
U
=
In-class Practice 1
Go to practice
22
Outline
Introduction
LU Decomposition
Singular Value Decomposition
Probabilistic Matrix Factorization
Non-negative Matrix Factorization
Recent Development of Matrix Factorization methods in Collaborative Filtering
23
Singular Value Decomposition
is the conjugate transpose of
is orthonormal matrix, i.e.,
is rectangular diagonal matrix with positive entries
is orthonormal matrix, i.e.,
Singular Value Decomposition
The Singular Value Decomposition (SVD) of an NxM matrix A is a factorization of the form:
24
SVD v.s. Eigen Decomposition
Diagonal entries of are called singular values of A.
Columns of U and V are called left singular vectors and right singular vectors of A, respectively
The singular values are arranged in descending order in
Singular Value Decomposition
The Singular Value Decomposition (SVD) of an NxM matrix A is a factorization of the form:
25
SVD v.s. Eigen Decomposition
The left singular vectors of A are eigenvectors of AA*, because
The left singular vectors of A are eigenvectors of A*A, because
The singular values of A are the square roots of eigenvalues of both AA* and A*A.
Singular Value Decomposition
The Singular Value Decomposition (SVD) of an NxM matrix A is a factorization of the form:
26
SVD Example
We give an example of a simple SVD decomposition
27
=
27
SVD as Low Rank Approximation
Low Rank Approximation
SVD gives the optimal solution.
Solution (Eckart-Young Theorem)
Let be the SVD for A, and is the same as by keeping the largest r singular values. Then,
Is the solution to the above problem.
28
SVD as Low Rank Approximation
It works when A is fully observed.
What if A is only partially observed?
Solution (Eckart-Young Theorem)
Let be the SVD for A, and is the same as by keeping the largest r singular values. Then,
Is the solution to the above problem.
29
Low Rank Approximation for Partially Observed Matrix
is the indicator that equals 1 if is observed and 0 otherwise
We consider only the observed entries.
A natural probabilistic extension of the above formulation is Probabilistic Matrix Factorization
Low Rank Approximation for Partially Observed Matrix
30
Outline
Introduction
LU Decomposition
Singular Value Decomposition
Probabilistic Matrix Factorization
Non-negative Matrix Factorization
Recent Development of Matrix Factorization methods in Collaborative Filtering
31
Probabilistic Matrix Factorization
A popular collaborative filtering (CF) method
Follow the low rank matrix factorization framework
32
Collaborative Filtering
For example:
Suppose you infer from the data that most of the users who like “Star Wars” also like “Lord of the Rings” and dislike “Dune”.
Then if a user watched and liked “Star Wars” you would recommend him/her “Lord of the Rings” but not “Dune”.
Preferences can be explicit or implicit:
Explicit preferences
Ratings assigned to items
Facebook “Like”, Google “Plus“
Implicit preferences
Catalog browse history
Items rented or bought by users
Collaborative Filtering
The goal of collaborative filtering (CF) is to infer user preferences for items given a large but incomplete collection of preferences for many users.
33
Content Based Filtering vs. Collaborative Filtering
Collaborative Filtering
User preferences are inferred from ratings
Item features are inferred from ratings
Cannot recommend new items
Very effective with sufficient data
Content Based Filtering
Analyze the content of the item
Match the item features with users preferences
Item features are hard to extract
Music, Movies
Can recommend new items
34
CF as Matrix Completion
CF can be viewed as a matrix completion problem
Task: given a user/item matrix with only a small subset of entries present, fill in (some of) the missing entries.
PMF approach: low rank matrix factorization.
Users
Items
35
Collaborative Filtering and Matrix Factorization
Collaborative filtering can be formulated as a matrix factorization problem.
Many matrix factorization methods can be used to solve collaborative filtering problem.
The above is only a partial list.
36
Notations
Suppose we have M items, N users and integer rating values from 1 to D.
Let ijth entry of X, , be the rating of user i for item j .
is latent user feature matrix, denote the latent feature vector for user i .
is latent item feature matrix, denote the latent feature vector for item j .
37
Matrix Factorization: the Non-probabilistic View
To predict the rating given by user i to item j,
Intuition
The item feature vector can be viewed as the input.
The user feature vector can be viewed as the weight vector.
The predicted rating is the output.
Unlike in linear regression, where inputs are given and weights are learned, we learn both the weights and the input by minimizing squared error.
The model is symmetric in items and users.
38
Vkj
Probabilistic Matrix Factorization
PMF is a simple probabilistic linear model with Gaussian observation noise.
Given the feature vectors for the user and the item, the distribution of the corresponding rating is:
The user and item feature vectors adopt zero-mean spherical Gaussian priors:
39
P
Probabilistic matrix factorization (PMF). Unlike all of the other approaches with the exception of the matrix-factorization-based ones, PMF scales well to large datasets. Furthermore, unlike most of the existing algorithms, which have trouble making accurate predictions for users who have very few ratings, PMF performs well on very sparse and imbalanced datasets, such as the Netflix dataset.
Zero-mean spherical Gaussian priors on U and V: Each row of U is drawn from a multivariate Gaussian with mean μ=0 and precision which is some multiple of the identity matrix I (U for U and V for V).
Given small precision parameters, the priors on UU and VV ensure our latent variables do not grow too far from 0. This prevents overly strong user preferences and item factor compositions from being learned. This is commonly known as complexity control, where the complexity of the model here is measured by the magnitude of the latent variables. Controlling complexity like this helps prevent overfitting, which allows the model to generalize better for unseen data. We must also choose an appropriate αα value for the normal distribution for RR. So the challenge becomes choosing appropriate values for αUαU, αVαV, and αα.
39
Probabilistic Matrix Factorization
Maximum A Posterior (MAP): Maximize the log-posterior over user and item features with fixed hyper-parameters.
MAP is equivalent to minimizing the following objective function:
PMF objective function
40
40
Probabilistic Matrix Factorization
and is indicator of whether user i rated item j
First term is the sum-of-squared-error.
Second and third term are quadratic regularization term to avoid over-fitting problem.
PMF objective function
41
When the observation noise variance and the prior variances U and V are all kept fixed, maximizing the log posterior is equivalent to minimizing the sum-of-squared-errors objective function with quadratic regularization terms.
41
In-class Practice 2
Go to practice
42
Probabilistic Matrix Factorization
If all ratings were observed, the objective reduces to the SVD objective in the limit of prior variances going to infinity.
PMF can be viewed as a probabilistic extension of SVD.
PMF objective function
43
Probabilistic Matrix Factorization
PMF objective function
44
A trick to improve stability (the range of rating values)
Map ratings to [0,1] by
Pass through logistic function
Outline
Introduction
LU Decomposition
Singular Value Decomposition
Probabilistic Matrix Factorization
Non-negative Matrix Factorization
Recent Development of Matrix Factorization methods in Collaborative Filtering
45
Non-negative Matrix Factorization
NMF is a popular method that is widely used in:
Images Mining
Text Mining
Metagenes Study
Collaborative Filtering
46
Non-negative Matrix Factorization
NMF fits in the low rank matrix factorization framework with additional non-negativity constraints.
NMF can only factorize a Non-negative matrix into basis matrix and weight matrix
47
Interpretation with NMF
Columns of W are the underlying basis vectors, i.e., each of the M columns of A can be built from K columns of W.
Columns of H give the weights associated with each basis vector.
W,H >= 0 commands additive parts-based representation.
48
What is the significance of the approximation in A ≈ WH? It can be rewritten column by column as a ≈ Wh, where a and h are the corresponding columns of A and H. In other words, each data vector a is approximated by a linear combination of the columns of W, weighted by the components of h. Therefore W can be regarded as containing a basis that is optimized for the linear approximation of the data in A . Since relatively few basis vectors are used to represent many data vectors, good approximation can only be achieved if the basis vectors discover structure that is latent in the data.
48
NMF in Image Mining
Additive parts-based representation
49
NMF in Image Mining
In image processing, we often assume Poisson Noise
Objective function can be changed to other form, the non-negative constraint is more important than the form of the objective function.
NMF Poisson Noise
NMF Gaussian Noise
50
Inference of NMF
Convex in W or H, but not both.
Global min generally not achievable.
Many number of unknowns: N×K for W and M×K for H (or HT)
NMF Gaussian Noise
51
Inference of NMF
Alternating gradient descent can get a local minima
NMF Gaussian Noise
52
W
Properties of NMF
Basis vectors Wi are not orthogonal
Wk, Hk ≥ 0 Have immediate interpretation
Example: large wij’s implies basis vector Wi is mostly about terms j
Example: hi1 denotes how much sample i is pointing in the “direction” of topic vector W1
NMF is algorithm-dependent: W, H not unique
53
53
Outline
Introduction
LU Decomposition
Singular Value Decomposition
Probabilistic Matrix Factorization
Non-negative Matrix Factorization
Recent Development of Matrix Factorization methods in Collaborative Filtering
54
Recent Development of MF methods in Collaborative Filtering
The basic form of matrix factorization has been extended to improve prediction accuracy
SVD++ [Yehuda Koren 2008]
RLFM [Agarwal 2009]
Etc.
55
SVD++
SVD++ is a matrix factorization model which makes use of implicit feedback.
In general, implicit feedback can refer to any kinds of users’ history information that can help indicate users’ preferences.
56
56
1st Tier
The first term is the basis rate; it takes in account a global mean and the bias of both user and item.
57
57
2nd Tier
The second term is similar to the original SVD but takes in account the implicit feedback present in the set of rated items N(u)
58
58
3rd Tier
The third and fourth terms are the neighborhood terms. The former is the weighted bias of the basis rate and the actual rate, and the latter is the local effect of the implicit feedback
59
59
RLFM
Regression-based Latent Factor Model makes use of the side information that is available in many recommender systems
User demographic information
Properties of items (e.g. director, leading actor of a movie, genre of a movie)
60
One-slide Takeaway
Matrix Factorization is the key to recommender systems
LU-decomposition
Decompose a matrix into a lower triangular matrix and an upper triangular matrix
SVD decomposition
Decompose a matrix into , where are orthonormal matrices and is a diagonal matrix, whose values are called singular values
Probabilistic Matrix Factorization
Factorize a partially observed matrix into the product of two low-rank matrices, usually used in recommender systems
Non-negative Matrix Factorization
Factorize a matrix into the produce of two non-negative matrices, can be used to learn the “parts”
61
References
[Salakhudinov Ruslan 2007] Probabilistic Matrix Factorization.
[Yehuda Koren 2008] Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. KDD 2008
[Agarwal 2009] Regression-based Latent Factor Models. KDD 2009
[Chen 2011] User Reputation in a Common Rating Environment. KDD 2011
Low-rank modeling, Emmanuel Candes, http://videolectures.net/site/normal_dl/tag=623106/mlss2011_candes_lowrank_01.pdf
Matrix factorization methods for collaborative filtering, Andriy Mnih and Ruslan Salakhutdinov
The nonnegative matrix factorization, a tutorial, Barbara Ball, Atina Brooks and Amy Langville
Netflix algorithm: Prize Tribute Recommendation Algorithm in Python, Danny Tarlow
http://www.math.ust.hk/~macheng/math111/LU_Decomposition.pdf, Shiu-Yuen CHENG
62
In-class Practice 1
LU decomposition
Perform LU decomposition of the following matrix A:
63
1 -2 3
2 -5 12
0 2 -10
In-class Practice 2
PMF objective function
Write out the partial derivative of the above objective function with respect to Ui and Vj.
We will explain how to solve the equation using the partial derivatives.
64
MF.graffle
Matrix Factorization
Partially
Observed
Matrix
Fully
Observed
Matrix
PMF SVD LU Eigen DecompositionNMFSVD++MMMF
CFMF.graffle
Collaborative
Filtering
Memory
Based
Methods
Model Based
Methods
Matrix Factorization
Partially
Observed
Matrix
Fully
Observed
Matrix
User
Based
Item
Based pLSA PMF SVD LU
Eigen
DecompositionNMFSVD++MMMF
/docProps/thumbnail.jpeg