Recommender Systems
COMPSCI 753 Kaiqi Zhao
Memory-based collaborative filtering
§ So far, we have learned memory-based CF
§ Use the entire user-item rating matrix
§ Find nearest neighbors based on the ratings / interpolation weights
§ Challenges:
§ Sparsity: <1% on real datasets
§ Each user often purchases hundreds from millions of items.
§ The similarity computed from the sparse user rating vector 𝑟! results in poor similarity measure
§ Need dimension reduction
§ Nearest neighbor is not scalable
§ Computation cost grow with number of users and items, i.e., 𝑂(𝑛𝑚) § Need to store all the training transactions in memory
49
Model-based collaborative filtering
§ Learn a model to uncover user preference
User profile
Item profile
𝑟̂ *%
Learn user and item profiles using Machine learning models
§ Dimension reductionàhandle sparse data
Item 1 Item 2 Item 3
Rating representation Latent representation
50
Factor from Item 1 and Item 2
Matrix Factorization
§ MF assumes there are latent vector representation of users and items. § The latent vector of users and items can reconstruct the rating matrix
𝑹
Items Users
𝑸 𝑷𝑻
≈× 𝑘 factors
𝑘 factors
User profiles 𝑛×𝑘
Item profiles
𝑘×𝑚
51
Example
§ Netflix – (# factors = 3) Item
1
3
5
5
4
5
4
?
4
2
1
3
2
4
1
2
3
4
3
5
2
4
5
4
2
4
3
4
2
2
5
1
3
3
2
4
.1
-.4
.2
-.5
.6
.5
-.2
.3
.5
1.1
2.1
.3
-.7
2.1
-2
-1
.7
.3
Items
1.1
-.2
.3
.5
-2
-.5
.8
-.4
.3
1.4
2.4
-.9
-.8
.7
.5
1.4
.3
-1
1.4
2.9
-.7
1.2
-.1
1.3
2.1
-.4
.6
1.7
2.4
.9
-.3
.4
.8
.7
-.6
.1
R ≈ Q ⋅ PT Fig from mmds.org
52
User
User
Latent space of user and item
Geared towards females
Ocean’s 11
The Lion King
Geared towards males
The Color Purple
Sense and Sensibility
Serious
Amadeus
Braveheart
Lethal Weapon
The Princess Diaries
Independence Dumb and Day Dumber
Fig from mmds.org
Funny 53
Singular value decomposition (SVD)
𝑅
𝑈 Λ 𝑉g
𝑟&& ⋯ 𝑟&) 𝑢&& ⋯ 𝑢&, 𝜆&& ⋯ 0 𝑣&& ⋯ 𝑣&) ⋮⋱⋮≈⋮⋱⋮⋮⋱⋮⋮⋱⋮
𝑟(& ⋯ 𝑟() 𝑢(& ⋯ 𝑢(, 0 ⋯ 𝜆,, 𝑛×𝑚 𝑛×𝑘 𝑘×𝑘
§ 𝑅 – rating matrix
§ 𝑈 – user preference on (latent) factors
§ Λ – singular values, importance of the factors § 𝑉 – item quality on (latent) factors
𝑹≈𝑸𝑷g, 𝑸 = 𝑼, 𝑷g =𝚲𝑽g
𝑣,&
⋯ 𝑣,) 𝑘×𝑚
54
Singular value decomposition (SVD)
§ Properties of SVD 𝑅 = 𝑈Λ𝑉^:
§ 𝑈 and 𝑉 are orthogonal matrices:
𝑈g𝑈 = 𝐼 and 𝑉g𝑉 = 𝐼
§ Each column of 𝑈 is an eigenvector of 𝑅𝑅g
𝑈Λ𝑉g𝑉Λ𝑈g = 𝑈Λ-𝑈g
§ Each column of 𝑉 is an eigenvector of 𝑅g𝑅 𝑉Λ𝑈g𝑈Λ𝑉g = 𝑉Λ-𝑉g
55
Matrix factorization
§ SVD is both user-based and item-based 𝑅=𝑈𝛬𝑉F ⇒ c𝑅𝑅F𝑢=𝜆𝑢
𝑅F𝑅𝑣 = 𝜆𝑣
𝑈 and 𝑉 are in latent space that encodes similarities
i𝑟!%𝑟!% ⋯ i𝑟!%𝑟4%
𝑅𝑅3=%⋮ ⋱%⋮ i𝑟4%𝑟!% ⋯ i𝑟4%𝑟4%
% 𝑛×𝑛%
User dot-product matrix
i𝑟*!𝑟*! ⋯ i𝑟*!𝑟*5
𝑅3𝑅=*⋮ ⋱*⋮ i𝑟*5𝑟*! ⋯ i𝑟*5𝑟*5
* 𝑚×𝑚 *
Item dot-product matrix
56
Solving SVD as Optimization
§ Minimize the Sum of Squared Errors
min 6 𝑅ik−𝑈Λ𝑉mik
n
f,g,h i∈f,k∈l
§ Pros:
§ It directly optimizes the SSE and thus minimize the RMSE
§ Cons:
§ SVD needs to run on all the entries!
§ Missing values will also be used to calculate the loss!
57
Latent factor model
§ A special way to solve the matrix factorization problem by only counting the loss on available ratings!
§ In this model, we aim to predict ratings
𝑟̂ =𝑞⋅𝑝
- F,n _!#∈𝑹
minZ 𝑟*+−𝑞*⋅𝑝+
§ Additional merit - it is easy to incorporate the bias terms!
𝑟̂ = 𝑏 + 𝑏 + 𝑏 + 𝑞 ⋅ 𝑝 ik o i k i k
ik i k
58
Finding latent factors
§ Our goal:
F,n _!#∈𝑹
-
minZ 𝑟*+−𝑞*⋅𝑝+ § The number of factors 𝑘
§ Too large? – overfitting on the limited data, i.e., the model fits perfectly to the current data, but may perform poor on unseen data.
§ Too small? – not precise in prediction.
59
Regularization
§ Objective with Regularization:
min Z 𝑟*+ − 𝑞* ⋅ 𝑝+ - + 𝜆& Z 𝑞* - + 𝜆- Z 𝑝+ -
F,n _!#∈𝑹
*∈C +∈F
Length of the vectors
SSE
𝜆!, 𝜆" are the hyperparameters to balance SSE and the regularization terms
60
Stochastic Gradient Descent
§ Our loss function: L=i𝑟−𝑞⋅𝑝"+𝜆i𝑞"+𝜆i𝑝"
*% * % ! * " % 6"#∈𝑹 *∈) %∈+
§ Stochastic Gradient Decent (SGD)
§ Initialize 𝑃, 𝑄 and all the bias terms randomly § Do gradient descent:
§𝑞-4'¬𝑞- −𝜂ÑG L- !5 !5 !' !#
ÑGL(-)=−2𝑟!#−𝑞-⋅𝑝- 𝑝-+2𝜆'𝑞- !' ! # #5 !5
§𝑝-4'¬𝑝- −𝜂Ñ3 L- #5 #5 #' !#
Ñ3L(-)=−2𝑟!#−𝑞-⋅𝑝- 𝑞-+2𝜆$𝑝- #' ! # !5 #5
The 𝑓-th factor of in the latent vector 𝑞𝑖
The 𝑓-th factor of in the latent vector 𝑝%
61
Incorporating bias to LFM
§ Consider predicting the rating of a user 𝑖 to an item 𝑗 with bias: 𝑟̂ = 𝑏 + 𝑏 + 𝑏 + 𝑞 ⋅ 𝑝
§ Our goal:
F,n,s!,s# _!#∈𝑹
-
ik o i k i k
min Z 𝑟*+ −𝑞* ⋅𝑝+ −𝑏J −𝑏* −𝑏+
62
Regularization
§ Objective with Regularization:
min ∑_!#∈𝑹 𝑟*+ −𝑞* ⋅𝑝+ −𝑏J −𝑏* −𝑏+
SSE
+𝜆&Z 𝑞* -+𝜆-Z 𝑝+ -+𝜆tZ 𝑏* -+𝜆uZ 𝑏+ - *∈C +∈F *∈C +∈F
Length of the vectors
𝜆!, 𝜆", 𝜆,, 𝜆# are the hyperparameters to balance SSE and the regularization terms
F,n,s!,s#
-
63
Stochastic Gradient Descent
§ Our loss function:
L= i 𝑟 −𝑞3𝑝 −𝑏 −𝑏 −𝑏 "+𝜆 i 𝑞 "+𝜆 i 𝑝 "+𝜆 i 𝑏 "+𝜆 i 𝑏 "
*% *% ; * % ! * " % , * # % 6"#∈𝑹 *∈) %∈+ *∈) %∈+
§ Stochastic Gradient Decent (SGD)
§ Initialize 𝑃, 𝑄 and all the bias terms randomly § Run SGD for each rating:
§Computethepredictedrating:𝑟̂- =𝑞-F𝑝(9)+𝑏;+𝑏9 +𝑏9 !#!% *%
§𝑞-4'¬𝑞- −𝜂ÑG L- =𝑞- −𝜂[−2 𝑟!#−𝑟̂- 𝑝- +2𝜆'𝑞-] !5 !5 !'!# !5 !# #5 !5
§𝑝-4'¬𝑝-−𝜂Ñ3L-=𝑝-−𝜂[−2𝑟!#−𝑟̂- 𝑞-+2𝜆$𝑝-] #5 #5 #'!# #5 !# !5 #5
§𝑏-4'¬𝑏- −𝜂ÑHL- =𝑏- −𝜂[−2 𝑟!#−𝑟̂- +2𝜆(𝑏-] !!!!#! !#!
§𝑏-4'¬𝑏- −𝜂ÑHL- =𝑏- −𝜂[−2 𝑟!#−𝑟̂- +2𝜆)𝑏-] ###!## !##
64
A probabilistic view of LFM
§ Probabilistic matrix factorization (A Mnih NIPS’08)
§ Assume the latent factors and ratings are
continuous random variables.
§ 𝒒*~𝒩(𝒒*|0, 𝜆&w&𝑰) § 𝒑+~𝒩(𝒑+|0,𝜆w-&𝑰)
§ 𝑟*+~𝒩(𝑟*+|𝒒*𝒑+, 𝜎-)
§ This is equivalent to model the rating by the dot product of the two latent vectors plus a Gaussian error:
𝑟* + = 𝒒 * 𝒑 + + 𝜖 , 𝜖 ~ 𝒩 ( 0 , 𝜎 - )
𝜆&
𝜆-
𝑝+
𝑗 = 1,...,𝑚
𝑖 = 1,...,𝑛
𝑞*
𝑟*+
𝜎
65
Probabilistic matrix factorization
§ The probabilistic matrix factorization (PMF) 𝒒*~𝒩(𝒒*|0, 𝜆&w&𝑰)
𝑟*+~𝒩(𝑟*+|𝒒*𝒑+, 𝜎-) § 𝑟*+~𝒩(𝑟*+|𝒒*𝒑+ +𝑏J +𝑏* +𝑏+,𝜎-)
§ Incorporating bias:
§ Flexibility of the model:
𝒑 + ~ 𝒩 ( 𝒑 + | 0 , 𝜆 w- & 𝑰 )
§ Modeling different types of response, e.g., binary/counts, by replacing the Gaussian distribution of the ratings to others, e.g., categorical/Poisson distributions.
§ Putting constraints on the latent factors, e.g., non-negativity by replacing the Gaussian of the latent factors by rectified Gaussian.
§ But this lecture will focus only on the above formulation
66
Parameter estimation of PMF
§ Maximum A Posteriori (MAP) § Posterior
𝑝𝑄,𝑃𝑅 =𝑝𝑄,𝑃,𝑅 ∝𝑝𝑅𝑃,𝑄𝑝𝑃𝑝(𝑄) 𝑝(𝑅)
𝒒*~𝒩(𝒒*|0, 𝜆&w&𝑰)
𝒑 + ~ 𝒩 ( 𝒑 + | 0 , 𝜆 w- & 𝑰 )
𝑟*+~𝒩(𝑟*+|𝒒*𝒑+, 𝜎-) log𝑝 𝑅 𝑃,𝑄 𝑝 𝑃 𝑝 𝑄 =log𝑝 𝑅 𝑃,𝑄 +log𝑝(𝑃)+log𝑝(𝑄)
§ The log-posterior:
= log } 𝑝(𝑟*+|𝑞*, 𝑝+) + log } 𝑝(𝑞*) + log } 𝑝(𝑝+) *+ * +
Log-prod = sum of log = Z log 𝑝(𝑟*+|𝑞*, 𝑝+) + Z log 𝑝(𝑞*) + Z log 𝑝(𝑝+) *+ * +
67
Parameter estimation of PMF
§ Maximum A Posteriori (MAP) § Log-posterior
L = Z log 𝑝(𝑟*+|𝑞*, 𝑝+) + Z log 𝑝(𝑞*) + Z log 𝑝(𝑝+) *+ * +
𝒒*~𝒩(𝒒*|0, 𝜆&w&𝑰)
𝒑 + ~ 𝒩 ( 𝒑 + | 0 , 𝜆 w- & 𝑰 )
𝑟*+~𝒩(𝑟*+|𝒒*𝒑+, 𝜎-)
§ Put the Gaussian distributions in and ignore constants, we have
L=Z− 1 𝑟*+−𝑞*𝑝+ -+Z−𝜆& 𝑞* -+Z−𝜆- 𝑝+ - 2𝜎- 2 2
*+ * +
Maximizing the above log-posterior is equivalent to minimizing the loss + regularization in LFM!
68
Summary
§ Content-based methods
§ Using item attributes and text content to profile items and users
§ Collaborative filtering
§ User-based & Item-based – k-nearest neighbors § Interpolation method
§ Matrix factorization and Latent factor model
§ Evaluation of recommender systems
69