Recommender Systems
COMPSCI 753 Kaiqi Zhao
Roadmap
§ Recap collaborative filtering and latent factor model § Context-aware recommendation
§ Session-based recommendation
1
User-based collaborative filtering
Liked
Similar taste
Liked
Recommend
2
User-based collaborative filtering
§ Basic steps:
§ Search similar users – according to a similarity function over the
historical behaviors on the items
§ Prediction – for each item adopted by similar users, compute estimate the rating that would be given by the user
§ Ranking – Pick the top 𝑁 items and recommend them to the user Item 1
𝑤
Target user
Item 2
Similar users
Weighted average
Item 1: 5 Item 2: 4.5
3
Similarity calculation
§ Let 𝑈 = {𝑢”:$} be the set of users and 𝑃 = {𝑝”:%} be the set of items. 𝑟&’ be the rating from user 𝑢& to item 𝑝’
§ Similarity between two users (𝑢& and 𝑢(): § Cosine similarity
∑’ 𝑟&’𝑟(‘
∑𝑟)⋅ ∑𝑟) ‘ &’ ‘ (‘
§ Problem: Treats missing ratings as “negative”
𝑤&( =
4
Example
§ Cosine similarity
5
?
2
?
3
4
?
?
4
?
1
4
?
3
?
2
5
?
?
5
!! ∑# 𝑟!#𝑟”# !”
𝑤!” =
# !# # “# !$
∑𝑟$⋅ ∑𝑟$ !# !%
𝒓! = 5,0,2,0 , 𝒓” = (3,4,0,0)
𝒓! ⋅ 𝒓” 𝒓! ⋅ 𝒓”
=5⋅3+0⋅4+2⋅0+0⋅0 5″+2″ 3″+4″
3 29
𝑤!” =
=
“!
“” “# “$
5
Similarity calculation
§ Let 𝑈 = {𝑢”:$} be the set of users and 𝑃 = {𝑝”:%} be the set of items. 𝑟&’ be the rating from user 𝑢& to item 𝑝’
§ Similarity between two users (𝑢& and 𝑢(): § Pearson correlation coefficient
∑ 𝑟−𝑟̅𝑟−𝑟̅ ‘∈8!” &’ & (‘ (
𝑤&( =
‘∈8!” &’ & ‘∈8!” (‘ (
∑ 𝑟 −𝑟̅ )⋅ ∑ 𝑟 −𝑟̅ )
§ 𝑆&( – the set of items rated by both users 𝑢& and 𝑢(
§ 𝑟̅ and 𝑟̅ — the average ratings of users 𝑢 and 𝑢 &( &(
6
Example
§ Pearson correlation coefficient
!!
5
?
2
?
3
4
?
?
4
?
1
4
?
3
?
2
5
?
?
5
𝑤!” =
∑ 𝑟−𝑟̅𝑟−𝑟̅ ! #∈&!” !# ! “# ” ”
∑ 𝑟 −𝑟̅ $⋅ ∑
𝑟 −𝑟̅ $ !#
!%
“!
#∈&!” !#
!
#∈&!”
“# ”
!$
Common item – 1st item
𝑟 −𝑟̅ 𝑟 −𝑟̅ !! ! “! ”
𝑤!” =
!!! “!!
𝑟 −𝑟̅ “⋅ 𝑟 −𝑟̅ ”
5−3.5 3−3.5 5−3.5 |3−3.5|
=
= −1
“” “# “$
7
Prediction
§ Prediction – use the similarity to weight the ratings of items from other user 𝑢(
𝑟̂ = ∑?”∈@# 𝑤&(𝑟(‘
&’
𝑈’ – set of most similar users who rated item 𝑝’
Any question if we use Pearson correlation coefficient?
𝑟̂ = ∑?”∈@# 𝑤&(𝑟(‘
&’
∑?”∈@# 𝑤&(
∑?”∈@# |𝑤&(|
E.g.,
Sim(George, Bob)=1 Sim(Andy, Bob)=-1
8
Item-based collaborative filtering
Another view:
Liked
Liked
Similar items
Recommend
9
Item-based collaborative filtering
§ Basic steps:
§ Search similar items – use the historical ratings from other users
§ Prediction – Weighted average on the most similar items that have been rated by the target user.
§ Ranking – Pick the top 𝑁 items and recommend them to the user
Item 1: 2
Item 2: 5
Weighted sum
Item 3: 4.5
Target user
Item 4: ?
4.2
10
Implementation details: global and local effects
§ Global effect:
§ Mean pizza stores rating: 3.5 stars
§ Pizzahut (item) is 0.5 stars above avg.
§ Kate (user) rates 0.2 stars below avg. Þ Kate may rate Pizzahut 3.8 stars
§ Local neighborhood (CF):
§ Other users share similar interests with
Kate had low ratings to Pizzahut
Þ Kate may rate Pizzahut 3.6 stars
11
Modeling Bias
§ Use the similarity to weight the ratings of items from other user 𝑢( 𝑟̂ =𝑏 +∑?”∈@#𝑤&((𝑟(‘−𝑏(‘)
§ Global effect:
§ Mean pizza stores rating: 3.5 stars
§ Pizzahut (item) is 0.5 stars above avg. § Kate (user) rates 0.2 stars below avg.
Þ Kate may rate Pizzahut 3.8 stars § Local neighborhood (CF):
§ Other users share similar interests with Kate had low ratings to Pizzahut
Þ Kate may rate Pizzahut 3.6 stars
&’ &’
∑?”∈@# |𝑤&(| 𝑏&’ = 𝑏A + 𝑏& + 𝑏’
𝑏’ – avg ratings over all transactions 𝑏! -avgratingsofuser𝑖–𝑏’
𝑏# -avgratingsofitem𝑗–𝑏’
12
Pros and cons – collaborative filtering
§ Pros
J Don’t need extra information from items
J Leverage the information from other users/items
§ Cons
L Cold-start problem
§ New users – user that has no/limited ratings, we need many rating records to construct accurate user profiles
§ New items – item that has no/limited ratings will not be recommended until it has been rated several times by users.
L Biased on popular items
13
Memory-based collaborative filtering
§ Memory-based CF
§ Use the entire user-item rating matrix
§ Find nearest neighbors based on the ratings
§ Challenges:
§ Sparsity: <1% on real datasets
§ Each user often purchases hundreds from millions of items.
§ The similarity computed from the sparse user rating vector 𝑟! results in poor similarity measure
§ Need dimension reduction
§ Nearest neighbor is not scalable
§ Computation cost grow with number of users and items, i.e., 𝑂(𝑛𝑚) § Need to store all the training transactions in memory
14
Model-based collaborative filtering
§ Learn a model to uncover user preference
User profile
Item profile
𝑟̂ #$
Learn user and item profiles using Machine learning models
§ Dimension reductionàhandle sparse data
Item 1 Item 2 Item 3
Rating representation Latent representation
15
Factor from Item 1 and Item 2
Matrix Factorization
§ MF assumes there are latent vector representation of users and items. § The latent vector of users and items can reconstruct the rating matrix
𝑹
Items Users
𝑸 𝑷𝑻
≈× 𝑘 factors
𝑘 factors
User profiles 𝑛×𝑘
Item profiles
𝑘×𝑚
16
Latent space of user and item
Geared towards females
Ocean’s 11
The Lion King
Geared towards males
The Color Purple
Sense and Sensibility
Serious
Amadeus
Braveheart
Lethal Weapon
The Princess Diaries
Independence Dumb and Day Dumber
Fig from mmds.org
Funny 17
Singular value decomposition (SVD)
𝑅
𝑈 𝑆 𝑉J
𝑟"" ⋯ 𝑟"% 𝑢"" ⋯ 𝑢"( s"" ⋯ 0 𝑣"" ⋯ 𝑣"% ⋮⋱⋮≈⋮⋱⋮⋮⋱⋮⋮⋱⋮
𝑟$" ⋯ 𝑟$% 𝑢$" ⋯ 𝑢$( 0 ⋯ 𝑠(( 𝑛×𝑚 𝑛×𝑘 𝑘×𝑘
§ 𝑅 – rating matrix
§ 𝑈 – user preference on (latent) factors
§ 𝑆 – singular values, importance of the factors § 𝑉 – item quality on (latent) factors
𝑹≈𝑸𝑷J, 𝑸 = 𝑼, 𝑷J =𝑺𝑽J
𝑣("
⋯ 𝑣(% 𝑘×𝑚
18
Solving SVD as Optimization
§ Minimize the Sum of Squared Errors
min $ 𝑅46−𝑈𝑆𝑉846
9
0,2,3 4∈0,6∈7
§ Pros:
§ It directly optimizes the SSE and thus minimize the RMSE
§ Cons:
§ SVD needs to run on all the entries!
§ Missing values will also be used to calculate the loss!
19
Latent factor model
§ A special way to solve the matrix factorization problem by only counting the loss on available ratings!
§ In this model, we aim to predict ratings
𝑟̂ =𝑞⋅𝑝
) K,L M!#∈𝑹
minJ 𝑟&'−𝑞&⋅𝑝'
§ Additional merit - it is easy to incorporate the bias terms!
𝑟̂ = 𝑏 + 𝑏 + 𝑏 + 𝑞 ⋅ 𝑝 46 : 4 6 4 6
46 4 6
20
Regularization
§ Objective with Regularization:
min J 𝑟&' − 𝑞& ⋅ 𝑝' ) + 𝜆" J 𝑞& ) + 𝜆) J 𝑝' )
K,L M!#∈𝑹
&∈@ '∈K
Length of the vectors
SSE
𝜆!, 𝜆" are the hyperparameters to balance SSE and the regularization terms
21
Roadmap
§ Recap collaborative filtering and latent factor model § Context-aware recommendation
§ Session-based recommendation
22
Context-aware recommendation
§ Traditional model
§ User × ItemsàRatings
§ It does not consider the context of the user (e.g., where, when and what?)
§ Context is important to make personalized recommendation! Feb
14
Book for my kid?
Book for me?
Location
Purpose
Time
23
Context as filters
§ Candidate items that are not within the context can be filtered
Items with season/festival tags
Location
Season/Festival
24
Context-aware recommendation
§ User × Item × ContextàRatings
§ Three paradigms for incorporating context information (Adomavicius,
2008)
§ Contextual prefiltering
§ Select relevant data using context
§ Contextual postfiltering
§ Select recommendation results using context
§ Contextual modeling
§ Incorporating context in learning user preferences
25
Three paradigms for incorporating context information
Adomavicius, et al. 2008
26
Challenges
§ Contextual prefiltering
§ Over-specification – the prefiltering makes the data even more sparse! § E.g., A user has totally 10 ratings on restaurants, 1 in city central.
§ Contextual postfiltering
§ Based on context similarity.
§ E.g., Tom only watches comedies in weekends à filter movies that are not comedies when recommending in weekends
§ It is not efficient for online recommendation
§ It does not uncover the user’s preferences within different contexts
We will focus on contextual modeling!
27
Modeling the context with CF
§ User-Social-Geo (USG) model (Ye, SIGIR’11) – recommending locations
28
The USG model
§ Three types of information
§ User preference (modeled by collaborative filtering)
§ Social influence from friends (social context) § Geographical influence (geo context)
Social influence from friends
𝑐̂(*) = ∑("∈,! 𝑆𝐼!"𝑐"! Check-in (0,1)
(!
Friends who have closer social tie may have better trust!
𝐹& ∩ 𝐹# 𝐹& ∪|𝐹#|
∑("∈,!
Social influence of user k to user i
𝑆𝐼#& =
𝑆𝐼!"
29
The USG model
§ Three types of information
§ User preference (modeled by collaborative filtering)
§ Social influence from friends (social context) § Geographical influence (geo context)
Geographical influence
30
The USG model
§ Three types of information
§ User preference (modeled by collaborative filtering)
§ Social influence from friends (social context) § Geographical influence (geo context)
Geographical influence
𝑝𝑙𝐿' =𝑝𝑙∪𝐿' 𝑝(𝐿' )
The probability that the user 𝑢 visit location 𝑙, given that the user has visited a set of locations 𝐿(
= W 𝑝 𝑙,𝑙, (!∈+#
𝑝𝐿' = W (!)("∈+#
Prob that a user visit both locations follows power law distribution.
𝑝 𝑙,,𝑙-
31
The USG model
§ Three types of information
§ User preference (modeled by collaborative filtering)
§ Social influence from friends (social context) § Geographical influence (geo context)
Geographical influence
Prob that a user visit both locations: Log-prob log𝑎+𝑏log𝑥
Linear regression problem
1
L=2i log𝑎+𝑏log𝑥, −log𝑦, "
,
𝑝 𝑙,, 𝑙-
= 𝑦 = 𝑎 ⋅ 𝑥0, 𝑥=𝑑𝑖𝑠𝑡 𝑙,,𝑙-
𝑥 = 𝒅𝒊𝒔𝒕(𝒍𝒏, 𝒍𝒎)
1 2 3 4
𝑎=12, 𝑏=−2
Prob visit both y 1/2 1/8 1/18 1/32
32
The USG model
§ Three types of information
§ User preference (modeled by collaborative filtering)
§ Social influence from friends (social context) § Geographical influence (geo context)
§ Linear fusion the three types of information
𝑐̂ = 1−𝛼−𝛽 ⋅𝑐̂jk +𝛼⋅ +𝛽⋅𝑐̂A
𝑐̂ l ?&
?&
?&
∑("∈-! 𝑤!"𝑐"! ∑("∈-! 𝑤!"
?&
∑("∈,! 𝑆𝐼!"𝑐"! ∑("∈,! 𝑆𝐼!"
𝑝 𝑖 𝐿(
33
The USG model
§ Pros
§ Very flexible to incorporate different context information by adding new estimator of
the check-ins
§ Can use different approaches to model very different contextual information (consider the geo influence model)
§ Cons
§ Cannot capture the interdependency of the context, because each context is modeled independently
34
Tensor decomposition
§ SVD
§User×ItemàRatings
§ 𝑟̂ = 𝑈 𝑆𝑉. = ∑ 𝑠 𝑢 𝑣
!# ! # / / !/ #/ Users
%
Items
'(
×!!! ⋯ 0 ×
⋮ ⋱ ⋮ 0 ⋯ !""
&!
Item profiles
#×$
User profiles !×#
# factors
≈
𝑽4 𝑘𝑚
𝑘×𝑘×𝑘 𝑘
§ Tensor decomposition
§ User × Item × ContextàRatings
# factors
𝑘
Tucker decomposition
𝑼
𝑘𝑘𝑝 𝑪4
𝑛𝑘
𝑟̂ =𝑆×𝑈×𝑉×𝐶 #$( !#"$1(
&
= i 𝑠2𝑢#2𝑣$2𝑐(2 23!
35
Terminology
§ A tensor is N-way array
§ E.g. a matrix is a 2-way array
§ A 3-way tensor has three modes – columns, rows, and tubes
𝑿 ∈ R5×7×8
Kolda & Bader 2009
36
n-mode matrix product
§ Let 𝑿 be an N-way tensor of size 𝐼"×𝐼)× ... 𝐼$ ...×𝐼x, and let 𝑼 be a matrix of size 𝐽×𝐼$, the n-mode matrix product is:
{%
𝑿 ×$ 𝑼 &",&),...,',..,&$ = 𝑋&",...,&%&',:,&%('...,&$ ⋅ 𝑈' = J 𝑥&",&),...,&%,..,&$ ⋅ 𝑢',&%
§ Example: 1-mode matrix product 𝑼2,:
&% z"
𝑿:,2,2
Vector dot
𝑿 ∈ R𝑰𝟏×𝑰𝟐×𝑰𝟑
𝑼 ∈ R𝑱×𝑰𝟏
𝑿×!𝑼 ∈ R𝑱×𝑰𝟐×𝑰𝟑 37
Tensor decomposition
§ Example Item
𝑈𝑉𝐶
×! ×" ×1
𝑟̂=𝑆×𝑈×𝑉×𝐶=∑& 𝑠𝑢𝑣𝑐 #$( ! # " $ 1 ( 23! 2 #2 $2 (2
-0.5
-0.8
0.3
-0.6
-0.5
-0.6
-0.4
0.5
0.8
-0.5
0.2
-0.2
-10.7
-5.5
-1.0
-13.2
2.0
0.4
-9.1
4.4
-3.8
-10.2
1.4
-0.5
-10.6
0.6
2.5
0.4
0.3
-0.1
0.4
-0.6
0.1
0.4
-0.3
0.3
0.3
-0.4
-1.2
0.4
0.2
0.9
0.4
0.4
-0.5
0.4
0.6
0.1
0.3
0.0
0.0
4
0
0
0
5
2
1
0
3
2
5
?0
3
4
4
3
2
0
0
0
4
0
1
2
4
4
3
3
0
0
1
0
5
2
0
0
1
0
4
3
𝑆 = 𝐼1×1×1
R
38
User
Context
Tensor decomposition
§ Example Item
𝑈
𝑟̂=𝑆×𝑈×𝑉×𝐶=∑& 𝑠𝑢𝑣𝑐 #$( ! # " $ 1 ( 23! 2 #2 $2 (2
-10.7
-5.5
-1.0
-13.2
2.0
0.4
-9.1
4.4
-3.8
-10.2
1.4
-0.5
-10.6
0.6
2.5
1
1
𝑆 = 𝐼1×1×1
×!
=
-5.5
𝑆×!𝑈 !,:,:
-1.0
4
0
0
0
5
2
1
0
3
2
5
?0
3
4
4
3
2
0
0
0
4
0
1
2
4
4
3
3
0
0
1
0
5
2
0
0
1
0
4
3
1
-10.7
R
39
User
Context
Tensor decomposition
§ Example Item
𝑉
𝑟̂=𝑆×𝑈×𝑉×𝐶=∑& 𝑠𝑢𝑣𝑐 #$( ! # " $ 1 ( 23! 2 #2 $2 (2
0.4
0.3
-0.1
0.4
-0.6
0.1
0.4
-0.3
0.3
0.3
-0.4
-1.2
0.4
0.2
0.9
0.4
0.4
-0.5
0.4
0.6
0.1
0.3
0.0
0.0
4
0
0
0
5
2
1
0
3
2
5
?0
3
4
4
3
2
0
0
0
4
0
1
2
4
4
3
3
0
0
1
0
5
2
0
0
1
0
4
3
-5.5 -10.7
𝑆×!𝑈 !,:,:
=
2.2 -3.2
-1.0
×"
1.2
𝑆×!𝑈×"𝑉 !,=,:
R
40
User
Context
Tensor decomposition
§ Example Item
𝐶
×1 =0.2
𝑟̂ = 𝑆× 𝑈× 𝑉× 𝐶
1.2
2.2 -3.2
𝑆× 𝑈× 𝑉 ! "
-0.5
-0.8
0.3
-0.6
-0.5
-0.6
-0.4
0.5
0.8
-0.5
0.2
-0.2
4
0
0
0
5
2
1
0
3
2
5
?0
3
4
4
3
2
0
0
0
4
0
1
2
4
4
3
3
0
0
1
0
5
2
0
0
1
0
4
3
R
𝑟̂=𝑆×𝑈×𝑉×𝐶=∑& 𝑠𝑢𝑣𝑐 #$( ! # " $ 1 ( 23! 2 #2 $2 (2
!,=,:
!,=,!
! " 1 !,=,!
41
User
Context
Factorization machine
§ Tensor decomposition
§ Model context, user and item in a single principle way
§ Similar training methods to matrix factorization
§ However, it does not scale well w.r.t. the context!
§ E.g., 1000 regions and 48 hoursà48,000 different contexts. § Not efficient – tensor multiplication 𝑂(𝑚𝑛𝑝𝑘)
§ Factorization machine (Rendle, et al., 2010) § Prediction:𝑈×𝐼×𝐶2×⋯×𝐶3à𝑅
§ Idea:
§ Represent user, item and contexts as features § Capture pairwise interactions of features
42
Factorization machine
§ FM treats all inputs (including user, item and context) in the same way § Each record from the input can be represented as [𝑢𝑠𝑒𝑟 ; 𝑖𝑡𝑒𝑚 ; 𝑐2 ; 𝑐$ ; ... ; 𝑐3]
§ The records can be represented by real-value, e.g.,
§ Discrete input: User or item à [0 1 0 0]
§ Context as a set, e.g., the user group à [0 0.5 0.5 0] § Context with real value, e.g., humidity, distance
§ [𝑢𝑠𝑒𝑟 ;𝑖𝑡𝑒𝑚 ; 𝑐2 ; 𝑐$ ;...; 𝑐3]àinput feature vector (𝑥2,𝑥$,...,𝑥4) § Prediction:
&
i𝑣#2𝑣$2 23!
𝑦X=𝑤 +∑~ 𝑤𝑥 +∑~ ∑~
<𝒗,𝒗 >𝑥𝑥
}
Global bias
&z” & &
&z” ‘z&”
& ‘
& ‘
Target (e.g., ratings)
Bias of features
Factorization of pair- wise interaction
43
Example
§ Movie data – users can view a movie as a group
Data format: (User, Movie, Group, Rating)
𝑥(!) 𝑦(!)
1
0
0
1
0
0
0
0
0
1
5
User encoding
Item encoding
Group encoding
Aà[1, 0, 0], Bà[0, 1, 0], Cà[0, 0, 1] TIà[1, 0, 0, 0], NHà[0, 1, 0, 0],
SWà[0, 0, 1, 0], STà[0, 0, 0, 1]
{C} à [0, 0, 1] {B, C}à[0, 1, 1]
44
Example
§ Construct the feature vector
§Prediction:
(Figure from Rendle et al., 2010)
More contexts can be easily inserted
𝑦U=𝑤 +∑4 𝑤𝑥 +∑4 ∑4 <𝑣,𝑣 >𝑥𝑥 ascolumns! 5 !62 ! ! !62 #6!72 ! # ! #
45
Factorization machine
§ Higher order tensoràpairs of features § Computation cost:
§ Tensor decomposition: § Original form of FM:
𝑂(𝑚𝑛|𝑐!||𝑐”| … |𝑐C|𝑘)
𝑦Ä = 𝑤 @ + i 𝑤 # 𝑥 # + i i < 𝑣 # , 𝑣 $ > 𝑥 # 𝑥 $ 𝑂 ( 𝑑 ” 𝑘 )
AAA #3! #3! $3#B!
§ Equivalent form (Rendle et al. 2010):
𝑦Ä=𝑤+∑A 𝑤𝑥+!∑& ∑A 𝑣𝑥”−∑A 𝑣”𝑥” @ #3! # # ” 23! #3! #2 # #3! #2 #
𝑂(𝑑𝑘)
46
Factorization machine
§ Parameters to learn: § Global bias: 𝑤5
§ Weights: 𝑤2:4
§ Latent vector of each feature 𝑣2:4
AAA
𝑦Ä = 𝑤 @ + i 𝑤 # 𝑥 # + i i < 𝑣 # , 𝑣 $ > 𝑥 # 𝑥 $ #3! #3! $3#B!
§ Learning the parameters – Stochastic Gradient Descent (SGD) § Square loss w.r.t. a single record (x, y)
AAA
”
LD,E= 𝑦−𝑤@−i𝑤#𝑥#−ii<𝑣#,𝑣$>𝑥#𝑥$ #3! #3! $3#B!
𝜕 L D , E = − 2 𝑦 − 𝑦Ä 𝑥 # i 𝑣 $ 2 𝑥 $ 𝜕𝑣#2 $)#
𝜕LD,E =−2 𝑦−𝑦Ä 𝑥 𝜕𝑤# #
𝜕LD,E =−2 𝑦−𝑦Ä 𝜕𝑤@
47