程序代写代做 algorithm COMP20008 Elements of Data Processing

COMP20008 Elements of Data Processing
Semester 1 2020 Recommender Systems

Plan today
• Why recommender systems?
• Recommender systems and collaborative filtering
– Popularity based
– Collaborative filtering – memory based
• Item-Item • User-User
• Friday’s lecture
– Content-based and model-based collaborative filtering – System evaluation

Why recommender systems?
• Scarcity to Abundance
• Internet changed shopping behaviours
• Online business is heavily dependent on recommender systems.
item sorted by popularity
– long tail –
popularity

Why recommender systems?
• The Long Tail by Chris Anderson: “In 1988, a British mountain climber named Joe Simpson wrote a book called ‘Touching the Void’, a harrowing account of near death in the Peruvian Andes. It got good reviews but, only a modest success, it was soon forgotten. Then, a decade later, a strange thing happened. Jon Krakauer wrote ‘Into Thin Air’, another book about a mountain-climbing tragedy, which became a publishing sensation. Suddenly Touching the Void started to sell again”.
• “A lot of times, people don’t know what they want until you show it to them” – Steve Jobs

++
Total price: $27.19
Recommender systems – examples
Ad feedback
This item: The Martian by Andy Weir Paperback $8.92
The Revenant: A Novel of Revenge by Michael Punke Paperback $9.52 • LinkedIn
The Life We Bury by Allen Eskens Paperback $8.75 • Facebook
• Twitter
Customers Who Bought This Item Also Bought
Netflix
Page 1 of 15
13/03/2016 10:03 26am
• Youtube • Netfix
• Amazon
The Revenant: A Novel of Revenge
› Michael Punke
Ready Player One: A Novel › Ernest Cline
The Life We Bury › Allen Eskens
The 5th Wave: The First The
Paperback
$9.52
2,006
1,250
Paperback
$8.37
Paperback
$8.75
9,210
1,896
QuestAcftoiorn Gold at the… › Daniel James Brown 17,056
Book of the 5th Wave Series
› Rick Yancey
Bo Americans and Their Epic
Paperback
$6.70
#1 Best Seller Paperback $9.1ht5tps://www.netflix.com/Kids
in Boating
Fuller House The Wiggles My Little Pony Mako Mermaids H2O: Just Add Water Good Luck Charlie Pokémon
Recently watched
Popular
Top Picks for Kids
Kids Categories Search Kids… Exit Kids
Boys in the
at: Nine
Page 1 of 4

Recommender systems
• “75% of what people watch is from some sort of recommendation” (Netflix)
• “If I have 3 million customers on the web, I should have 3 million stores on the web.” (Amazon CEO)
Movie recommender systems
– finding best matched movies,
– reducing search times and frustration.
Users
Titanic
Batman
Inception
Superman
The Martian
Jurassic World
Harry
3
2


1

Ming


1
2


Peter
1


3
2
1

Recommender systems – How it works
• An online system where many users interact with many items.
• Each user has a profile
• User rate items
– Explicitly: give a score
– Implicitly: web usage mining: Time spent on viewing the item, etc.
• System does the rest, How?

Popularity based recommendation
• Show popular items.
• Which item is popular?
• Simple but not personalised.

Collaborative filtering
• Collaborative Filtering: Making predictions about a user’s missing data according to the collective behaviour of many other users
– Look at users’ collective behavior (e.g. ratings) – Active user history
– Combine!
• Item-based collaborative filtering (Item-Item)
• User-based collaborative filtering (User-User)

Collaborative filtering – A framework
𝑹
𝒖𝟏
𝒊𝟏
3
1
𝒊𝟐
4
𝑰: 𝒏-items

2
𝒊𝒋

𝒊𝒏
1
𝒖𝟐
2
5

2
1
𝒖𝒊
3
4
𝑟𝑖𝑗?
𝑓: 𝑈 𝑥 𝐼 → 𝑅

5
3
2
4
3
3
𝒖𝒎
• Given
– A set of 𝑚 users 𝑼 and a set of 𝑛 Items 𝑰
– A𝑚×𝑛InteractionMatrixorRatingMatrix𝑹
• Find unknown ratings 𝑟𝑖𝑗
: -users

Item-based method: Intuition
People like things similar to other things they like
• Search for similarities among items
– Many users like both Batman and Superman→ the two movies are similar.
– Similarity is collective similarity in ratings by many users.
• Recommend items similar to those rated by the target user.
– Superman and Batman are similar – If Peter liked Batman then
recommend Superman to Peter.
Superman Batman

Item based collaborative filtering
Three questions to address:
• How to measure item similarities?
• How to find similar items?
• How to combine ratings of these items?
i1 ii ij in u1
um

Q1: Measure item-similarity
Example similarity between item 𝑖 and item 𝑖 : 𝑖𝑗
• Euclidean distance with mean imputation – Imputation with their mean values
3+3+2
• 𝑚𝑒𝑎𝑛𝑖𝑖 = 3 =2.7
𝑖𝑖 𝑖𝑖 𝑖𝑗 𝑖𝑗
3 −
− 3.5 − 3
3 3.75 2.7 3.5 2.7 3
3 3.5
𝑗 − 4 2.7 4
3 3.5 •𝑚𝑒𝑎𝑛𝑖=3.75 2 24
– Similarity score based on Euclidean distance
𝐬𝒊𝒎 𝒊𝒊,𝒊𝒋 = 𝟏 where𝒅 𝒊𝒊,𝒊𝒋 = σ𝒎 (𝒓 −𝒓 )𝟐
– 𝒅𝒊𝒊,𝒊𝒋 =3.24
= (3 − 3.75)2+(2.7 − 3.5)2 + (2.7 − 3)2+(3 − 3.5)2 +(2 − 4)2 +(2.7 − 4)2+ (2.7 − 4.5)2
–𝒔𝒊𝒎𝒊𝒊,𝒊𝒋= 𝟏 =𝟎.𝟐𝟒 𝟏+𝟑.𝟐𝟒
− 4 2.7 4.5 4.5
𝟏+𝒅 𝒊𝒋,𝒊𝒊 𝒌=𝟏 𝒌𝒊 𝒌𝒋

Q2: How to find similar items?
• We have an answer to Q1, for item 𝑥1, we have the similarities between it and other items:
𝒔𝒊𝒎(𝒙𝟏, 𝒙𝟐)
𝒔𝒊𝒎(𝒙𝟏, 𝒙𝟑)
𝒔𝒊𝒎(𝒙𝟏, 𝒙𝟒)
𝒔𝒊𝒎(𝒙𝟏, 𝒙𝟓)
𝒔𝒊𝒎(𝒙𝟏, 𝒙𝟔)
0.48
0.4
0.20
0.33
0.35
• The target user 𝒂 has rated some items:
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒙𝟒
𝒙𝟓
𝒙𝟔
𝒂
?4
-5
3
3
• Choose a number 𝒌, find k-most similar items to 𝑥1 for user a
• Let 𝑘 = 3, which 3 items?
– Items 𝑥2, 𝑥6, and 𝑥5
– scores 0.48, 0.35, and 0.33.

Q3: How to combine ratings of similar items?
• Predict the rating of item 𝑥1 for user 𝑎
• From Q1 and Q2, we get:
– For user 𝑎, the 3 (k = 3) most similar items to 𝑥1: 𝑥2, 𝑥6, 𝑥5
𝒔𝒊𝒎(𝒙𝟏, 𝒙𝟐)
𝒔𝒊𝒎(𝒙𝟏, 𝒙𝟑)
𝑺𝒊𝒎(𝒙𝟏, 𝒙𝟒)
𝒔𝒊𝒎(𝒙𝟏, 𝒙𝟓)
𝒔𝒊𝒎(𝒙𝟏, 𝒙𝟔)
0.48
0.4
0.20
0.33
0.35
– The ratings of these 3 items by user a: 4, 3.5, 3
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒙𝟒
𝒙𝟓
𝒙𝟔
𝒂
?
4

5
3
3
• Rating = weighted average over the ratings of the 3 most similar items
– 𝑟 =0.48×4+0.33×3+0.35×3=3.41
𝑎,𝑥1
0.48 + 0.33 + 0.35

Item-based Collaborative Filtering – Algorithm
• Phase 1 – For each item j,
– Compute similarities between j and other items.
similarity: e.g. Euclidean distance with mean imputation.
– Batch, Off-line
• Phase 2 – Predict rating of item j by user 𝑎 based on the k-most similar items (among items rated by 𝑎)
– Predicted rating = weighted average over the ratings of the k-most similar items.
𝑟= 𝑎𝑗
– Online
σ 𝑠𝑖𝑚𝑖,𝑗 ×𝑟
𝑖∈𝑘−𝑠𝑖𝑚𝑖𝑙𝑎𝑟−𝑖𝑡𝑒𝑚𝑠 σ𝑖∈𝑘−𝑠𝑖𝑚𝑖𝑙𝑎𝑟−𝑖𝑡𝑒𝑚𝑠 𝑠𝑖𝑚 𝑖, 𝑗
𝑎𝑖

Item-based filtering – Practice example
• Predict 𝑟 (𝑎 = 𝑇𝑖𝑚; 𝑗 = 𝐼𝑛𝑐𝑒𝑝𝑡𝑖𝑜𝑛) 𝑎𝑗
Users
Titanic
Batman
Inception
Superman
The Martian
Jurassic World
Michelle
2.5
3
3.5
2.5
3
Tom
3
3.5
5
3
3.5
Lao
2.5
3
3.5
4
Chan
3.5
3
4
2.5
Mary
4
2
3
2
3
Tim
3
4
raj?
5
3.5
3
John
4.5
4
1
Phase – 1 offline: similarities between Inception and other movies
sim( Inception, Titanic)
sim( Inception, Batman)
sim( Inception, Superman)
sim( Inception, The Martian)
sim( Inception, Jurassic World)
0.48 (d=1.08)
0.24 (d=3.24)
0.20 (d=3.89)
0.33 (d=2.05)
0.34 (d=1.97)

Item-based filtering – Practice example cont.
• Phase – 2 online:
– select 3-most similar items (k=3) w.r.t. (Tim, Inception)

sim( Inception, Titanic)
sim( Inception, Batman)
sim( Inception, Superman)
sim( Inception, The Martian)
sim( Inception, Jurassic World)
0.48
0.24
0.20
0.33
0.34
Users
Titanic
Batman
Inception
Superman
The Martian
Jurassic World
Michell e
2.5
3
3.5
2.5
3
Tom
3
3.5
5
3
3.5
Lao
2.5
3
3.5
4
Chan
3.5
3
4
2.5
Mary
4
2
3
2
3
Tim
3
4
?
5
3.5
3
John
4.5
4
1
– weighted avg over the ratings of the 3-most similar items
– 𝑟 = 0.48×3 + 0.33×3.5 + 0.34×3 = 3.14
𝑎𝑗
0.48 + 0.33 + 0.34

Item-based collaborative filtering Summary
• Item similarities computation is off-line
• So, efficient at runtime.
• Developed by Amazon, suited for situations #users >> #items
• What do we do with new items?

User-based collaborative filtering: Intuition
People like things liked by other people with similar taste
• Search for similarities among users
– Two users Jane and Bob tend to like same movies; they
have similar taste in movies.
• Recommend items like by users similar to the target user. – Jane and Bob have similar rating behaviours (taste), – If Jane liked Batman then
recommend Batman to Bob.
• Mathematically similar to Item-based methods.

User-based method
Target user
Q1: how to measure similarity?
Q2: how to find similar users?
Q3: how to combine?

Q1: How to measure similarity between users
• Euclidean distance with mean imputation
𝑖1 in 𝑢1
𝑢2
𝒖𝟏
17

20
18
17
18.5
𝒖𝟐
8


17
14
17.5
• 𝑠𝑖𝑚𝑢1,𝑢2 = 1 =0.08 1+𝑑(𝑢1,𝑢2)
𝑑𝑢1,𝑢2 =11.9=
(17 − 8)2+ (18.1 − 14.1)2+(20 − 14.1)2+(18 − 17)2+(17 − 14)2+(18.5 − 17.5)2
• Compute mean value for user1’s missing values (18.1)
• Compute mean value for user2’s missing values (14.1)
• Compute Euclidean distance between resulting rows
• Convert the distance into a similarity (high similarity for low
distance, low similarity for high distance)

User-based: Q2: How to find similar users? Q3: How to combine ratings?
• Selecting similar users and making prediction
• With respect to user a and item 𝑗:
– Choose 𝒌 most similar users who have rated item 𝒋.
– Prediction of rating is weighted average of the ratings of item 𝑗 from the top-k similar users.

User-based method
• Mathematically similar to Item-based method. • However:
– Item-based performs better in many practical cases: movies, books, etc.
– User preference is dynamic; relatively static for item based High update frequency of offline-calculated information
– Sparsity problem with user based method.
• No recommendation for new users
• Scalability issues
– As the number of users increase, more costly to find similar users.
– Offline clustering of users

Scale-up search of k-similar users

Scale up search of k-similar users
• Offline step

Options for Q1: Similarity metrics
• Item-item: Considers the similar items
• User-user: Considers the similar users
• We looked at Euclidean distance based similarity.
• The other two popular similarity measures are
– Cosine similarity and
– Pearson Correlation (centered cosine similarity).

Cosine similarity
• Cosine similarity is a measure of similarity between two vectors X, Y.
– a dot product between two vectors X, Y.
– X, Y: 2 vectors of ratings by user x and user y – X, Y: 2 vectors of ratings of item x and item y
𝑋⋅𝑌
• 𝑐𝑜𝑠𝑋,𝑌=(𝑋⋅𝑌)=
• 𝑐𝑜𝑠 𝑋,𝑌 >𝑐𝑜𝑠(𝑋,𝑍)
𝑖=1
𝑖=1
σ𝑛 𝑋𝑖×𝑌𝑖 𝑖=1
σ𝑛 𝑥𝑖2× σ𝑛 𝑦𝑖2
Y
X
Z
𝜃1 𝜃2

Centred cosine similarity
• Cosine similarity
– Missing values in vectors are imputed with the value 0
– Issue: 0 has very different meanings in different vector context
• Two users, one is tough and one is easy
• Two items having higher and lower ratings. • Misleading results
• let 𝑋 be normalised values of 𝑋 and 𝑌 𝑛𝑟𝑜𝑚 𝑛𝑟𝑜𝑚
normalised 𝑌
• 𝑐𝑒𝑛𝑡𝑟𝑒𝑑_𝑐𝑜𝑠 𝑋,𝑌 = 𝑐𝑜𝑠 𝑋 ,𝑌 𝑛𝑜𝑟𝑚 𝑛𝑜𝑟𝑚
σ𝑛 (𝑋𝑖 −𝑥ҧ )×(𝑌𝑖− 𝑦ത) 𝑖=1
σ𝑛 (𝑥𝑖 −𝑥ҧ)2 × σ𝑛 (𝑦𝑖 −𝑦ത)2 𝑖=1 𝑖=1
=
• Centred cosine similarity is Pearson correlation.

Summary
• We learnt:
– Popularity based.
– Item-based and user-based collaborative filtering (Gen 1)
• Simple but reasonably powerful.
• Achieves some level of personalisation. – Different measurements of similarities.
– Some limitations with these approaches.
• Cold start problem – new items/users • Scalability issues