COMM7380 Recommender Systems for Digital Media¶
In [1]:
#important command to display IMMEDIATELY your plots
%matplotlib inline
# Install NetworkX, Matplotlib, Pandas, Numpy using pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install numpy
Requirement already satisfied: matplotlib in /anaconda3/lib/python3.7/site-packages (3.1.0)
Requirement already satisfied: cycler>=0.10 in /anaconda3/lib/python3.7/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /anaconda3/lib/python3.7/site-packages (from matplotlib) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /anaconda3/lib/python3.7/site-packages (from matplotlib) (2.4.0)
Requirement already satisfied: python-dateutil>=2.1 in /anaconda3/lib/python3.7/site-packages (from matplotlib) (2.8.0)
Requirement already satisfied: numpy>=1.11 in /anaconda3/lib/python3.7/site-packages (from matplotlib) (1.16.4)
Requirement already satisfied: six in /anaconda3/lib/python3.7/site-packages (from cycler>=0.10->matplotlib) (1.12.0)
Requirement already satisfied: setuptools in /anaconda3/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib) (41.0.1)
Requirement already satisfied: pandas in /anaconda3/lib/python3.7/site-packages (0.24.2)
Requirement already satisfied: pytz>=2011k in /anaconda3/lib/python3.7/site-packages (from pandas) (2019.1)
Requirement already satisfied: python-dateutil>=2.5.0 in /anaconda3/lib/python3.7/site-packages (from pandas) (2.8.0)
Requirement already satisfied: numpy>=1.12.0 in /anaconda3/lib/python3.7/site-packages (from pandas) (1.16.4)
Requirement already satisfied: six>=1.5 in /anaconda3/lib/python3.7/site-packages (from python-dateutil>=2.5.0->pandas) (1.12.0)
Requirement already satisfied: numpy in /anaconda3/lib/python3.7/site-packages (1.16.4)
User Behaviour and the User-Item Matrix¶
Importing and knowing your data¶
In [2]:
import pandas as pd
import numpy as np
In [4]:
evidence = pd.read_csv( ‘collector_log.csv’)
In [5]:
# checkin the type and take a glance at the head
print(type(evidence))
evidence.head(5)
Out[5]:
id
created
content_id
event
session_id
user_id
0
3
14/01/2020 17:54
4501244
details
794773
400003
1
4
14/01/2020 17:54
3521164
moreDetails
794773
400003
2
5
14/01/2020 17:54
3640424
details
441002
400005
3
6
14/01/2020 17:54
2823054
moreDetails
885440
400001
4
7
14/01/2020 17:54
3553976
genreView
441003
400005
Examining the attributes of the Data Frame (standard procedures)¶
• df.shape (“dim” in R)
• df.columns (check the variables, like “names” in R)
• df.index (check the index of the “rows”)
• df.info()
• df.describe() (descriptive statistics for numerical variables)
In [6]:
evidence.shape
# (the number of cases/observations, the number of variables)
Out[6]:
(100000, 6)
In [7]:
evidence.columns
Out[7]:
Index([‘id’, ‘created’, ‘content_id’, ‘event’, ‘session_id’, ‘user_id’], dtype=’object’)
In [8]:
evidence.index
Out[8]:
RangeIndex(start=0, stop=100000, step=1)
In [9]:
evidence.info()
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
id 100000 non-null int64
created 100000 non-null object
content_id 100000 non-null int64
event 100000 non-null object
session_id 100000 non-null int64
user_id 100000 non-null int64
dtypes: int64(4), object(2)
memory usage: 4.6+ MB
In [10]:
evidence.describe()
Out[10]:
id
content_id
session_id
user_id
count
100000.000000
1.000000e+05
100000.000000
100000.000000
mean
50002.500000
2.987698e+06
585135.625220
400003.502640
std
28867.657797
1.230240e+06
317096.306234
1.703597
min
3.000000
4.752900e+05
42450.000000
400001.000000
25%
25002.750000
1.972591e+06
404784.000000
400002.000000
50%
50002.500000
3.040964e+06
794789.000000
400004.000000
75%
75002.250000
4.034228e+06
886267.000000
400005.000000
max
100002.000000
5.700672e+06
935091.000000
400006.000000
In [11]:
users = evidence.user_id.unique()
content = evidence.content_id.unique()
print(type(content))
print(len(content))
103
Implicit Ratings¶
Binary Matrix¶
Let’s create a user-item binary matrix from the “buy” events
In [12]:
#Create a user-item binary matrix
uiBuyMatrix = pd.DataFrame(columns=content, index=users)
uiBuyMatrix.head(2)
Out[12]:
4501244
3521164
3640424
2823054
3553976
3470600
4513674
4698684
3315342
3874544
…
4196776
2948356
1355644
3300542
5247022
2140479
1083452
1179933
3410834
3553442
400003
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
…
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
400005
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
…
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2 rows × 103 columns
In [13]:
evidence.event.unique()
Out[13]:
array([‘details’, ‘moreDetails’, ‘genreView’, ‘addToList’, ‘buy’],
dtype=object)
Select only the “buy” events
In [14]:
buyEvidence = evidence[evidence[‘event’] == ‘buy’]
buyEvidence.head(5)
Out[14]:
id
created
content_id
event
session_id
user_id
92
95
14/01/2020 17:54
4501244
buy
794776
400003
131
134
14/01/2020 17:54
2937696
buy
885441
400001
358
361
14/01/2020 17:54
3874544
buy
885444
400001
612
615
14/01/2020 17:54
3949660
buy
885445
400001
707
710
14/01/2020 17:54
5512872
buy
42460
400006
Create the user-item matrix uiBuyMatrix for the buy events
In [22]:
for index, row in buyEvidence.iterrows():
currentUser = row[‘user_id’]
currentContent = row[‘content_id’]
uiBuyMatrix.at[currentUser, currentContent] = 1
In [23]:
print(uiBuyMatrix)
4501244 3521164 3640424 2823054 3553976 3470600 4513674 4698684 \
400003 1 1 1 1 1 1 1 1
400005 NaN NaN 1 NaN 1 NaN 1 NaN
400001 1 1 1 NaN 1 1 1 1
400006 1 NaN 1 1 NaN 1 1 1
400002 NaN NaN 1 1 1 1 1 NaN
400004 NaN NaN 1 NaN NaN NaN NaN NaN
3315342 3874544 … 4196776 2948356 1355644 3300542 5247022 2140479 \
400003 1 1 … NaN NaN 1 NaN NaN 1
400005 1 NaN … NaN NaN 1 NaN NaN 1
400001 1 1 … NaN NaN 1 NaN 1 1
400006 NaN 1 … NaN NaN NaN NaN 1 NaN
400002 1 NaN … 1 1 1 NaN 1 NaN
400004 1 NaN … 1 1 NaN 1 NaN NaN
1083452 1179933 3410834 3553442
400003 NaN 1 NaN 1
400005 1 1 NaN NaN
400001 1 1 NaN 1
400006 NaN NaN NaN 1
400002 1 NaN 1 1
400004 NaN NaN 1 NaN
[6 rows x 103 columns]
Behavioural Implicit Ratings¶
Using the formula introduced during lecture
$${IR}_{(i,u)} = \left(w_1*{\#event}_1\right)+\left(w_2*{\#event}_2\right)+\dots+\left(w_n*{\#event}_n\right)$$
In [24]:
#Create a user-item matrix
uiMatrix = pd.DataFrame(columns=content, index=users)
uiMatrix.head(2)
Out[24]:
4501244
3521164
3640424
2823054
3553976
3470600
4513674
4698684
3315342
3874544
…
4196776
2948356
1355644
3300542
5247022
2140479
1083452
1179933
3410834
3553442
400003
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
…
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
400005
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
…
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2 rows × 103 columns
Type of events recorded in the logs
In [25]:
eventTypes = evidence.event.unique()
print(eventTypes)
[‘details’ ‘moreDetails’ ‘genreView’ ‘addToList’ ‘buy’]
Give a weight to each of them
In [26]:
eventWeights = {
‘details’: 15,
‘moreDetails’: 50,
‘genreView’: 0,
‘addToList’: 0,
‘buy’: 100}
Compute the Implicit Rating for each user-item combination. Populate the user-item matrix uiMatrix with the IR values.
In [27]:
# Iterate the evidence
for index, row in evidence.iterrows():
# Select the user and items involved
currentUser = row[‘user_id’]
currentContent = row[‘content_id’]
# Extract the appropriate weight for the event
w = eventWeights[row[‘event’]]
# Find the value eventually stored for the current user-item combination
currentValue = uiMatrix.at[currentUser, currentContent]
if np.isnan(currentValue):
currentValue = 0
# Compute the new value and update the user-item matrix
updatedValue = currentValue + w #+ (1 * w)
uiMatrix.at[currentUser, currentContent] = updatedValue
In [28]:
print(uiMatrix)
4501244 3521164 3640424 2823054 3553976 3470600 4513674 4698684 \
400003 4820 4645 4110 3875 2785 4165 7070 4155
400005 NaN NaN 7555 NaN 7785 NaN 7840 NaN
400001 4810 4590 4255 4000 2335 5230 6825 4520
400006 7615 7675 195 8700 NaN 9240 8835 8600
400002 2245 2420 5890 2815 1710 2345 4015 1900
400004 NaN NaN 8320 NaN NaN NaN NaN NaN
3315342 3874544 … 4196776 2948356 1355644 3300542 5247022 2140479 \
400003 3920 4200 … 1625 2040 2745 1620 3580 2840
400005 7965 NaN … 50 80 7630 15 NaN 8090
400001 4235 4445 … 1630 1715 2150 1570 4600 2445
400006 30 9540 … 60 230 NaN 150 7725 NaN
400002 6160 2160 … 4800 4060 1665 3630 2530 2090
400004 8200 NaN … 8425 8120 NaN 8775 NaN NaN
1083452 1179933 3410834 3553442
400003 2885 2665 1990 3655
400005 9135 7945 15 NaN
400001 2325 2535 1830 4010
400006 NaN NaN 30 8220
400002 1800 1395 4240 2415
400004 NaN NaN 8720 NaN
[6 rows x 103 columns]
Exercise 2¶
Limit the number of relevant events to a specific threshold (e.g. 10).
In [ ]:
Exercise 3¶
Add a decay threshold. Older events are not informative about the user’s behavior. Check the sample Python function and adapt the code according to the following formulation.
Behavioural Implicit Ratings with Decay¶
We modify the formula introduced during lecture
$${IR}_{(i,u)} = \sum_{i=1}^n w_i*{\#event}_i = \left(w_1*{\#event}_1\right)+\left(w_2*{\#event}_2\right)+\dots+\left(w_n*{\#event}_n\right)$$
to
$${IRDecay}_{(i,u)} = \sum_{i=1}^n w_i*{\#event}_i*d\left({\#event}_i\right) = \left(w_1*{\#event}_1*d\left({\#event}_1\right)\right)+\left(w_2*{\#event}_2*d\left({\#event}_2\right)\right)+\dots+\left(w_n*{\#event}_n*d\left({\#event}_n\right)\right)$$
Computing decay¶
In [27]:
import datetime
from datetime import date, timedelta, datetime
def compute_decay(eventDate, decayDays):
age = (date.today() – datetime.strptime(eventDate, ‘%d/%m/%Y %H:%M’).date()) // timedelta(days=decayDays)
#print(“Age of event:”, age)
decay = 1/age #simple decay
#print(“Decay factor:”, decay)
return decay
createdEvent = evidence.at[0,’created’]
thresholdDays = 2 # Number of days
decayFactor = compute_decay(createdEvent, thresholdDays)
print(decayFactor)
0.1111111111111111
In [ ]:
• Course Instructor: Dr. Paolo Mengoni (Visiting Scholar, School of Communication, Hong Kong Baptist University)
▪ pmengoni@hkbu.edu.hk
• The codes in this notebook take insipiration from various sources. All codes are for educational purposes only and released under the CC1.0.