Introduction to Machine Learning Maximum Likelihood Estimates
Prof. Kutty
collaborative filtering
Copyright By PowCoder代写 加微信 powcoder
use this link for in-class exercises
https://forms.gle/jqAdK1sSMhcx6zDHA
Matrix Completion
call this the utility (or user-item) matrix Y
How to solve for the missing ratings?
1)Matrix factorization
2)Nearest neighbor prediction
Low-Rank Factorization: example
COMEDY and ACTION are the latent factors rows are user vectors
columns are item vectors
based on example from Aggarwal 2016
approximated by
Low-Rank Factorization Algorithm
” # , % = 12 ) ! $ & − ,+ $ ⋅ / ̅ & * + 21 ) ,+ $ * + 21 ) / ̅ & *
($,&)∈) observedentries in7 $
Initialize item features “̅ ! , … , “̅ ”
to small (random)
Iterate until convergence
fix”̅ ! ,…,”̅ ”
solve for % ∈ {1,…,)}
m i n ! ∑ ( ‘ , ) ) ∈ , / ‘ ) − 21 ‘ ⋅ ” ̅ ) % + – 21 ‘ % $#! % %
For 21 ! ,…,21 . obtained above solve for 5 ∈ {1, … , 6}
m i n ! ∑ ( ‘ , ) ) ∈ , / ‘ ) − 21 ‘ ⋅ ” ̅ ) % + – ” ̅ ) % 0/” % %
/7 = 9 : 1 = 21 ‘ ⋅ ” ̅ ) ‘) ‘)
Prediction for user a, item i can then be computed as
in-class Questions
• How do you specify the rank of the approximating matrix
hyperparameter
How to solve for the missing ratings?
1)Matrix factorization
2)Nearest neighbor prediction
Nearest Neighbor Prediction
Suppose user a has not rated movie i To predict the rating
– compute similarity between user a and all other users in the system – find the k ‘nearest neighbors’ of user a who have rated movie i
– compute a prediction based on these users’ ratings of i
(Y Y ̃ )(Y Y ̃ ) sim(a,b)= q Pj2R(a,b) aj a:b bj b:a
P ̃ 2P ̃ 2 X j2R(a,b)(Yaj Ya:b) j2R(a,b)(Ybj Yb:a)
ranges from [-1, 1]
-1, +1 → perfect linear relationship
0 → no linear correlation
absolute value is a measure of degree of correlation
!(#, %): set of movies rated by both users a and b ̃ 1
Ya:b = |R(a, b)| Yaj j2R(a,b)
Yˆ=Y ̄+ 1 X sim(a,b)(Y Y ̄)
|sim(a,b)| bi b b kN N (a,i)
b2kN N (a,i)
in class exercise: what is the similarity of users a and b
sim(a,b)= q
P (Y Y ̃ )(Y Y ̃ ) j2R(a,b) aj a:b bj b:a
P ̃2P ̃2 j2R(a,b)(Yaj Ya:b) j2R(a,b)(Ybj Yb:a)
d) strictly between -1 and 0
e) strictly between 0 and 1
b 2,6 Ya b 4 5
denominator
efftab si É
Example: similarity
b P (Y Y ̃ )(Y Y ̃ ) sim(a,b)= q j2R(a,b) aj a:b bj b:a
P ̃2P ̃2 j2R(a,b)(Yaj Ya:b) j2R(a,b)(Ybj Yb:a)
in class exercise: what is the prediction for item 3?
k “nearest neighbors”
i.e., the k most similar users to a, who have rated movie i
Assume we are using 1NN and b is the user most similar to a
a) 0 b) 1 c) 2.5 d) 3
Yˆ=Y ̄+P ai a
b2kN N (a,i)
|sim(a,b)|
X sim(a,b)(Y Y ̄) bi b
b2kN N (a,i)
Example: prediction
k “nearest neighbors”
i.e., the k most similar users to a, who have rated movie i
Assume we are using 1NN and b is the user most similar to a
Yˆ=Y ̄+P ai a
b2kN N (a,i)
X sim(a,b)(Y Y ̄) |sim(a,b)| bi b
b2kN N (a,i)
How to choose k?
• Some heuristics:
– allusersareconsideredasneighbors
– Why could limiting neighborhood size result in more accurate predictions?
• neighbors with low correlation tend to introduce noise – Variable k
• varying neighborhoods are selected for each item based on a similarity threshold
– offlineanalysis:valuesinthe20–50rangearea reasonable starting point in many domains.
Discriminative vs Generative Models
Discriminative Models
E.g., Classificationàlearned a separator to discriminate two classes
*internal structure of the classes is not captured
+ ++++ +++ + +++
in Class Question: Where does this point belong?
For each plot indicate to which cluster the given datapoint belongs and why: cluster 1, cluster 2, Both OR Neither
Why do we care about generative models?
Better understanding of where our data came from; how it was ‘generated’
• describes internal structure of the data
• can also be used for classification
We can use this as a basis for soft clustering
We can use this as a basis for graphical models
+ ++ – ++ +
Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimate
independently
I identically
We assume data are generated i.i.d. from an unknown Bernoulli distribution that has parameter p
each of these coin flips is with the same coin (same bias towards head) and each coin flip is independent of previous flips
Use MLE to determine the likeliest parameter value, given the dataset
Maximum Likelihood Estimation (MLE) Bernoulli
• Coin could be biased. Estimate the generating distribution given these
observations.
(identically distributed)
!” =!; !% =1−!
• !”%”” =!”!%!”!” =!3 1−! • Maximize this likelihood:
(independently distributed) – take derivative of this product wrt !, set to zero and solve for !
In fact any other value of 4 would lead to a lower likelihood of the data
generative story with i.i.d. assumption
(7) 5 Given*5=,̅ 79:
Assume (7)
for Bernoulli
eachIciE 0I
~ Bern(,; !) (identically distributed)
i.e., each ,̅(7) = 1 with probability ! and
• ∀5≠7 ! ,̅ 7 ,,̅ ; =Bern(,̅ 7 ;!)Bern(,̅ 7 ;!)
(independently distributed)
,̅(7) = 0 with probability 1 − !
e.g.,! ,̅ : =”,,̅ < =%,,̅ 3 =",,̅ = =" =!3(1−!)
Consequently Goal: Determine !
!*5 =9!(,̅7) 79:
Maximum Likelihood Estimate: intuition
We assume data are generated i.i.d. from an unknown Gaussian distribution that has parameter !, #!
each datapoint was drawn from the same ‘bell curve’
Use MLE to determine the likeliest parameter values, given the dataset
!"#,%! = 1 exp−"−#! 2)%! 2%!
examples: inches of snowfall, heights of people etc.
Gaussian (normal) Distribution
Suppose the data is generated from a (univariate) normal distribution:
;<=,>% = 1 exp−<−=% 2>%
! ” = ‘ % ” ” &” = ‘
() ‘ ‘ var” =! “−![“] =+
generative story with i.i.d. assumption
for univariate Gaussian
Given D. = <̅ )2! drawn iid
• ∀5≠L M <̅ ,<̅
% ~ F(<̅|=, > )
(identically distributed) )a3 on) % 3 %
=F(<̅ |=,> )F(<̅ |=,> )
(independently distributed)
Consequently,
Goal: Determine =, >%
• Want to maximize M D.
• Want to maximize M D.
n)% MD. =NF(<̅ |=,>)
wrt =% wrt >
MLE for the univariate Gaussian
• Given!!=#(#)! drawniid #%&
$!! =%$## #%&
• Wanttomaximize$!! wrt3 ! ##
3 ‘ ( ) = 4# % & 5
!##−3* 6* =4 ‘()
• Want to maximize $ !!
su É yesend Multivariate Gaussian
Distribution for >̅ ∈ R, A ≥ 2
Here >̅ ∈ R*
Multivariate Gaussian (normal) Distribution
d by 1 mean vector
d by d covariance matrix
N ( x ̄ | μ ̄ , ⌃ ) = 1 e x p [ 1 ( x ̄ μ ̄ ) T ⌃ 1 ( x ̄ μ ̄ ) ] (2⇡)d/2|⌃|1/2 2
e.g., for d=2
+̅ = – .! .”
Σ = – .̅ − +̅ .̅ − +̅ # =
Σ$% measures the covariance between .& and .’
Multivariate Gaussian (normal) Distribution general form
d by 1 mean vector
N ( x ̄ | μ ̄ , ⌃ ) = d by 1 data
d by d covariance matrix
1 e x p [ 1 ( x ̄ μ ̄ ) T ⌃ 1 ( x ̄ μ ̄ ) ] (2⇡)d/2|⌃|1/2 2
Σ!” measures the covariance between %# and %$
Likelihood of the Spherical Gaussian
N ( x ̄ | μ ̄ , 2 ) = 1 e x p 1 2 | | x ̄ μ ̄ | | 2 (2⇡ 2)d/2 2
• Given .6 = “̅(4) 6 478
drawn iid according to /(“̅|’̅, +’)
Want to maximize % .6 wrt parameters B̅ = (‘̅, +’) 6611
%.6 =D%”̅4 478
=D 9exp −2+’ “̅4 −’̅ 478 2F+’ ‘
that maximize p Sn
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com