程序代写 • Chapter: Dimensionality Reduction & Matrix Factorization

• Chapter: Dimensionality Reduction & Matrix Factorization
1. Introduction
2. Principal Component Analysis (PCA)
3. Singular Value Decomposition (SVD)

4. Matrix Factorization
− Motivation & Approach
− Regularization & Sparsity
− FurtherFactorizationModels
5. Neighbor Graph Methods
6. Autoencoders (Non-linear Dimensionality Reduction)
Dimensionality Reduction & Matrix Factorization
64 Data Analytics and Machine Learning

Motivation: The Netflix Prize
• Training data
– 100 million ratings, 480,000 users, 17,770 movies
– 6 years of data: 2000-2005
• Test data
– Last few ratings of each user (2.8 million)
17,700 movies
– Root Mean Square Error (RMSE)
– Netflix’s system RMSE: 0.9514
• Competition
– 2,700+ teams
– $1 million prize for 10% improvement
480,000 users
Dimensionality Reduction & Matrix Factorization
65 Data Analytics and Machine Learning

Evaluating Recommender Systems
• 𝑆 = set of tuples (𝑢,𝑖) of
users 𝑢 that have rated item 𝑖
with a rating of 𝑟 𝑢𝑖
•RMSE=1σ 𝑟−𝑟Ƹ2 𝑟 𝑆 (𝑢,𝑖)∈𝑆 𝑢𝑖 𝑢𝑖 43
17,700 movies
true rating of user 𝑢 for item 𝑖
480,000 users
predicted rating
– Training and validation data
– Test data
– Missing data
missing no-rating
Dimensionality Reduction & Matrix Factorization
66 Data Analytics and Machine Learning

Regression vs. Recommendation
– Training and validation data
– Test data
– Missing data
Regression
Recommendation
Dimensionality Reduction & Matrix Factorization
67 Data Analytics and Machine Learning

SVD on Rating Data
• Goal: Make good recommendations
– Good performance on observed (user, item) ratings, i.e. low RMSE
– Generalize to the unseen test data
• Can we use SVD to obtain the solution?
– SVD on the rating matrix 𝑹 ∈ R𝑛×𝑑 where we replace missing entries with zeros – 𝑹≈𝑸·𝑷𝑻 //SVD:𝑹=𝑼𝜮𝑽𝑻 ⟶𝑸=𝑼𝜮, 𝑷=𝑽
Dimensionality Reduction & Matrix Factorization
68 Data Analytics and Machine Learning

SVD on Rating Data
• SVD gives minimum reconstruction error (Sum of Squared Errors):
min෍ 𝑹u𝑖−𝑼𝜮𝑽𝑻
𝑼,𝜮,𝑽 u=1…n u𝑖 i=1…d
– SSE and RMSE are monotonically related: −RMSE=1 SSE
− Greatnews:SVDisminimizingRMSE
• Complication:
– The sum in the SVD error term is over all entries (no-rating = zero-rating)
– But our 𝑹 has missing entries!
– (Classical) SVD isn’t defined when entries are missing!
• Also: We actually don’t care about orthogonality and normalization
Dimensionality Reduction & Matrix Factorization
69 Data Analytics and Machine Learning

Discussion: SVD/PCA
• Optimal low-rank approximation
− In terms of Frobenius norm (and also in terms of the spectral norm)
• Missing elements (we have no information/not observed) are treated as 0 (a very low rating)
– Critical for many applications (e.g. recommender systems)
– General problem: two kinds of “zeros” (not observed/missing ≠ 0)
• Lack of Sparsity
– Singular vectors are dense
– Potential interpretability problem
• Orthogonality really required/useful?
Dimensionality Reduction & Matrix Factorization
70 Data Analytics and Machine Learning

Latent Factor Models
• Our goal: Find 𝑸 ∈ R𝑛×𝑘 and 𝑷 ∈ R𝑑×𝑘 such that: min ෍ 𝑟 −𝒒 ⋅𝒑T 2
– We only sum over existing entries, i.e. the set 𝑆 =
𝑢𝑖 𝑢 𝑖 𝑢,𝑖 ∈𝑆
– We don’t require columns of 𝑷, 𝑸 to be orthogonal or unit length
≠ missing )
• Use standard optimization techniques to solve this problem
𝑢𝑖 𝑢 𝑖 𝒒𝑢 row𝑢of𝑸
𝒑𝑖 row𝑖of𝑷
Dimensionality Reduction & Matrix Factorization
71 Data Analytics and Machine Learning

Finding the Latent Factors (I)
• Goal: minσ 𝑟 −𝒒 ⋅𝒑T 2 =:min𝑓(𝑷,𝑸) 𝑷,𝑸𝑢,𝑖∈𝑆𝑢𝑖𝑢𝑖 𝑷,𝑸
• One approach: Alternating Optimization // a.k.a. block coordinate minimization
– Pick initial values for 𝑷 and Q
– Alternatingly keep one variable fix and optimize for the other
– Repeat until convergence
• Pseudo-Code:
1. initialize 𝑷(0), 𝐐(0), 𝑡 = 0
2. 𝑷(𝑡+1) = argmin𝑷 𝑓(𝑷, 𝑸(𝑡))
3. 𝑸(𝑡+1)=argmin𝑸𝑓𝑷𝑡+1,𝑸
5. goto 2 until convergence
Dimensionality Reduction & Matrix Factorization
72 Data Analytics and Machine Learning

Finding the Latent Factors (II)
• Initialization of 𝑷 and 𝑸:
– Use, e.g., SVD where missing entries are replaced by 0 (or overall mean)
• Howtosolve𝑷(𝑡+1) =argmin𝑓(𝑷,𝑸(𝑡))=argminσ 𝑟 −𝒒 ⋅𝒑𝑇 2
u,𝑖∈𝑆𝑢𝑖 𝑢 𝑖 – Observation 1: Since 𝑸 is fixed, the problem can be solved
𝑷𝑷 for each vector 𝒑𝒊 independently
min ෍ 𝑟 −𝒒 ⋅𝒑T 2= ෍ min෍ 𝑟 −𝒒 ⋅𝒑T 2 𝑷 𝑢𝑖𝑢𝑖 𝒑𝑖 𝑢𝑖𝑢𝑖
𝑢,𝑖 ∈𝑆 𝑖=1… 𝑑 u∈𝑆∗,𝑖
where 𝑆∗,𝑖 = 𝑢 𝑢, 𝑖 ∈ 𝑆} // = only users 𝑢 who have rated item 𝑖
– Equivalently for vector 𝒒𝑢 based on the set 𝑆𝑢,∗ = 𝑖 𝑢, 𝑖 ∈ 𝑆}
Dimensionality Reduction & Matrix Factorization
73 Data Analytics and Machine Learning

Finding the Latent Factors (III)
• Observation2:minσ 𝑟 −𝒒 ⋅𝒑T 2 isan
ordinary least squares regression problem
– minσ𝑛 𝑦 −𝒘𝑻𝒙 2 //𝑦 isscalar,𝒙 and𝒘(column)vectors
– Optimalsolutioninclosedform:𝒘= 1σ𝑛 𝒙𝒙𝑇 −1⋅1σ𝑛 𝒙𝑦 𝑛 j=1𝑗𝑗 𝑛 j=1𝑗j
u∈𝑆∗,𝑖 𝑢𝑖 𝑢 𝑖
j=1j 𝑗 𝑗 𝑗
– In our case: 𝒑 = σ 𝒒 𝒒 ⋅ σ 𝒒 𝑟
𝑖 |𝑆∗,𝑖| 𝑢∈𝑆∗,𝑖 𝑢 𝑢 |𝑆∗,𝑖| 𝑢∈𝑆∗,𝑖 𝑢 𝑢𝑖
– Equivalently for 𝒒𝑢
• Computation of argmin(𝑷, 𝑸(𝑡)) reduces to a standard problem 𝑷
Dimensionality Reduction & Matrix Factorization
74 Data Analytics and Machine Learning

Alternating Optimization: Discussion
• May provide solution to difficult optimization problems
• Here: sequence of simple OLS problems
– Overall algorithm can be implemented in a few lines in, e.g., Python
– Quite efficient: since data is sparse, the sets 𝑆∗,𝑖 (and 𝑆𝑢,∗) are relatively small
• Drawback of Alternating Optimization:
– Solution is only an approximation
– No guarantee that close to the optimal solution
– Highly depends on initial solution
Dimensionality Reduction & Matrix Factorization
75 Data Analytics and Machine Learning

SGD for Matrix Factorization
• Stochastic Gradient Descent as an alternative
• SGD on the objective L ∶= σ 𝑟 − 𝒒 ⋅ 𝒑T 2
– Pick a random user 𝑢 and a random item 𝑖 with rating 𝑟 𝑢𝑖
– Compute the gradients w.r.t. the parameters 𝜕L and 𝜕L 𝜕𝒒𝑢 𝜕𝒑𝑖
(batch size 1)
– Update the parameters 𝒒𝑢 ← 𝒒𝑢 − 𝜂 𝜕L 𝒑𝑖 ← 𝒑𝑖 − 𝜂 𝜕L
• In this case the gradient update step is rather simple
– 𝑒 ← 𝑟 − 𝒒 ⋅ 𝒑T \\ helper variable, the current error
– 𝒒𝑢←𝒒𝑢+2𝜂𝑒𝑢𝑖𝒑𝑖
– 𝒑𝑖←𝒑𝑖+2𝜂𝑒𝑢𝑖𝒒𝑢
𝑢,𝑖∈𝑆𝑢𝑖 𝑢 𝑖
Dimensionality Reduction & Matrix Factorization
76 Data Analytics and Machine Learning

Rating Prediction
How to estimate the missing rating of user 𝑢 for item 𝑖? items
𝑘 𝒑𝑖 row𝑖of𝑷
𝑟Ƹ =𝒒 ⋅𝒑𝑇=෍𝑞 ⋅𝑝
𝑢𝑖 𝑢 𝑖 𝒒𝑢 row𝑢of𝑸
Dimensionality Reduction & Matrix Factorization
77 Data Analytics and Machine Learning

Rating Prediction
How to estimate the missing rating of user 𝑢 for item 𝑖? items
𝑘 𝒑𝑖 row𝑖of𝑷
𝑟Ƹ =𝒒 ⋅𝒑𝑇=෍𝑞 ⋅𝑝
𝑢𝑖 𝑢 𝑖 𝒒𝑢 row𝑢of𝑸
Dimensionality Reduction & Matrix Factorization
78 Data Analytics and Machine Learning

Rating Prediction
How to estimate the missing rating of user 𝑢 for item 𝑖? items
𝑘 𝒑𝑖 row𝑖of𝑷
𝑟Ƹ =𝒒 ⋅𝒑𝑇=෍𝑞 ⋅𝑝
𝑢𝑖 𝑢 𝑖 𝒒𝑢 row𝑢of𝑸
Dimensionality Reduction & Matrix Factorization
79 Data Analytics and Machine Learning

Side-note: Bell System
• The winner of the Netflix Challenge uses matrix factorization as one building block
– The overall model is a combination of multiple ideas
• Multi-scale modeling of the data: Combine top level, “regional” modeling of the data, with
a refined, local view:
− Overalldeviationsofusers/movies
– Factorization:
− Addressing“regional”effects
– Collaborative filtering: − Extractlocalpatterns
Global effects
Factorization
Collaborative filtering
Dimensionality Reduction & Matrix Factorization
80 Data Analytics and Machine Learning

User and Item Biases
• Certain users might give overly optimistic or pessimistic rating
• Certain movies tend to always have low ratings
• We introduce additional bias terms to capture these effects
– Each user 𝑢 can have a bias term 𝑏𝑢
– Each user 𝑖 can have a bias term 𝑏𝑖
– Additional global bias term 𝑏 shared by everyone
• TheresultingoptimizationproblemcanbeeasilysolvedwithSGD min ෍ 𝑟 −(𝒒 ⋅𝒑T+𝑏 +𝑏+𝑏)2
𝑢𝑖 𝑢 𝑖 𝑢 𝑖
• The cold start problem is another important issue
𝑏𝑢,𝑏𝑖,𝑏 𝑢, 𝑖 ∈𝑆
Dimensionality Reduction & Matrix Factorization
81 Data Analytics and Machine Learning

• Chapter: Dimensionality Reduction & Matrix Factorization
1. Introduction
2. Principal Component Analysis (PCA)
3. Singular Value Decomposition (SVD)
4. Matrix Factorization
− Motivation & Approach
− Regularization & Sparsity
− FurtherFactorizationModels
5. Neighbor Graph Methods
6. Autoencoders (Non-linear Dimensionality Reduction)
Dimensionality Reduction & Matrix Factorization
82 Data Analytics and Machine Learning

Challenge: Overfitting
• Final Goal: Minimize SSE for unseen test data
• Proxy: Minimize SSE on training data
– Want large 𝑘 (number of factors) to capture all the signals
– But, SSE on test data begins to rise for larger 𝑘
• Classical example of overfitting:
– With too many degrees freedom (too many free parameters) the model starts
fitting noise
– The model fits too well the training data but does not generalize well to unseen test data
Dimensionality Reduction & Matrix Factorization
83 Data Analytics and Machine Learning

Challenge: Overfitting
• Problem can easily be seen in our scenario:
– In each step of the alternating optimization we solve an OLS regression − Numberofregressionparameters=𝑘=numberoflatentfactors
− Number of data points used for regression = cardinality of 𝑆∗,𝑖 / 𝑆𝑢,∗
– Lots of parameters but not enough data points→regression can overfit
– If 𝑘 is large this might even lead to an underdetermined system of equations
– Problem is ill-posed
• Solution: Regularization
– Interpretation for underdetermined systems: “In the absence of any other information, the parameter vector should be small (i.e. only small effect of features)“
– Equivalently: We put a prior on the weights
Dimensionality Reduction & Matrix Factorization
84 Data Analytics and Machine Learning

Regularization
• To solve overfitting we introduce regularization:
Allow rich model where we have sufficient data Shrink aggressively where data are scarce
“error” “length”
𝜆1 and 𝜆2 are user-defined regularization parameters
min ෍ 𝑟 − 𝒒 ⋅ 𝒑T 2 + 𝜆 ෍ 𝒒 2 + 𝜆 ෍ 𝒑 2
𝑢𝑖 𝑢 𝑖 1 𝑢 2 𝑖 𝑢,𝑖 ∈𝑆 𝑢 𝑖
• Note: We do not care about the actual value of the objective function, but we care about the 𝑷, 𝑸 that achieve the minimum of the objective
Dimensionality Reduction & Matrix Factorization
85 Data Analytics and Machine Learning

The Effect of Regularization
less action
Ocean’s 11
The Lion Day
more action
The Color Purple
Sense and Sensibility
Braveheart
Lethal Weapon
The Princess Diaries
Dumb and Dumber
min ෍ 𝑟 − 𝒒 ⋅ 𝒑T 2 + 𝜆 ෍ 𝒒 2 + 𝜆 ෍ 𝒑 2
𝑢𝑖 𝑢 𝑖 1 𝑢 2 𝑖 𝑢,𝑖 ∈𝑆 𝑢 𝑖
“error” “length”
Dimensionality Reduction & Matrix Factorization
86 Data Analytics and Machine Learning

The Effect of Regularization
less action
Ocean’s 11
The Lion Day
more action
𝑢𝑖 𝑢 𝑖 1 𝑢 2 𝑖 𝑢,𝑖 ∈𝑆 𝑢 𝑖
“error” “length”
The Color Purple
Sense and Sensibility
Braveheart
The Princess Diaries
Dumb and Dumber
Lethal Weapon
min ෍ 𝑟 − 𝒒 ⋅ 𝒑T 2 + 𝜆 ෍ 𝒒 2 + 𝜆 ෍ 𝒑 2
Dimensionality Reduction & Matrix Factorization
89 Data Analytics and Machine Learning

Regularization in Our Use Case
• Regularized Problem: min ෍ 𝑟 − 𝒒 ⋅ 𝒑T 2 + 𝜆 ෍ 𝒒 2 + 𝜆 ෍ 𝒑 2
𝑢𝑖 𝑢 𝑖 1 𝑢 2 𝑖 𝑢,𝑖 ∈𝑆 𝑢 𝑖
• Recap of unregularized problem: In each iteration of the alternating optimization we solved an OLS regression problem
• For the regularized version this becomes ridge regression
minσ 𝑟 −𝒒 ⋅𝒑𝑇 2+𝜆 𝒑 2
𝑖∈𝑆∗,𝑖 𝑢𝑖 𝑢 𝑖 2 𝑖
Effect: parameter values are forced to become smaller Large values that only capture noise are avoided
• You know how to solve this!
– Closed form solution (see linear regression slides)
– Gradient decent
Dimensionality Reduction & Matrix Factorization
90 Data Analytics and Machine Learning

SGD for MF with Regularization and Biases
• SGD on the objective
L ∶= min ෍ 𝑟 − (𝒒 ⋅ 𝒑T + 𝑏 + 𝑏 + 𝑏) 2 + 𝜆 ෍ 𝒒 2 + 𝜆 ෍ 𝒑 2
𝑢𝑖𝑢𝑖𝑢𝑖 1𝑢2𝑖 𝑏𝑢,𝑏𝑖,𝑏 𝑢,𝑖 ∈𝑆 𝑢 𝑖
• Pick a random user 𝑢 and a random item 𝑖 with rating 𝑟 (batch size 1) 𝑢𝑖
• The gradient update step is very similar to before
– 𝑒 ← 𝑟 − 𝒒 ⋅ 𝒑T \\ helper variable, the current error
– 𝒒𝑢←𝒒𝑢+2𝜂𝑒𝑢𝑖𝒑𝑖−𝜆1𝒒𝑢
– 𝒑𝑖←𝒑𝑖+2𝜂𝑒𝑢𝑖𝒒𝑢−𝜆2𝒑𝒊
– 𝑏𝑢←𝑏𝑢+𝜂𝑒𝑢𝑖
– 𝑏𝑖←𝑏𝑖+𝜂𝑒𝑢𝑖
– Usually directly set 𝑏 to the global mean 𝑏 = 1 σ 𝑟
Dimensionality Reduction & Matrix Factorization
91 Data Analytics and Machine Learning

L2 vs. L1 Regularization
• Comparison: L2 vs. L1 regularization
– L2 tries to “shrink” the parameter vector 𝒘 equally
− Large values are highly penalized due to the square in the L2 norm
− Unlikely that any component will be exactly 0
– L1 allows large values in individual components of 𝒘 by shrinking other components to 0
– L1 is suited to enforce sparsity of the parameter vector
• Why sparsity?
– Better interpretation
− Only few values contribute to result
− Unintuitive that sparse input data is generated based on dense signal
– Less storage, faster processing
Dimensionality Reduction & Matrix Factorization
92 Data Analytics and Machine Learning

L2 vs. L1 Intuition
• An L1 constraint promotes sparsity
Dimensionality Reduction & Matrix Factorization
93 Data Analytics and Machine Learning

• Chapter: Dimensionality Reduction & Matrix Factorization
1. Introduction
2. Principal Component Analysis (PCA)
3. Singular Value Decomposition (SVD)
4. Matrix Factorization
− Motivation & Approach
− Regularization & Sparsity
− Further Factorization Models
5. Neighbor Graph Methods
6. Autoencoders (Non-linear Dimensionality Reduction)
Dimensionality Reduction & Matrix Factorization
94 Data Analytics and Machine Learning

Further Factorization Models
• Matrix factorization is extremely powerful
– Dimensionality reduction, data analysis/data understanding, prediction of
missing values, …
• Various extensions have been proposed
– Enforcing different constraints or operating on different data types
– Important goal: better interpretation of result
• Non-Negative Matrix Factorization (next slides)
• Boolean Matrix Factorization
– Factorize Boolean 𝑨 in Boolean 𝑸 and 𝑷
Dimensionality Reduction & Matrix Factorization
95 Data Analytics and Machine Learning

Non-Negative Matrix Factorization
• Often data is given in form of non-negative values
– Rating values between 1 to 5; income, age, … of persons; number of words in a
document; grayscale images; etc.
• However: SVD (and the other approaches presented before) might lead to
factors containing negative values
– Difficult to interpret: non-negative data is “generated” based on negative
factors; what do these negative factors mean?
– Predicted values might also become negative
• Solution: Non-Negative Matrix Factorization
Dimensionality Reduction & Matrix Factorization
96 Data Analytics and Machine Learning

Non-Negative Matrix Factorization
• Task: Factorize non-negative 𝑨 in non-negative 𝑸 and 𝑷, i.e. 𝐀 ≈ 𝑸 ⋅ 𝑷𝑻
• Formally:
– Given𝑨∈R𝑛×𝑑 with𝑨𝑖𝑗 ≥0andinteger𝑘,find𝑷∈R𝑛×𝑘,𝑸∈R𝑘×𝑑 such
that 𝑨 − 𝑸 ⋅ 𝑷𝑻
is minimized subject to 𝑸 ≥ 0 and 𝑷 ≥ 0
min 𝑨−𝑸⋅𝑷𝑻 𝑷≥0,𝑸≥0
– Constrained optimization
Dimensionality Reduction & Matrix Factorization
97 Data Analytics and Machine Learning

NNMF Example: Olivetti Faces Data
results according to scikit-learn
Dimensionality Reduction & Matrix Factorization
98 Data Analytics and Machine Learning

• Chapter: Dimensionality Reduction & Matrix Factorization
1. Introduction
2. Principal Component Analysis (PCA)
3. Singular Value Decomposition (SVD)
4. Matrix Factorization
5. Neighbor Graph Methods
6. Autoencoders (Non-linear Dimensionality Reduction)
Dimensionality Reduction & Matrix Factorization
99 Data Analytics and Machine Learning

Preserving global vs. preserving local similarity
• PCA tries to find a global structure
– Can lead to local inconsistencies
– Far away point can become nearest neighbors
• Illustration (during lecture):
• Idea: Preserve local structure instead
• Example: https://projector.tensorflow.org/
Dimensionality Reduction & Matrix Factorization
100 Data Analytics and Machine Learning

Neighbor Graph Methods
• All methods we have seen so far are based on matrix factorizations
• An alternative class of methods based on neighbor graphs
Highdimensional data: Neighbor / similarity graph:
• Idea: Low dimensional neighborhood similar to the original neighborhood
Dimensionality Reduction & Matrix Factorization
101 Data Analytics and Machine Learning

Neighbor Graph Methods
1. Construct neighbor graph of high-dimensional data
2. Initialize points in low-dimensional space
3. Optimize coordinates in low-dimensional space s.t. similarities align
Dimensionality Reduction & Matrix Factorization
102 Data Analytics and Machine Learning

t-SNE: t-Distributed Stochastic Neighbor Embedding
• High-dimensional similarities for input 𝒙𝑖:
𝒙−𝒙 exp−𝑖 𝑗
σ𝑘≠𝑖 exp −
𝑝𝑖𝑗 = 𝑝𝑗|𝑖 + 𝑝𝑗|𝑖 2 2𝑛
– 𝜎 chosen to achieve fixed perplexity (effective number of neighbors) 𝑖
• Low-dimensional similarities for (to be optimized) parameters 𝒚𝑖 : 1+ 𝐲 −𝐲 2 −1
𝑞𝑖𝑗=𝑖𝑗 𝑞=0 σ𝑘

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts