Machine learning lecture slides
Machine learning lecture slides
COMS 4771 Fall 2020
0 / 26
Overview
Questions
I Please use Piazza Live Q&A
1 / 26
Outline
I A “bird’s eye view” of machine learning
I About COMS 4771
2 / 26
Figure 1: Predict the bird species depicted in a given image.
3 / 26
COVER FE ATURE
COMPUTER 44
vector q
i
R f, and each user u is associ-
ated with a vector p
u
R f. For a given item
i, the elements of q
i
measure the extent to
which the item possesses those factors,
positive or negative. For a given user u,
the elements of p
u
measure the extent of
interest the user has in items that are high
on the corresponding factors, again, posi-
tive or negative. The resulting dot product,
q
i
T p
u
, captures the interaction between user
u and item i—the user’s overall interest in
the item’s characteristics. This approximates
user u’s rating of item i, which is denoted by
r
ui
, leading to the estimate
r̂ui
= q
i
T p
u
. (1)
The major challenge is computing the map-
ping of each item and user to factor vectors
q
i
, p
u
R f. After the recommender system
completes this mapping, it can easily esti-
mate the rating a user will give to any item
by using Equation 1.
Such a model is closely related to singular value decom-
position (SVD), a well-established technique for identifying
latent semantic factors in information retrieval. Applying
SVD in the collaborative filtering domain requires factoring
the user-item rating matrix. This often raises difficulties
due to the high portion of missing values caused by sparse-
ness in the user-item ratings matrix. Conventional SVD is
undefined when knowledge about the matrix is incom-
plete. Moreover, carelessly addressing only the relatively
few known entries is highly prone to overfitting.
Earlier systems relied on imputation to fill in missing
ratings and make the rating matrix dense.2 However, im-
putation can be very expensive as it significantly increases
the amount of data. In addition, inaccurate imputation
might distort the data considerably. Hence, more recent
works3-6 suggested modeling directly the observed rat-
ings only, while avoiding overfitting through a regularized
model. To learn the factor vectors (p
u
and q
i
), the system
minimizes the regularized squared error on the set of
known ratings:
min
* *,q p
( , )u i
(r
ui
q
i
Tp
u
)2 + (|| q
i
||2 + || p
u
||2) (2)
Here, is the set of the (u,i) pairs for which r
ui
is known
(the training set).
The system learns the model by fitting the previously
observed ratings. However, the goal is to generalize those
previous ratings in a way that predicts future, unknown
ratings. Thus, the system should avoid overfitting the
observed data by regularizing the learned parameters,
whose magnitudes are penalized. The constant controls
recommendation. These methods have become popular in
recent years by combining good scalability with predictive
accuracy. In addition, they offer much flexibility for model-
ing various real-life situations.
Recommender systems rely on different types of
input data, which are often placed in a matrix with one
dimension representing users and the other dimension
representing items of interest. The most convenient data
is high-quality explicit feedback, which includes explicit
input by users regarding their interest in products. For
example, Netflix collects star ratings for movies, and TiVo
users indicate their preferences for TV shows by pressing
thumbs-up and thumbs-down buttons. We refer to explicit
user feedback as ratings. Usually, explicit feedback com-
prises a sparse matrix, since any single user is likely to
have rated only a small percentage of possible items.
One strength of matrix factorization is that it allows
incorporation of additional information. When explicit
feedback is not available, recommender systems can infer
user preferences using implicit feedback, which indirectly
reflects opinion by observing user behavior including pur-
chase history, browsing history, search patterns, or even
mouse movements. Implicit feedback usually denotes the
presence or absence of an event, so it is typically repre-
sented by a densely filled matrix.
A BASIC MATRIX FACTORIZATION MODEL
Matrix factorization models map both users and items
to a joint latent factor space of dimensionality f, such that
user-item interactions are modeled as inner products in
that space. Accordingly, each item i is associated with a
Geared
toward
males
Serious
Escapist
The Princess
Diaries
Braveheart
Lethal Weapon
Independence
Day
Ocean’s 11
Sense and
Sensibility
Gus
Dave
Geared
toward
females
Amadeus
The Lion King
Dumb and
Dumber
The Color Purple
Figure 2. A simplified illustration of the latent factor approach, which
characterizes both users and movies using two axes—male versus female
and serious versus escapist.
Figure 2: Predict how a given user would rate a given movie.
4 / 26
Figure 3: Predict the French translation of a given English sentence.
5 / 26
Figure 4: Predict the “win probability” of a given move in a given game
state.
6 / 26
How to “solve” problems without ML?
I Image classification:
I Recruit a “bird expert” to teach you about different birds
features (e.g., beak shape, feather color, typical environment)
I Recognize these features in a given image, and then come up
with a best guess of the bird species
I Recommender system:
I Ask user to self-report specific movie genres of interest (e.g.,
horror, sci-fi)
I Ask movie suppliers to categorize movies into the same genres
I Predict a high rating for any movie in a user’s genre-of-interest;
low rating for all other movies
I Machine translation: . . .
I Chess: . . .
7 / 26
Work in ML
I Applied ML
I Collect/prepare data, build/train models, analyze
performance/errors/etc
I ML developer
I Implement ML algorithms and infrastructure
I ML research
I Design/analyze models and algorithms
Note: These roles are not mutually exclusive!
8 / 26
Mathematical and computational prerequisites
I Math
I Linear algebra, probability, multivariable calculus
I Understand and reason about the concepts (not just
calculations)
I Software/programming
I Much ML work is implemented in python with libraries such as
numpy and pytorch
9 / 26
Basic setting: supervised learning
I Training data: dataset comprised of labeled examples
I Labeled example: a pair of the form (input, label)
I Input: what you see before you make a prediction (a.k.a.
context, side-information, features, etc.)
I Label: output value (a.k.a. output, response, target, etc.)
I Goal: learn predictor (i.e., prediction function) to predict label
from input for new examples
10 / 26
Figure 5: Decision tree
11 / 26
Figure 6: Linear classifier (“Perceptron”)
12 / 26
input hidden units output
Figure 7: Neural network
13 / 26
Types of prediction problems
I Binary classification
I Given an email, is it spam or not?
I (What’s the probability that it is spam?)
I Multi-class classification
I Given an image, what animal is depicted?
I (Or which animals are depicted?)
I Regression
I Given clincal measurements, what is level of tumor antigens?
I (In absolute level? Log-scale?)
I Structured output prediction
I Given a sentence, what is its grammatical parse tree?
I (Or dependency tree?)
I . . .
14 / 26
Template of supervised learning pipeline
I Get data
I Determine representation of and predictive model for data
I Train the predictor (a.k.a. model fitting, parameter estimation)
I Evaluate predictor (test the “goodness of fit”)
I Deploy predictor in application
15 / 26
Questions
I What is the core prediction problem?
I What features (i.e., predictor variables) are available?
I Will these features be available at time of prediction?
I Is there enough information (e.g., training data, features) to
learn the relationship between the input and output?
I What are the modeling assumptions?
I Is high-accuracy prediction a useful goal for the application?
16 / 26
Where do assumptions / domain expertise come in?
I Form of the prediction function
I Choice of features
I Choice of training data
I Choice of learning algorithm
I Choice of objective function and contraints
17 / 26
Challenges
I Might not have the “right” data
I Might not have the “right” model
I Might under-fit the data
I Might over-fit the data
I Data might be corrupted, noisy, . . .
18 / 26
Key statistical/algorithmic ideas in ML
I Plug-in principle
I Inductive bias
I Linearity
I Mathematical optimization
19 / 26
About COMS 4771
I Basic principles and methods of supervised machine learning
1. Appetizer: nearest neighbor rules (a “non-parametric” method)
2. Statistical model for prediction
3. Regression
I Why? Clean, simple, and illustrates important concepts
(linearity, inductive bias, regularization, kernels)
4. Classification
5. Optimization methods for machine learning
I Convex optimization, neural networks
6. Maybe one other topic if time permits . . .
I This is not a course about how to use sklearn, tensorflow, etc.
I Also not about latest nonsense on arXiv
I Good stuff beyond COMS 4771:
I COMS 4252, 4773: Mathematical theory of learning
I COMS 4774: Unsupervised learning
I COMS 4775: Causal inference
I . . .
20 / 26
About me
I Professor Daniel Hsu
I Okay to call me “Daniel”!
I “Professor Hsu” also okay
I “Professor Daniel” is a little weird
I At Columbia since 2013
I Previously at Microsoft Research, Rutgers, UPenn, UC San
Diego, UC Berkeley, . . .
I Research interests: algorithms, statistics, & combining the two
I Good at: LATEX-hacking
I Bad at: making slides
21 / 26
About you
I I assume you have fluency in
I multivariable calculus,
I linear algebra, and
I elementary probability (no measure theory needed)
I I also assume you can read and write programs in Python
I (and read online documentation to learn, e.g., how to do I/O
with CSV files)
I See Courseworks for a “Python basics” Jupyter notebook to
brush up on Python, Numpy, etc.
I Let me know why you are interested in ML!
I Part of HW 1.
22 / 26
Administrative stuff
I Website: https://www.cs.columbia.edu/~djhsu/ML
I Schedule for office hours/lectures/homework/quizzes/exam
I Syllabus
I Course format:
I Lecture/recitation: online over Zoom
I “On Campus” people: check email about in-person lectures
I Course assistants (CAs):
I Andy, Andrea, Wonjun, Serena
I Links for online office hours will be posted on Courseworks
I Technology:
I Piazza: communicate with course staff (live Q&A and offline)
I Courseworks: retrieve assignments, quizzes, data files, etc.
I Gradescope: submit homework write-ups, code
I Slack: discussion with fellow classmates
I Disability services:
I Please make arrangements with disability services ASAP
23 / 26
https://www.cs.columbia.edu/~djhsu/ML
Academic rules of conduct
I See syllabus
I Cheating: don’t do it
I If unsure about something, ask!
I Consequence is automatic fail
I Cheating out of desperation is also cheating
I Instead: get help early
I We are here to help
I Okay to work on homework in groups of ≤ 3
I No collaboration across groups
I No diffusion of responsibility
I No collaboration at all on quizzes or exams
24 / 26
Reading assignments
I There are some required reading assignments (mostly from
handouts posted on website)
I Unfortunately, most textbooks on ML are not appropriate for
this course
I Closest is “A Course in Machine Learning” by Daumé
I I have selected some optional reading assignments from a few
books that may be used to supplement the lectures
I All books available online
25 / 26
Questions?
26 / 26
Overview