LECTURE 5 TERM 2:
MSIN0097
Predictive Analytics
A P MOORE
MSIN0097
Individual coursework
MSIN0097
Individual Coursework assignment has been extended by one week
to Friday 5th March 2021 at 10:00 am
USING OTHER PEOPLE’S CODE
— Wojciech Zaremba (@woj_zaremba) February 4, 2021
MACHINE LEARNING JARGON
— Model
— Interpolating / Extrapolating — Data Bias
— Noise / Outliers
— Learning algorithm
— Inference algorithm
— Supervised learning
— Unsupervised learning
— Classification
— Regression
— Clustering
— Decomposition
— Parameters
— Optimisation
— Training data
— Testing data
— Error metric
— Linear model
— Parametric model
— Model variance
— Model bias
— Model generalization
— Overfitting
— Goodness-of-fit
— Hyper-parameters
— Failure modes
— Confusion matrix
— True Positive
— False Negative
— Partition
— Margin
— Data density
— Hidden parameter
— High dimensional space
— Low dimensional space
— Separable data
— Manifold / Decision surface
— Hyper cube / volume / plane
A – B – C- D ALGORITHMIC APPROACHES
A. ClAssification
C. Clustering
Hidden variables
Density estimation Manifolds
B. Regression
Super vised
D. Decomposition
Subspaces
Unsuper vised
QUES TIONS
— How would I know if my data will be benefitted from a transformation to a higher or lower dimensional space?
CURSE OF DIMENSIONALITY
https://www.nature.com/articles/s41592-018-0019-x
QUES TIONS
— Would I always have to visualize the data at a 2D or 3D level to visually understand if the data can be better separable? (but then this would defeat the idea of going a higher dimensional space which can’t be visualized).
SUMMARY STATISTICS
Anscombe’s quartet
SUMMARY STATISTICS
https://seaborn.pydata.org/examples/scatterplot_matrix.html
QUES TIONS?
— Should I have to go all the way through modelling (e.g. classification) and evaluate a metric such as the Gini coefficient and then go back to comparing different Gini scores from (addition of) extra dimensions?
QUES TIONS?
— I understand that it might be better to go up a dimension in certain cases and other cases it will be better to go lower a dimension?
MULTIPLE MODELS
MSIN0097
K-means
K-MEANS LLOYD–FORGY ALGORITHM
K-MEANS
— Advantages — Disadvantages
ELLIPSOIDAL DISTRIBUTED DATA
MSIN0097
Gaussian mixtures
PARTITIONAL
MIXTURE OF GAUSSIANS (1D)
HIDDEN (LATENT) VARIABLES
MIXTURE OF GAUSSIANS (2D)
GRAPHICAL MODELS GAUSSIAN MIXTURES
PLATE NOTATION
— including its parameters (squares, solid circles, bullet) — random variables (circles)
— conditional dependencies (solid arrows)
FAMILIES OF MODELS
Gaussian mixture T-distribution mixture Factor Analysis
TWO STEP – EM ALGORITHM
EM ALGORITHM
EXPECTATION MAXIMIZATION
MIXTURE OF GAUSSIANS AS MARGINALIZATION
E-S TEP
M-S TEP
EM ALGORITHM
EXPECTATION MAXIMIZATION
MANIPULATING THE LOWER BOUND
LOCAL MAXIMA
Repeated fitting of mixture of Gaussians model with different starting points results in different models as the fit converges to different local maxima.
Log likelihoods are a) 98.76 b) 96.97 c) 94.35, respectively, indicating that (a) is the best fit.
COVARIANCE COMPONENTS
a) Full covariances.
b) Diagonal covariances.
c) Identical diagonal covariances.
LEARNING GMM PSEUDO CODE
ANOMALY DETECTION
BIC AND AIC
GAUSSIAN MIXTURES
BAYESIAN GMMS
CONCENTRATION PRIORS
The more data we have, however, the less the priors matter. In fact, to plot diagrams with such large differences, you must use very strong priors and little data.
TWO MOONS DATA
PROBLEMS WITH MULTI-VARIATE NORMAL DENSITY
MSIN0097
Types of models
GENERATIVE VS DISCRIMINATIVE
CLASSIFICATION (DISCRIMINATIVE)
LOGISTIC REGRESSION REVISITED
MODEL CONTINGENCY OF THE WORLD ON DATA
World state: Linear model Bernoulli distribution
Probability / Decision surface
CLASSIFICATION (GENERATIVE)
GAUSSIAN MIXTURE
MODEL CONTINGENCY OF DATA ON THE WORLD
WHAT SORT OF MODEL SHOULD WE USE?
WHAT SORT OF MODEL SHOULD WE USE? TL;DR NO DEFINITIVE ANSWER
— Inference is generally simpler with discriminative models.
— Generative models calculate this probability via Bayes’ rule, and sometimes this requires a computationally expensive algorithm.
— Generative models might waste modelling power.
The data are generally of much higher dimension than the world, and modelling it is costly. Moreover, there may be many aspects of the data which do not influence the state;
— Using discriminative approaches, it is harder to exploit this knowledge: essentially we have to re-learn these phenomena from the data.
— Sometimes parts of the training or test data vector x may be missing. Here, generative models are preferred.
— It is harder to impose prior knowledge in a principled way in discriminative models.
SUMMARY OF APPROACHES
MSIN0097
Best practice…
BEST PRACTICE…
BEST PRACTICE…
BEST PRACTICE…
BEST PRACTICE…
Source: https://www.marekrei.com/blog/ml-and-nlp-publications-in-2019/
Percentage of papers mentioning GitHub (indicating that the code is made available):
ACL 70%, EMNLP 69%, NAACL 68% ICLR 56%, NeurIPS 46%, ICML 45%, AAAI 31%.
It seems the NLP papers are releasing their code much more freely.
PAPERS WITH CODE
https://paperswithcode.com/
PERCEPTIONS OF PROBABILITY
DEPLO YMEN T
@SOCIAL
@chipro @random_forests @zachar ylipton @yudapearl @svpino @jackclarkSF
TEACHING TEAM
Dr Alastair Moore Senior Teaching Fellow
a.p.moore@ucl.ac.uk
@latticecut
Kamil Tylinski Teaching Assistant
kamil.tylinski.16@ucl.ac.uk
Jiangbo Shangguan Teaching Assistant
j.shangguan.17@ucl.ac.uk
Individual Coursework workshop
to Thursday 11th Feb 2021 at 12:00 am
LECTURE 3 TERM 2:
MSIN0097
Predictive Analytics
A P MOORE