Solution – Lab 8 – Unsupervised learning
Lab 8: Unsupervised Learning from Unlabelled Data¶
A: Principal component analysis ; B: Clustering; C: Autoencoder¶
Copyright By PowCoder代写 加微信 powcoder
Haiping Lu – COM4509/6509 MLAI2021 @ The University of Sheffield
Accompanying lectures: YouTube video lectures recorded in Year 2020/21.
Sources: Part A is based on the Dimensionality Reduction: Latent Variable Modelling notebook by [MLAI2015]. Part B covers clustering briefly and uses Scikit-learn examples. Part C is based on the Autoencoder notebook by from APS360, University of Toronto and Lab 2 of my SimplyDeep notebooks.
There are six questions in this notebook.
Why unsupervised learning?¶
So far we have focused mainly on supervised learning problems including regression and classification, where training data samples are all labelled. Now we are going to turn to a different form of learning, commonly known as unsupervised learning. In unsupervised learning, our data are not labelled, but we want models that give us a better understanding of the data, which is called exploratory data analysis in some contexts. Unsupervised learning is a fundamental problem in machine learning and the focus is to learn from unlabelled data. Unsupervised feature learning or representation learning can find wide usage in various applications for extracting useful information from often abundant unlabelled data.
A: Principal Component Analysis (PCA)¶
PCA example in Scikit-learn¶
Run the Scikit-learn example Face recognition example using eigenfaces and SVMs. More examples on PCA are at (the bottom of) the PCA documentation.
The following discusses the background and history of PCA that is complementary to the lecture materials. The latent factor perspective will be useful to understanding generative models in the next session. If you are interetsted in the applications (how to use) PCA only, you may safely skip the remaining part of this section (A1 and A2), including the two questions.
A1. Latent variable and latent factor analysis¶
This section discusses the context where PCA was first proposed.
Latent variables¶
Latent means hidden. Latent variables are hidden, unobservable variables. Recall that Artificial neural networks have three types of nodes (a.k.a. units) called artificial neurons: input, hidden, and output neurons.
Latent variables are also called latent factors (see below) when the focus is more on analysing and interpreting the data. In the following, let us look at the factor analysis problem first.
Factor analysis model as a multi-output regression problem¶
Note that in the following, $\mathbf{y}$ is observed and known while $\mathbf{x}$ is hidden, unknown, and to be estimated. This can be confusing because in the lecture, the input is $\mathbf{x}$ and the output (projections) is $\mathbf{y}$ so please mind the difference.
If we are given a high dimensional vector of features (perhaps questionaire answers) associated with an individual, $\mathbf{y}$, we assume that these factors are actually generated from a low dimensional vector latent traits, or latent variables $\mathbf{x}$, which determine the personality.
\mathbf{y} = \mathbf{f}(\mathbf{x}) + \boldsymbol{\epsilon}
where $\mathbf{f}(\mathbf{x})$ is a vector valued function that is dependent on the latent traits and $\boldsymbol{\epsilon}$ is some corrupting noise. For simplicity, we assume that the function is given by a linear relationship,
\mathbf{f}(\mathbf{x}) = \mathbf{W}\mathbf{x}
where we have introduced a matrix $\mathbf{W}$ that is sometimes referred to as the factor loadings but we also immediately see is related to our multivariate linear regression models. That is because our vector valued function is of the form
\mathbf{f}(\mathbf{x}) = \begin{bmatrix} f_1(\mathbf{x}) \\ f_2(\mathbf{x}) \\ \vdots \\ f_p(\mathbf{x})\end{bmatrix}
where there are $p$ features associated with the individual. If we consider any of these functions individually we have a prediction function that looks like a regression model,
f_j(\mathbf{x}) = \mathbf{w}_{i, :} \mathbf{x},
for each element of the vector valued function, where $\mathbf{w}_{i, :}$ is the $i$th row of the matrix $\mathbf{W}$. In the context of regression, each column of $\mathbf{W}$ is a vector of regression weights. This is a multiple input and multiple output regression. Our inputs (or covariates) have dimensionality greater than 1 and our outputs (or response variables) also have dimensionality greater than one, different from predicting a single output value in our previous sessions. Just as in a standard regression, we are assuming that we don’t observe the function directly (note that this also makes the function a type of latent variable), but we observe some corrupted variant of the function, where the corruption is given by $\boldsymbol{\epsilon}$. Just as in linear regression we can assume that this corruption is given by Gaussian noise, where the noise for the $j$th element of $\mathbf{y}$ is by,
\epsilon_j \sim \mathcal{N}(0, \sigma^2_j).
Of course, just as in a regression problem we also need to make an assumption across the individual data points to form our full likelihood. Our data set now consists of many observations of $\mathbf{y}$ for different individuals. We store these observations in a design matrix, $\mathbf{Y}$, where each row of $\mathbf{Y}$ contains the observation for one individual. To emphasize that $\mathbf{y}$ is a vector derived from a row of $\mathbf{Y}$, we represent the observation of the features associated with the $i$th individual by $\mathbf{y}_{i, :}$, and place each individual in our data matrix,
\mathbf{Y} = \begin{bmatrix} \mathbf{y}_{1, :}^\top \\ \mathbf{y}_{2, :}^\top \\ \vdots \\ \mathbf{y}_{n, :}^\top\end{bmatrix},
where we have $n$ data points. Our data matrix therefore has $n$ rows and $p$ columns. The point to notice here is that each data obsesrvation appears as a row vector in the design matrix (thus the transpose operation inside the brackets). Our prediction functions are now actually a matrix value function,
\mathbf{F} = \mathbf{X}\mathbf{W}^\top,
where for each matrix the data points are in the rows and the data features are in the columns. This implies that if we have $q$ inputs to the function we have $\mathbf{F}\in \Re^{n\times p}$, $\mathbf{W} \in \Re^{p \times q}$ and $\mathbf{X} \in \Re^{n\times q}$. Note that $\mathbf{Y}$ equals to $\mathbf{F}$ + noise.
$n$: number of data point
$p$: input feature dimension
$q$: latent feature dimension
Question 1¶
Show that, given all the definitions above, if,
\mathbf{F} = \mathbf{X}\mathbf{W}^\top
and the elements of the vector valued function $\mathbf{F}$ are given by
f_{i, j} = f_j(\mathbf{x}_{i, :}),
where $\mathbf{x}_{i, :}$ is the $i$th row of the latent variables, $\mathbf{X}$, then show that
$$f_j(\mathbf{x}_{i, :}) = \mathbf{x}_{i, :} \mathbf{w}_{j, :} ^\top,$$
where $\mathbf{w}_{j, :}$ is the $j$th row of $\mathbf{W}$.
\mathbf{X}\mathbf{W}^\top = \left(\begin{array}{c}\mathbf{x}_{0, :}\\
\mathbf{x}_{1, :}\\
\mathbf{x}_{n, :}\end{array}\right) \left(\begin{array}{cccc}\mathbf{w}_{0, :}^\top & \mathbf{w}_{1, :}^\top & \cdots & \mathbf{w}_{p, :}^\top\end{array}\right)
\mathbf{X}\mathbf{W}^\top = \begin{bmatrix}
\mathbf{x}_{0, :}\mathbf{w}_{0, :}^\top & \mathbf{x}_{0, :}\mathbf{w}_{1, :}^\top & \cdots & \mathbf{x}_{0, :}\mathbf{w}_{p, :}^\top\\
\mathbf{x}_{1, :}\mathbf{w}_{0, :}^\top & \ddots & & \vdots \\
\vdots & & \mathbf{x}_{i, :}\mathbf{w}_{j, :}^\top & \vdots \\
\mathbf{x}_{n, :}\mathbf{w}_{0, :}^\top & \cdots & \cdots & \mathbf{x}_{n, :}\mathbf{w}_{p, :}^\top
\end{bmatrix}
\mathbf{F} = \begin{bmatrix}
f_{0,0} & f_{0,1} & \cdots & f_{0,p}\\
f_{1,0} & \ddots & & \vdots \\
\vdots & & f_{i,j} & \vdots \\
f_{n,0} & \cdots & \cdots & f_{n,p}
\end{bmatrix}
Gaussian modelling of the factor analysis model¶
The difference between this factor analysis model and a multiple output regression is that in the regression case we are provided with the covariates $\mathbf{X}$, here they are latent variables. These variables are unknown.
Just as we have done in the past for unknowns, we now treat them with a probability distribution. In factor analysis we assume that the latent variables have a Gaussian density which is independent across both across the latent variables associated with the different data points, and across those associated with different data features, so we have,
x_{i,j} \sim \mathcal{N}(0, 1),
and we can write the density governing the latent variable associated with a single point as,
\mathbf{x}_{i, :} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}).
If we consider the values of the function for the $i$th data point as
\mathbf{f}_{i, :} = \mathbf{f}(\mathbf{x}_{i, :}) = \mathbf{W}\mathbf{x}_{i, :}
then we can use the rules for multivariate Gaussian relationships to write that
\mathbf{f}_{i, :} \sim \mathcal{N}(\mathbf{0}, \mathbf{W}\mathbf{W}^\top)
which implies that the distribution for $\mathbf{y}_{i, :}$ is given by
\mathbf{y}_{i, :} = \sim \mathcal{N}(\mathbf{0}, \mathbf{W}\mathbf{W}^\top + \boldsymbol{\Sigma})
where $\boldsymbol{\Sigma}$ the covariance of the noise variable, $\epsilon_{i, :}$ which for factor analysis is a diagonal matrix (because we have assumed that the noise was independent across the features),
\boldsymbol{\Sigma} = \begin{bmatrix}\sigma^2_{1} & 0 & 0 & 0\\
0 & \sigma^2_{2} & 0 & 0\\
0 & 0 & \ddots & 0\\
0 & 0 & 0 & \sigma^2_p\end{bmatrix}.
For completeness, we could also add in a mean for the data vector $\boldsymbol{\mu}$,
\mathbf{y}_{i, :} = \mathbf{W} \mathbf{x}_{i, :} + \boldsymbol{\mu} + \boldsymbol{\epsilon}_{i, :}
which would give our marginal distribution for $\mathbf{y}_{i, :}$ a mean $\boldsymbol{\mu}$. However, the maximum likelihood solution for $\boldsymbol{\mu}$ turns out to equal the empirical mean of the data,
\hat{\boldsymbol{\mu}} = \frac{1}{n} \sum_{i=1}^n \mathbf{y}_{i, :},
regardless of the form of the covariance, $\mathbf{C} = \mathbf{W}\mathbf{W}^\top + \boldsymbol{\Sigma}$. As a result it is very common to simply preprocess the data and ensure it is zero mean.
The prior density over latent variables is independent, and the likelihood is independent, that means that the marginal likelihood here is also independent over the data points.
Factor analysis was developed mainly in psychology and the social sciences for understanding personality and intelligence. was concerned with the measurements of “the abilities of man” and is credited with the earliest version of factor analysis.
A2. PCA background¶
This section discusses how PCA was first proposed and the mathematical background for PCA.
Background on PCA and factor analysis (history)¶
In 1933 published on principal component analysis the first mention of this approach. Hotelling’s inspiration was to provide mathematical foundation for factor analysis methods that were by then widely used within psychology and the social sciences. His model was a factor analysis model, but he considered the noiseless ‘limit’ of the model. In other words he took $\sigma^2_i \rightarrow 0$ so that he had
\mathbf{y}_{i, :} \sim \lim_{\sigma^2 \rightarrow 0} \mathcal{N}(\mathbf{0}, \mathbf{W}\mathbf{W}^\top + \sigma^2 \mathbf{I}).
The paper had two unfortunate effects.
Firstly, the resulting model is no longer valid probablistically, because the covariance of this Gaussian is ‘degenerate’. Because $\mathbf{W}\mathbf{W}^\top$ has rank of at most $q$ where $q
>$n$ (generally the case in practice), you need to consider how to do the larger eigenvalue probleme efficiently without large demands on computer memory.
When the data is quite high dimensional, solving the eigenvalue problem in the high dimensional space can take some time. At this point we turn to a neat trick, you don’t have to solve the full eigenvalue problem in the $p\times p$ covariance, you can choose instead to solve the related eigenvalue problem in the $n \times n$ space, and in this case $n=200$ which is much smaller than $p$.
The original eigenvalue problem has the form
\mathbf{Y}^\top\mathbf{Y} \mathbf{u} = \lambda\mathbf{u}
Let us look at a related eigenvalue problem.
\mathbf{Y}\mathbf{Y}^\top \mathbf{v} = \sigma\mathbf{v}
Let us multiply the left hand side by $\mathbf{Y}^\top$, then we have
\mathbf{Y}^\top\mathbf{Y}(\mathbf{Y}^\top \mathbf{v}) = \sigma(\mathbf{Y}^\top\mathbf{v}).
Note that $(\mathbf{Y}^\top\mathbf{v})$ is an eigenvector of $\mathbf{Y}^\top\mathbf{Y}$. Thus, we can obtain the eigenvector of $\mathbf{Y}^\top\mathbf{Y}$ ($p\times p$) via obtaining the eigenvector of $\mathbf{Y}\mathbf{Y}^\top$ ($n\times n$) and then multiply it by $\mathbf{Y}^\top$.
B: Clustering¶
Clustering associates each data point, $\mathbf{y}_{i, :}$ with one of $k$ different discrete groups. For example, clustering animals into discrete groups or clustering diseases into different sub-types. This is a fundamental unsupervised learning task and also essential in data mining.
$k$-means clustering¶
The $k$-means algorithm¶
The simplest and most popular clustering algorithm.
Require: Set $k$ and a stopping criterion Initialize cluster centres as randomly selected data points.
Assign each data point to nearest cluster centre (centroid).
Update each cluster centre by setting it to the mean of assigned data points.
Repeat 2 and 3 until the stopping criterion reached (e.g., cluster allocations do not change).
There is only one hyperparameter ($k$) to set.
The objective function of $k$-means¶
The standard $k$-means algorithm minimizes the objective (compactness)
E=\sum_{j=1}^k \sum_{i\ \text{allocated to}\ j} \left(\mathbf{y}_{i, :} – \boldsymbol{\mu}_{j, :}\right)^\top\left(\mathbf{y}_{i, :} – \boldsymbol{\mu}_{j, :}\right)
i.e. it minimizes the sum of Euclidean squared distances betwen points and their associated centres. This is equivalent to minimizing the trace of the pooled within-cluster covariance/scatter matrix. The minimum is no