Final Review
April 19th, 2022
Neural Networks
Copyright By PowCoder代写 加微信 powcoder
Neural Networks
Neural Networks: another supervised machine learning technique
● Input layer (x)
● Hidden layer (h)
● Output layer (z)
● Weights (w)
Example: Neural Network
● Two hidden layers
● Sigmoid activation
● Loss function L
Ex: Forward and Backward Propagation
Activation function:
Ex: Forward Computation
Activation function:
Ex: Forward and Backward Propagation
Repeat for remaining layers… Activation function:
Backpropagation
Work backwards to compute weights:
Weight for layer 3
Backpropagation
Work backwards:
Weight for layer 2
Backpropagation
Work backwards:
Weight for layer 1
Convolutional Layers
To evaluate conv layer:
● Slide filter across width and
height of input volume
● Compute dot products across
entries of filter and input
● Creates a 2D activation map
● Stack together activation maps
to create output
● Activation maps store spatial
data for given filter
● Network starts to pick up on
patterns in input
Use: insert in between convolutional layers to decrease dimensionality
● Reduces overfitting
● Reduces computational cost
CNN Architecture
How many filters are in convolutional layer 1?
How many learnable parameters are in convolutional layer 1?
CNN Architecture
How many filters are in convolutional layer 1? 20
How many learnable parameters are in convolutional layer 1? 4*4*3*20 + 20
Activation Functions
Each layer can be transformed by a non-linear activation function in order to capture non-linear relationships among features.
Important Properties: ● Nonlinear
● Easy to compute ● Subdifferentiable
Example Activation Functions
Common for CNN:
Common for NN:
● We learned CNNs are good at recognizing spatial relationships
● In this case, RNNs are good at recognizing temporal relationship, because
they can learn dependencies between timesteps
○ Ex. Stock market price prediction, sentence translation, and video data
Sample Final Exam Question
Sample Final Exam Question
No because point a and d map to (0,0) and they have different labels. In future layers, these points will never be separated, so the neural network will always mislabel at least one of them
Sample Final Exam Question
Yes because you could learn four different classifiers forming a decision boundary between squares and circles on each of the four sides. In general according to the Universal Approximation Theorem, a one hidden layer neural network can approximate any function given sufficient width (more hidden units)
Sample Final Exam Question
False, training performance gives no indication of text performance as test data is completely held out from the training stage of a model
Sample Final Exam Question
Sample Final Exam Question
Number of parameters = size of filer x number of filters = (1x2x2)x1 = 4 Output size: 1x4x1
Number of parameters = input size*output size = 2*1 = 2
Clustering Methods
Unsupervised Machine Learning
● Before: Supervised Machine Learning
○ Requires labels
○ Hard to collect + insert noise
● Now: Unsupervised Machine Learning
○ No explicit reliance on labels
○ Use similarity between groups to extract meaningful labels
○ Two groups: clustering and matrix completion
K-means Clustering – Algorithm
● Algorithm: minimize distance between points of same cluster; maximize distance between points in different clusters
○ Given a data set with n points
○ (1) randomly choose k cluster means
○ (2) iterate between two main steps until convergence:
■ (a) assign each data point to cluster i with the mean that is closest to
■ (b) recompute each cluster mean based on all data points assigned to cluster i
● Best with
○ Globular clusters
○ Similar variances amongst clusters
Spectral Clustering
Spectral Clustering: a graph partitioning algorithm where each data point is a node with edges connecting
● is similarity between and using cosine similarity and determines edge weights
Spectral Clustering Components
● Similarity Matrix (W): ○ Symmetric
● Degree Matrix (D): ○ Diagonal
● Laplacian Matrix: L = D – W
Spectral Clustering Intuition
● Want to partition graph into k connected components that minimizes the cost of those cuts (closest data points in each of k splits)
● Split graph that has minimal effect on graph of Laplacian
● Therefore, perform splits according to the k eigenvectors corresponding with
the smallest k eigenvalues of L
● To figure out clusters, perform k-means (or another clustering algorithm) to
the rows of these ordered eigenvectors
Spectral Clustering Intuition
● Want to partition graph into k connected components that minimizes the cost of those cuts (closest data points in each of k splits)
● Split graph that has minimal effect on graph of Laplacian
● Therefore, perform splits according to the k eigenvectors corresponding with
the smallest k eigenvalues of L
● To figure out clusters, perform k-means (or another clustering algorithm) to
the rows of these ordered eigenvectors
Spectral Clustering
● The multiplicity of the eigenvalue 0 is the number of connected components in the graph
Hierarchical Clustering
Agglomerative Clustering: start with every point in their own cluster and then repeatedly merge clusters based on some criteria of similarity
Divisive Clustering: start with one big cluster and then repeatedly split each cluster based on some criteria
Linkage Criteria
Agglomerative Clustering Algorithm
Comparison of Clustering Methods
K-means: spirals and concentric circles
Hierarchical: gives many different number of clusters with one run, doesn’t depend on initialization
Spectral: concentric, less efficient (requires k-means)
Sample Final Exam Question
No, objective function of K-means monotonically decreases
Sample Final Exam Question
Sample Final Exam Question
[0.8,0.1,0.6,0]
[0,0,2.4,0,0]
[-0.8,0,1.4,-0.6,0]
Sample Final Exam Question
{P,X},{Q,Y},{Z}
Sample Final Exam Question
Plot A corresponds to the lower value of σ2
Collaborative Filtering
Collaborative Filtering
Type of recommendation system without the need for substantial expert knowledge and improves the quality of recommendations
Main Idea: solve a regression problem for the expected rating that a user would give to an item using a metric of similarity between users
● Like users will rate items similarly
Matrix Factorization
Given an observed user-item matrix with potential missing cells, construct an approximation that is completely filled.
Set up a ridge regression problem:
● : observed user-item matrix that user a assigned item i
● : filled approximation of the user-item matrix
● : set of observed (a, i) ratings
● : tunable hyperparameter
Matrix Factorization
Without constraining to be low rank, we get trivial solutions of 0 for missing observations. Instead, define where and .
Our objective function can be rewritten as:
Optimizing U and V
How do we go about minimizing U and V ?
We can iteratively minimize U and V and alternate holding one constant and minimizing the other:
1. Treat V as constant and minimize J(U, V) w.r.t U
2. Treat U as constant and minimize J(U, V) w.r.t V
3. Repeat until some stopping criteria
K Nearest Neighbors
The idea here is to make predictions using information from the k most similar users.
General steps:
1. Define a similarity metric between users
2. Compute similarity between user a and all other users
3. Rate item i for user a using the K most similar users
Similarity Metric
Allows us to quantify how alike two users, a and b are. (Correlation)
Where defines the avg. rating from user a among items rated by both user a and user b.
Similarity Metric
For our similarity metric, values will range between [-1, 1].
● 0 : Complete lack of correlation
● +1 : Perfect positive correlation
● -1: Perfect negative correlation
Which similarity values would be most useful for our KNN prediction method?
KNN to Predict Ratings
Once we have similarities computed between all users in our user-item matrix, we can go to predict unseen ratings.
For user a missing a rating on item i, compute to be:
“Weighted average from K nearest neighbors of user a with rating on item i”
Maximum Likelihood Estimation
Maximum Likelihood Estimate
Idea: find parameters of a probabilistic model 𝜃 that maximize the likelihood of the data X being sampled from the distribution
● Before: discriminative model (classification and regression)
● Now: generative model (parameterize a probability distribution)
1. Compute log-likelihood of distribution
2. Optimize with respect to variable of interest a. Derive
b. Set equal to 0
c. Solve for variable
“Likelihood of dataset X”
Sample Exam Final Question
y = 8x + n N=10n
Gaussian Mixture Models
Spherical Gaussian Distribution
Definition: when data points are evenly distributed around some center point ● PDF:
Gaussian Mixture Models
Mixture Model: special kind of probability distribution that consists of a weighted combination of multiple distributions
Gaussian Mixture Model (GMM): underlying distribution is a Gaussian distribution*
● Training a GMM, requires learning each distributions mean and covariance matrix, and mixing weights
*Assume underlying distribution is spherical
GMM Notation
● Assume Gaussians are spherical, learned GMM:
Complete Data Case
● Assume we know the cluster assignment z for each data point ○ “Hidden” or “latent” values
● Assume points are distributed based on our mixing weights (“priors”), probability of sampling a data point x :
Maximum Likelihood Estimation
Process: similar to other optimization problems
(1) take the gradient with respect to the parameter of interest,
(2) set it equal to 0,
(3) and solve
Incomplete Data Case
Recall the complete data log-likelihood formulation:
Incomplete Data Case: p(j | i) is unknown
● We don’t know which cluster a point is assigned to
● Initially, the best we can do is just randomly guess →
Expectation-Maximization (EM)
Expectation-Maximization
1. Expectation: given our current guess of the parameters, calculate a “soft” assignment. Basically, construct a probability distribution for which cluster a point belongs to:
1. Maximization: since we already know the MLE for a known dataset based on our soft assignments
Expectation-Maximization Visualized
Sample Final Exam Question
Sample Final Exam Question
p(1|1) > p(2|1) p(1|1) = p(2|1) p(1|1) < p(2|1)
p(1|1) > p(2|1)
p(1|1) = p(2|1) p(1|1) < p(2|1)
Sample Final Exam Question
The EM algorithm always converges to the global optimum
The log-likelihood of an objective function can decrease after an EM iteration The EM algorithm can only be used with GMM
None of the above
The EM algorithm always converges to the global optimum
The log-likelihood of an objective function can decrease after an EM iteration The EM algorithm can only be used with GMM
None of the above
Bayesian Networks
Recall: Bayesian Networks
Definition: a general framework for learning probability distributions
● Assume data is generated by a set of probability distribution with parameters
● Drawn as a graph to express dependencies
○ Probability distributions can depend on each other
Fundamental Assumption of Bayesian Networks: Given its parents, a node x is conditionally independent from all other nodes in the graph
● Helps write pdf: where is the set of parents of node
Example PDF
D-Separation
Goal: determine independencies (conditional or marginal) Process
1. “Ancestral graph”: keep variables of interest + keep all nodes that have a directed path to those nodes
2. “Moralize graph”: add undirected edges between any two nodes in the ancestral graph that share a child
3. Adjust all edges to be undirected
4. Read off independence
D-Separation: Independencies
Marginal: if there is no path from X1 to X2, then they are marginally independent
Conditional: if there are no paths from X1 to X2 or all such paths go through X3, then X1 and X2 are conditionally independent given X3
Sample Final Exam Question
P(Z5,Z7) = P(Z5|Z7)P(Z7) P(Z1,Z3|Z2) = P(Z1)P(Z2)P(Z3|Z1,Z2) P(Z6,Z8|Z3) = P(Z6|Z3)P(Z8|Z3) P(Z1)P(Z2) = P(Z1,Z2)
None of the above
P(Z5,Z7) = P(Z5|Z7)P(Z7)
P(Z1,Z3|Z2) = P(Z1)P(Z2)P(Z3|Z1,Z2) P(Z6,Z8|Z3) = P(Z6|Z3)P(Z8|Z3) P(Z1)P(Z2) = P(Z1,Z2)
None of the above
Sample Final Exam Question
Sample Final Exam Question
Hidden Markov Models
Definition: type of graphical model where we assume data comes from a sequence of hidden states, where:
● Each observation is associated with a state
● Each state is independent of all other states and observations given the
previous state
● Each observation is independent of all other observations and states given the
associated state
Example Hidden Markov Model
Hidden states are the top row (z)
Observations in the bottom row (x) -- typically known, want to estimate corresponding hidden state
Probability of a sequence of observations and states:
Sample Final Exam Question
Since each day’s lunch can include any one or more of bread, soda, and/or fruits, there are 2^3-1=7 situations resulting in 7 hidden states
There are two previous states contributing to the current state. The transition probability for the 7th state can be fully determined by the transition probabilities for the 6 other hidden states. Hence there are 7*7*(7-1)=294 independent parameters
There are 7 hidden states, each can emit either asleep or awake. The emission probability for asleep or awake can be fully determined by the other. The number of independent parameters is therefore 7*(2-1)=7
Viterbi Algorithm
The Viterbi Algorithm allows us to efficiently figure out the most likely hidden state sequence z1,...,zT given a sequence of observations x1,...,xT
Viterbi Algorithm
Once we have a Viterbi table set up, we can backtrack and figure out the most likely sequence of hidden states with the following equations
Sample Final Exam Question
You can calculate this using brute force.
There are 9 possible sequences of hidden states for the observed sequence ZX:
DE = 0.5*0.1*0.1*0.1 = .0005
DF = 0.5*0.1*0.1*0.3 = .0015
DD = 0.5*0.8*0.1*0.7 = .028 → DD has the highest likelihood EF = 0.4*0.4*0.1*0.3 = .0048
ED = 0.4*0.2*0.1*0.7 = .0056
EE = 0.4*0.4*0.1*0.1 = .0016
FE = 0.1*0.2*0.7*0.1 = .0014
FF = 0.1*0.8*0.7*0.3 = .0168
Sample Final Exam Question
Now calculate the answer using the viterbi table provided.
First, we can compute the hidden state at time t=2 by taking the max value of the second row of the Viterbi table. We compute max{0.028, 0.0016, 0.0168} = 0.028, so z2 = D
Next, we can find the hidden state at time t=1 by following the second equation given in the previous slide. We compute max{0.05*0.8, 0.04*0.2, 0.07*0} = 0.04, so z1 = D.
Therefore, the most likely hidden sequence is DD.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com