CS代考 Final Review

Final Review
April 19th, 2022

Neural Networks

Copyright By PowCoder代写 加微信 powcoder

Neural Networks
Neural Networks: another supervised machine learning technique
● Input layer (x)
● Hidden layer (h)
● Output layer (z)
● Weights (w)

Example: Neural Network
● Two hidden layers
● Sigmoid activation
● Loss function L

Ex: Forward and Backward Propagation
Activation function:

Ex: Forward Computation
Activation function:

Ex: Forward and Backward Propagation
Repeat for remaining layers… Activation function:

Backpropagation
Work backwards to compute weights:
Weight for layer 3

Backpropagation
Work backwards:
Weight for layer 2

Backpropagation
Work backwards:
Weight for layer 1

Convolutional Layers
To evaluate conv layer:
● Slide filter across width and
height of input volume
● Compute dot products across
entries of filter and input
● Creates a 2D activation map
● Stack together activation maps
to create output
● Activation maps store spatial
data for given filter
● Network starts to pick up on
patterns in input

Use: insert in between convolutional layers to decrease dimensionality
● Reduces overfitting
● Reduces computational cost

CNN Architecture
How many filters are in convolutional layer 1?
How many learnable parameters are in convolutional layer 1?

CNN Architecture
How many filters are in convolutional layer 1? 20
How many learnable parameters are in convolutional layer 1? 4*4*3*20 + 20

Activation Functions
Each layer can be transformed by a non-linear activation function in order to capture non-linear relationships among features.
Important Properties: ● Nonlinear
● Easy to compute ● Subdifferentiable

Example Activation Functions
Common for CNN:
Common for NN:

● We learned CNNs are good at recognizing spatial relationships
● In this case, RNNs are good at recognizing temporal relationship, because
they can learn dependencies between timesteps
○ Ex. Stock market price prediction, sentence translation, and video data

Sample Final Exam Question

Sample Final Exam Question

No because point a and d map to (0,0) and they have different labels. In future layers, these points will never be separated, so the neural network will always mislabel at least one of them

Sample Final Exam Question

Yes because you could learn four different classifiers forming a decision boundary between squares and circles on each of the four sides. In general according to the Universal Approximation Theorem, a one hidden layer neural network can approximate any function given sufficient width (more hidden units)

Sample Final Exam Question

False, training performance gives no indication of text performance as test data is completely held out from the training stage of a model

Sample Final Exam Question

Sample Final Exam Question

Number of parameters = size of filer x number of filters = (1x2x2)x1 = 4 Output size: 1x4x1
Number of parameters = input size*output size = 2*1 = 2

Clustering Methods

Unsupervised Machine Learning
● Before: Supervised Machine Learning
○ Requires labels
○ Hard to collect + insert noise
● Now: Unsupervised Machine Learning
○ No explicit reliance on labels
○ Use similarity between groups to extract meaningful labels
○ Two groups: clustering and matrix completion

K-means Clustering – Algorithm
● Algorithm: minimize distance between points of same cluster; maximize distance between points in different clusters
○ Given a data set with n points
○ (1) randomly choose k cluster means
○ (2) iterate between two main steps until convergence:
■ (a) assign each data point to cluster i with the mean that is closest to
■ (b) recompute each cluster mean based on all data points assigned to cluster i
● Best with
○ Globular clusters
○ Similar variances amongst clusters

Spectral Clustering
Spectral Clustering: a graph partitioning algorithm where each data point is a node with edges connecting
● is similarity between and using cosine similarity and determines edge weights

Spectral Clustering Components
● Similarity Matrix (W): ○ Symmetric
● Degree Matrix (D): ○ Diagonal
● Laplacian Matrix: L = D – W

Spectral Clustering Intuition
● Want to partition graph into k connected components that minimizes the cost of those cuts (closest data points in each of k splits)
● Split graph that has minimal effect on graph of Laplacian
● Therefore, perform splits according to the k eigenvectors corresponding with
the smallest k eigenvalues of L
● To figure out clusters, perform k-means (or another clustering algorithm) to
the rows of these ordered eigenvectors

Spectral Clustering Intuition
● Want to partition graph into k connected components that minimizes the cost of those cuts (closest data points in each of k splits)
● Split graph that has minimal effect on graph of Laplacian
● Therefore, perform splits according to the k eigenvectors corresponding with
the smallest k eigenvalues of L
● To figure out clusters, perform k-means (or another clustering algorithm) to
the rows of these ordered eigenvectors

Spectral Clustering
● The multiplicity of the eigenvalue 0 is the number of connected components in the graph

Hierarchical Clustering
Agglomerative Clustering: start with every point in their own cluster and then repeatedly merge clusters based on some criteria of similarity
Divisive Clustering: start with one big cluster and then repeatedly split each cluster based on some criteria

Linkage Criteria

Agglomerative Clustering Algorithm

Comparison of Clustering Methods
K-means: spirals and concentric circles
Hierarchical: gives many different number of clusters with one run, doesn’t depend on initialization
Spectral: concentric, less efficient (requires k-means)

Sample Final Exam Question

No, objective function of K-means monotonically decreases

Sample Final Exam Question

Sample Final Exam Question

[0.8,0.1,0.6,0]
[0,0,2.4,0,0]
[-0.8,0,1.4,-0.6,0]

Sample Final Exam Question

{P,X},{Q,Y},{Z}

Sample Final Exam Question

Plot A corresponds to the lower value of σ2

Collaborative Filtering

Collaborative Filtering
Type of recommendation system without the need for substantial expert knowledge and improves the quality of recommendations
Main Idea: solve a regression problem for the expected rating that a user would give to an item using a metric of similarity between users
● Like users will rate items similarly

Matrix Factorization
Given an observed user-item matrix with potential missing cells, construct an approximation that is completely filled.
Set up a ridge regression problem:
● : observed user-item matrix that user a assigned item i
● : filled approximation of the user-item matrix
● : set of observed (a, i) ratings
● : tunable hyperparameter

Matrix Factorization
Without constraining to be low rank, we get trivial solutions of 0 for missing observations. Instead, define where and .
Our objective function can be rewritten as:

Optimizing U and V
How do we go about minimizing U and V ?
We can iteratively minimize U and V and alternate holding one constant and minimizing the other:
1. Treat V as constant and minimize J(U, V) w.r.t U
2. Treat U as constant and minimize J(U, V) w.r.t V
3. Repeat until some stopping criteria

K Nearest Neighbors
The idea here is to make predictions using information from the k most similar users.
General steps:
1. Define a similarity metric between users
2. Compute similarity between user a and all other users
3. Rate item i for user a using the K most similar users

Similarity Metric
Allows us to quantify how alike two users, a and b are. (Correlation)
Where defines the avg. rating from user a among items rated by both user a and user b.

Similarity Metric
For our similarity metric, values will range between [-1, 1].
● 0 : Complete lack of correlation
● +1 : Perfect positive correlation
● -1: Perfect negative correlation
Which similarity values would be most useful for our KNN prediction method?

KNN to Predict Ratings
Once we have similarities computed between all users in our user-item matrix, we can go to predict unseen ratings.
For user a missing a rating on item i, compute to be:
“Weighted average from K nearest neighbors of user a with rating on item i”

Maximum Likelihood Estimation

Maximum Likelihood Estimate
Idea: find parameters of a probabilistic model 𝜃 that maximize the likelihood of the data X being sampled from the distribution
● Before: discriminative model (classification and regression)
● Now: generative model (parameterize a probability distribution)
1. Compute log-likelihood of distribution
2. Optimize with respect to variable of interest a. Derive
b. Set equal to 0
c. Solve for variable
“Likelihood of dataset X”

Sample Exam Final Question

y = 8x + n N=10n

Gaussian Mixture Models

Spherical Gaussian Distribution
Definition: when data points are evenly distributed around some center point ● PDF:

Gaussian Mixture Models
Mixture Model: special kind of probability distribution that consists of a weighted combination of multiple distributions
Gaussian Mixture Model (GMM): underlying distribution is a Gaussian distribution*
● Training a GMM, requires learning each distributions mean and covariance matrix, and mixing weights
*Assume underlying distribution is spherical

GMM Notation
● Assume Gaussians are spherical, learned GMM:

Complete Data Case
● Assume we know the cluster assignment z for each data point ○ “Hidden” or “latent” values
● Assume points are distributed based on our mixing weights (“priors”), probability of sampling a data point x :

Maximum Likelihood Estimation
Process: similar to other optimization problems
(1) take the gradient with respect to the parameter of interest,
(2) set it equal to 0,
(3) and solve

Incomplete Data Case
Recall the complete data log-likelihood formulation:
Incomplete Data Case: p(j | i) is unknown
● We don’t know which cluster a point is assigned to
● Initially, the best we can do is just randomly guess →
Expectation-Maximization (EM)

Expectation-Maximization
1. Expectation: given our current guess of the parameters, calculate a “soft” assignment. Basically, construct a probability distribution for which cluster a point belongs to:
1. Maximization: since we already know the MLE for a known dataset based on our soft assignments

Expectation-Maximization Visualized

Sample Final Exam Question

Sample Final Exam Question
p(1|1) > p(2|1) p(1|1) = p(2|1) p(1|1) < p(2|1) p(1|1) > p(2|1)
p(1|1) = p(2|1) p(1|1) < p(2|1) Sample Final Exam Question The EM algorithm always converges to the global optimum The log-likelihood of an objective function can decrease after an EM iteration The EM algorithm can only be used with GMM None of the above The EM algorithm always converges to the global optimum The log-likelihood of an objective function can decrease after an EM iteration The EM algorithm can only be used with GMM None of the above Bayesian Networks Recall: Bayesian Networks Definition: a general framework for learning probability distributions ● Assume data is generated by a set of probability distribution with parameters ● Drawn as a graph to express dependencies ○ Probability distributions can depend on each other Fundamental Assumption of Bayesian Networks: Given its parents, a node x is conditionally independent from all other nodes in the graph ● Helps write pdf: where is the set of parents of node Example PDF D-Separation Goal: determine independencies (conditional or marginal) Process 1. “Ancestral graph”: keep variables of interest + keep all nodes that have a directed path to those nodes 2. “Moralize graph”: add undirected edges between any two nodes in the ancestral graph that share a child 3. Adjust all edges to be undirected 4. Read off independence D-Separation: Independencies Marginal: if there is no path from X1 to X2, then they are marginally independent Conditional: if there are no paths from X1 to X2 or all such paths go through X3, then X1 and X2 are conditionally independent given X3 Sample Final Exam Question P(Z5,Z7) = P(Z5|Z7)P(Z7) P(Z1,Z3|Z2) = P(Z1)P(Z2)P(Z3|Z1,Z2) P(Z6,Z8|Z3) = P(Z6|Z3)P(Z8|Z3) P(Z1)P(Z2) = P(Z1,Z2) None of the above P(Z5,Z7) = P(Z5|Z7)P(Z7) P(Z1,Z3|Z2) = P(Z1)P(Z2)P(Z3|Z1,Z2) P(Z6,Z8|Z3) = P(Z6|Z3)P(Z8|Z3) P(Z1)P(Z2) = P(Z1,Z2) None of the above Sample Final Exam Question Sample Final Exam Question Hidden Markov Models Definition: type of graphical model where we assume data comes from a sequence of hidden states, where: ● Each observation is associated with a state ● Each state is independent of all other states and observations given the previous state ● Each observation is independent of all other observations and states given the associated state Example Hidden Markov Model Hidden states are the top row (z) Observations in the bottom row (x) -- typically known, want to estimate corresponding hidden state Probability of a sequence of observations and states: Sample Final Exam Question Since each day’s lunch can include any one or more of bread, soda, and/or fruits, there are 2^3-1=7 situations resulting in 7 hidden states There are two previous states contributing to the current state. The transition probability for the 7th state can be fully determined by the transition probabilities for the 6 other hidden states. Hence there are 7*7*(7-1)=294 independent parameters There are 7 hidden states, each can emit either asleep or awake. The emission probability for asleep or awake can be fully determined by the other. The number of independent parameters is therefore 7*(2-1)=7 Viterbi Algorithm The Viterbi Algorithm allows us to efficiently figure out the most likely hidden state sequence z1,...,zT given a sequence of observations x1,...,xT Viterbi Algorithm Once we have a Viterbi table set up, we can backtrack and figure out the most likely sequence of hidden states with the following equations Sample Final Exam Question You can calculate this using brute force. There are 9 possible sequences of hidden states for the observed sequence ZX: DE = 0.5*0.1*0.1*0.1 = .0005 DF = 0.5*0.1*0.1*0.3 = .0015 DD = 0.5*0.8*0.1*0.7 = .028 → DD has the highest likelihood EF = 0.4*0.4*0.1*0.3 = .0048 ED = 0.4*0.2*0.1*0.7 = .0056 EE = 0.4*0.4*0.1*0.1 = .0016 FE = 0.1*0.2*0.7*0.1 = .0014 FF = 0.1*0.8*0.7*0.3 = .0168 Sample Final Exam Question Now calculate the answer using the viterbi table provided. First, we can compute the hidden state at time t=2 by taking the max value of the second row of the Viterbi table. We compute max{0.028, 0.0016, 0.0168} = 0.028, so z2 = D Next, we can find the hidden state at time t=1 by following the second equation given in the previous slide. We compute max{0.05*0.8, 0.04*0.2, 0.07*0} = 0.04, so z1 = D. Therefore, the most likely hidden sequence is DD. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com