Linear Regression and Classification
AIMA 19.6, and Additional Material
CMPSC 442
Week 12, Meeting 34, Three Segments
Outline
● Linear Regression
● Logistic Regression
● Perceptrons and Neural Networks
2Outline, Wk 12, Mtg 34
Linear Regression and Classification
AIMA 19.6, and Additional Material
CMPSC 442
Week 12, Meeting 34, Segment 1 of 3: Linear Regression
Fitting a Univariate Linear Regression
● Price (y) as a function of floor space (x) in houses for sale in Berkeley, CA July 2009
(left figure)
● Squared error loss
○ A convex function with no local minima
○ Loss for different values of the weight w1 and bias w0 (right figure)
4Linear Regression
Solving a Univariate Regression
● Loss is minimized when the partial derivatives of the loss function with respect
to w1 and w0 are both 0
5Linear Regression
Analytic Solution: Univariate Case
● The univariate linear model can be easily solved using the above
equations, based on finding the values of the weights where the partial
derivatives equal zero
● An alternative method can be applied that relies on hill-climbing
6Linear Regression
Gradient Descent Hill Climbing
● Where w is a vector for the weights (including bias), and α is the step
size or learning rate, apply the following update rule
● For univariate regression, the loss is quadratic so the partial derivative
will be linear
7Linear Regression
Gradient Descent
8Linear Regression
Consider a Single Training Example
● Chain rule:
● Derivatives:
9Linear Regression
Batch Gradient Descent for Univariate Case
● Batch: sum over all data points (one epoch)
● Translates into one update rule for each weight
○ α is the learning rate, with the 2 folded in to α
○ Guaranteed to converge if α is small
○ Increasingly slow as N increases
10Linear Regression
Stochastic Gradient Descent (SGD): Univariate Case
● Randomly select m data points at a time, m << N ● Comparison to batch: assume N = 10,000 and m = 100 ○ Each SGD step is 100 times faster than batch gradient descent ○ Increase in standard error is proportional to the square root of the number of examples, or a factor of 10 11Linear Regression Convergence Properties ● With fixed α, no guarantee of convergence ● With decreasing α, convergence is guaranteed ● What previous topic is this similar to? 12Linear Regression Extension to Multivariate Case ● Each example xj is an n-dimensional vector ● The linear equation sums over all xj,i and adds a bias weight ● The weights are therefore an n+1-dimensional vector, so we define a dummy input attribute xj,0 = 1 13Linear Regression Learning the Multivariate Regression ● Essentially the same update rule ● Need to regularize in the multivariate case 14Linear Regression Linear Regression and Classification AIMA 19.6, and Additional Material CMPSC 442 Week 12, Meeting 34, Segment 2 of 3: Logistic Regression Regression versus Classification ● Seismic data (1982-1990, Asia & Middle East): body wave magnitude (x1) and surface wave magnitude (x2) for earthquakes (orange circles) and nuclear explosions (green circles) ● Decision boundary for a linearly separable subset of the data (left) ● All of the data: not linearly separable 16Logistic Regression Seismic Data Can be Classified ● Given the weight vector for the seismic data, and the 2-D vectors of examples, a classification decision boundary can be operationalized as follows 17Logistic Regression Hard Threshold Function for Classification ● Turns a linear regression into a classifier that uses a hard threshold function ● At zero, the decisions switch to the other class ○ Values of the function above 0 are in one class ○ Values of the function below 0 are in the other class 18Logistic Regression Perceptron Learning Rule ● Learning the weights for the classifier with a hard threshold ○ Cannot use gradient descent because the gradients of the values of the threshold function are either zero or undefined ○ Can use the following update rule (for a single example) that converges to a solution, provided the data are linearly separable ● This is called the perceptron learning rule ● The update rule for linear regression! 19Logistic Regression Convergence Behavior of Perceptron Rule Seismic data: training accuracy by number of epochs ● Left: perceptron learning rule, separable data (700 updates) ● Middle: non-separable data (75,000 updates) ● Right: Same as middle but with a learning rate α(t) = 1000/(1000+t) 20Logistic Regression Linear Classification with Logistic Regression ● Hard threshold decision rule is non-differentiable, cannot use SGD ● Replacing hard threshold with sigmoid function gives a differentiable decision function ● Because the values of the sigmoid function are in [0,1] they are interpreted as the probability of the class for each example 21Logistic Regression ● The derivative g’ of a function g satisfies: ● Thus the update rule has a somewhat different form than for multivariate regression with a hard threshold Training a Logistic Regression 22Logistic Regression Logistic Function Converges Much More Smoothly Seismic data: training accuracy by number of epochs ● Left: 5000 updates ● Middle: non-separable data (75,000 updates) ● Right: Same as middle but with a learning rate α(t) = 1000/(1000+t) 23Logistic Regression Linear Regression and Classification AIMA 19.6, and Additional Material CMPSC 442 Week 12, Meeting 34, Segment 3 of 3: Perceptrons and Neural Networks A Perceptron is a Classifier ● Above: One neuron (unit) ● A neural net: neurons linked by directed arcs ● A sigmoid perceptron = logistic regression classifier 25Neural Networks One Neuron (Unit) ● Link from neuron input ai to output aj propagates through the network ● Each input link has a weight wi,j for the strength and sign of activation ● A neuron’s input is a weighted sum over the input links (including dummy input a0=1 for bias weight w0j ) ● A neuron’s output activation aj is a function g over the input inj ● A neural network: neurons linked by directed arcs 26Neural Networks Types of Neural Networks ● Feedforward network: a directed acyclic graph ○ Information propagates in one direction ○ Output is a function of the input ● Recurrent network: has cycles ○ Outputs can recur as inputs ○ Output is a function of the initial state, dependent on previous inputs ○ Dependence on earlier inputs amounts to short-term memory ● Single layer versus multi-layer 27Neural Networks Biological Neuron ● Input ○ A neuron's dendritic tree is connected to a thousand neighboring neurons. When one fires, a positive or negative charge is received 28 ○ The strengths of all the received charges are added together ● Output ○ If the aggregate input is greater than the axon hillock's threshold value, then the neuron fires ○ The physical and neurochemical characteristics of each synapse determines the strength and polarity of the new signal Neural Networks Single Feed Forward Neuron: AND Gate ● White circles: output function is 0 ● Black circle: output function is 1 29 AND x1 x2 Out 0 0 0 0 1 0 1 0 0 1 1 1 Neural Networks Decision threshold = 1.5 Single Feed Forward Neuron: OR Gate ● White circle: output function is 0 ● Black circles: output function is 1 30 OR x1 x2 Out 0 0 0 0 1 1 1 0 1 1 1 1 Neural Networks Decision threshold = 0.5 Single Feed Forward Neuron Cannot be an XOR Gate ● White circle: output function is 0 ● Black circles: output function is 1 ● Data is not linearly separable ● Two classes are needed: Solution uses two neurons 31Neural Networks ● Hidden unit in the middle with a threshold of 1.5 goes on only if both inputs are 1 ● The three weights on the inputs to the final output ensure: ○ If input is 1, 1 then sum of weights is 0 ○ If input is 1, 0 then sum of weights is 1 ○ If input is 0, 1 then sum of weights is 1 ○ If input is 0, 0 then sum of weights is 0 Two Layer XOR Gate 32Neural Networks Two Layer Feed Forward Network ● Two inputs ● One hidden layer with two neurons ● Two output neurons ○ Output is a 2D vector: a5,a6 ● Fully connected feed forward network 33Neural Networks Two Layer Feed Forward Network ● Output of a network with m output nodes is a length m vector ● Given a loss function that is additive ○ Learning decomposes into m learning problems ● Loss must be back-propagated ○ Each node j in the current hidden layer contributes to Errm ○ Node error depends on the weights ● Activation function must be differentiable 34Neural Networks Summary ● Linear regression can be learned through stochastic gradient descent ● Logistic regression is linear regression with a sigmoid output function ● A perceptron, or neuron, is a classifier ○ Sigmoid is one possible activation function for a neuron ● Neural networks consist of networks of neurons 35Summary, Wk 12, Mtg 34