Linear models: Recap
Linear models: I Perceptron
score(y, x; ✓) = ✓ · f (x, y)
I Na ̈ıve Bayes:
log P(y|x; ✓) = log P(x|y; ) + log P(y; u) = log B(x) + ✓ · f (x, y)
I Logistic Regression
log P(y|x; ✓) = ✓ · f (x, y) log X exp ✓ · f (x, y0) y02Y
Features and weights in linear models: Recap
I Feature representation: f (x , y )
f (x , y = 1) = [x ; 0; 0; · · · ; 0]
| {z }
(K 1)⇥V
f (x , y = 2) = [0; 0; · · · ; 0 x ; 0; 0; · · · ; 0]
| {z } | {z }
V (K 2)⇥V f (x , y = K ) = [0; 0; · · · ; 0; x ]
| {z }
I Weights: ✓
✓ = [✓1; ✓2; · · · ; ✓V ; ✓1; ✓2; · · · ; ✓V ; · · · ; ✓1; ✓2; · · · ; ✓V ]
(K 1)⇥V
| {z }| {z } | {z }
y=1 y=2 y=K
Rearranging the features and weights
I Represent the features x as a column vector of length V , and represent the weights as a ⇥ as K ⇥ V matrix
23 2×1 x2···xV3 x1 y=16✓1,1 ✓1,2 ··· ✓1,V 7
x=64×275 ⇥= y=26✓2,1 ✓2,2 ··· ✓2,V 7 ··· ··· 4 ··· ··· ··· ··· 5
xV y=K ✓K,1 ✓K,2 ··· ✓K,V I What is ⇥x?
Scores for each class
I Verify that 1, class
2, · · · ,
=⇥x=64✓2·x= 275
K correspond to the scores for each 26 ✓ 1 · x = 1 37
··· ✓3·x= K
Implementation in Pytorch
Digression: Matrix multiplication
I Matrix with m rows and n columns:
A P2 R m ⇥ n , B 2 R n ⇥ p , C = A B 2 R m ⇥ p where Cij = nk=1 AikBkj
I Example:
2 3 ⇥1 0 2 =2 3 7 12 011 124
Digression: 3-D matrix multiplication
Tensor shape: (batch-size, sentence-length, embedding size)
SoftMax
j
I SoftMax, also known as normalized exponential function. SoftMax ( ) = exp i
i PKj exp for i = 1,2,··· ,K
I Applying SoftMax turns the scores into a probabilistic distribution:
26P(y = 1)37 SoftMax( ) = 64P(y = 2)75
P(y = K) I Verify this is exactly logistic regression
···
Logistic regression as a neural network
y = SoftMax(⇥x) V=5K=3
Going deep
I There is no reason why we can’t add layers in the middle
z = (⇥1x)
y = SoftMax(⇥2z)
Going even deeper
I There is no reason why we can’t add layers in the middle
I But why?
z1 = (⇥1x)
z2 = (⇥2z1)
y = SoftMax(⇥3z2)
Non-linear classification
Linear models like Logistic regression can map data into a high-dimensional vector space, and they are expressive enough and work well for many NLP problems, why do we need more complex non-linear models?
I There have been rapid advances in deep learning, a family of nonlinear methods that learn complex functions of the input through multiple layers of computation.
I Deep learning facilitates the incorporation of word embeddings, which are dense vector representations of words, that can be learned from massive amounts of unlabeled data.
I It has evolved from early static embeddings (e.g., Word2vec, Glove) to recent dynamtic embeddings (ELMO, BERT, XLNet)
I Rapid advances in specialized hardware called graphic processing units (GPUs). Many deep learning models can be implemented e ciently on GPUs.
Feedforward Neural networks: an intuitive justification
I In image classification, instead of using the input (pixels) to predict the image type directly, you can imagine a scenario that you can predict the shapes of parts of an image, mouth, hand, ear.
I In text processing, we can imagine a similar scenario. Let’s say we want to classify movie reviews (or movies themselves) into a label set of {Good,Bad,OK}. Instead predicting these labels directly, we first predict a set of composite features such as the story, acting, soundtrack, cinematography, etc. from raw input (words in the text).
Face Recognition
Feedforward neural networks
Formally, this is what we do:
I Use the text x to predict the features z. Specifically, train a logistic regression classifier to compute P(zk|x), for each
k 2 {1,2,··· ,Kz}
I Use the features z to predict the label y. Train a logistic regression classifier to compute P(y|z). z is unknown or hidden, so we will use the P(z|x) as the features.
Caveat: it’s easy to demonstrate what this is what the model does for image processing, but it’s hard to show this is what’s actually going on in language processing. Interpretability is a major issue in neural models for language processing.
The hidden layer: computing the composite features
I If we assume each zk is binary, that is, zk 2 {0, 1}, then P(zk|x) can be modeled with binary logistic regression:
P(zk =1|x;⇥(x!z))= (✓x!z ·x) k
= (1 + exp( ✓x!z · x)) 1 k
I The weight matrix ⇥(x!z) 2 Rkz⇥V is constructed by stacking (not concatenating, as in linear models) the weight
vectors for each zk,
⇥(x!z) = ⇥✓x!z,✓x!z,··· ,✓x!z⇤T
I 1 2 Kz
We assume an o↵set/bias term is included in x and its
parameter is included in each ✓x!z k
Notations: ⇥(x!z) 2 Rkz ⇥V is a real number matrix with a dimension of kz rows and V columns
The output layer
I The output layer is computed by the multiclass logistic regression probability
P exp(✓(z!y) ·z +bj) P(y = j|z;⇥(z!y),b) = j
I The weight matrix ⇥(z !y ) 2 Rky ⇥kz again is constructed by
stacking weight vectors for each yk:
⇥(z!y) = h✓z!y,✓z!y,··· ,✓z!yi> 1 2 Ky
I The vector of probabilities over each possible value of y is denoted:
P(y|z;⇥(z!y),b) = SoftMax(⇥(z!y)z +b)
exp(✓(z!y) · z + b0) j02Y j0 j
Activation functions
I Sigmoid: The range of sigmoid function is (0, 1). (x) = 1
1+e x
I Tanh: The range of the tanh activation function is ( 1, 1)
tanh(x) = e2x 1 e2x +1
I ReLU: The rectified linear unit (ReLU) is zero for negative inputs, and linear for positive inputs
ReLU(x) = max(x,0) = (0 x < 0
x otherwise
Sigmoid and tanh are sometimes described as squashing functions.
Activation functions in Pytorch
from torch import nn import torch
input = torch . randn (4) sigmoid = nn.Sigmoid() output = sigmoid(input)
tanh = nn.Tanh() output = tanh(input)
relu = nn.ReLU() output = relu(input)
Output and loss functions
In a multi-class classification setting, a softmax output produces a probabilistic distribution over possible labels. It works well together with negative conditional likelihood (just like logistic regression)
L = or cross entropy loss:
XN i=1
log P(y(i)|x(i); ⇥) y ̃ , P r X( y = j | x ( i ) ; ⇥ )
j
N
L= ey(i) ·logy ̃ i=1
where ey(i) is a one-hot vector of zeros with a value of one at the position y(i)
Output and loss function
I There are alternatives to SoftMax and cross-entropy loss, just as there are alternatives in linear models.
I Pairing an a ne transformation (remember perceptron) with a margin loss:
(y; x(i), ⇥) = ✓(z!y) · z + b yy
`MARGIN(⇥;x(i),y(i))= max ⇣1+ (y;x(i),⇥) (y(i);x(i),⇥)⌘+ y6=y(i)
Inputs and Lookup layers
I Assuming a bag-of-words model, when the input x is the count of each word xi (This can be generalized to feature count).
I To compute the hidden unit zk :
j=1
networks.
I The connections from word j to each of the hidden units zk
form a vector ✓(x!z) is sometimes described as the embedding j
of word j. Word embeddings can be learned from unlabeled data, using techniques such as Word2Vec and GLOVE.
XV
j,k
zk =
I This text representation is particularly suited for feedforward
✓x!zxj
Alternative text representations
I Alternatively, a text can be represented as a sequence of word tokens w1, w2, w3, · · · , wM . This view is useful for models such as Convolutional Neural Networks, or ConvNets, which processes text as a sequence.
I Each word token wm is represented as a one-hot vector ewm , with dimension V . The complete document can be represented by the horizontal concatenation of these one-hot vectors: W = [ew1,ew2,··· ,ewm] 2 RV⇥M.
I To show that this is equivalent to the bag-of-words model, we can recover the word count from the matrix-vector product W[1,1,···,1]> 2RV.
I The matrix product ⇥x!zW 2 Rkz⇥M contains horizontally concatenated embeddings of each word in the document.