Convolutional Networks for text classification
Convolutional Networks
I A convolutional network is designed to identify indicative local indicators in a large structure, and combine them to produce a fixed size vector representation of the structure with a pooling function, capturing the local aspects that are most informative of the prediction task at hand.
I A convolutional network is not fully connected as a feedforward network is.
I It has been tremendously successful in image procession (or computer vision), where the input is the raw pixels of an image
I In NLP, it has been shown to be e↵ective in sentence classification, etc.
Why it has been so e↵ective in image processing
Image pixels
Four operations in a convolutional network
I Convolution
I Non-linear activation (ReLU)
I Pooling or subsampling (Max)
I Classification with a fully connected layer
Image convolution
O
Output
or “feature map”
(U)
X
1
1
1
0
1
1
0
0
1
2
2
0
2
1
0
0
1
=conv ,
Input image
“filter”
Kernel size = (2,2), strides = 1
Image convolution
conv(X, U)
O
1×1
1×0
1
0x0
1×1
1
0
0
1
2
Kernel size = (2,2), strides = 1
Image convolution
conv(X, U)
O
1
1×1
1×0
0
1×0
1×1
0
0
1
2
2
Kernel size = (2,2), strides = 1
Image convolution
conv(X, U)
O
1
1
1
0x1
1×0
1
0x0
0x1
1
2
2
0
Kernel size = (2,2), strides = 1
Image convolution
conv(X, U)
O
1
1
1
0
1×1
1×0
0
0x0
1×1
2
2
0
2
Kernel size = (2,2), strides = 1
Forward computation
o11 = x11u11 + x12u12 + x21u21 + x22u22 o12 = x12u11 + x13u12 + x22u21 + x23u22 o21 = x21u11 + x22u12 + x31u21 + x32u22 o22 = x22u11 + x23u12 + x32u21 + x33u22
ReLU
I Nonlinear transformation with ReLU
Output = ReLU(input) = max(0, input)
I As we know, ReLU is an element-wise transformation that does not change the dimension of the feature map
I ReLU replaces all negative pixel values in the featuremap with 0
Image ReLU
ReLU activation and Max pooling
I ReLU activation is a component-wise function and does not change the dimension of the input
2 2 = ReLU ✓2 2 ◆ 02 02
I Max pooling does change the dimension of the input. Need to specify the pool size and strides.
⇥2⇤ = Max ✓2 2 ◆ 02
pool size = (2, 2), strides = 2
Training a CNN
I Loss functions: Cross-entropy loss, squared error loss I What are the parameters of a CNN?
I The filters (kernels), weight matricies for the feedforward network on top of the convolution and pooling layers, biases
I Computing the gradient for the convolution layers is di↵erent from a feedforward neural network…
Computing the gradient on U
@E @ u11
@E @ u12
@E @ u21
@E @ u22
= @E @o11 + @E @o12 + @E @o21 + @E @o22 @ o11 @u11 @o12 @u11 @o21 @u11 @o22 @u11
= @E x11 + @E x12 + @E x21 + @E x22 @ o11 @o12 @o21 @o22
= @E @o11 + @E @o12 + @E @o21 + @E @o22 @ o11 @u12 @o12 @u12 @o21 @u12 @o22 @u12
= @E x12 + @E x13 + @E x22 + @E x23 @ o11 @o12 @o21 @o22
= @E @o11 + @E @o12 + @E @o21 + @E @o22 @ o11 @u21 @o12 @u21 @o21 @u21 @o22 @u21
= @E x21 + @E x22 + @E x31 + @E x32 @ o11 @o12 @o21 @o22
= @E @o11 + @E @o12 + @E @o21 + @E @o22 @ o11 @u22 @o12 @u22 @o21 @u22 @o22 @u22
= @E x11 + @E x23 + @E x32 + @E x33 @ o22 @o12 @o21 @o22
Summing up errors from all outputs that the filter component has contributed to.
Reverse Convolution
The computation of the gradient on the filter can be vectorized as a reverse convolution:
“@E @u11
@E # 02×11 x12 x133 “@E @u12 = conv @4×21 x22 x235 , @o11
@E #1 @o12 A
@E @u21
@E @E @u22 x31 x32 x33 @o21
@E @o22
Computing the gradient on X (if this is not the input layer)
@E = @ x11
@E = @ x12
@E = @ x13
@E = @ x21
@E = @ x22
@E = @ x23
@E = @ x31
@E = @ x32
@E = @ x12
@E u11 + @E 0+ @E 0+ @E 0 @o11 @o12 @o21 @o22
@Eu12+ @Eu11+ @E0+ @E0 @o11 @o12 @o21 @o22
@E 0+ @E u12 + @E 0+ @E 0 @o11 @o12 @o21 @o22
@Eu21+ @E0+ @Eu11+ @E0 @o11 @o12 @o21 @o22
@Eu22+ @Eu21+ @Eu12+ @Eu11 @o11 @o12 @o21 @o22
@E0+ @ o11
@E0+ @ o11
@E0+ @ o11
@E0+ @ o11
@Eu22+ @E0+ @Eu12 @o12 @o21 @o22
@E0+ @Eu21+ @E0 @o12 @o21 @o22
@E0+ @Eu22+ @Eu21 @o12 @o21 @o22
@E0+ @E0+ @Eu22 @o12 @o21 @o22
Full convolution
2@E @E @E3 ”
#
!
6@x11 @x12 @x13 7 @E
@E u u @o12 , 11 12
4 @E @E @E 5 = full conv @o11
@x21 @x22 @x23 @E @E
u21 u22
@E @E @E @o21 @o22 @x31 @x32 @x33
Gradient on X if it is not the inputs
u22
u21
u12
δo11u11
δo12
δo21
δo22
u22
u21
δo11u12
δo12u11
δo21
δo22
u22
u21
δo11
δo12u12
u11
δo21
δo22
u22
u12
δo11u21
δo12
δo21u11
δo22
δo11
δo12
u22
u12
δo21u21
u11
δo22
δo11u22
δo12u21
δo21u12
δo22u11
δo11
x12
u21
δo21
x22
u11
δo11
δo12
δo21u22
δo22u21
u12
u11
δo11
δo12
δo21
δo22u22
u21
u12
u11
Why convolutational networks for NLP?
I Even though bag-of-word models are simple and work well in some text classification tasks, they don’t account for cases where multiple words combine to create meaning, such as “not interesting”.
I The analogy with image processing is if the pixels are treated as separate features. (The analogy might be going too far).
Alternative text representations
I When using feed-forward networks for text classification, the input text is represented as a sparse vector that encodes bag-of-words features.
I Alternatively, a text can be represented as a sequence of word tokens w1, w2, w3, · · · , wM . This view is useful for models such as Convolutional Neural Networks, or ConvNets, which processes text as a sequence.
I Each word token wm is represented as a one-hot vector ewm , with dimension V . The complete document can be represented by the horizontal concatenation of these one-hot vectors: W = [ew1,ew2,··· ,ewm] 2 RV⇥M.
I To show that this is equivalent to the bag-of-words model, we can recover the word count from the matrix-vector product W[1,1,···,1]> 2RV.
Input to a convolutional network in a text classification task
When using conv nets for text classification, it is common to first “look up” the pretrained word embedding (e.g., the weight matrix produced by Word2Vec or GLOVE1) for each word in the sequence:
X(0) = ⇥(x!z)[ew1,ew2,··· ,ewM ] X(0) 2RKe⇥M
where ewm is a column vector of zeros, with a 1 at position wm, Ke is the size of embeddings
1 https://nlp.stanford.edu/projects/glove
“Convolve” the input with a set of filters (kernels)
I A filter is a weight matrix of dimension C(k) 2 RKe⇥h where C(k) is the kth filter. Note the first dimension of the filter is the same as the size of the embedding.
I Unlike image processing, the filter doesn’t have to cover the full width of the image.
I To merge adjacent words, we convolve X(0) by sliding a set of filters across it:
X(1) =f(b+C⇤X(0))
where f is an activation function (e.g., tanh, ReLU), b is a vector of bias terms, and ⇤ is the convolution operator.
Computing the convolution
I At each position m (the mth word in the sequence), we compute the element-wise product of the kth filter and the sequence of words of window size h (think of it as an ngram
of length h) starting at m and take its sum: C(k) X(0)
I m:m+h 1
The value of the mth position with the kth filter can be computed as:
Kh! x(1)=f b+Xe Xc(k)⇥x(0)
m k k0,n k0,m+n 1 k0=1 n=1
I When we finish the convolution step, if we have Kf filters of dimension RKe⇥h, then X(1) 2 RKf ⇥M h+1
I In practice, filters of di↵erent sizes are often used to captured ngrams of di↵erent lengths, so X(1) will be Kf vectors of variable lengths, and we can write the size of each vector of hk
Convolution step when processing text
X(0) U
0.4
0.7
0.8
0.3
2.3
3.2
0.7
0.5
1.3
0.2
1.5
0.8
0.6
1.8
1.2
2.2
1.1
4.3
0.5
0.3
3.4
1
0
0
1
1
0
Conv( =
Input text sequence
X(1)
,) “filter”
2.9
3.1
3.4
?
?
?
Padding
I To deal with the beginning and end of the input, the base matrix is often padded with h 1 column vectors of zeros at the beginning and end. this is called wide convolution
I If no padding is applied, then the output of each convolution layer will be h 1 units smaller than the input. This is known as narrow convolution.
Pooling
I After D convolutional layers, assuming filters have identical lengths, we have a representation of the document as a matrix
X(D) 2RKz⇥M.
I It is very likely that the documents will be of di↵erent lengths, so we need to turn them into matricies of the same length before feeding them to a feedward network to perform classification.
I This can done by pooling across times (over the sequence of words)
Prediction and training with CNN
I The CNN needs to be fed into a feedforward network to make a prediction yˆ and compute the loss `(i) in training.
I Parameters of a CNN includes the weight matrics for the feedforward network and the filters C(k) of the CNN, as well as the biases.
I The parameters can be updated with backpropagation, which may involve computing the gradient for the max pooling function.
(1, x(D) = max⇣x(D),x(D),··· ,x(D)⌘ k,m k,1 k,2 k,M
0, Otherwise
@zk
(D) = @xk,m
Di↵erent pooling methods
I Max pooling
zk = max⇣x(D),x(D),··· ,x(D)⌘
I Average pooling
1 XM ( D ) zk = M xk,m
m=1
k,1 k,2 k,M
A graphic representation of a CNN
Figure 1: Caption
Using Conv Nets in PyTorch