程序代写代做代考 kernel Transformer: Neural machine translation without recurrence

Transformer: Neural machine translation without recurrence
I It is possible to replace recurrence with self-attention within the encoder and decoder, as in the transformer architecture
X( s ) M
h(i)=⇥ReLU ⇥z(i)+b +b m21m12
↵(i) (⇥ h(i1)) n=1 ⇣ ⌘
z(i) =
m m!nn
For each token m at level i, we compute self-attention over the entire source sentence. The keys, values, and queries are all projections of the vector h(i1).
I The attention scores ↵(i) are computed using a scaled form
m!n of softmax attention,
↵m!n / exp( ↵(m, n)/M)

“Self-attention”
18.3. NEURALMACHINETRANSLATION
447
z(i) (i)
m (i)(m,·)
v
k

h(i1)
Figure 18.7: The transformer encoder’s computation of z(i) from h(i1). The key, value,
q
m1 m m+1 m
and query are shown for token m 1. For example, (i)(m, m 1) is computed from ↵
the key h(i1) and the query h(i1), and the gate ↵(i) operates on the value k m1 q m m!m1
vh(i1). The figure shows a minimal version of the architecture, with a single atten- m1
tion head. With multiple heads, it is possible to attend to different properties of multiple words.
where M is the length of the input. This encourages the attention to be more evenly dispersed across the input. Self-attention is applied across multiple “heads”, each using different projections of h(i1) to form the keys, values, and queries. This architecture is
shown in Figure 18.7. The output of the self-attentional layer is the representation z(i), m
which is then passed through a two-layer feed-forward network, yielding the input to the next layer, h(i). This self-attentional architecture can be applied in the decoder as well, but this requires that there is zero attention to future words: ↵m!n = 0 for all n > m.
To ensure that information about word order in the source is integrated into the model, the encoder augments the base layer of the network with positional encodings of the indices of each word in the source. These encodings are vectors for each position m 2 {1, 2, . . . , M }. The transformer sets these encodings equal to a set of sinusoidal functions of m,
2i 1 Ke
e (m)=sin(m/(100002i )) [18.45]

“Self-attention”

“Self-attention”

“Self-attention”

“Self-attention”

“Self-attention”

“Self-attention”
I Attention is an e↵ective mechanism to model dependencies.
I By varying the keys, values, and queries, we get di↵erent variants of attention. We get self-attention by making the key, value and query to be projections of tokens in the same sentence.
I Intuitively self-attention contextualizes a word token in a sentence (or any sequence of word tokens), which is useful in interpreting the word token, which will in turns helps other tasks that build on the understanding of words.

Multiple Attention “heads”

Multiple Attention “heads”

Multiple Attention “heads”

Multiple Attention “heads”

What does self-attention do?
I Contextualization.
I why is contextualization a good thing?
I “Work out the solution in your head.” I “Heat the solution to 75 Celsius.”
I Having the same embedding for both instances of “solution” doesn’t make sense

The Transformer architecture in (almost) its entirety
Figure 1: The Transformer – model architecture.
wise fully connected feed-forward network. We employ a residual connection [10] around each of
the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.
Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
3.2 Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output,

Position Embedding
The original Transformer includes an absolute, precomputed position embedding.
I For word w at position pos 2 [0, L 1] in the sequence
w = {w0, · · · , wL1}, with a 4-dimensional embedding ew , and dmodel = 4, the operation would be:
ew0 = ew + sin⇣ pos ⌘,cos⇣ pos ⌘,sin✓ pos ◆,cos✓ pos ◆ 100000 100000 100002/4 100002/4
=ew +hsin(pos),cos(pos),sin⇣pos⌘,cos⇣pos⌘i 100 100
I The formula to calculate the position embedding is:

PE(pos,2i) = sin⇣ pos ⌘ 100002i /dmodel
PE(pos,2i +1) = cos⇣ pos 100002i /dmodel
for i 2 [0, dmodel /2 1]

Position-wise feedforward neural network
I A feed-forward network is applied to each position separately and identically, i.e., with shared weights across positions in the sequence
FFN(x) = ⇥2(ReLU(⇥1x + b1)) + b2 I This is equivalent to applying two one-dimensional
convolutional network with one kernel to the sequence.
I Question to think about: why not just use a one-dimensional convolution?

Layer normalization
I Given a minibatch of input of size m, B = {x1,x2,··· ,xm}, after applying layer normalization to each sample in the batch, we get a transformed minibatch B0 = {y1, y2, · · · , ym}, where
I yi =LN,(xi)
Specifically, layer normalization needs the following steps:
I Compute the mean and variance of each sample in the batch:
1 XK
XK k=1
i2 =
I Normalize each sample such that the elements in the sample
x i , k
have zero mean and unit variance:
( x i , k μ i ) 2
μ i = K
k=1
xˆ = p i,k
xi,k μi
i2 + ” I Finally scaling and shifting with and :
y=xˆ+,LN (x) i i ,i

Reducing the output space with BPE
BPE is a simple data compression technique
I Initialize the vocabulary with the character vocabulary, and segment each words into a sequence of characters,plus a end-of-word symbol ’.’
I Iteratively count all symbol pairs, and replace each most frequent pair [’A’,’B’] with a new symbol [’AB’]. Each merge produces a new symbol which represents a character ngram. Frequent character ngrams are eventually merged into a single symbol.
I The final vocabulary size equals to the initial vocabulary size plus the number of merge operations
I The number of merge operations is a hyperparameter

Python implementation

Python implementation

Tutorials
I NMT with Tensorflow: https://github.com/tensorflow/nmt
I Attention visualization: https://jalammar.github.io/visualizing-neural-machine- translation-mechanics-of-seq2seq-models-with-attention/
I Illustrated transformer: http://jalammar.github.io/illustrated-transformer/

BERT: Bidirectional Encoder Representations from Transformers
Input Representations:
I The input can be both a single sentence or a pair of sentences
I “Sentence” here just means a sequence of words.
I Sentences are broken down into WordPieces (token vocab size= 30K).
I The first token of the input sequence is a special symbol [CLS] I Sentence pars are packed together into one sentence and
separated with a special token [SEP]
I Separate token embeddings for the first sentence and the second sentence in a sentence pair.

The pretrain-fine tune paradigm

Pre-Training BERT
I The Transformer is trained with a Masked Language Model (MLM) task, and a next sentence prediction (NSP) task for sentence pairs.
I The MLM task (or the Cloze task) is trained by predicting 15% of the tokens that are randomly masked.
I The final hidden vectors corresponding to the masked tokens are fed into a softmax over the token vocabulary.

BERT Pretraining with MLM

BERT next sentence prediction

BERT: fine tune to specific tasks