CS224N作业A4:机器翻译

本文最后更新于:几秒前

1、Neural Machine Translation with RNNs (45 points)

(a)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def pad_sents(sents, pad_token):
""" Pad list of sentences according to the longest sentence in the batch.
The paddings should be at the end of each sentence.
@param sents (list[list[str]]): list of sentences, where each sentence
is represented as a list of words
@param pad_token (str): padding token
@returns sents_padded (list[list[str]]): list of sentences where sentences shorter
than the max length sentence are padded out with the pad_token, such that
each sentences in the batch now has equal length.
"""
sents_padded = []

### YOUR CODE HERE (~6 Lines)
len_list = [len(l) for l in sents]
max_len = sorted(len_list,reverse=True)[0]
for l in sents:
for i in range(max_len-len(l)):
l.append(pad_token)
sents_padded.append(l)
### END YOUR CODE
return sents_padded

(b)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
class ModelEmbeddings(nn.Module): 
"""
Class that converts input words to their embeddings.
"""
def __init__(self, embed_size, vocab):
"""
Init the Embedding layers.

@param embed_size (int): Embedding size (dimensionality)
@param vocab (Vocab): Vocabulary object containing src and tgt languages
See vocab.py for documentation.
"""
super(ModelEmbeddings, self).__init__()
self.embed_size = embed_size

# default values
self.source = None
self.target = None

src_pad_token_idx = vocab.src['<pad>']
tgt_pad_token_idx = vocab.tgt['<pad>']

### YOUR CODE HERE (~2 Lines)
### TODO - Initialize the following variables:
### self.source (Embedding Layer for source language)
### self.target (Embedding Layer for target langauge)
###
### Note:
### 1. `vocab` object contains two vocabularies:
### `vocab.src` for source
### `vocab.tgt` for target
### 2. You can get the length of a specific vocabulary by running:
### `len(vocab.<specific_vocabulary>)`
### 3. Remember to include the padding token for the specific vocabulary
### when creating your Embedding.
###
### Use the following docs to properly initialize these variables:
### Embedding Layer:
### https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding
self.source = nn.Embedding(len(vocab.src),embed_size,padding_idx=src_pad_token_idx)
self.target = nn.Embedding(len(vocab.tgt),embed_size,padding_idx=tgt_pad_token_idx)

### END YOUR CODE

(c)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def __init__(self, embed_size, hidden_size, vocab, dropout_rate=0.2):
""" Init NMT Model.

@param embed_size (int): Embedding size (dimensionality)
@param hidden_size (int): Hidden Size, the size of hidden states (dimensionality)
@param vocab (Vocab): Vocabulary object containing src and tgt languages
See vocab.py for documentation.
@param dropout_rate (float): Dropout probability, for attention
"""
super(NMT, self).__init__()
self.model_embeddings = ModelEmbeddings(embed_size, vocab)
self.hidden_size = hidden_size
self.dropout_rate = dropout_rate
self.vocab = vocab

# default values
self.encoder = None
self.decoder = None
self.h_projection = None
self.c_projection = None
self.att_projection = None
self.combined_output_projection = None
self.target_vocab_projection = None
self.dropout = None
# For sanity check only, not relevant to implementation
self.gen_sanity_check = False
self.counter = 0

### YOUR CODE HERE (~9 Lines)
### TODO - Initialize the following variables IN THIS ORDER:
### self.post_embed_cnn (Conv1d layer with kernel size 2, input and output channels = embed_size,
### padding = same to preserve output shape)
### self.encoder (Bidirectional LSTM with bias)
### self.decoder (LSTM Cell with bias)
### self.h_projection (Linear Layer with no bias), called W_{h} in the PDF.
### self.c_projection (Linear Layer with no bias), called W_{c} in the PDF.
### self.att_projection (Linear Layer with no bias), called W_{attProj} in the PDF.
### self.combined_output_projection (Linear Layer with no bias), called W_{u} in the PDF.
### self.target_vocab_projection (Linear Layer with no bias), called W_{vocab} in the PDF.
### self.dropout (Dropout Layer)
###
### Use the following docs to properly initialize these variables:
### LSTM:
### https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM
### LSTM Cell:
### https://pytorch.org/docs/stable/nn.html#torch.nn.LSTMCell
### Linear Layer:
### https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
### Dropout Layer:
### https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout
### Conv1D Layer:
### https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html
#padding的策略是samee
self.post_embed_cnn = nn.Conv1d(in_channels=embed_size,out_channels=embed_size,kernel_size=2,padding="same")
self.encoder = nn.LSTM(input_size=embed_size,hidden_size=hidden_size,bidirectional=True,bias=True)
self.decoder = nn.LSTMCell(input_size=embed_size+hidden_size,hidden_size=hidden_size,bias=True)
self.h_projection = nn.Linear(in_features=2*hidden_size,out_features=hidden_size,bias=False)
self.c_projection = nn.Linear(in_features=2*hidden_size,out_features=hidden_size,bias=False)
self.att_projection = nn.Linear(in_features=2*hidden_size,out_features=hidden_size,bias=False)
self.combined_output_projection = nn.Linear(in_features=3*hidden_size,out_features=hidden_size,bias=False)
#这里需要注意输出的是vocab.target的size
self.target_vocab_projection = nn.Linear(in_features=hidden_size,out_features=len(vocab.tgt),bias=False)
self.dropout = nn.Dropout(p=dropout_rate)

### END YOUR CODE

(d)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
def encode(self, source_padded: torch.Tensor, source_lengths: List[int]) -> Tuple[
torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
""" Apply the encoder to source sentences to obtain encoder hidden states.
Additionally, take the final states of the encoder and project them to obtain initial states for decoder.

@param source_padded (Tensor): Tensor of padded source sentences with shape (src_len, b), where
b = batch_size, src_len = maximum source sentence length. Note that
these have already been sorted in order of longest to shortest sentence.
@param source_lengths (List[int]): List of actual lengths for each of the source sentences in the batch
@returns enc_hiddens (Tensor): Tensor of hidden units with shape (b, src_len, h*2), where
b = batch size, src_len = maximum source sentence length, h = hidden size.
@returns dec_init_state (tuple(Tensor, Tensor)): Tuple of tensors representing the decoder's initial
hidden state and cell. Both tensors should have shape (2, b, h).
"""
enc_hiddens, dec_init_state = None, None

### YOUR CODE HERE (~ 11 Lines)
### TODO:
### 1. Construct Tensor `X` of source sentences with shape (src_len, b, e) using the source model embeddings.
### src_len = maximum source sentence length, b = batch size, e = embedding size. Note
### that there is no initial hidden state or cell for the encoder.
### 2. Apply the post_embed_cnn layer. Before feeding X into the CNN, first use torch.permute to change the
### shape of X to (b, e, src_len). After getting the output from the CNN, still stored in the X variable,
### remember to use torch.permute again to revert X back to its original shape.
### 3. Compute `enc_hiddens`, `last_hidden`, `last_cell` by applying the encoder to `X`.
### - Before you can apply the encoder, you need to apply the `pack_padded_sequence` function to X.
### - After you apply the encoder, you need to apply the `pad_packed_sequence` function to enc_hiddens.
### - Note that the shape of the tensor output returned by the encoder RNN is (src_len, b, h*2) and we want to
### return a tensor of shape (b, src_len, h*2) as `enc_hiddens`, so you may need to do more permuting.
### - Note on using pad_packed_sequence -> For batched inputs, you need to make sure that each of the
### individual input examples has the same shape.
### 4. Compute `dec_init_state` = (init_decoder_hidden, init_decoder_cell):
### - `init_decoder_hidden`:
### `last_hidden` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
### Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
### Apply the h_projection layer to this in order to compute init_decoder_hidden.
### This is h_0^{dec} in the PDF. Here b = batch size, h = hidden size
### - `init_decoder_cell`:
### `last_cell` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
### Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
### Apply the c_projection layer to this in order to compute init_decoder_cell.
### This is c_0^{dec} in the PDF. Here b = batch size, h = hidden size
###
### See the following docs, as you may need to use some of the following functions in your implementation:
### Pack the padded sequence X before passing to the encoder:
### https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html
### Pad the packed sequence, enc_hiddens, returned by the encoder:
### https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_packed_sequence.html
### Tensor Concatenation:
### https://pytorch.org/docs/stable/generated/torch.cat.html
### Tensor Permute:
### https://pytorch.org/docs/stable/generated/torch.permute.html
### Tensor Reshape (a possible alternative to permute):
### https://pytorch.org/docs/stable/generated/torch.Tensor.reshape.html
# 注意这里需要理解一下模型的结构,前面定义了Model_embed层,其中source是一个Embedding,可以直接嵌入source语句
X = self.model_embeddings.source(source_padded)
X = self.post_embed_cnn(X.permute(1,2,0))
X = X.permute(2,0,1)
enc_hiddens,(last_hidden,last_cell) = self.encoder(pack_padded_sequence(X,source_lengths))
enc_hiddens = pad_packed_sequence(enc_hiddens)[0].permute(1,0,2)
#0,1的顺序(我个人感觉和pdf不一致)
last_hidden = torch.cat((last_hidden[0],last_hidden[1]),dim=1)
last_cell = torch.cat((last_cell[0],last_cell[1]),dim=1)
init_decoder_hidden = self.h_projection(last_hidden)
init_decoder_cell = self.c_projection(last_cell)
dec_init_state = (init_decoder_hidden,init_decoder_cell)
### END YOUR CODE
return enc_hiddens, dec_init_state

image-20230401000237712

(e)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def decode(self, enc_hiddens: torch.Tensor, enc_masks: torch.Tensor,
dec_init_state: Tuple[torch.Tensor, torch.Tensor], target_padded: torch.Tensor) -> torch.Tensor:
"""Compute combined output vectors for a batch.

@param enc_hiddens (Tensor): Hidden states (b, src_len, h*2), where
b = batch size, src_len = maximum source sentence length, h = hidden size.
@param enc_masks (Tensor): Tensor of sentence masks (b, src_len), where
b = batch size, src_len = maximum source sentence length.
@param dec_init_state (tuple(Tensor, Tensor)): Initial state and cell for decoder
@param target_padded (Tensor): Gold-standard padded target sentences (tgt_len, b), where
tgt_len = maximum target sentence length, b = batch size.

@returns combined_outputs (Tensor): combined output tensor (tgt_len, b, h), where
tgt_len = maximum target sentence length, b = batch_size, h = hidden size
"""
# Chop off the <END> token for max length sentences.
target_padded = target_padded[:-1]

# Initialize the decoder state (hidden and cell)
dec_state = dec_init_state

# Initialize previous combined output vector o_{t-1} as zero
batch_size = enc_hiddens.size(0)
o_prev = torch.zeros(batch_size, self.hidden_size, device=self.device)

# Initialize a list we will use to collect the combined output o_t on each step
combined_outputs = []

### YOUR CODE HERE (~9 Lines)
### TODO:
### 1. Apply the attention projection layer to `enc_hiddens` to obtain `enc_hiddens_proj`,
### which should be shape (b, src_len, h),
### where b = batch size, src_len = maximum source length, h = hidden size.
### This is applying W_{attProj} to h^enc, as described in the PDF.
### 2. Construct tensor `Y` of target sentences with shape (tgt_len, b, e) using the target model embeddings.
### where tgt_len = maximum target sentence length, b = batch size, e = embedding size.
### 3. Use the torch.split function to iterate over the time dimension of Y.
### Within the loop, this will give you Y_t of shape (1, b, e) where b = batch size, e = embedding size.
### - Squeeze Y_t into a tensor of dimension (b, e).
### - Construct Ybar_t by concatenating Y_t with o_prev on their last dimension
### - Use the step function to compute the the Decoder's next (cell, state) values
### as well as the new combined output o_t.
### - Append o_t to combined_outputs
### - Update o_prev to the new o_t.
### 4. Use torch.stack to convert combined_outputs from a list length tgt_len of
### tensors shape (b, h), to a single tensor shape (tgt_len, b, h)
### where tgt_len = maximum target sentence length, b = batch size, h = hidden size.
###
### Note:
### - When using the squeeze() function make sure to specify the dimension you want to squeeze
### over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1.
###
### You may find some of these functions useful:
### Zeros Tensor:
### https://pytorch.org/docs/stable/torch.html#torch.zeros
### Tensor Splitting (iteration):
### https://pytorch.org/docs/stable/torch.html#torch.split
### Tensor Dimension Squeezing:
### https://pytorch.org/docs/stable/torch.html#torch.squeeze
### Tensor Concatenation:
### https://pytorch.org/docs/stable/torch.html#torch.cat
### Tensor Stacking:
### https://pytorch.org/docs/stable/torch.html#torch.stack
#enc_hidden:(b, src_len, h*2)
enc_hiddens_proj = self.att_projection(enc_hiddens)
# self.model_embeddings.target
Y = self.model_embeddings.target(target_padded)
Y = torch.split(Y,1,dim=0)
for y_t in Y:
y_t = torch.squeeze(y_t)
ybar_t = torch.cat((y_t,o_prev),dim=1)
dec_state,o_t,e_t = self.step(ybar_t,dec_state,enc_hiddens,enc_hiddens_proj,enc_masks)
combined_outputs.append(o_t)
o_prev = o_t
combined_outputs=torch.stack(combined_outputs,dim=0)
### END YOUR CODE

return combined_outputs

image-20230401000612593

(f)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
def step(self, Ybar_t: torch.Tensor,
dec_state: Tuple[torch.Tensor, torch.Tensor],
enc_hiddens: torch.Tensor,
enc_hiddens_proj: torch.Tensor,
enc_masks: torch.Tensor) -> Tuple[Tuple, torch.Tensor, torch.Tensor]:
""" Compute one forward step of the LSTM decoder, including the attention computation.

@param Ybar_t (Tensor): Concatenated Tensor of [Y_t o_prev], with shape (b, e + h). The input for the decoder,
where b = batch size, e = embedding size, h = hidden size.
@param dec_state (tuple(Tensor, Tensor)): Tuple of tensors both with shape (b, h), where b = batch size, h = hidden size.
First tensor is decoder's prev hidden state, second tensor is decoder's prev cell.
@param enc_hiddens (Tensor): Encoder hidden states Tensor, with shape (b, src_len, h * 2), where b = batch size,
src_len = maximum source length, h = hidden size.
@param enc_hiddens_proj (Tensor): Encoder hidden states Tensor, projected from (h * 2) to h. Tensor is with shape (b, src_len, h),
where b = batch size, src_len = maximum source length, h = hidden size.
@param enc_masks (Tensor): Tensor of sentence masks shape (b, src_len),
where b = batch size, src_len is maximum source length.

@returns dec_state (tuple (Tensor, Tensor)): Tuple of tensors both shape (b, h), where b = batch size, h = hidden size.
First tensor is decoder's new hidden state, second tensor is decoder's new cell.
@returns combined_output (Tensor): Combined output Tensor at timestep t, shape (b, h), where b = batch size, h = hidden size.
@returns e_t (Tensor): Tensor of shape (b, src_len). It is attention scores distribution.
Note: You will not use this outside of this function.
We are simply returning this value so that we can sanity check
your implementation.
"""

combined_output = None

### YOUR CODE HERE (~3 Lines)
### TODO:
### 1. Apply the decoder to `Ybar_t` and `dec_state`to obtain the new dec_state.
### 2. Split dec_state into its two parts (dec_hidden, dec_cell)
### 3. Compute the attention scores e_t, a Tensor shape (b, src_len).
### Note: b = batch_size, src_len = maximum source length, h = hidden size.
###
### Hints:
### - dec_hidden is shape (b, h) and corresponds to h^dec_t in the PDF (batched)
### - enc_hiddens_proj is shape (b, src_len, h) and corresponds to W_{attProj} h^enc (batched).
### - Use batched matrix multiplication (torch.bmm) to compute e_t (be careful about the input/ output shapes!)
### - To get the tensors into the right shapes for bmm, you will need to do some squeezing and unsqueezing.
### - When using the squeeze() function make sure to specify the dimension you want to squeeze
### over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1.
###
### Use the following docs to implement this functionality:
### Batch Multiplication:
### https://pytorch.org/docs/stable/torch.html#torch.bmm
### Tensor Unsqueeze:
### https://pytorch.org/docs/stable/torch.html#torch.unsqueeze
### Tensor Squeeze:
### https://pytorch.org/docs/stable/torch.html#torch.squeeze
dec_state = self.decoder(Ybar_t,dec_state)
dec_hidden = dec_state[0]
dec_cell = dec_state[1]
# enc_hidden_proj:(b,src_len,h), dec_hidden:(b,h),e_t:(b,src_len)
# torch.squeeze去除size==1的维度,unsqueeze在指定维度补1
e_t = torch.bmm(enc_hiddens_proj,dec_hidden.unsqueeze(-1)).squeeze(-1)
### END YOUR CODE

# Set e_t to -inf where enc_masks has 1
if enc_masks is not None:
e_t.data.masked_fill_(enc_masks.bool(), -float('inf'))

### YOUR CODE HERE (~6 Lines)
### TODO:
### 1. Apply softmax to e_t to yield alpha_t
### 2. Use batched matrix multiplication between alpha_t and enc_hiddens to obtain the
### attention output vector, a_t.
# $$ Hints:
### - alpha_t is shape (b, src_len)
### - enc_hiddens is shape (b, src_len, 2h)
### - a_t should be shape (b, 2h)
### - You will need to do some squeezing and unsqueezing.
### Note: b = batch size, src_len = maximum source length, h = hidden size.
### TODO:
### 3. Concatenate dec_hidden with a_t to compute tensor U_t
### 4. Apply the combined output projection layer to U_t to compute tensor V_t
### 5. Compute tensor O_t by first applying the Tanh function and then the dropout layer.
###
### Use the following docs to implement this functionality:
### Softmax:
### https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.softmax
### Batch Multiplication:
### https://pytorch.org/docs/stable/torch.html#torch.bmm
### Tensor View:
### https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view
### Tensor Concatenation:
### https://pytorch.org/docs/stable/torch.html#torch.cat
### Tanh:
### https://pytorch.org/docs/stable/torch.html#torch.tanh
alpha_t = F.softmax(e_t)
a_t = torch.bmm(alpha_t.unsqueeze(1),enc_hiddens).squeeze(dim=1)
U_t = torch.cat((a_t,dec_hidden),dim=1)
V_t = self.combined_output_projection(U_t)
O_t = self.dropout(F.tanh(V_t))

### END YOUR CODE

combined_output = O_t
return dec_state, combined_output, e_t

(g)

对encoder的padding数据进行mask,能够标记出数据中padding过程补足的位置,从而在训练过程中用一些方法将其忽略,减少其在注意力计算中产生的影响。使用masks时,对补充的位置masks值设置为1,即bool=True,这样在Encoder建模过程中就可以将padding的位置补充为-inf,而注意力计算过程中$exp(-inf) \rightarrow 0$,从而使模型对padding的部分忽略。

(h)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
epoch 3, iter 18610, avg. loss 58.13, avg. ppl 8.95 cum. examples 320, speed 1491.16 words/sec, time elapsed 3291.41 sec
epoch 3, iter 18620, avg. loss 54.11, avg. ppl 8.00 cum. examples 640, speed 5900.33 words/sec, time elapsed 3292.82 sec
epoch 3, iter 18630, avg. loss 56.25, avg. ppl 8.79 cum. examples 960, speed 5968.53 words/sec, time elapsed 3294.21 sec
epoch 3, iter 18640, avg. loss 58.12, avg. ppl 9.20 cum. examples 1280, speed 5329.38 words/sec, time elapsed 3295.78 sec
epoch 3, iter 18650, avg. loss 51.07, avg. ppl 7.36 cum. examples 1600, speed 4701.92 words/sec, time elapsed 3297.52 sec
epoch 3, iter 18660, avg. loss 55.50, avg. ppl 8.40 cum. examples 1920, speed 5755.93 words/sec, time elapsed 3298.97 sec
epoch 3, iter 18670, avg. loss 53.45, avg. ppl 8.33 cum. examples 2240, speed 5521.04 words/sec, time elapsed 3300.44 sec
epoch 3, iter 18680, avg. loss 55.00, avg. ppl 8.06 cum. examples 2560, speed 5861.04 words/sec, time elapsed 3301.88 sec
epoch 3, iter 18690, avg. loss 54.89, avg. ppl 8.64 cum. examples 2880, speed 5247.28 words/sec, time elapsed 3303.43 sec
epoch 3, iter 18700, avg. loss 57.09, avg. ppl 8.81 cum. examples 3200, speed 5595.19 words/sec, time elapsed 3304.93 sec
epoch 3, iter 18710, avg. loss 59.10, avg. ppl 9.15 cum. examples 3520, speed 5764.15 words/sec, time elapsed 3306.41 sec
epoch 3, iter 18720, avg. loss 51.66, avg. ppl 7.38 cum. examples 3840, speed 4835.91 words/sec, time elapsed 3308.12 sec
epoch 3, iter 18730, avg. loss 58.08, avg. ppl 9.23 cum. examples 4160, speed 5069.38 words/sec, time elapsed 3309.77 sec
epoch 3, iter 18740, avg. loss 58.46, avg. ppl 8.77 cum. examples 4480, speed 5673.07 words/sec, time elapsed 3311.29 sec
epoch 3, iter 18750, avg. loss 58.70, avg. ppl 9.46 cum. examples 4800, speed 5863.14 words/sec, time elapsed 3312.72 sec
epoch 4, iter 18760, avg. loss 51.31, avg. ppl 6.98 cum. examples 5120, speed 5585.10 words/sec, time elapsed 3314.23 sec
epoch 4, iter 18770, avg. loss 49.36, avg. ppl 6.88 cum. examples 5440, speed 6081.40 words/sec, time elapsed 3315.58 sec
epoch 4, iter 18780, avg. loss 50.36, avg. ppl 7.06 cum. examples 5760, speed 5890.12 words/sec, time elapsed 3316.98 sec
epoch 4, iter 18790, avg. loss 48.79, avg. ppl 6.37 cum. examples 6080, speed 5914.69 words/sec, time elapsed 3318.40 sec
epoch 4, iter 18800, avg. loss 51.09, avg. ppl 7.16 cum. examples 6400, speed 5709.77 words/sec, time elapsed 3319.86 sec
epoch 4, iter 18800, cum. loss 54.53, cum. ppl 8.10 cum. examples 6400
begin validation ...
validation: iter 18800, dev. ppl 12.352464
hit patience 1
hit #4 trial
load previously best model and decay learning rate to 0.000031
restore parameters of the optimizers
epoch 4, iter 18810, avg. loss 52.73, avg. ppl 7.19 cum. examples 320, speed 1987.35 words/sec, time elapsed 3324.16 sec
epoch 4, iter 18820, avg. loss 50.84, avg. ppl 7.38 cum. examples 640, speed 5747.25 words/sec, time elapsed 3325.58 sec
epoch 4, iter 18830, avg. loss 51.07, avg. ppl 7.29 cum. examples 960, speed 5666.00 words/sec, time elapsed 3327.03 sec
epoch 4, iter 18840, avg. loss 50.78, avg. ppl 6.95 cum. examples 1280, speed 5808.57 words/sec, time elapsed 3328.47 sec
epoch 4, iter 18850, avg. loss 50.70, avg. ppl 6.89 cum. examples 1600, speed 5786.21 words/sec, time elapsed 3329.93 sec
epoch 4, iter 18860, avg. loss 52.06, avg. ppl 7.29 cum. examples 1920, speed 5404.75 words/sec, time elapsed 3331.48 sec
epoch 4, iter 18870, avg. loss 51.83, avg. ppl 7.13 cum. examples 2240, speed 5924.19 words/sec, time elapsed 3332.90 sec
epoch 4, iter 18880, avg. loss 51.02, avg. ppl 7.25 cum. examples 2560, speed 5307.42 words/sec, time elapsed 3334.46 sec
epoch 4, iter 18890, avg. loss 49.71, avg. ppl 6.98 cum. examples 2880, speed 5497.55 words/sec, time elapsed 3335.95 sec
epoch 4, iter 18900, avg. loss 51.77, avg. ppl 7.19 cum. examples 3200, speed 5847.27 words/sec, time elapsed 3337.38 sec
epoch 4, iter 18910, avg. loss 50.58, avg. ppl 6.99 cum. examples 3520, speed 5952.13 words/sec, time elapsed 3338.78 sec
epoch 4, iter 18920, avg. loss 53.59, avg. ppl 7.55 cum. examples 3840, speed 5720.31 words/sec, time elapsed 3340.26 sec
epoch 4, iter 18930, avg. loss 49.79, avg. ppl 6.85 cum. examples 4160, speed 5844.42 words/sec, time elapsed 3341.68 sec
epoch 4, iter 18940, avg. loss 53.56, avg. ppl 7.98 cum. examples 4480, speed 5481.18 words/sec, time elapsed 3343.19 sec
epoch 4, iter 18950, avg. loss 48.81, avg. ppl 6.52 cum. examples 4800, speed 5548.51 words/sec, time elapsed 3344.69 sec
epoch 4, iter 18960, avg. loss 51.65, avg. ppl 7.16 cum. examples 5120, speed 5802.95 words/sec, time elapsed 3346.14 sec
epoch 4, iter 18970, avg. loss 48.99, avg. ppl 6.55 cum. examples 5440, speed 5652.33 words/sec, time elapsed 3347.61 sec
epoch 4, iter 18980, avg. loss 52.00, avg. ppl 7.15 cum. examples 5760, speed 5564.60 words/sec, time elapsed 3349.13 sec
epoch 4, iter 18990, avg. loss 49.65, avg. ppl 6.67 cum. examples 6080, speed 5997.98 words/sec, time elapsed 3350.53 sec
epoch 4, iter 19000, avg. loss 49.06, avg. ppl 6.51 cum. examples 6400, speed 5984.15 words/sec, time elapsed 3351.93 sec
epoch 4, iter 19000, cum. loss 51.01, cum. ppl 7.06 cum. examples 6400
begin validation ...
validation: iter 19000, dev. ppl 12.401339
hit patience 1
1
2
3
4
5
6
7
8
9
10
11
12
13
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
load test source sentences from [./zh_en_data/test.zh]
load test target sentences from [./zh_en_data/test.en]
load model from model.bin
Decoding: 0% 0/1001 [00:00<?, ?it/s]/usr/local/lib/python3.9/dist-packages/torch/nn/modules/conv.py:309: UserWarning: Using padding='same' with even kernel lengths and odd dilation may require a zero-padded copy of the input be created (Triggered internally at ../aten/src/ATen/native/Convolution.cpp:895.)
return F.conv1d(input, weight, bias, self.stride,
/content/drive/MyDrive/a4/nmt_model.py:376: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
alpha_t = F.softmax(e_t)
/usr/local/lib/python3.9/dist-packages/torch/nn/functional.py:1956: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
Decoding: 100% 1001/1001 [00:43<00:00, 28.36it/s]
### Corpus BLEU: 19.97235090067466

(i)

(i)

优点:相比于multiplicative attention,其参数量更少,计算量更低

缺点:限定了查询向量$s$和值向量$h$的维度必须相等

(ii)

优点:将注意力变成了神经网络可训练的形式,可以将注意力和网络一起训练,提高效率

缺点:大规模加法的计算时间要远高于矩阵乘法

2. Analyzing NMT Systems (25 points)

(a)

多个词或语素的组合可能会产生完全不同的意义,因此卷积层通过将相邻的几个词或语素结合起来,从而提高对语料数据集信息的提取程度,进而帮助到NMT系统之后的模型理解过程。

(b)

(i)

the culprits werethe culprit was的不同之处在单复数的问题,模型没有完全学习出来句子中应该的单复数形式,或许可以通过增大数据量、扩大隐藏层大小等来解决;

(ii)

resources have been exhausted重复了两次,或许可以通过引入自注意力机制,在句子生成时考虑到自身句子的合理性;

(iii)

a national mourning today在英语训练数据集中只出现了一次,而且“今天是XX日”的语言结构也比较难学习,可以通过针对性提高数据规模等方法提高对这种语言结构的学习能力

(iv)

“唔做唔错”是比较方言的语言,而模型的数据为普通话数据,因此训练出来的模型对这种方言比较难测试,可以通过加入方言数据或训练一个方言特化的模型来解决。

(c)

(i)

对于$c_1$:$p_1=\frac{4}{9}$,$p_2=\frac{3}{8}$,$BP\approx 0.8$,$BLEU\approx 0.32$

对于$c_2$:$p_1=1$,$p_2=\frac{3}{5}$,$BP=1$,$BLEU\approx0.77$

$∵c_1<c_2$,$∴c_2$的效果更好,我不认可这个结果。

(ii)

对于$c_1$:$p_1=\frac{4}{9}$,$p_2=\frac{3}{8}$,$BP\approx 1$,$BLEU\approx 0.4$

对于$c_2$:$p_1=\frac{1}{2}$,$p_2=\frac{1}{5}$,$BP=1$,$BLEU\approx0.32$

$∵c_1>c_2$,$∴c_1$的效果更好,我认可这个结果。

(iii)

只有单个参考翻译的$BLEU$会有比较高的噪声,即可能会过高或过低,从而对$BLEU$的评分产生一些影响。NMT系统在解释和解码源句子时更灵活,更有可能获得公平的BLEU评分。

(iv)

优点:

1、$BLEU$评分能够使得模型评价更简单,有一个固定的评价指标

2、语言独立,利于理解

缺点:

1、和人类直观评价相比仍有距离

2、没有考虑语法和句子结构


CS224N作业A4:机器翻译
http://paopao0226.site/post/1ce6e410.html
作者
Ywj226
发布于
2023年3月31日
更新于
2023年9月23日
许可协议