Lab12: Sequence to Sequence Learning with Attention

In this lab we introduce the basic attention (cross-attention) and the self-attention mechansim, applying them to the same square sequence predication problem we have seen in Lab11’s Sequence to Sequence Model.

The second part of this lab focuses on a real neural machine translation task to build attention based models, one with no sampling (i.e. using teacher forcing 100% of the time), and another with scheduled sampling (i.e. using teacher forcing only sometimes). Since we have biGRU for the Encoder, the masked padded sequence in the backward layer would affect the results. So here we take the opportunity to introduce the PackedSequence in PyTorch to handle variable length sequences.

Credit: the notebooks are adapted from: