1. Recurrent Neural Networks - Introduction

In this notebook, we will look at the basics of how to use nn.RNNCell and nn.RNN to create RNN Cells and RNN layers, and illustrate the important tensor shape expectation of RNN.

1.1. Imports

import torch
import torch.nn as nn

1.2. RNN Cell

Given an input feature \(n\), and a hidden dimension \(d\), if the transformed dimension is m, then the dimensionality of \(W_{ih}\) should be \(m \times n\), and the \(W_{hh}\) should be \(m \times d\).

The following code randomly initialised a (\(2\times 2\)) hidden to hidden matrix, and a (\(2\times 2\)) input to hidden matrix.

n_features = 2
hidden_dim = 2

torch.manual_seed(19)
rnn_cell = nn.RNNCell(input_size=n_features, hidden_size=hidden_dim)
rnn_cell.state_dict()
OrderedDict([('weight_ih',
              tensor([[ 0.6627, -0.4245],
                      [ 0.5373,  0.2294]])),
             ('weight_hh',
              tensor([[-0.4015, -0.5385],
                      [-0.1956, -0.6835]])),
             ('bias_ih', tensor([0.4954, 0.6533])),
             ('bias_hh', tensor([-0.3565, -0.2904]))])

Your Turn

Modify the input feature and hidden dimension, to observe how the weight matrices change.

1.3. RNN Layer

RNN cell requires us manually feed the cell hidden layer as one of the two inputs (previous hidden state and current input) to the same RNN cell in a for loop. See the forward() method in the ElmanRNN model class as an example. Luckily PyTorch has a nn.RNN() function that looks after this recurrent behaviour for us.

The example below creates the same set of weights, but with l0 suffix for the weight matrix keys, to indicate these weights are the first layer of RNN.

n_features = 2
hidden_dim = 2

torch.manual_seed(19)
rnn = nn.RNN(input_size=n_features, hidden_size=hidden_dim)
rnn.state_dict()
OrderedDict([('weight_ih_l0',
              tensor([[ 0.6627, -0.4245],
                      [ 0.5373,  0.2294]])),
             ('weight_hh_l0',
              tensor([[-0.4015, -0.5385],
                      [-0.1956, -0.6835]])),
             ('bias_ih_l0', tensor([0.4954, 0.6533])),
             ('bias_hh_l0', tensor([-0.3565, -0.2904]))])

1.4. Sequence-first shape default in RNN

1.4.1. Generate some synthetic data

The code below generate random sequences of four points (points e.g. A, B, C, D) Each point has two values, which can be think of a data point in a 2D space. The sequence are points in sequence that are ordered either clock-wise or counter-wise (direction). This is a simplified version of sentences,

  • sentences are sequences (of words);

  • the order or direction of the above sequence is analogus to classes, e.g. sentiment or news categories

  • words are elements in the sequence, representing as vectors with dimensions.

The example data points has a feature space of two dimensions, so they can be easily visualised using a x-y Cartesian coordinate system, whereas words may be in 50, 100, 300 etc. dimensions depending on the embedding methods.

import numpy as np

def generate_sequences(n=128, variable_len=False, seed=13):
    basic_corners = np.array([[-1, -1], [-1, 1], [1, 1], [1, -1]])
    np.random.seed(seed)
    bases = np.random.randint(4, size=n)
    if variable_len:
        lengths = np.random.randint(3, size=n) + 2
    else:
        lengths = [4] * n
    directions = np.random.randint(2, size=n)
    points = [basic_corners[[(b + i) % 4 for i in range(4)]][slice(None, None, d*2-1)][:l] + np.random.randn(l, 2) * 0.1 for b, d, l in zip(bases, directions, lengths)]
    return points, directions
points, directions = generate_sequences(n=128, seed=13)

1.4.2. Batch-first tensor

We now take three (N=3) sequences, each sequence has four data points (L=4), with two features (F=2) representing each data point. This is an example of batch-first input tensor (N,L,F), as shown in the example below.

batch = torch.as_tensor(points[:3]).float()
batch.shape
torch.Size([3, 4, 2])

1.4.3. How to use RNN with correctly shaped tensors

However, RNN uses sequence-first by default (L,N,F), we need to make our tensor RNN friendly. Two options:

  1. We could explicitly change the shape of the batch using permute() to flip the first two dimensions.

  2. We could use the batch_first argument in the RNN layer construction.

1.4.3.1. Option 1

# From a batch-first tensor to sequence-first
permuted_batch = batch.permute(1, 0, 2)
permuted_batch.shape
torch.Size([4, 3, 2])

Once the data is in an “RNN-friendly” shape and we can run it through a regular RNN to get two sequence-first tensors back:

torch.manual_seed(19)
rnn = nn.RNN(input_size=n_features, hidden_size=hidden_dim)
out, final_hidden = rnn(permuted_batch)
out.shape, final_hidden.shape
(torch.Size([4, 3, 2]), torch.Size([1, 3, 2]))

Once we’re done with the RNN we can turn the data back to our familiar batchfirst shape:

batch_hidden = final_hidden.permute(1, 0, 2)
batch.shape
torch.Size([3, 4, 2])

Your Turn

In the code above, the hidden state dimension (hidden_dim) happens to be the same as the number of input features (n_features), which is 2. These two do not have to agree. For example, the word embedding dimension can be 100 (e.g. n_features=100), the hidden dimension can be 50 (hidden_dim=50). Change these two parameters to observe the shape change and get familar with the sequence-first and batch-first tensor shapes.

1.4.3.2. Option 2

Option 1 is a lot of work to keep track of, we can instead set RNN’s batch_first argument to True so we can use the batch above without any modifications.

Warning

But you get these two distinct shapes as a result: batch-first (N,L,H) for the output and sequence-first (1,N,H) for the final hidden state.

On the one hand, this can lead to confusion. On the other hand, most of the time we would not be handling the hidden state, and we will handle the batch-first output instead. So, we can stick with batch-first for now and, when it comes the time we have to handle the hidden state, we will highlight the difference in shapes once again.

torch.manual_seed(19)
rnn_batch_first = nn.RNN(input_size=n_features, hidden_size=hidden_dim, batch_first=True)
out, final_hidden = rnn_batch_first(batch)
out.shape, final_hidden.shape
(torch.Size([3, 4, 2]), torch.Size([1, 3, 2]))

Note

For simple RNNs, the last element of the output IS the final hidden state!

out = out.permute(1,0,2)
(out[-1] == final_hidden).all()
tensor(True)

Summary

The RNN’s default behavior is to handle tensors having the shape (L,N,H) for hidden states and (L,N,F) for sequences of data points.

Datasets and data loaders, unless customized otherwise, will produce data points in the shape (N,L,F).

To address this difference, we’ll be using OPTION 2 the batch_first argument to turn both inputs and outputs into this familiar batch-first shape. But be aware of the shape difference between hidden state and output state. In other words, with batch_first argument to be true, we have input and output “batch-first”, but the hidden states are still “sequence-first”.

1.5. Stacked RNN with Two Layers

torch.manual_seed(19)
rnn_stacked = nn.RNN(input_size=2, hidden_size=2,
        num_layers=2, batch_first=True)
state = rnn_stacked.state_dict()
state
OrderedDict([('weight_ih_l0',
              tensor([[ 0.6627, -0.4245],
                      [ 0.5373,  0.2294]])),
             ('weight_hh_l0',
              tensor([[-0.4015, -0.5385],
                      [-0.1956, -0.6835]])),
             ('bias_ih_l0', tensor([0.4954, 0.6533])),
             ('bias_hh_l0', tensor([-0.3565, -0.2904])),
             ('weight_ih_l1',
              tensor([[-0.6701, -0.5811],
                      [-0.0170, -0.5856]])),
             ('weight_hh_l1',
              tensor([[ 0.1159, -0.6978],
                      [ 0.3241, -0.0983]])),
             ('bias_ih_l1', tensor([-0.3163, -0.2153])),
             ('bias_hh_l1', tensor([ 0.0722, -0.3242]))])

1.5.1. Manually Stacking Two RNNs

Let’s replicate the above with two RNNs, “manually” stacked together.

list(state.items())
[('weight_ih_l0',
  tensor([[ 0.6627, -0.4245],
          [ 0.5373,  0.2294]])),
 ('weight_hh_l0',
  tensor([[-0.4015, -0.5385],
          [-0.1956, -0.6835]])),
 ('bias_ih_l0', tensor([0.4954, 0.6533])),
 ('bias_hh_l0', tensor([-0.3565, -0.2904])),
 ('weight_ih_l1',
  tensor([[-0.6701, -0.5811],
          [-0.0170, -0.5856]])),
 ('weight_hh_l1',
  tensor([[ 0.1159, -0.6978],
          [ 0.3241, -0.0983]])),
 ('bias_ih_l1', tensor([-0.3163, -0.2153])),
 ('bias_hh_l1', tensor([ 0.0722, -0.3242]))]

Your Turn

Give a string k, what does k[:-1] do?

str = 'test'
str[:-2]
'te'
dict([(k[:-1]+'0', v) for k, v in list(state.items())[4:]])
{'weight_ih_l0': tensor([[-0.6701, -0.5811],
         [-0.0170, -0.5856]]),
 'weight_hh_l0': tensor([[ 0.1159, -0.6978],
         [ 0.3241, -0.0983]]),
 'bias_ih_l0': tensor([-0.3163, -0.2153]),
 'bias_hh_l0': tensor([ 0.0722, -0.3242])}
# Create two RNNs
rnn_layer0 = nn.RNN(input_size=2, hidden_size=2, batch_first=True)
rnn_layer1 = nn.RNN(input_size=2, hidden_size=2, batch_first=True)
# Load the same weights from above
rnn_layer0.load_state_dict(dict(list(state.items())[:4]))
# Note the layer label (keys) need to change
rnn_layer1.load_state_dict(dict([(k[:-1]+'0', v) for k, v in list(state.items())[4:]]))
<All keys matched successfully>

Step 0: A batch sequence from the sample (N=1, L=4, F=2)

x = torch.as_tensor(points[0:1]).float()

Step 1: Feed the input to the first RNN layer

out0, h0 = rnn_layer0(x)

Step 2: Feed the output of the first layer to the second RNN layer

out1, h1 = rnn_layer1(out0)

The overall output of the stacked RNN must have two elements as well:

  • a sequence of hidden states, those produced by the last layer (out1)

  • the concatenation of final hidden states of all layers

out1, torch.cat([h0, h1])
(tensor([[[-0.7533, -0.7711],
          [-0.0566, -0.5960],
          [ 0.4324, -0.2908],
          [ 0.1563, -0.5152]]], grad_fn=<TransposeBackward1>),
 tensor([[[-0.5297,  0.3551]],
 
         [[ 0.1563, -0.5152]]], grad_fn=<CatBackward>))

This should be the same as running a stacked RNN, as shown below.

out, hidden = rnn_stacked(x)
out, hidden
(tensor([[[-0.7533, -0.7711],
          [-0.0566, -0.5960],
          [ 0.4324, -0.2908],
          [ 0.1563, -0.5152]]], grad_fn=<TransposeBackward1>),
 tensor([[[-0.5297,  0.3551]],
 
         [[ 0.1563, -0.5152]]], grad_fn=<StackBackward>))

Important

For stacked RNNs, the last element of the output is the final hidden state of the LAST LAYER!

But, since we’re using a batch_first layer, we need to permute the hidden state’s dimensions to batch-first as well:

(out[:, -1] == hidden.permute(1, 0, 2)[:, -1]).all()
tensor(True)

1.6. Bidirectional RNN

torch.manual_seed(19)
rnn_bidirect = nn.RNN(input_size=2, hidden_size=2,
            bidirectional=True, batch_first=True)
state = rnn_bidirect.state_dict()
state
OrderedDict([('weight_ih_l0',
              tensor([[ 0.6627, -0.4245],
                      [ 0.5373,  0.2294]])),
             ('weight_hh_l0',
              tensor([[-0.4015, -0.5385],
                      [-0.1956, -0.6835]])),
             ('bias_ih_l0', tensor([0.4954, 0.6533])),
             ('bias_hh_l0', tensor([-0.3565, -0.2904])),
             ('weight_ih_l0_reverse',
              tensor([[-0.6701, -0.5811],
                      [-0.0170, -0.5856]])),
             ('weight_hh_l0_reverse',
              tensor([[ 0.1159, -0.6978],
                      [ 0.3241, -0.0983]])),
             ('bias_ih_l0_reverse', tensor([-0.3163, -0.2153])),
             ('bias_hh_l0_reverse', tensor([ 0.0722, -0.3242]))])

1.6.1. Manually Created Bidirectional RNN

Once again, we can create two simple RNNs, and use the weights and biases above to set their weights accordingly. Each RNN will behave as one of the layers from the bidirectional one:

rnn_forward = nn.RNN(input_size=2, hidden_size=2, batch_first=True)
rnn_reverse = nn.RNN(input_size=2, hidden_size=2, batch_first=True)
rnn_forward.load_state_dict(dict(list(state.items())[:4]))
rnn_reverse.load_state_dict(dict([(k[:-8], v) for k, v in list(state.items())[4:]]))
<All keys matched successfully>

1.6.1.1. Step 0: A batch sequence and its reverse

x = torch.as_tensor(points[0:1]).float()
print(x)
x_rev = torch.flip(x, dims=[1]) #N, L, F
print(x_rev)
tensor([[[ 1.0349,  0.9661],
         [ 0.8055, -0.9169],
         [-0.8251, -0.9499],
         [-0.8670,  0.9342]]])
tensor([[[-0.8670,  0.9342],
         [-0.8251, -0.9499],
         [ 0.8055, -0.9169],
         [ 1.0349,  0.9661]]])

1.6.1.2. Step 1: Feed each RNN with its corresponding sequence

Since there is no dependency between the two layers, we just need to feed each layer its corresponding sequence (regular and reversed) and remember to reverse back the sequence of hidden states.

out, h = rnn_forward(x)
out_rev, h_rev = rnn_reverse(x_rev)
out_rev_back = torch.flip(out_rev, dims=[1])

1.6.2. Step 2: Tidy up the output

The overall output of the bidirectional RNN must have two elements as well:

  • a concatenation side-by-side of both sequences of hidden states (out and out_rev_back)

  • the concatenation of final hidden states of both layers

out
tensor([[[ 0.3924,  0.8146],
         [ 0.4347, -0.0481],
         [-0.1521, -0.3367],
         [-0.5297,  0.3551]]], grad_fn=<TransposeBackward1>)
out_rev_back
tensor([[[-0.9355, -0.8353],
         [-0.1766,  0.2596],
         [ 0.8829,  0.0425],
         [-0.2032, -0.7901]]], grad_fn=<FlipBackward>)
torch.cat([out, out_rev_back], dim=2), torch.cat([h, h_rev])
(tensor([[[ 0.3924,  0.8146, -0.9355, -0.8353],
          [ 0.4347, -0.0481, -0.1766,  0.2596],
          [-0.1521, -0.3367,  0.8829,  0.0425],
          [-0.5297,  0.3551, -0.2032, -0.7901]]], grad_fn=<CatBackward>),
 tensor([[[-0.5297,  0.3551]],
 
         [[-0.9355, -0.8353]]], grad_fn=<CatBackward>))

Double check the results with the bi-directional RNN itself

out, hidden = rnn_bidirect(x)
out, hidden
(tensor([[[ 0.3924,  0.8146, -0.9355, -0.8353],
          [ 0.4347, -0.0481, -0.1766,  0.2596],
          [-0.1521, -0.3367,  0.8829,  0.0425],
          [-0.5297,  0.3551, -0.2032, -0.7901]]], grad_fn=<TransposeBackward1>),
 tensor([[[-0.5297,  0.3551]],
 
         [[-0.9355, -0.8353]]], grad_fn=<StackBackward>))

Important

For bidirectional RNNs, the last element of the output ISN’T the final hidden state! Once again, since we’re using a batch_first layer, we need to permute the hidden state’s dimensions to batch-first as well:

out[:, -1] == hidden.permute(1, 0, 2).view(1, -1)
tensor([[ True,  True, False, False]])

Bidirectional RNNs are different because the final hidden state corresponds to the last element in the sequence for the forward layer and to the first element in the sequence for the reverse layer. The output, on the other hand, is aligned to sequence, hence the difference.