Neural Networks in PyTorch
======================================
## A neural network with one hidden layer 

Extending the simple perceptron in the openning page of this lab, let's build a simple neural network that takes two binary inputs, and simulate the logical operation of XOR. The network has two sets of weights and biases, 
one set between the input and the hidden layer with two nodes, another set between the hidden layer and the single output, as shown in the figure below.

```{figure} ../images/neural-network-xor.png
:alt: The XOR Neural Network
:class: bg-primary mb-1
:width: 400px
```

```{note}
Compare the code below with the `Perceptron` code at the front page of this lab to have a better understanding of the building blocks. 
``` 

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

### Defining the model class

Let's define the XOR neural network model inherited from the `torch.nn.Module` class using an Object Oriented approach.

In [2]:
class XOR(nn.Module):
    """
    An XOR is similuated using neural network with 
    two fully connected linear layers 
    """
    
    def __init__(self, input_dim, output_dim):
        """
        Args:
            input_dim (int): size of the input features
            output_dim (int): size of the output
        """
        super(XOR, self).__init__()
        self.fc1 = nn.Linear(input_dim, 2)
        self.fc2 = nn.Linear(2, output_dim)

    def forward(self, x_in):
        """The forward pass of the perceptron

        Args:
            x_in (torch.Tensor): an input data tensor
                x_in.shape should be (batch, num_features)
        Returns:
            the resulting tensor. tensor.shape should be (batch,).
        """
        hidden = torch.sigmoid(self.fc1(x_in))
        yhat = torch.sigmoid(self.fc2(hidden))
        return yhat

### Object Orientation in PyTorch

In PyTorch, a model (e.g. the `XOR` model) is represented by a regular Python class that inherits from the Module class.

:::{important}
IMPORTANT: If you are uncomfortable with object-oriented
programming (OOP) concepts like *classes*, *constructors*, *methods/class methods*, *instances*, and *attributes*, it is strongly recommended to follow tutorials such as [Real Python's Objected-Oriented Programming (OOP) in Python 3](https://realpython.com/python3-object-oriented-programming/)
:::

The most fundamental methods a model class needs to implement are:
- `__init__(self)`: it defines the parts that make up the model — in our case,
two parameters, b and w.
- `forward(self, x)`: it performs the actual computation, that is, it outputs a
prediction, given the input x.

:::{note}
Do not forget to include `super().__init__()` to execute
the `__init__()` method of the parent class `(nn.Module)` before
your own.
:::

## XOR Model
 
Now let's create a Linear Model with two inputs (features) and one output. 

Calling the XOR constructor with an `input_dim=2` and `output_dim=1`, namely `XOR(2,1)` results in two fully connected linear models, one `nn.Linear(2,2)` and ``nn.Linear(2,1)`, which will create a model with two input features, one output feature with biases at the input layer, hidden layer and output layer.

In [3]:
model = XOR(2,1)

### Obtain all model parameters using `state_dict()`

We can get the current values of all parameters using our model’s
state_dict() method.

In [4]:
model.state_dict()

OrderedDict([('fc1.weight',
              tensor([[ 0.5827,  0.0717],
                      [-0.2969, -0.1214]])),
             ('fc1.bias', tensor([-0.5186, -0.6406])),
             ('fc2.weight', tensor([[0.3564, 0.5904]])),
             ('fc2.bias', tensor([0.3660]))])

We used to manually assign random values to these weights and biases. Now PyTorch does it for us automatically. 

The `state_dict()` of a given model is simply a Python dictionary that maps each attribute/parameter to its corresponding tensor. But only learnable parameters are included, as its purpose is to keep track of parameters that are going to be updated by the optimizer.

The optimizer itself has a `state_dict()` too, which contains its internal
state, as well as other hyper-parameters. Let's take a quick look at it:

In [5]:
lr = 0.01 
optimizer = optim.SGD(model.parameters(), lr=lr)
optimizer.state_dict()

{'state': {},
 'param_groups': [{'lr': 0.01,
   'momentum': 0,
   'dampening': 0,
   'weight_decay': 0,
   'nesterov': False,
   'params': [0, 1, 2, 3]}]}

### Model and Data need to be on the same device

:::{important} 
We need to send our model to the same device
where the data is. If our data is made of GPU tensors, our model
must "live" inside the GPU as well.
:::

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [7]:
model = model.to(device)

## Training XOR 

Now let's put it all together to train an neural XOR model.

### Training Data Preparation

In [8]:
# Training data preparation

x_train_tensor = torch.tensor([[0,0],[0,1],[1,1],[1,0]], device=device).float()
y_train_tensor = torch.tensor([0,1,1,0], device=device).view(4,1).float()

x_val_tensor = torch.clone(x_train_tensor)
y_val_tensor = torch.clone(y_train_tensor)

In [9]:
# Verify the shape of the output tensor
y_train_tensor.shape

torch.Size([4, 1])

### Hyperparameter setup

We need to set up the learning rate and the number of epochs, and then select the three key compoenents of a neural model: model, optimiser and loss function before training. 

In [10]:
# Sets learning rate - this is "eta" ~ the "n" like
# Greek letter
lr = 0.01

# Step 0 - Initializes parameters "b" and "w" randomly
torch.manual_seed(42)
# Now we can create a model and send it at once to the device
model = XOR(2,1)
model = model.to(device)

# Defines a SGD optimizer to update the parameters
# (now retrieved directly from the model)
optimizer = optim.SGD(model.parameters(), lr=lr)

# Defines a MSE loss function
loss_fn = nn.MSELoss(reduction='mean')

# Defines number of epochs
n_epochs = 100000

### Training

Now we are ready to train. 

In [11]:
for epoch in range(n_epochs):
    #for j in range(steps):
    model.train() # What is this?!?

    # Step 1 - Computes model's predicted output - forward pass
    # No more manual prediction!
    yhat = model(x_train_tensor)

    # Step 2 - Computes the loss
    loss = loss_fn(yhat, y_train_tensor)
    
    # Step 3 - Computes gradients for both "b" and "w" parameters
    loss.backward()

    # Step 4 - Updates parameters using gradients and
    # the learning rate
    optimizer.step()
    optimizer.zero_grad()
    if (epoch % 500 == 0):
        print("Epoch: {0}, Loss: {1}, ".format(epoch, loss.to("cpu").detach().numpy()))

# We can also inspect its parameters using its state_dict
print(model.state_dict())

Epoch: 0, Loss: 0.27375736832618713, 
Epoch: 500, Loss: 0.2555079460144043, 
Epoch: 1000, Loss: 0.25011080503463745, 
Epoch: 1500, Loss: 0.24655020236968994, 
Epoch: 2000, Loss: 0.24295492470264435, 
Epoch: 2500, Loss: 0.238966703414917, 
Epoch: 3000, Loss: 0.23448437452316284, 
Epoch: 3500, Loss: 0.22930260002613068, 
Epoch: 4000, Loss: 0.2232476770877838, 
Epoch: 4500, Loss: 0.21623794734477997, 
Epoch: 5000, Loss: 0.20812170207500458, 
Epoch: 5500, Loss: 0.19883160293102264, 
Epoch: 6000, Loss: 0.18850544095039368, 
Epoch: 6500, Loss: 0.1771014928817749, 
Epoch: 7000, Loss: 0.16488364338874817, 
Epoch: 7500, Loss: 0.15211743116378784, 
Epoch: 8000, Loss: 0.139143705368042, 
Epoch: 8500, Loss: 0.12626223266124725, 
Epoch: 9000, Loss: 0.11392521113157272, 
Epoch: 9500, Loss: 0.10234947502613068, 
Epoch: 10000, Loss: 0.09172774851322174, 
Epoch: 10500, Loss: 0.08199518918991089, 
Epoch: 11000, Loss: 0.0733182281255722, 
Epoch: 11500, Loss: 0.06560301780700684, 
Epoch: 12000, Loss: 0.05

:::{important}
In PyTorch, models have a train() method which, somewhat
disappointingly, does NOT perform a training step. Its only
purpose is to set the model to **training mode**.
Why is this important? Some models may use mechanisms like
Dropout, for instance, which have distinct behaviors during
training and evaluation phases.
:::

It is good practice to call `model.train()` in the training loop. It is also possible to set a model to evaluation mode. We will see this in later labs. 

:::{admonition} Your Turn
Put the returned weights and biases into the XOR neural network diagram and try to work out the output when the input is `[0,0]`.
:::

### Inference (Forward Pass)

Instead of verifying it maually, we can test out our XOR model, with input [0,1] by called the model with the input. Note we do not call the forward function directly, instead we provide input to the model. 

In [12]:
model(torch.tensor([0.,1.]).to(device))

tensor([0.9715], device='cuda:0', grad_fn=<SigmoidBackward>)

## Logging the model training for Visualisation in TensorBoard

TensorBoard by TensorFlow is a very useful tool for visualising training progress and model architectures, despite being a competiting platform, PyTorch provides classes and methods for us to integrate it with our model.

TensorBoard can be loaded inside Jupyter notebooks, or can be started externally from a command line. Before we run TensorBoard, we need to change to the directory of your model code, and create a folder, e.g. `runs` to keep a log of the training progress. 

The examples in this lab are running Tensorboard inside a notebook, but it is a good idea to run TensorBoard on the command line by giving the log directory. Assuming you are one level up the `runs` directory:

```
tensorboard --logdir runs
```
If successful, it will say that `TensorBoard 2.6.0 at http://localhost:6006/ (Press CTRL+C to quit)`, copy and paste the URL to your browser to see TensorBoard in action. 

In [13]:
%load_ext tensorboard

### TensorBoard running in Jupyter Notebook

In [None]:
%tensorboard --logdir runs

```{figure} ../images/tensorboard_1.png
:alt: TensorBoard
:class: bg-primary mb-1
:width: 600px
```

### SummaryWriter

It all starts with the creation of a SummaryWriter. TensorBoard to look for logs inside the `runs` folder, it only makes sense to actually log to that folder. Moreover, to be able to distinguish between different experiments or models, we should also specify a sub-folder: test.

If we do not specify any folder, TensorBoard will default to `runs/CURRENT_DATETIME_HOSTNAME`, which is not such a great name if you’d be looking for your experiment results in the future.

So, it is recommended to try to name it in a more meaningful way, like runs/test or runs/simple_linear_regression. It will then create a subfolder inside runs (the folder we specified when we started TensorBoard).

Even better, you should name it in a meaningful way and add datetime or a sequential number as a suffix, like `runs/test_001` or `runs/test_20200502172130`, to avoid writing data of multiple runs into the same folder (we'll see why this is bad in the add_scalars section below).

In [15]:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/test')

### `add_graph`

It will produce an input-output graph that allows you to interactively inspect parameters, which is different from the TorchViz's computation graph (a static visualisation - not interactive). 

In [None]:
writer.add_graph(model, x_train_tensor)
%tensorboard --logdir runs

```{figure} ../images/tensorboard_2.png
:alt: TensorBoard
:class: bg-primary mb-1
:width: 600px
```

### `add_scalar`

We can send the loss values to TensorBoard using the `add_scalars` method to send multiple scalar values at once, and it needs three arguments:
- main_tag: the parent name of the tags or, the "group tag"
- tag_scalar_dict: the dictionary containing the key: value pairs for the scalars you want to keep track of (can be training and validation losses)
- global_step: step value, that is, the index you're associating with the values you're sending in the dictionary - the epoch comes to mind in our case, as losses are computed for each epoch

In [17]:
writer.add_scalars(main_tag='loss',
    tag_scalar_dict={'training': loss,
                    'validation': loss},
    global_step=epoch
)

If you run the code above after performing the model training, it will just send both loss values computed for the last epoch.

In [None]:
%tensorboard --logdir runs

```{figure} ../images/tensorboard_3.png
:alt: TensorBoard
:class: bg-primary mb-1
:width: 600px
```

In [19]:
from datetime import datetime
# Sets learning rate - this is "eta" ~ the "n" like
# Greek letter
lr = 0.1

# Step 0 - Initializes parameters "b" and "w" randomly
torch.manual_seed(42)
# Now we can create a model and send it at once to the device
model = XOR(2,1)
model = model.to(device)

# Defines a SGD optimizer to update the parameters
# (now retrieved directly from the model)
optimizer = optim.SGD(model.parameters(), lr=lr)

# Defines a MSE loss function
loss_fn = nn.MSELoss(reduction='mean')

# Tensorboard setup
writer = SummaryWriter('runs/XOR' + datetime.now().strftime("%Y%m%d-%H%M%S"))
writer.add_graph(model, x_train_tensor.to(device))

# Defines number of epochs
n_epochs = 10000

losses = []
val_losses = [] # note we did not use the validation data
for epoch in range(n_epochs):
    #for j in range(steps):
    model.train() # What is this?!?

    # Step 1 - Computes model's predicted output - forward pass
    # No more manual prediction!
    yhat = model(x_train_tensor)

    # Step 2 - Computes the loss
    loss = loss_fn(yhat, y_train_tensor)
    
    # Step 3 - Computes gradients for both "b" and "w" parameters
    loss.backward()

    # Step 4 - Updates parameters using gradients and
    # the learning rate
    optimizer.step()
    optimizer.zero_grad()
    if (epoch % 500 == 0):
        print("Epoch: {0}, Loss: {1}, ".
            format(epoch, loss.to("cpu").detach().numpy()))

    losses.append(loss)
    writer.add_scalars(main_tag='loss', 
                       tag_scalar_dict={'training': loss,
                                      'validation': loss},
                       global_step=epoch)

writer.close()                        


Epoch: 0, Loss: 0.27375736832618713, 
Epoch: 500, Loss: 0.2081705629825592, 
Epoch: 1000, Loss: 0.09176724404096603, 
Epoch: 1500, Loss: 0.03289078176021576, 
Epoch: 2000, Loss: 0.016263967379927635, 
Epoch: 2500, Loss: 0.009998900815844536, 
Epoch: 3000, Loss: 0.006978640332818031, 
Epoch: 3500, Loss: 0.005261328537017107, 
Epoch: 4000, Loss: 0.00416548689827323, 
Epoch: 4500, Loss: 0.0034290249459445477, 
Epoch: 5000, Loss: 0.002894176635891199, 
Epoch: 5500, Loss: 0.002497166395187378, 
Epoch: 6000, Loss: 0.0021936693228781223, 
Epoch: 6500, Loss: 0.0019435270223766565, 
Epoch: 7000, Loss: 0.0017444714903831482, 
Epoch: 7500, Loss: 0.0015824229922145605, 
Epoch: 8000, Loss: 0.001446553273126483, 
Epoch: 8500, Loss: 0.0013299942947924137, 
Epoch: 9000, Loss: 0.00122724543325603, 
Epoch: 9500, Loss: 0.0011413537431508303, 


In [None]:
%tensorboard --logdir runs

```{figure} ../images/tensorboard_4.png
:alt: TensorBoard
:class: bg-primary mb-1
:width: 600px
```

:::{note}
In the TensorBoard logged run, we increased the learning rate and shortened the number of epochs. Play with these two parameters to see what you can get. Or change the loss function to BCE or BCEWithLogits to see how your training loss changes. 
:::