3. Linear Models in Pytorch¶

Now that we have some basic knowledge of Torch tensors, let’s see how we can implement the linear model earlier using Pytorch.

3.1. Data Preparation¶

Let’s first repeat the same data preparation process, first generating a synthetic dataset of 100 data points, then perform an 80%:20% split as training and validation dataset, respectively.

import numpy as np
import torch

true_b = 1
true_w = 2
N = 100
# Data Generation
np.random.seed(42)
x = np.random.rand(N, 1)
# Guassian noise to add some randomness to y
epsilon = (.1 * np.random.randn(N, 1))
y = true_b + true_w * x + epsilon

# Shuffles the indices
idx = np.arange(N)
np.random.shuffle(idx)
# Uses first 80 random indices for train
train_idx = idx[:int(N*.8)]
# Uses the remaining indices for validation
val_idx = idx[int(N*.8):]
# Generates train and validation sets
x_train, y_train = x[train_idx], y[train_idx]
x_val, y_val = x[val_idx], y[val_idx]

No matter you have a GPU or not, the best practice is to use .to(device) method to make your code GPU ready.

device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Our data was in Numpy arrays, but we need to transform them
# into PyTorch's Tensors and then we send them to the
# chosen device
x_train_tensor = torch.as_tensor(x_train).float().to(device)
y_train_tensor = torch.as_tensor(y_train).float().to(device)

# Here we can see the difference - notice that .type() is more
# useful since it also tells us WHERE the tensor is (device)
print(type(x_train), type(x_train_tensor), x_train_tensor.type())

<class 'numpy.ndarray'> <class 'torch.Tensor'> torch.cuda.FloatTensor

We normally can turn a tensor back to a Numpy array using sourceTensor.numpy(), but now we have GPU tensor, which cannot be directly handled by Numpy. We have turn it back to a CPU tensor first before converting to a Numpy array.

back_to_numpy = x_train_tensor.cpu().numpy()

Good Practice

It is a good practice to always first cpu() and then numpy(), even if you are using a CPU. It follows the same principle of to(device): you may share your code with others who may be using a GPU.

3.2. Creating Parameters¶

What distinguishes a tensor used for training data (or validation, or test) — like the ones we’ve just created — from a tensor used as a (trainable) parameter/weight?

The latter (a parameter) requires the computation of its gradients, so we can update their values (the parameters’ values).

In Pytorch, we use the requires_grad=True argument to tell PyTorch to compute gradients for us.

A tensor for a learnable parameter requires a gradient!

Good Practice

To make GPU ready code, we should specify the device at the moment of creation to avoid shadowing the gradient requirement.

# We can specify the device at the moment of creation
# RECOMMENDED!
# Step 0 - Initializes parameters "b" and "w" randomly
torch.manual_seed(42)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
print(b, w)

tensor([0.1940], device='cuda:0', requires_grad=True) tensor([0.1391], device='cuda:0', requires_grad=True)

Note

Notice that even with the same seed value, because Pytorch and Numpy are two different packages, they have different implemenations of the randn() method, thus different results.

3.3. Autograd¶

In Pytorch, we don’t need to worry about partial derivatives, chain rule, or anything like it. Autograd is PyTorch’s automatic differentiation package.

3.3.1. `backward()`¶

To tell PyTorch to compute all gradients, we use the backward() method. It will compute gradients for all (requiring gradient) tensors involved in the computation of a given variable.

Recall that we need to compute the partial derivatives of the loss function w.r.t. our parameters. Hence, we need to invoke the backward() method from the corresponding Python variable: loss.backward().

# Step 1 - Computes our model's predicted output - forward pass
yhat = b + w * x_train_tensor
# Step 2 - Computes the loss
# We are using ALL data points, so this is BATCH gradient
# descent. How wrong is our model? That's the error!
error = (yhat - y_train_tensor)
# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()
# Step 3 - Computes gradients for both "b" and "w" parameters
# No more manual computation of gradients!
# b_grad = 2 * error.mean()
# w_grad = 2 * (x_tensor * error).mean()
loss.backward()

We have set requires_grad=True to both b and w, so they are obviously included in the list of gradient calculation. We use them both to compute yhat, so it will also make it to the list. Then we use yhat to compute the error, so error is also on the list.

x_train_tensor and y_train_tensor however, are not gradient-requiring tensors, so backward() does not care about them.

print(error.requires_grad, yhat.requires_grad, b.requires_grad, w.requires_grad)
print(y_train_tensor.requires_grad, x_train_tensor.requires_grad)

True True True True
False False

3.3.2. grad¶

We can inspect the actual values of the gradients by looking at the grad attribute of a tensor.

print(b.grad, w.grad)

tensor([-3.3881], device='cuda:0') tensor([-1.9439], device='cuda:0')

3.3.3. Accumulated Gradients¶

Let’s run the backward function again:

# Step 1 - Computes our model's predicted output - forward pass
yhat = b + w * x_train_tensor
# Step 2 - Computes the loss
# We are using ALL data points, so this is BATCH gradient
# descent. How wrong is our model? That's the error!
error = (yhat - y_train_tensor)
# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()
# Step 3 - Computes gradients for both "b" and "w" parameters
# No more manual computation of gradients!
# b_grad = 2 * error.mean()
# w_grad = 2 * (x_tensor * error).mean()
loss.backward()

print(b.grad, w.grad)

tensor([-6.7762], device='cuda:0') tensor([-3.8878], device='cuda:0')

Note

If we ran this above code again, the gradient of \(b\) and \(w\) exactly doubled. This is because Pytorch implements an accumlated gradients to circumvent hardware limitations. If a minibatch is still too big to fit in memory, we can split it further into “subminibatch”, that’s when the aggregated gradients become useful.

3.3.4. zero_¶

For training problem that does not have memory limitations, every time we use the gradients to update the parameters, we need to zero the gradients afterward. This what zero_() is good for.

# This code will be placed _after_ Step 4
# (updating the parameters)
b.grad.zero_(), w.grad.zero_()

(tensor([0.], device='cuda:0'), tensor([0.], device='cuda:0'))

Important

In PyTorch, every method that ends with an underscore (_), like the requires_grad_() and zero_() method above, makes changes in-place, in other words, they will modify the underlying variable.

3.4. Put it all together¶

# Sets learning rate - this is "eta" ~ the "n"-like Greek letter
lr = 0.1

# Step 0 - Initializes parameters "b" and "w" randomly
torch.manual_seed(42)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

# Defines number of epochs
n_epochs = 1000

for epoch in range(n_epochs):
    # Step 1 - Computes model's predicted output - forward pass
    yhat = b + w * x_train_tensor

    # Step 2 - Computes the loss
    # We are using ALL data points, so this is BATCH gradient
    # descent. How wrong is our model? That's the error!
    error = (yhat - y_train_tensor)
    # It is a regression, so it computes mean squared error (MSE)
    loss = (error ** 2).mean()

    # Step 3 - Computes gradients for both "b" and "w"
    # parameters. No more manual computation of gradients!
    # b_grad = 2 * error.mean()
    # w_grad = 2 * (x_tensor * error).mean()
    # We just tell PyTorch to work its way BACKWARDS
    # from the specified loss!
    loss.backward()

    # Step 4 - Updates parameters using gradients and
    # the learning rate. But not so fast...
    # FIRST ATTEMPT - just using the same code as before
    # AttributeError: 'NoneType' object has no attribute 'zero_'
    # b = b - lr * b.grad
    # w = w - lr * w.grad
    # print(b)

    # SECOND ATTEMPT - using in-place Python assingment
    # RuntimeError: a leaf Variable that requires grad
    # has been used in an in-place operation.
    # b -= lr * b.grad
    # w -= lr * w.grad

    # THIRD ATTEMPT - NO_GRAD for the win!
    # We need to use NO_GRAD to keep the update out of
    # the gradient computation. Why is that? It boils
    # down to the DYNAMIC GRAPH that PyTorch uses...
    with torch.no_grad():
        b -= lr * b.grad
        w -= lr * w.grad

    # PyTorch is "clingy" to its computed gradients, we
    # need to tell it to let it go...
    b.grad.zero_()
    w.grad.zero_()

print(b, w)

tensor([1.0235], device='cuda:0', requires_grad=True) tensor([1.9690], device='cuda:0', requires_grad=True)

In the first attempt, if we use the same update structure as in our Numpy code, we’ll get a weird error but we can get a hint of what’s going on by looking at the tensor itself — once again, we “lost” the gradient while reassigning the update results to our parameters. Thus, the grad attribute turns out to be None, and it raises the error…

Important

We use with torch.no_grad(): to ensure the update is not tracked by the dyanmic computation graph mechanism of Pytorch. We will talk about compuation graph next lab.

CITS4012 Natural Language Processing