2. Dataset and DataLoader¶

2.1. Dataset¶

In PyTorch, a dataset is represented by a regular Python class that inherits from the Dataset class. You can think of it as a list of tuples, each tuple corresponding to one data point (features, label).

The most fundamental methods it needs to implement are:

__init__(self): it takes arguments needed to build a list of tuples, for example
- the name of a CSV file that will be loaded and processed; or
- two tensors, one for features, another one for labels; or
- or anything else, depending on the task at hand.

Important

There is no need to load the whole dataset in the constructor method (__init__). If your dataset is big (tens of thousands of files, for instance), loading it at once would not be memory efficient. It is recommended to load them on demand (whenever __get_item__ is called).

__get_item__(self, index): this function allows the dataset to be indexed so that it can work like a list (dataset[i])
- it must return a tuple (features, label) corresponding to the requested data point.
- We can either return the corresponding slices of our pre-loaded dataset or, as mentioned above, load them on demand.
__len__(self): it should simply return the size of the whole dataset. This is useful to ensure when sampling, the indexing is bouned by the size of the data.

2.1.1. Building a `Dataset` class with Two Tensors¶

Let’s build a simple custom dataset that takes two tensors as arguments: one for the features, one for the labels.

Let’s first get the generated data from the perceptron.ipynb notebook.

import import_ipynb
from perceptron import get_toy_data

data_size = 1000
x_data, y_truth = get_toy_data(data_size);

import torch
from torch.utils.data import Dataset
class CustomDataset(Dataset):
    def __init__(self, x_tensor, y_tensor):
        self.x = x_tensor
        self.y = y_tensor

    def __getitem__(self, index):
        return (self.x[index], self.y[index])

    def __len__(self):
        return len(self.x)

# Wait, is this a CPU tensor now? Why? Where is .to(device)?
x_train_tensor = torch.as_tensor(x_data).float()
y_train_tensor = torch.as_tensor(y_truth).float()

train_data = CustomDataset(x_train_tensor, y_train_tensor)

# Now we can iterate through our data points using simple indices
print(train_data[0])

(tensor([ 2.5058, -2.7603]), tensor(1.))

Why do we not sending the Numpy arrays to GPU?

We don’t want our whole training data to be loaded into GPU tensors, as we have been doing in our example so far, because it takes up space in our precious graphics card’s RAM.

2.1.2. Use `TensorDataset`¶

Why go through all this trouble to wrap a couple of tensors in a class?

If the dataset is more than a couple of tensors, we can use PyTorch’s TensorDataset class, which will do pretty much the same as our custom dataset above.

The full-fledged custom dataset class may seem like a stretch, but this is a design pattern that we will use repeatedly in later chapters.

from torch.utils.data import TensorDataset
train_data = TensorDataset(x_train_tensor, y_train_tensor)
print(train_data[0])

(tensor([ 2.5058, -2.7603]), tensor(1.))

2.2. `DataLoader`¶

Until now, we have used one data point training data at every training step. So we have been doing stochastic gradient descent all along. This is fine for small dataset, but if we want to go with large dataset, we must use mini-batch gradient descent. Thus, we need mini-batches. Thus, we need to slice our dataset accordingly. DataLoader saves us from doing it manually! All we need to do is to tell the DataLoader:

the dataset
the size of the mini-batch
whether we want to shuffle it or not

Your Turn

Recall what’s batch, mini-batch and stochastic gradient decent.

A data loader will behave like an iterator, so we can loop over it and fetch a different mini-batch every time.

How do I choose my mini-batch size?

It is typical to use powers of two for mini-batch sizes, like 16, 32, 64 or 128, and 32 or 64 are popular choices.

from torch.utils.data import DataLoader
train_loader = DataLoader(dataset=train_data, batch_size=1000, shuffle=True)

To retrieve a mini-batch, one can simply run the command below — it will return a list containing two tensors, one for the features, another one for the labels:

next(iter(train_loader))

[tensor([[ 3.6587, -1.7854],
         [ 1.2505, -2.9371],
         [ 3.3038,  2.9329],
         ...,
         [ 3.9159,  3.5062],
         [ 3.3351, -2.6869],
         [ 3.7338,  2.9684]]),
 tensor([1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 0., 1., 0., 1.,
         0., 1., 0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1.,
         1., 1., 1., 1., 0., 1., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0.,
         0., 1., 1., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
         0., 1., 0., 1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 0., 1., 0., 1., 0.,
         0., 0., 1., 0., 1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1.,
         0., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0.,
         0., 1., 1., 1., 0., 0., 1., 0., 1., 0., 1., 1., 0., 0., 1., 0., 1., 1.,
         1., 0., 1., 1., 0., 1., 1., 0., 1., 0., 0., 1., 1., 0., 1., 0., 0., 1.,
         0., 1., 1., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0., 0.,
         1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 1., 1., 1., 1., 0.,
         0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0., 0., 0., 1., 0., 1.,
         1., 1., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0., 0., 1., 1., 0., 1., 0.,
         0., 0., 0., 0., 1., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1.,
         0., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1.,
         0., 1., 0., 1., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0.,
         1., 1., 1., 1., 0., 0., 1., 1., 0., 1., 0., 0., 1., 0., 0., 1., 1., 0.,
         1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 0., 0., 0., 0., 1., 0.,
         1., 0., 1., 1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 1., 1., 0., 0., 1.,
         0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 1., 0., 1., 1., 0.,
         0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0., 1.,
         0., 0., 1., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 1., 0.,
         0., 1., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 1., 1., 1., 0., 1.,
         0., 0., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0.,
         1., 0., 1., 0., 0., 0., 1., 1., 1., 0., 1., 0., 0., 0., 1., 0., 1., 1.,
         1., 1., 1., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 1.,
         0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0., 0.,
         0., 0., 1., 0., 1., 0., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 0., 1.,
         1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 1., 0., 0., 1.,
         0., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0.,
         0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0.,
         1., 0., 1., 0., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 1., 1.,
         1., 1., 1., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0.,
         0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1.,
         0., 0., 0., 0., 1., 0., 1., 1., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0.,
         0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1.,
         0., 1., 0., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0., 0.,
         0., 0., 0., 0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 1., 1., 1., 1., 0.,
         0., 1., 0., 0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 1., 0., 0., 1., 0.,
         1., 1., 1., 1., 1., 0., 1., 0., 1., 0., 0., 1., 1., 1., 1., 1., 1., 0.,
         0., 1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 1.,
         0., 1., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 1., 1.,
         0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 1., 1., 0., 1., 0., 1., 0., 1.,
         1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 1., 1., 1.,
         1., 0., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1., 0., 1., 1.,
         0., 0., 1., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 1.,
         0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.,
         1., 0., 1., 1., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 1., 0., 1., 1.,
         1., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 0., 0., 1., 1., 1., 0.,
         0., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0.,
         0., 0., 0., 1., 1., 1., 1., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0.,
         0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 1., 0., 1., 0., 1., 1., 0.,
         1., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 1., 1., 0.,
         1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1.,
         0., 1., 0., 0., 0., 0., 1., 0., 1., 0.])]

len(train_loader)

To shuffle or not to shuffle

In the absolute majority of cases, you should set shuffle=True for your training set to improve the performance of gradient descent. There are a few exceptions, though, like time series problems, where shuffling actually leads to data leakage.

So, always ask yourself: “do I have a reason NOT to shuffle the data?”

“What about the validation and test sets?” There is no need to shuffle them since we are not computing gradients with them.

There is more to a DataLoader, for example, it is also possible to use it together with a sampler to fetch mini-batches that compensate for imbalanced classes, for instance.

Your Turn

Note we put all data in one “mini-batch” because we set batch_size=1000. Set batch_size to a smaller number and observe the training time and performance.

2.3. Putting it all together¶

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

A higher order function is a function returns a function, for example

The key elements of our training loop: model, loss, and optimizer. The actual training step function to be returned will have

input: two arguments, namely, features and labels, and
return: the corresponding loss value.

def make_train_step(model, loss_fn, optimizer):
    # Builds function that performs a step in the train loop
    def perform_train_step(x_batch, y_batch):
        # Sets model to TRAIN mode
        model.train()

        # Step 1 - Computes model's predictions - forward pass
        yhat = model(x_batch).squeeze()
        y_batch = y_batch.squeeze()
    
        # Step 2 - Computes the loss
        loss = loss_fn(yhat, y_batch)
        # Step 3 - Computes gradients for parameters
        loss.backward()
        # Step 4 - Updates parameters using gradients and
        # the learning rate
        optimizer.step()
        optimizer.zero_grad()
        # Returns the loss
        return loss.item()

    # Returns the function that will be called inside the
    # train loop
    return perform_train_step

import import_ipynb
from perceptron import Perceptron
import torch.nn as nn
import torch.optim as optim
import numpy as np

input_dim = 2
lr = 0.01
n_epochs = 60

perceptron = Perceptron(input_dim=input_dim).to(device)
optimizer = optim.Adam(params=perceptron.parameters(), lr=lr)
bce_loss = nn.BCELoss()
train_step = make_train_step(perceptron, bce_loss, optimizer)
losses = []

change = 1.0
last = 10.0
epsilon = 1e-3
epoch = 0
while change > epsilon or epoch < n_epochs or last > 0.3:
# For each epoch...
# for epoch in range(n_epochs):
    # inner loop
    mini_batch_losses = []
    for x_batch, y_batch in train_loader: 
        # the dataset "lives" in the CPU, so do our mini-batches
        # therefore, we need to send those mini-batches to the
        # device where the model "lives"
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        # Performs one train step and returns the
        # corresponding loss for this mini-batch
        
        mini_batch_loss = train_step(x_batch, y_batch)
        mini_batch_losses.append(mini_batch_loss)

    # Computes average loss over all mini-batches
    # That's the epoch loss
    loss = np.mean(mini_batch_losses)    
    change = abs(last - loss)    
    last = loss
    epoch += 1    
    
losses.append(loss)

print(loss)

0.10094598680734634

perceptron = perceptron.to('cpu')

import import_ipynb
from perceptron import visualize_results
import matplotlib.pyplot as plt

left_x = []
right_x = []
left_colors = []
right_colors =  []

# Construct a stack of values and colors
for x_i, y_true_i in zip(x_data, y_truth):
    color = 'black'

    if y_true_i == 0:
        left_x.append(x_i)
        left_colors.append(color)

    else:
        right_x.append(x_i)
        right_colors.append(color)

left_x = np.stack(left_x)
right_x = np.stack(right_x)

_, axes = plt.subplots(1,2,figsize=(12,4))

axes[0].scatter(left_x[:, 0], left_x[:, 1], facecolor='white',edgecolor='black', marker='o', s=300)
axes[0].scatter(right_x[:, 0], right_x[:, 1], facecolor='white', edgecolor='black', marker='*', s=300)
axes[0].axis('off')
visualize_results(perceptron, x_data, y_truth, epoch=None, levels=[0.5], ax=axes[1])
axes[1].axis('off')

(-0.3603001832962036, 6.013579845428467, -5.019380569458008, 5.617672443389893)

CITS4012 Natural Language Processing

2. Dataset and DataLoader¶

2.1. Dataset¶

2.1.1. Building a Dataset class with Two Tensors¶

2.1.2. Use TensorDataset¶

2.2. DataLoader¶

2.3. Putting it all together¶

2.1.1. Building a `Dataset` class with Two Tensors¶

2.1.2. Use `TensorDataset`¶

2.2. `DataLoader`¶