1. Frankenstein Dataset At a Glance

Here we will build a text dataset from a digitized version of Mary Shelley’s novel Frankenstein, available via Project Gutenberg. This section walks through the preprocessing; building a PyTorch Dataset class for this text dataset; and finally splitting the dataset into training, validation, and test sets.

1.1. Import

import os

from argparse import Namespace
import collections
import nltk.data
import numpy as np
import pandas as pd
import re
import string
from tqdm.notebook import tqdm

1.2. Setting up

args = Namespace(
    raw_dataset_txt="../data/books/frankenstein.txt",
    window_size=5,
    train_proportion=0.7,
    val_proportion=0.15,
    test_proportion=0.15,
    output_munged_csv="../data/books/frankenstein_with_splits.csv",
    seed=1337
)

1.3. Preprocessing - Tokenizer

Starting with the raw text file that Project Gutenberg distributes, the preprocessing is minimal: we use NLTK’s Punkt tokenizer to split the text into separate sentences, then each sentence is converted to lowercase and the punctuation is completely removed. This preprocessing allows for us to later split strings on whitespace in order to retrieve a list of tokens.

# Split the raw text book into sentences
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
with open(args.raw_dataset_txt) as fp:
    book = fp.read()
sentences = tokenizer.tokenize(book)
print (len(sentences), "sentences")
print ("Sample:", sentences[100])
3427 sentences
Sample: No incidents have hitherto befallen us that would make a figure in a
letter.
# Clean sentences
def preprocess_text(text):
    text = ' '.join(word.lower() for word in text.split(" "))
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text
cleaned_sentences = [preprocess_text(sentence) for sentence in sentences]

1.4. CBOW training data preparation

As we know that CBOW is to use context in a specified window to predict the center word. We need to then to the list of tokens in each sentence and group them into context words in a specified window size, for each token as a center word.

Note

tqdm instantly make your loops show a smart progress meter - just wrap any iterable with tqdm(iterable) in your for loop, and you’ll see a progress bar when run the loop.

# Global vars
MASK_TOKEN = "<MASK>"
# Create windows based on the window_size
flatten = lambda outer_list: [item for inner_list in outer_list for item in inner_list]
windows = flatten([list(nltk.ngrams([MASK_TOKEN] * args.window_size + sentence.split(' ') + \
    [MASK_TOKEN] * args.window_size, args.window_size * 2 + 1)) \
    for sentence in tqdm(cleaned_sentences)])

# Create cbow data (extract target center word and context words)
data = []
for window in tqdm(windows):
    target_token = window[args.window_size]
    context = []
    for i, token in enumerate(window):
        if token == MASK_TOKEN or i == args.window_size:
            continue
        else:
            context.append(token)
    data.append([' '.join(token for token in context), target_token])
    
            
# Convert to dataframe
cbow_data = pd.DataFrame(data, columns=["context", "target"])

1.5. Training, Validation and Test data split

# Create split data
n = len(cbow_data)
def get_split(row_num):
    if row_num <= n*args.train_proportion:
        return 'train'
    elif (row_num > n*args.train_proportion) and (row_num <= n*args.train_proportion + n*args.val_proportion):
        return 'val'
    else:
        return 'test'
cbow_data['split']= cbow_data.apply(lambda row: get_split(row.name), axis=1)
cbow_data.head()
context target split
0 , or the modern prometheus frankenstein train
1 frankenstein or the modern prometheus by , train
2 frankenstein , the modern prometheus by mary or train
3 frankenstein , or modern prometheus by mary wo... the train
4 frankenstein , or the prometheus by mary wolls... modern train
# Write split data to file
cbow_data.to_csv(args.output_munged_csv, index=False)