1. Frankenstein Dataset At a Glance¶

Here we will build a text dataset from a digitized version of Mary Shelley’s novel Frankenstein, available via Project Gutenberg. This section walks through the preprocessing; building a PyTorch Dataset class for this text dataset; and finally splitting the dataset into training, validation, and test sets.

1.1. Import¶

import os

from argparse import Namespace
import collections
import nltk.data
import numpy as np
import pandas as pd
import re
import string
from tqdm.notebook import tqdm

1.2. Setting up¶

args = Namespace(
    raw_dataset_txt="../data/books/frankenstein.txt",
    window_size=5,
    train_proportion=0.7,
    val_proportion=0.15,
    test_proportion=0.15,
    output_munged_csv="../data/books/frankenstein_with_splits.csv",
    seed=1337
)

1.3. Preprocessing - Tokenizer¶

Starting with the raw text file that Project Gutenberg distributes, the preprocessing is minimal: we use NLTK’s Punkt tokenizer to split the text into separate sentences, then each sentence is converted to lowercase and the punctuation is completely removed. This preprocessing allows for us to later split strings on whitespace in order to retrieve a list of tokens.

# Split the raw text book into sentences
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
with open(args.raw_dataset_txt) as fp:
    book = fp.read()
sentences = tokenizer.tokenize(book)

print (len(sentences), "sentences")
print ("Sample:", sentences[100])

3427 sentences
Sample: No incidents have hitherto befallen us that would make a figure in a
letter.

# Clean sentences
def preprocess_text(text):
    text = ' '.join(word.lower() for word in text.split(" "))
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text

cleaned_sentences = [preprocess_text(sentence) for sentence in sentences]

1.4. CBOW training data preparation¶

As we know that CBOW is to use context in a specified window to predict the center word. We need to then to the list of tokens in each sentence and group them into context words in a specified window size, for each token as a center word.

Note

tqdm instantly make your loops show a smart progress meter - just wrap any iterable with tqdm(iterable) in your for loop, and you’ll see a progress bar when run the loop.

# Global vars
MASK_TOKEN = "<MASK>"

# Create windows based on the window_size
flatten = lambda outer_list: [item for inner_list in outer_list for item in inner_list]
windows = flatten([list(nltk.ngrams([MASK_TOKEN] * args.window_size + sentence.split(' ') + \
    [MASK_TOKEN] * args.window_size, args.window_size * 2 + 1)) \
    for sentence in tqdm(cleaned_sentences)])

# Create cbow data (extract target center word and context words)
data = []
for window in tqdm(windows):
    target_token = window[args.window_size]
    context = []
    for i, token in enumerate(window):
        if token == MASK_TOKEN or i == args.window_size:
            continue
        else:
            context.append(token)
    data.append([' '.join(token for token in context), target_token])
    
            
# Convert to dataframe
cbow_data = pd.DataFrame(data, columns=["context", "target"])

1.5. Training, Validation and Test data split¶

# Create split data
n = len(cbow_data)
def get_split(row_num):
    if row_num <= n*args.train_proportion:
        return 'train'
    elif (row_num > n*args.train_proportion) and (row_num <= n*args.train_proportion + n*args.val_proportion):
        return 'val'
    else:
        return 'test'
cbow_data['split']= cbow_data.apply(lambda row: get_split(row.name), axis=1)

cbow_data.head()

	context	target	split
0	, or the modern prometheus	frankenstein	train
1	frankenstein or the modern prometheus by	,	train
2	frankenstein , the modern prometheus by mary	or	train
3	frankenstein , or modern prometheus by mary wo...	the	train
4	frankenstein , or the prometheus by mary wolls...	modern	train

# Write split data to file
cbow_data.to_csv(args.output_munged_csv, index=False)

CITS4012 Natural Language Processing