3. Neual Machine Translation

For the following two notebooks, we use a dataset of English–French sentence pairs from the Tatoeba Project. Such paired dataset is referred to as a parallel corpus. This dataset is composed of pairs of English sentences and their corresponding French translations.

3.1. Imports

from argparse import Namespace
from nltk.tokenize import word_tokenize
import numpy as np
import pandas as pd
args = Namespace(
    source_data_path="../data/nmt/eng-fra.txt",
    output_data_path="../data/nmt/simplest_eng_fra.csv",
    perc_train=0.7,
    perc_val=0.15,
    perc_test=0.15,
    seed=1337
)

assert args.perc_test > 0 and (args.perc_test + args.perc_val + args.perc_train == 1.0)

3.2. Preprocessing

The data preprocessing begins by reading the lines in and making all sentences lowercase and applying NLTK’s English and French tokenizers to each of the sentence pairs.

with open(args.source_data_path, encoding="utf-8") as fp:
    lines = fp.readlines()
    
lines = [line.replace("\n", "").lower().split("\t") for line in lines]

Next, we apply NLTK’s language­-specific word tokenizer to create a list of tokens. Even though we do further computations, which we describe next, this list of tokens is a preprocessed dataset.

data = []
for english_sentence, french_sentence in lines:
    data.append({"english_tokens": word_tokenize(english_sentence, language="english"),
                 "french_tokens": word_tokenize(french_sentence, language="french")})

3.3. Selecting a subset of data

Here we select a subset of the data we select consists of the English sentences that begin with “i am”, “he is”, “she is”, “they are”, “you are”, or “we are”. This reduces the dataset from 135,842 sentence pairs to just 13,062 sentence pairs, a factor of 10.

filter_phrases = (
    ("i", "am"), ("i", "'m"), 
    ("he", "is"), ("he", "'s"),
    ("she", "is"), ("she", "'s"),
    ("you", "are"), ("you", "'re"),
    ("we", "are"), ("we", "'re"),
    ("they", "are"), ("they", "'re")
)

Creating empty lists as place holders for each filter phrase.

data_subset = {phrase: [] for phrase in filter_phrases}
data_subset
{('i', 'am'): [],
 ('i', "'m"): [],
 ('he', 'is'): [],
 ('he', "'s"): [],
 ('she', 'is'): [],
 ('she', "'s"): [],
 ('you', 'are'): [],
 ('you', "'re"): [],
 ('we', 'are'): [],
 ('we', "'re"): [],
 ('they', 'are'): [],
 ('they', "'re"): []}

Get the first two tokens of the English sentence as a key, if it is in the prepared data_subset keys, append the datum into the list of that key.

for datum in data:
    key = tuple(datum['english_tokens'][:2])
    if key in data_subset:
        data_subset[key].append(datum)
counts = {k: len(v) for k,v in data_subset.items()}
counts, sum(counts.values())
({('i', 'am'): 805,
  ('i', "'m"): 4760,
  ('he', 'is'): 1069,
  ('he', "'s"): 787,
  ('she', 'is'): 504,
  ('she', "'s"): 316,
  ('you', 'are'): 449,
  ('you', "'re"): 2474,
  ('we', 'are'): 181,
  ('we', "'re"): 1053,
  ('they', 'are'): 194,
  ('they', "'re"): 470},
 13062)

3.4. Training, Test, Validation Split

To finalize the learning setup, we split the subset of 13,062 sentence pairs into 70% training, 15% validation, and 15% test sets. The proportion of each sentence beginning with the just listed syntax is held constant by first grouping by sentence beginning, creating the splits from those groups, and then merging the splits from each group.

np.random.seed(args.seed)

dataset_stage3 = []
for phrase, datum_list in sorted(data_subset.items()):
    np.random.shuffle(datum_list)
    n_train = int(len(datum_list) * args.perc_train)
    n_val = int(len(datum_list) * args.perc_val)

    for datum in datum_list[:n_train]:
        datum['split'] = 'train'
        
    for datum in datum_list[n_train:n_train+n_val]:
        datum['split'] = 'val'
        
    for datum in datum_list[n_train+n_val:]:
        datum['split'] = 'test'
    
    dataset_stage3.extend(datum_list)    
# here we pop and assign into the dictionary, thus modifying in place
for datum in dataset_stage3:
    datum['source_language'] = " ".join(datum.pop('english_tokens'))
    datum['target_language'] = " ".join(datum.pop('french_tokens'))

3.5. A glimpse of the processed dataset

nmt_df = pd.DataFrame(dataset_stage3)
nmt_df.head()
split source_language target_language
0 train he 's the cutest boy in town . c'est le garçon le plus mignon en ville .
1 train he 's a nonsmoker . il est non-fumeur .
2 train he 's smarter than me . il est plus intelligent que moi .
3 train he 's a lovely young man . c'est un adorable jeune homme .
4 train he 's three years older than me . il a trois ans de plus que moi .

3.6. Write the processed dataset to disk

nmt_df.to_csv(args.output_data_path)