3. Surname Dataset Processing

In this example, we introduce the surnames dataset, a collection of 10,000 surnames from 18 different nationalities collected by the authors from different name sources on the internet. This dataset has several properties that make it interesting.

  • The first property is that it is fairly imbalanced. The top three classes account for more than 60% of the data: 27% are English, 21% are Russian, and 14% are Arabic. The remaining 15 nationalities have decreasing frequency - a property that is endemic to language, as well.

  • The second property is that there is a valid and intuitive relationship between nationality of origin and surname orthography (spelling). There are spelling variations that are strongly tied to nation of origin (such in O'Neill, Antonopoulos, Nagasawa, or Zhu).

3.1. Imports

import collections
import numpy as np
import pandas as pd
import re

from argparse import Namespace
args = Namespace(
    raw_dataset_csv="../data/surnames/surnames.csv",
    train_proportion=0.7,
    val_proportion=0.15,
    test_proportion=0.15,
    output_munged_csv="../data/surnames/surnames_with_splits.csv",
    seed=1337
)
# Read raw data
surnames = pd.read_csv(args.raw_dataset_csv, header=0)
surnames.head()
surname nationality
0 Woodford English
1 Coté French
2 Kore English
3 Koury Arabic
4 Lebzak Russian
# Unique classes
set(surnames.nationality)
{'Arabic',
 'Chinese',
 'Czech',
 'Dutch',
 'English',
 'French',
 'German',
 'Greek',
 'Irish',
 'Italian',
 'Japanese',
 'Korean',
 'Polish',
 'Portuguese',
 'Russian',
 'Scottish',
 'Spanish',
 'Vietnamese'}
# Splitting train by nationality
# Create dict
by_nationality = collections.defaultdict(list)
for _, row in surnames.iterrows():
    by_nationality[row.nationality].append(row.to_dict())
# Create split data
final_list = []
np.random.seed(args.seed)
for _, item_list in sorted(by_nationality.items()):
    np.random.shuffle(item_list)
    n = len(item_list)
    n_train = int(args.train_proportion*n)
    n_val = int(args.val_proportion*n)
    n_test = int(args.test_proportion*n)
    
    # Give data point a split attribute
    for item in item_list[:n_train]:
        item['split'] = 'train'
    for item in item_list[n_train:n_train+n_val]:
        item['split'] = 'val'
    for item in item_list[n_train+n_val:]:
        item['split'] = 'test'  
    
    # Add to final list
    final_list.extend(item_list)
# Write split data to file
final_surnames = pd.DataFrame(final_list)
final_surnames.split.value_counts()
train    7680
test     1660
val      1640
Name: split, dtype: int64
final_surnames.head()
surname nationality split
0 Gerges Arabic train
1 Nassar Arabic train
2 Kanaan Arabic train
3 Hadad Arabic train
4 Tannous Arabic train
# Write munged data to CSV
final_surnames.to_csv(args.output_munged_csv, index=False)

Your Turn

Visualise the distribution of nationalities to observe the class imbalance in this dataset. What can you do to create a balanced dataset?