3. Surname Dataset Processing¶
In this example, we introduce the surnames dataset, a collection of 10,000 surnames from 18 different nationalities collected by the authors from different name sources on the internet. This dataset has several properties that make it interesting.
The first property is that it is fairly imbalanced. The top three classes account for more than 60% of the data: 27% are English, 21% are Russian, and 14% are Arabic. The remaining 15 nationalities have decreasing frequency - a property that is endemic to language, as well.
The second property is that there is a valid and intuitive relationship between nationality of origin and surname orthography (spelling). There are spelling variations that are strongly tied to nation of origin (such in
O'Neill
,Antonopoulos
,Nagasawa
, orZhu
).
3.1. Imports¶
import collections
import numpy as np
import pandas as pd
import re
from argparse import Namespace
args = Namespace(
raw_dataset_csv="../data/surnames/surnames.csv",
train_proportion=0.7,
val_proportion=0.15,
test_proportion=0.15,
output_munged_csv="../data/surnames/surnames_with_splits.csv",
seed=1337
)
# Read raw data
surnames = pd.read_csv(args.raw_dataset_csv, header=0)
surnames.head()
surname | nationality | |
---|---|---|
0 | Woodford | English |
1 | Coté | French |
2 | Kore | English |
3 | Koury | Arabic |
4 | Lebzak | Russian |
# Unique classes
set(surnames.nationality)
{'Arabic',
'Chinese',
'Czech',
'Dutch',
'English',
'French',
'German',
'Greek',
'Irish',
'Italian',
'Japanese',
'Korean',
'Polish',
'Portuguese',
'Russian',
'Scottish',
'Spanish',
'Vietnamese'}
# Splitting train by nationality
# Create dict
by_nationality = collections.defaultdict(list)
for _, row in surnames.iterrows():
by_nationality[row.nationality].append(row.to_dict())
# Create split data
final_list = []
np.random.seed(args.seed)
for _, item_list in sorted(by_nationality.items()):
np.random.shuffle(item_list)
n = len(item_list)
n_train = int(args.train_proportion*n)
n_val = int(args.val_proportion*n)
n_test = int(args.test_proportion*n)
# Give data point a split attribute
for item in item_list[:n_train]:
item['split'] = 'train'
for item in item_list[n_train:n_train+n_val]:
item['split'] = 'val'
for item in item_list[n_train+n_val:]:
item['split'] = 'test'
# Add to final list
final_list.extend(item_list)
# Write split data to file
final_surnames = pd.DataFrame(final_list)
final_surnames.split.value_counts()
train 7680
test 1660
val 1640
Name: split, dtype: int64
final_surnames.head()
surname | nationality | split | |
---|---|---|---|
0 | Gerges | Arabic | train |
1 | Nassar | Arabic | train |
2 | Kanaan | Arabic | train |
3 | Hadad | Arabic | train |
4 | Tannous | Arabic | train |
# Write munged data to CSV
final_surnames.to_csv(args.output_munged_csv, index=False)
Your Turn
Visualise the distribution of nationalities to observe the class imbalance in this dataset. What can you do to create a balanced dataset?