3. Surname Dataset Processing¶

In this example, we introduce the surnames dataset, a collection of 10,000 surnames from 18 different nationalities collected by the authors from different name sources on the internet. This dataset has several properties that make it interesting.

The first property is that it is fairly imbalanced. The top three classes account for more than 60% of the data: 27% are English, 21% are Russian, and 14% are Arabic. The remaining 15 nationalities have decreasing frequency - a property that is endemic to language, as well.
The second property is that there is a valid and intuitive relationship between nationality of origin and surname orthography (spelling). There are spelling variations that are strongly tied to nation of origin (such in O'Neill, Antonopoulos, Nagasawa, or Zhu).

3.1. Imports¶

import collections
import numpy as np
import pandas as pd
import re

from argparse import Namespace

args = Namespace(
    raw_dataset_csv="../data/surnames/surnames.csv",
    train_proportion=0.7,
    val_proportion=0.15,
    test_proportion=0.15,
    output_munged_csv="../data/surnames/surnames_with_splits.csv",
    seed=1337
)

# Read raw data
surnames = pd.read_csv(args.raw_dataset_csv, header=0)

surnames.head()

	surname	nationality
0	Woodford	English
1	Coté	French
2	Kore	English
3	Koury	Arabic
4	Lebzak	Russian

# Unique classes
set(surnames.nationality)

{'Arabic',
 'Chinese',
 'Czech',
 'Dutch',
 'English',
 'French',
 'German',
 'Greek',
 'Irish',
 'Italian',
 'Japanese',
 'Korean',
 'Polish',
 'Portuguese',
 'Russian',
 'Scottish',
 'Spanish',
 'Vietnamese'}

# Splitting train by nationality
# Create dict
by_nationality = collections.defaultdict(list)
for _, row in surnames.iterrows():
    by_nationality[row.nationality].append(row.to_dict())

# Create split data
final_list = []
np.random.seed(args.seed)
for _, item_list in sorted(by_nationality.items()):
    np.random.shuffle(item_list)
    n = len(item_list)
    n_train = int(args.train_proportion*n)
    n_val = int(args.val_proportion*n)
    n_test = int(args.test_proportion*n)
    
    # Give data point a split attribute
    for item in item_list[:n_train]:
        item['split'] = 'train'
    for item in item_list[n_train:n_train+n_val]:
        item['split'] = 'val'
    for item in item_list[n_train+n_val:]:
        item['split'] = 'test'  
    
    # Add to final list
    final_list.extend(item_list)

# Write split data to file
final_surnames = pd.DataFrame(final_list)

final_surnames.split.value_counts()

train    7680
test     1660
val      1640
Name: split, dtype: int64

final_surnames.head()

	surname	nationality	split
0	Gerges	Arabic	train
1	Nassar	Arabic	train
2	Kanaan	Arabic	train
3	Hadad	Arabic	train
4	Tannous	Arabic	train

# Write munged data to CSV
final_surnames.to_csv(args.output_munged_csv, index=False)

Your Turn

Visualise the distribution of nationalities to observe the class imbalance in this dataset. What can you do to create a balanced dataset?

CITS4012 Natural Language Processing

3. Surname Dataset Processing¶

3.1. Imports¶