{
"cells": [
{
"cell_type": "markdown",
"source": [
"Surname Dataset Processing\r\n",
"==========================\r\n",
"\r\n",
"In this example, we introduce the surnames dataset, a collection of 10,000 surnames from 18 different nationalities collected by the authors from different name sources on the internet. This dataset has several properties that make it interesting. \r\n",
"\r\n",
"- The first property is that it is fairly imbalanced. The top three classes account for more than 60% of the data: 27% are English, 21% are Russian, and 14% are Arabic. The remaining 15 nationalities have decreasing frequency - a property that is endemic to language, as well. \r\n",
"- The second property is that there is a valid and intuitive relationship between nationality of origin and surname orthography\r\n",
"(spelling). There are spelling variations that are strongly tied to nation of origin (such in `O'Neill`, `Antonopoulos`, `Nagasawa`, or `Zhu`)."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Imports"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 1,
"source": [
"import collections\r\n",
"import numpy as np\r\n",
"import pandas as pd\r\n",
"import re\r\n",
"\r\n",
"from argparse import Namespace"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 2,
"source": [
"args = Namespace(\r\n",
" raw_dataset_csv=\"../data/surnames/surnames.csv\",\r\n",
" train_proportion=0.7,\r\n",
" val_proportion=0.15,\r\n",
" test_proportion=0.15,\r\n",
" output_munged_csv=\"../data/surnames/surnames_with_splits.csv\",\r\n",
" seed=1337\r\n",
")"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 3,
"source": [
"# Read raw data\r\n",
"surnames = pd.read_csv(args.raw_dataset_csv, header=0)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 4,
"source": [
"surnames.head()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" surname | \n",
" nationality | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Woodford | \n",
" English | \n",
"
\n",
" \n",
" 1 | \n",
" Coté | \n",
" French | \n",
"
\n",
" \n",
" 2 | \n",
" Kore | \n",
" English | \n",
"
\n",
" \n",
" 3 | \n",
" Koury | \n",
" Arabic | \n",
"
\n",
" \n",
" 4 | \n",
" Lebzak | \n",
" Russian | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" surname nationality\n",
"0 Woodford English\n",
"1 Coté French\n",
"2 Kore English\n",
"3 Koury Arabic\n",
"4 Lebzak Russian"
]
},
"metadata": {},
"execution_count": 4
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 5,
"source": [
"# Unique classes\r\n",
"set(surnames.nationality)"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"{'Arabic',\n",
" 'Chinese',\n",
" 'Czech',\n",
" 'Dutch',\n",
" 'English',\n",
" 'French',\n",
" 'German',\n",
" 'Greek',\n",
" 'Irish',\n",
" 'Italian',\n",
" 'Japanese',\n",
" 'Korean',\n",
" 'Polish',\n",
" 'Portuguese',\n",
" 'Russian',\n",
" 'Scottish',\n",
" 'Spanish',\n",
" 'Vietnamese'}"
]
},
"metadata": {},
"execution_count": 5
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 6,
"source": [
"# Splitting train by nationality\r\n",
"# Create dict\r\n",
"by_nationality = collections.defaultdict(list)\r\n",
"for _, row in surnames.iterrows():\r\n",
" by_nationality[row.nationality].append(row.to_dict())"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 8,
"source": [
"# Create split data\r\n",
"final_list = []\r\n",
"np.random.seed(args.seed)\r\n",
"for _, item_list in sorted(by_nationality.items()):\r\n",
" np.random.shuffle(item_list)\r\n",
" n = len(item_list)\r\n",
" n_train = int(args.train_proportion*n)\r\n",
" n_val = int(args.val_proportion*n)\r\n",
" n_test = int(args.test_proportion*n)\r\n",
" \r\n",
" # Give data point a split attribute\r\n",
" for item in item_list[:n_train]:\r\n",
" item['split'] = 'train'\r\n",
" for item in item_list[n_train:n_train+n_val]:\r\n",
" item['split'] = 'val'\r\n",
" for item in item_list[n_train+n_val:]:\r\n",
" item['split'] = 'test' \r\n",
" \r\n",
" # Add to final list\r\n",
" final_list.extend(item_list)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 9,
"source": [
"# Write split data to file\r\n",
"final_surnames = pd.DataFrame(final_list)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 10,
"source": [
"final_surnames.split.value_counts()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"train 7680\n",
"test 1660\n",
"val 1640\n",
"Name: split, dtype: int64"
]
},
"metadata": {},
"execution_count": 10
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 11,
"source": [
"final_surnames.head()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" surname | \n",
" nationality | \n",
" split | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Gerges | \n",
" Arabic | \n",
" train | \n",
"
\n",
" \n",
" 1 | \n",
" Nassar | \n",
" Arabic | \n",
" train | \n",
"
\n",
" \n",
" 2 | \n",
" Kanaan | \n",
" Arabic | \n",
" train | \n",
"
\n",
" \n",
" 3 | \n",
" Hadad | \n",
" Arabic | \n",
" train | \n",
"
\n",
" \n",
" 4 | \n",
" Tannous | \n",
" Arabic | \n",
" train | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" surname nationality split\n",
"0 Gerges Arabic train\n",
"1 Nassar Arabic train\n",
"2 Kanaan Arabic train\n",
"3 Hadad Arabic train\n",
"4 Tannous Arabic train"
]
},
"metadata": {},
"execution_count": 11
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 12,
"source": [
"# Write munged data to CSV\r\n",
"final_surnames.to_csv(args.output_munged_csv, index=False)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
":::{admonition} Your Turn\r\n",
"Visualise the distribution of nationalities to observe the class imbalance in this dataset. What can you do to create a balanced dataset?\r\n",
":::"
],
"metadata": {}
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3.8.10 64-bit ('cits4012': conda)"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"toc": {
"colors": {
"hover_highlight": "#DAA520",
"running_highlight": "#FF0000",
"selected_highlight": "#FFD700"
},
"moveMenuLeft": true,
"nav_menu": {
"height": "12px",
"width": "252px"
},
"navigate_menu": true,
"number_sections": true,
"sideBar": true,
"threshold": "5",
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": false
},
"interpreter": {
"hash": "d990147e05fc0cc60dd3871899a6233eb6a5324c1885ded43d013dc915f7e535"
}
},
"nbformat": 4,
"nbformat_minor": 2
}