{ "cells": [ { "cell_type": "markdown", "source": [ "Surname Dataset Processing\r\n", "==========================\r\n", "\r\n", "In this example, we introduce the surnames dataset, a collection of 10,000 surnames from 18 different nationalities collected by the authors from different name sources on the internet. This dataset has several properties that make it interesting. \r\n", "\r\n", "- The first property is that it is fairly imbalanced. The top three classes account for more than 60% of the data: 27% are English, 21% are Russian, and 14% are Arabic. The remaining 15 nationalities have decreasing frequency - a property that is endemic to language, as well. \r\n", "- The second property is that there is a valid and intuitive relationship between nationality of origin and surname orthography\r\n", "(spelling). There are spelling variations that are strongly tied to nation of origin (such in `O'Neill`, `Antonopoulos`, `Nagasawa`, or `Zhu`)." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Imports" ], "metadata": {} }, { "cell_type": "code", "execution_count": 1, "source": [ "import collections\r\n", "import numpy as np\r\n", "import pandas as pd\r\n", "import re\r\n", "\r\n", "from argparse import Namespace" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 2, "source": [ "args = Namespace(\r\n", " raw_dataset_csv=\"../data/surnames/surnames.csv\",\r\n", " train_proportion=0.7,\r\n", " val_proportion=0.15,\r\n", " test_proportion=0.15,\r\n", " output_munged_csv=\"../data/surnames/surnames_with_splits.csv\",\r\n", " seed=1337\r\n", ")" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 3, "source": [ "# Read raw data\r\n", "surnames = pd.read_csv(args.raw_dataset_csv, header=0)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 4, "source": [ "surnames.head()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
surnamenationality
0WoodfordEnglish
1CotéFrench
2KoreEnglish
3KouryArabic
4LebzakRussian
\n", "
" ], "text/plain": [ " surname nationality\n", "0 Woodford English\n", "1 Coté French\n", "2 Kore English\n", "3 Koury Arabic\n", "4 Lebzak Russian" ] }, "metadata": {}, "execution_count": 4 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 5, "source": [ "# Unique classes\r\n", "set(surnames.nationality)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'Arabic',\n", " 'Chinese',\n", " 'Czech',\n", " 'Dutch',\n", " 'English',\n", " 'French',\n", " 'German',\n", " 'Greek',\n", " 'Irish',\n", " 'Italian',\n", " 'Japanese',\n", " 'Korean',\n", " 'Polish',\n", " 'Portuguese',\n", " 'Russian',\n", " 'Scottish',\n", " 'Spanish',\n", " 'Vietnamese'}" ] }, "metadata": {}, "execution_count": 5 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 6, "source": [ "# Splitting train by nationality\r\n", "# Create dict\r\n", "by_nationality = collections.defaultdict(list)\r\n", "for _, row in surnames.iterrows():\r\n", " by_nationality[row.nationality].append(row.to_dict())" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 8, "source": [ "# Create split data\r\n", "final_list = []\r\n", "np.random.seed(args.seed)\r\n", "for _, item_list in sorted(by_nationality.items()):\r\n", " np.random.shuffle(item_list)\r\n", " n = len(item_list)\r\n", " n_train = int(args.train_proportion*n)\r\n", " n_val = int(args.val_proportion*n)\r\n", " n_test = int(args.test_proportion*n)\r\n", " \r\n", " # Give data point a split attribute\r\n", " for item in item_list[:n_train]:\r\n", " item['split'] = 'train'\r\n", " for item in item_list[n_train:n_train+n_val]:\r\n", " item['split'] = 'val'\r\n", " for item in item_list[n_train+n_val:]:\r\n", " item['split'] = 'test' \r\n", " \r\n", " # Add to final list\r\n", " final_list.extend(item_list)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 9, "source": [ "# Write split data to file\r\n", "final_surnames = pd.DataFrame(final_list)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 10, "source": [ "final_surnames.split.value_counts()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "train 7680\n", "test 1660\n", "val 1640\n", "Name: split, dtype: int64" ] }, "metadata": {}, "execution_count": 10 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 11, "source": [ "final_surnames.head()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
surnamenationalitysplit
0GergesArabictrain
1NassarArabictrain
2KanaanArabictrain
3HadadArabictrain
4TannousArabictrain
\n", "
" ], "text/plain": [ " surname nationality split\n", "0 Gerges Arabic train\n", "1 Nassar Arabic train\n", "2 Kanaan Arabic train\n", "3 Hadad Arabic train\n", "4 Tannous Arabic train" ] }, "metadata": {}, "execution_count": 11 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 12, "source": [ "# Write munged data to CSV\r\n", "final_surnames.to_csv(args.output_munged_csv, index=False)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ ":::{admonition} Your Turn\r\n", "Visualise the distribution of nationalities to observe the class imbalance in this dataset. What can you do to create a balanced dataset?\r\n", ":::" ], "metadata": {} } ], "metadata": { "kernelspec": { "name": "python3", "display_name": "Python 3.8.10 64-bit ('cits4012': conda)" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "12px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": "5", "toc_cell": false, "toc_section_display": "block", "toc_window_display": false }, "interpreter": { "hash": "d990147e05fc0cc60dd3871899a6233eb6a5324c1885ded43d013dc915f7e535" } }, "nbformat": 4, "nbformat_minor": 2 }