{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "Surname Dataset Processing\r\n",
    "==========================\r\n",
    "\r\n",
    "In this example, we introduce the surnames dataset, a collection of 10,000 surnames from 18 different nationalities collected by the authors from different name sources on the internet. This dataset has several properties that make it interesting. \r\n",
    "\r\n",
    "- The first property is that it is fairly imbalanced. The top three classes account for more than 60% of the data: 27% are English, 21% are Russian, and 14% are Arabic. The remaining 15 nationalities have decreasing frequency - a property that is endemic to language, as well. \r\n",
    "- The second property is that there is a valid and intuitive relationship between nationality of origin and surname orthography\r\n",
    "(spelling). There are spelling variations that are strongly tied to nation of origin (such in `O'Neill`, `Antonopoulos`, `Nagasawa`, or `Zhu`)."
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Imports"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "source": [
    "import collections\r\n",
    "import numpy as np\r\n",
    "import pandas as pd\r\n",
    "import re\r\n",
    "\r\n",
    "from argparse import Namespace"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "source": [
    "args = Namespace(\r\n",
    "    raw_dataset_csv=\"../data/surnames/surnames.csv\",\r\n",
    "    train_proportion=0.7,\r\n",
    "    val_proportion=0.15,\r\n",
    "    test_proportion=0.15,\r\n",
    "    output_munged_csv=\"../data/surnames/surnames_with_splits.csv\",\r\n",
    "    seed=1337\r\n",
    ")"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "source": [
    "# Read raw data\r\n",
    "surnames = pd.read_csv(args.raw_dataset_csv, header=0)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "source": [
    "surnames.head()"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>surname</th>\n",
       "      <th>nationality</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Woodford</td>\n",
       "      <td>English</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Coté</td>\n",
       "      <td>French</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Kore</td>\n",
       "      <td>English</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Koury</td>\n",
       "      <td>Arabic</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Lebzak</td>\n",
       "      <td>Russian</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    surname nationality\n",
       "0  Woodford     English\n",
       "1      Coté      French\n",
       "2      Kore     English\n",
       "3     Koury      Arabic\n",
       "4    Lebzak     Russian"
      ]
     },
     "metadata": {},
     "execution_count": 4
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "source": [
    "# Unique classes\r\n",
    "set(surnames.nationality)"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "{'Arabic',\n",
       " 'Chinese',\n",
       " 'Czech',\n",
       " 'Dutch',\n",
       " 'English',\n",
       " 'French',\n",
       " 'German',\n",
       " 'Greek',\n",
       " 'Irish',\n",
       " 'Italian',\n",
       " 'Japanese',\n",
       " 'Korean',\n",
       " 'Polish',\n",
       " 'Portuguese',\n",
       " 'Russian',\n",
       " 'Scottish',\n",
       " 'Spanish',\n",
       " 'Vietnamese'}"
      ]
     },
     "metadata": {},
     "execution_count": 5
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "source": [
    "# Splitting train by nationality\r\n",
    "# Create dict\r\n",
    "by_nationality = collections.defaultdict(list)\r\n",
    "for _, row in surnames.iterrows():\r\n",
    "    by_nationality[row.nationality].append(row.to_dict())"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "source": [
    "# Create split data\r\n",
    "final_list = []\r\n",
    "np.random.seed(args.seed)\r\n",
    "for _, item_list in sorted(by_nationality.items()):\r\n",
    "    np.random.shuffle(item_list)\r\n",
    "    n = len(item_list)\r\n",
    "    n_train = int(args.train_proportion*n)\r\n",
    "    n_val = int(args.val_proportion*n)\r\n",
    "    n_test = int(args.test_proportion*n)\r\n",
    "    \r\n",
    "    # Give data point a split attribute\r\n",
    "    for item in item_list[:n_train]:\r\n",
    "        item['split'] = 'train'\r\n",
    "    for item in item_list[n_train:n_train+n_val]:\r\n",
    "        item['split'] = 'val'\r\n",
    "    for item in item_list[n_train+n_val:]:\r\n",
    "        item['split'] = 'test'  \r\n",
    "    \r\n",
    "    # Add to final list\r\n",
    "    final_list.extend(item_list)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "source": [
    "# Write split data to file\r\n",
    "final_surnames = pd.DataFrame(final_list)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "source": [
    "final_surnames.split.value_counts()"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "train    7680\n",
       "test     1660\n",
       "val      1640\n",
       "Name: split, dtype: int64"
      ]
     },
     "metadata": {},
     "execution_count": 10
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "source": [
    "final_surnames.head()"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>surname</th>\n",
       "      <th>nationality</th>\n",
       "      <th>split</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Gerges</td>\n",
       "      <td>Arabic</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Nassar</td>\n",
       "      <td>Arabic</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Kanaan</td>\n",
       "      <td>Arabic</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Hadad</td>\n",
       "      <td>Arabic</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Tannous</td>\n",
       "      <td>Arabic</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   surname nationality  split\n",
       "0   Gerges      Arabic  train\n",
       "1   Nassar      Arabic  train\n",
       "2   Kanaan      Arabic  train\n",
       "3    Hadad      Arabic  train\n",
       "4  Tannous      Arabic  train"
      ]
     },
     "metadata": {},
     "execution_count": 11
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "source": [
    "# Write munged data to CSV\r\n",
    "final_surnames.to_csv(args.output_munged_csv, index=False)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    ":::{admonition} Your Turn\r\n",
    "Visualise the distribution of nationalities to observe the class imbalance in this dataset. What can you do to create a balanced dataset?\r\n",
    ":::"
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3.8.10 64-bit ('cits4012': conda)"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "12px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": "5",
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false
  },
  "interpreter": {
   "hash": "d990147e05fc0cc60dd3871899a6233eb6a5324c1885ded43d013dc915f7e535"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}