{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "Yelp Dataset at a glance\r\n",
    "=========================\r\n",
    "\r\n",
    "The Yelp dataset, pairs review texts with their sentiment labels (positive or negative). In this notebook, we take a look at the dataset by loading a csv file into a Pandas Data Frame, which will give us the foundation of motivating a more object-oriented data handling in PyTorch. "
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "source": [
    "import collections\r\n",
    "import numpy as np\r\n",
    "import pandas as pd\r\n",
    "import re\r\n",
    "\r\n",
    "from argparse import Namespace"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Group all initial settings together"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "source": [
    "args = Namespace(\r\n",
    "    raw_train_dataset_csv=\"../data/yelp/raw_train.csv\",\r\n",
    "    raw_test_dataset_csv=\"../data/yelp/raw_test.csv\",\r\n",
    "    train_proportion=0.7,\r\n",
    "    val_proportion=0.3,\r\n",
    "    output_munged_csv=\"../data/yelp/reviews_with_splits_full.csv\",\r\n",
    "    seed=1337\r\n",
    ")"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Use `pandas.read_csv` to process CSV files"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "source": [
    "# Read raw data\r\n",
    "train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names=['rating', 'review'])\r\n",
    "train_reviews = train_reviews[~pd.isnull(train_reviews.review)]\r\n",
    "test_reviews = pd.read_csv(args.raw_test_dataset_csv, header=None, names=['rating', 'review'])\r\n",
    "test_reviews = test_reviews[~pd.isnull(test_reviews.review)]"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "source": [
    "train_reviews.head()"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rating</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Unfortunately, the frustration of being Dr. Go...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Been going to Dr. Goldberg for over 10 years. ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>I don't know what Dr. Goldberg was like before...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>I'm writing this review to give you a heads up...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>All the food is great here. But the best thing...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   rating                                             review\n",
       "0       1  Unfortunately, the frustration of being Dr. Go...\n",
       "1       2  Been going to Dr. Goldberg for over 10 years. ...\n",
       "2       1  I don't know what Dr. Goldberg was like before...\n",
       "3       1  I'm writing this review to give you a heads up...\n",
       "4       2  All the food is great here. But the best thing..."
      ]
     },
     "metadata": {},
     "execution_count": 7
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "source": [
    "test_reviews.head()"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rating</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Ordered a large Mango-Pineapple smoothie. Stay...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Quite a surprise!  \\n\\nMy wife and I loved thi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>First I will say, this is a nice atmosphere an...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>I was overall pretty impressed by this hotel. ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Video link at bottom review. Worst service I h...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   rating                                             review\n",
       "0       1  Ordered a large Mango-Pineapple smoothie. Stay...\n",
       "1       2  Quite a surprise!  \\n\\nMy wife and I loved thi...\n",
       "2       1  First I will say, this is a nice atmosphere an...\n",
       "3       2  I was overall pretty impressed by this hotel. ...\n",
       "4       1  Video link at bottom review. Worst service I h..."
      ]
     },
     "metadata": {},
     "execution_count": 8
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "source": [
    "# Unique classes\r\n",
    "set(train_reviews.rating)"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "{1, 2}"
      ]
     },
     "metadata": {},
     "execution_count": 9
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Splitting the training dataset \r\n",
    "\r\n",
    "For turning the hyper-parameters, we often want to retain a small proportion of validation data from the original training data. "
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "source": [
    "# Splitting train by rating\r\n",
    "# Create dict\r\n",
    "by_rating = collections.defaultdict(list)\r\n",
    "for _, row in train_reviews.iterrows():\r\n",
    "    by_rating[row.rating].append(row.to_dict())"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "source": [
    "# Create split data\r\n",
    "final_list = []\r\n",
    "np.random.seed(args.seed)\r\n",
    "\r\n",
    "for _, item_list in sorted(by_rating.items()):\r\n",
    "\r\n",
    "    np.random.shuffle(item_list)\r\n",
    "    \r\n",
    "    n_total = len(item_list)\r\n",
    "    n_train = int(args.train_proportion * n_total)\r\n",
    "    n_val = int(args.val_proportion * n_total)\r\n",
    "    \r\n",
    "    # Give data point a split attribute\r\n",
    "    for item in item_list[:n_train]:\r\n",
    "        item['split'] = 'train'\r\n",
    "    \r\n",
    "    for item in item_list[n_train:n_train+n_val]:\r\n",
    "        item['split'] = 'val'\r\n",
    "\r\n",
    "    # Add to final list\r\n",
    "    final_list.extend(item_list)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "source": [
    "for _, row in test_reviews.iterrows():\r\n",
    "    row_dict = row.to_dict()\r\n",
    "    row_dict['split'] = 'test'\r\n",
    "    final_list.append(row_dict)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Simple preprocessing and Write to file"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "source": [
    "final_reviews = pd.DataFrame(final_list)\r\n",
    "final_reviews.split.value_counts()"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "train    392000\n",
       "val      168000\n",
       "test      38000\n",
       "Name: split, dtype: int64"
      ]
     },
     "metadata": {},
     "execution_count": 13
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "source": [
    "final_reviews.review.head()"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    The entrance was the #1 impressive thing about...\n",
       "1    I'm a Mclover, and I had no problem\\nwith the ...\n",
       "2    Less than good here, not terrible, but I see n...\n",
       "3    I don't know if I can ever bring myself to go ...\n",
       "4    Food was OK/Good but the service was terrible....\n",
       "Name: review, dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 14
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "source": [
    "final_reviews[pd.isnull(final_reviews.review)]"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rating</th>\n",
       "      <th>review</th>\n",
       "      <th>split</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [rating, review, split]\n",
       "Index: []"
      ]
     },
     "metadata": {},
     "execution_count": 15
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "source": [
    "# Preprocess the reviews\r\n",
    "# You can come up with better pre-processing\r\n",
    "def preprocess_text(text):\r\n",
    "    if type(text) == float:\r\n",
    "        print(text)\r\n",
    "    text = text.lower()\r\n",
    "    text = re.sub(r\"([.,!?])\", r\" \\1 \", text)\r\n",
    "    text = re.sub(r\"[^a-zA-Z.,!?]+\", r\" \", text)\r\n",
    "    return text\r\n",
    "    \r\n",
    "final_reviews.review = final_reviews.review.apply(preprocess_text)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "source": [
    "final_reviews['rating'] = final_reviews.rating.apply({1: 'negative', 2: 'positive'}.get)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "source": [
    "final_reviews.head()"
   ],
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rating</th>\n",
       "      <th>review</th>\n",
       "      <th>split</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>negative</td>\n",
       "      <td>the entrance was the impressive thing about th...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>negative</td>\n",
       "      <td>i m a mclover , and i had no problem nwith the...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>negative</td>\n",
       "      <td>less than good here , not terrible , but i see...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>negative</td>\n",
       "      <td>i don t know if i can ever bring myself to go ...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>negative</td>\n",
       "      <td>food was ok good but the service was terrible ...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     rating                                             review  split\n",
       "0  negative  the entrance was the impressive thing about th...  train\n",
       "1  negative  i m a mclover , and i had no problem nwith the...  train\n",
       "2  negative  less than good here , not terrible , but i see...  train\n",
       "3  negative  i don t know if i can ever bring myself to go ...  train\n",
       "4  negative  food was ok good but the service was terrible ...  train"
      ]
     },
     "metadata": {},
     "execution_count": 18
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "source": [
    "final_reviews.to_csv(args.output_munged_csv, index=False)"
   ],
   "outputs": [],
   "metadata": {}
  }
 ],
 "metadata": {
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3.8.10 64-bit ('cits4012': conda)"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "12px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": "5",
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false
  },
  "interpreter": {
   "hash": "d990147e05fc0cc60dd3871899a6233eb6a5324c1885ded43d013dc915f7e535"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}