{ "cells": [ { "cell_type": "markdown", "source": [ "Yelp Dataset at a glance\r\n", "=========================\r\n", "\r\n", "The Yelp dataset, pairs review texts with their sentiment labels (positive or negative). In this notebook, we take a look at the dataset by loading a csv file into a Pandas Data Frame, which will give us the foundation of motivating a more object-oriented data handling in PyTorch. " ], "metadata": {} }, { "cell_type": "code", "execution_count": 4, "source": [ "import collections\r\n", "import numpy as np\r\n", "import pandas as pd\r\n", "import re\r\n", "\r\n", "from argparse import Namespace" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Group all initial settings together" ], "metadata": {} }, { "cell_type": "code", "execution_count": 5, "source": [ "args = Namespace(\r\n", " raw_train_dataset_csv=\"../data/yelp/raw_train.csv\",\r\n", " raw_test_dataset_csv=\"../data/yelp/raw_test.csv\",\r\n", " train_proportion=0.7,\r\n", " val_proportion=0.3,\r\n", " output_munged_csv=\"../data/yelp/reviews_with_splits_full.csv\",\r\n", " seed=1337\r\n", ")" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Use `pandas.read_csv` to process CSV files" ], "metadata": {} }, { "cell_type": "code", "execution_count": 6, "source": [ "# Read raw data\r\n", "train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names=['rating', 'review'])\r\n", "train_reviews = train_reviews[~pd.isnull(train_reviews.review)]\r\n", "test_reviews = pd.read_csv(args.raw_test_dataset_csv, header=None, names=['rating', 'review'])\r\n", "test_reviews = test_reviews[~pd.isnull(test_reviews.review)]" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 7, "source": [ "train_reviews.head()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ratingreview
01Unfortunately, the frustration of being Dr. Go...
12Been going to Dr. Goldberg for over 10 years. ...
21I don't know what Dr. Goldberg was like before...
31I'm writing this review to give you a heads up...
42All the food is great here. But the best thing...
\n", "
" ], "text/plain": [ " rating review\n", "0 1 Unfortunately, the frustration of being Dr. Go...\n", "1 2 Been going to Dr. Goldberg for over 10 years. ...\n", "2 1 I don't know what Dr. Goldberg was like before...\n", "3 1 I'm writing this review to give you a heads up...\n", "4 2 All the food is great here. But the best thing..." ] }, "metadata": {}, "execution_count": 7 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 8, "source": [ "test_reviews.head()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ratingreview
01Ordered a large Mango-Pineapple smoothie. Stay...
12Quite a surprise! \\n\\nMy wife and I loved thi...
21First I will say, this is a nice atmosphere an...
32I was overall pretty impressed by this hotel. ...
41Video link at bottom review. Worst service I h...
\n", "
" ], "text/plain": [ " rating review\n", "0 1 Ordered a large Mango-Pineapple smoothie. Stay...\n", "1 2 Quite a surprise! \\n\\nMy wife and I loved thi...\n", "2 1 First I will say, this is a nice atmosphere an...\n", "3 2 I was overall pretty impressed by this hotel. ...\n", "4 1 Video link at bottom review. Worst service I h..." ] }, "metadata": {}, "execution_count": 8 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 9, "source": [ "# Unique classes\r\n", "set(train_reviews.rating)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{1, 2}" ] }, "metadata": {}, "execution_count": 9 } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Splitting the training dataset \r\n", "\r\n", "For turning the hyper-parameters, we often want to retain a small proportion of validation data from the original training data. " ], "metadata": {} }, { "cell_type": "code", "execution_count": 10, "source": [ "# Splitting train by rating\r\n", "# Create dict\r\n", "by_rating = collections.defaultdict(list)\r\n", "for _, row in train_reviews.iterrows():\r\n", " by_rating[row.rating].append(row.to_dict())" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 11, "source": [ "# Create split data\r\n", "final_list = []\r\n", "np.random.seed(args.seed)\r\n", "\r\n", "for _, item_list in sorted(by_rating.items()):\r\n", "\r\n", " np.random.shuffle(item_list)\r\n", " \r\n", " n_total = len(item_list)\r\n", " n_train = int(args.train_proportion * n_total)\r\n", " n_val = int(args.val_proportion * n_total)\r\n", " \r\n", " # Give data point a split attribute\r\n", " for item in item_list[:n_train]:\r\n", " item['split'] = 'train'\r\n", " \r\n", " for item in item_list[n_train:n_train+n_val]:\r\n", " item['split'] = 'val'\r\n", "\r\n", " # Add to final list\r\n", " final_list.extend(item_list)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 12, "source": [ "for _, row in test_reviews.iterrows():\r\n", " row_dict = row.to_dict()\r\n", " row_dict['split'] = 'test'\r\n", " final_list.append(row_dict)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Simple preprocessing and Write to file" ], "metadata": {} }, { "cell_type": "code", "execution_count": 13, "source": [ "final_reviews = pd.DataFrame(final_list)\r\n", "final_reviews.split.value_counts()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "train 392000\n", "val 168000\n", "test 38000\n", "Name: split, dtype: int64" ] }, "metadata": {}, "execution_count": 13 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 14, "source": [ "final_reviews.review.head()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 The entrance was the #1 impressive thing about...\n", "1 I'm a Mclover, and I had no problem\\nwith the ...\n", "2 Less than good here, not terrible, but I see n...\n", "3 I don't know if I can ever bring myself to go ...\n", "4 Food was OK/Good but the service was terrible....\n", "Name: review, dtype: object" ] }, "metadata": {}, "execution_count": 14 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 15, "source": [ "final_reviews[pd.isnull(final_reviews.review)]" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ratingreviewsplit
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [rating, review, split]\n", "Index: []" ] }, "metadata": {}, "execution_count": 15 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 16, "source": [ "# Preprocess the reviews\r\n", "# You can come up with better pre-processing\r\n", "def preprocess_text(text):\r\n", " if type(text) == float:\r\n", " print(text)\r\n", " text = text.lower()\r\n", " text = re.sub(r\"([.,!?])\", r\" \\1 \", text)\r\n", " text = re.sub(r\"[^a-zA-Z.,!?]+\", r\" \", text)\r\n", " return text\r\n", " \r\n", "final_reviews.review = final_reviews.review.apply(preprocess_text)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 17, "source": [ "final_reviews['rating'] = final_reviews.rating.apply({1: 'negative', 2: 'positive'}.get)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 18, "source": [ "final_reviews.head()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ratingreviewsplit
0negativethe entrance was the impressive thing about th...train
1negativei m a mclover , and i had no problem nwith the...train
2negativeless than good here , not terrible , but i see...train
3negativei don t know if i can ever bring myself to go ...train
4negativefood was ok good but the service was terrible ...train
\n", "
" ], "text/plain": [ " rating review split\n", "0 negative the entrance was the impressive thing about th... train\n", "1 negative i m a mclover , and i had no problem nwith the... train\n", "2 negative less than good here , not terrible , but i see... train\n", "3 negative i don t know if i can ever bring myself to go ... train\n", "4 negative food was ok good but the service was terrible ... train" ] }, "metadata": {}, "execution_count": 18 } ], "metadata": {} }, { "cell_type": "code", "execution_count": 19, "source": [ "final_reviews.to_csv(args.output_munged_csv, index=False)" ], "outputs": [], "metadata": {} } ], "metadata": { "kernelspec": { "name": "python3", "display_name": "Python 3.8.10 64-bit ('cits4012': conda)" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "12px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": "5", "toc_cell": false, "toc_section_display": "block", "toc_window_display": false }, "interpreter": { "hash": "d990147e05fc0cc60dd3871899a6233eb6a5324c1885ded43d013dc915f7e535" } }, "nbformat": 4, "nbformat_minor": 2 }