{
"cells": [
{
"cell_type": "markdown",
"source": [
"Yelp Dataset at a glance\r\n",
"=========================\r\n",
"\r\n",
"The Yelp dataset, pairs review texts with their sentiment labels (positive or negative). In this notebook, we take a look at the dataset by loading a csv file into a Pandas Data Frame, which will give us the foundation of motivating a more object-oriented data handling in PyTorch. "
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 4,
"source": [
"import collections\r\n",
"import numpy as np\r\n",
"import pandas as pd\r\n",
"import re\r\n",
"\r\n",
"from argparse import Namespace"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Group all initial settings together"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 5,
"source": [
"args = Namespace(\r\n",
" raw_train_dataset_csv=\"../data/yelp/raw_train.csv\",\r\n",
" raw_test_dataset_csv=\"../data/yelp/raw_test.csv\",\r\n",
" train_proportion=0.7,\r\n",
" val_proportion=0.3,\r\n",
" output_munged_csv=\"../data/yelp/reviews_with_splits_full.csv\",\r\n",
" seed=1337\r\n",
")"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Use `pandas.read_csv` to process CSV files"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 6,
"source": [
"# Read raw data\r\n",
"train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names=['rating', 'review'])\r\n",
"train_reviews = train_reviews[~pd.isnull(train_reviews.review)]\r\n",
"test_reviews = pd.read_csv(args.raw_test_dataset_csv, header=None, names=['rating', 'review'])\r\n",
"test_reviews = test_reviews[~pd.isnull(test_reviews.review)]"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 7,
"source": [
"train_reviews.head()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rating | \n",
" review | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Unfortunately, the frustration of being Dr. Go... | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Been going to Dr. Goldberg for over 10 years. ... | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" I don't know what Dr. Goldberg was like before... | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" I'm writing this review to give you a heads up... | \n",
"
\n",
" \n",
" 4 | \n",
" 2 | \n",
" All the food is great here. But the best thing... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rating review\n",
"0 1 Unfortunately, the frustration of being Dr. Go...\n",
"1 2 Been going to Dr. Goldberg for over 10 years. ...\n",
"2 1 I don't know what Dr. Goldberg was like before...\n",
"3 1 I'm writing this review to give you a heads up...\n",
"4 2 All the food is great here. But the best thing..."
]
},
"metadata": {},
"execution_count": 7
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 8,
"source": [
"test_reviews.head()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rating | \n",
" review | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Ordered a large Mango-Pineapple smoothie. Stay... | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Quite a surprise! \\n\\nMy wife and I loved thi... | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" First I will say, this is a nice atmosphere an... | \n",
"
\n",
" \n",
" 3 | \n",
" 2 | \n",
" I was overall pretty impressed by this hotel. ... | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Video link at bottom review. Worst service I h... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rating review\n",
"0 1 Ordered a large Mango-Pineapple smoothie. Stay...\n",
"1 2 Quite a surprise! \\n\\nMy wife and I loved thi...\n",
"2 1 First I will say, this is a nice atmosphere an...\n",
"3 2 I was overall pretty impressed by this hotel. ...\n",
"4 1 Video link at bottom review. Worst service I h..."
]
},
"metadata": {},
"execution_count": 8
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 9,
"source": [
"# Unique classes\r\n",
"set(train_reviews.rating)"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"{1, 2}"
]
},
"metadata": {},
"execution_count": 9
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Splitting the training dataset \r\n",
"\r\n",
"For turning the hyper-parameters, we often want to retain a small proportion of validation data from the original training data. "
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 10,
"source": [
"# Splitting train by rating\r\n",
"# Create dict\r\n",
"by_rating = collections.defaultdict(list)\r\n",
"for _, row in train_reviews.iterrows():\r\n",
" by_rating[row.rating].append(row.to_dict())"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 11,
"source": [
"# Create split data\r\n",
"final_list = []\r\n",
"np.random.seed(args.seed)\r\n",
"\r\n",
"for _, item_list in sorted(by_rating.items()):\r\n",
"\r\n",
" np.random.shuffle(item_list)\r\n",
" \r\n",
" n_total = len(item_list)\r\n",
" n_train = int(args.train_proportion * n_total)\r\n",
" n_val = int(args.val_proportion * n_total)\r\n",
" \r\n",
" # Give data point a split attribute\r\n",
" for item in item_list[:n_train]:\r\n",
" item['split'] = 'train'\r\n",
" \r\n",
" for item in item_list[n_train:n_train+n_val]:\r\n",
" item['split'] = 'val'\r\n",
"\r\n",
" # Add to final list\r\n",
" final_list.extend(item_list)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 12,
"source": [
"for _, row in test_reviews.iterrows():\r\n",
" row_dict = row.to_dict()\r\n",
" row_dict['split'] = 'test'\r\n",
" final_list.append(row_dict)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Simple preprocessing and Write to file"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 13,
"source": [
"final_reviews = pd.DataFrame(final_list)\r\n",
"final_reviews.split.value_counts()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"train 392000\n",
"val 168000\n",
"test 38000\n",
"Name: split, dtype: int64"
]
},
"metadata": {},
"execution_count": 13
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 14,
"source": [
"final_reviews.review.head()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 The entrance was the #1 impressive thing about...\n",
"1 I'm a Mclover, and I had no problem\\nwith the ...\n",
"2 Less than good here, not terrible, but I see n...\n",
"3 I don't know if I can ever bring myself to go ...\n",
"4 Food was OK/Good but the service was terrible....\n",
"Name: review, dtype: object"
]
},
"metadata": {},
"execution_count": 14
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 15,
"source": [
"final_reviews[pd.isnull(final_reviews.review)]"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rating | \n",
" review | \n",
" split | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [rating, review, split]\n",
"Index: []"
]
},
"metadata": {},
"execution_count": 15
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 16,
"source": [
"# Preprocess the reviews\r\n",
"# You can come up with better pre-processing\r\n",
"def preprocess_text(text):\r\n",
" if type(text) == float:\r\n",
" print(text)\r\n",
" text = text.lower()\r\n",
" text = re.sub(r\"([.,!?])\", r\" \\1 \", text)\r\n",
" text = re.sub(r\"[^a-zA-Z.,!?]+\", r\" \", text)\r\n",
" return text\r\n",
" \r\n",
"final_reviews.review = final_reviews.review.apply(preprocess_text)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 17,
"source": [
"final_reviews['rating'] = final_reviews.rating.apply({1: 'negative', 2: 'positive'}.get)"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 18,
"source": [
"final_reviews.head()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rating | \n",
" review | \n",
" split | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" negative | \n",
" the entrance was the impressive thing about th... | \n",
" train | \n",
"
\n",
" \n",
" 1 | \n",
" negative | \n",
" i m a mclover , and i had no problem nwith the... | \n",
" train | \n",
"
\n",
" \n",
" 2 | \n",
" negative | \n",
" less than good here , not terrible , but i see... | \n",
" train | \n",
"
\n",
" \n",
" 3 | \n",
" negative | \n",
" i don t know if i can ever bring myself to go ... | \n",
" train | \n",
"
\n",
" \n",
" 4 | \n",
" negative | \n",
" food was ok good but the service was terrible ... | \n",
" train | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rating review split\n",
"0 negative the entrance was the impressive thing about th... train\n",
"1 negative i m a mclover , and i had no problem nwith the... train\n",
"2 negative less than good here , not terrible , but i see... train\n",
"3 negative i don t know if i can ever bring myself to go ... train\n",
"4 negative food was ok good but the service was terrible ... train"
]
},
"metadata": {},
"execution_count": 18
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 19,
"source": [
"final_reviews.to_csv(args.output_munged_csv, index=False)"
],
"outputs": [],
"metadata": {}
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3.8.10 64-bit ('cits4012': conda)"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"toc": {
"colors": {
"hover_highlight": "#DAA520",
"running_highlight": "#FF0000",
"selected_highlight": "#FFD700"
},
"moveMenuLeft": true,
"nav_menu": {
"height": "12px",
"width": "252px"
},
"navigate_menu": true,
"number_sections": true,
"sideBar": true,
"threshold": "5",
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": false
},
"interpreter": {
"hash": "d990147e05fc0cc60dd3871899a6233eb6a5324c1885ded43d013dc915f7e535"
}
},
"nbformat": 4,
"nbformat_minor": 2
}