3. Yelp Dataset at a glance¶
The Yelp dataset, pairs review texts with their sentiment labels (positive or negative). In this notebook, we take a look at the dataset by loading a csv file into a Pandas Data Frame, which will give us the foundation of motivating a more object-oriented data handling in PyTorch.
import collections
import numpy as np
import pandas as pd
import re
from argparse import Namespace
3.1. Group all initial settings together¶
args = Namespace(
raw_train_dataset_csv="../data/yelp/raw_train.csv",
raw_test_dataset_csv="../data/yelp/raw_test.csv",
train_proportion=0.7,
val_proportion=0.3,
output_munged_csv="../data/yelp/reviews_with_splits_full.csv",
seed=1337
)
3.2. Use pandas.read_csv
to process CSV files¶
# Read raw data
train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names=['rating', 'review'])
train_reviews = train_reviews[~pd.isnull(train_reviews.review)]
test_reviews = pd.read_csv(args.raw_test_dataset_csv, header=None, names=['rating', 'review'])
test_reviews = test_reviews[~pd.isnull(test_reviews.review)]
train_reviews.head()
rating | review | |
---|---|---|
0 | 1 | Unfortunately, the frustration of being Dr. Go... |
1 | 2 | Been going to Dr. Goldberg for over 10 years. ... |
2 | 1 | I don't know what Dr. Goldberg was like before... |
3 | 1 | I'm writing this review to give you a heads up... |
4 | 2 | All the food is great here. But the best thing... |
test_reviews.head()
rating | review | |
---|---|---|
0 | 1 | Ordered a large Mango-Pineapple smoothie. Stay... |
1 | 2 | Quite a surprise! \n\nMy wife and I loved thi... |
2 | 1 | First I will say, this is a nice atmosphere an... |
3 | 2 | I was overall pretty impressed by this hotel. ... |
4 | 1 | Video link at bottom review. Worst service I h... |
# Unique classes
set(train_reviews.rating)
{1, 2}
3.3. Splitting the training dataset¶
For turning the hyper-parameters, we often want to retain a small proportion of validation data from the original training data.
# Splitting train by rating
# Create dict
by_rating = collections.defaultdict(list)
for _, row in train_reviews.iterrows():
by_rating[row.rating].append(row.to_dict())
# Create split data
final_list = []
np.random.seed(args.seed)
for _, item_list in sorted(by_rating.items()):
np.random.shuffle(item_list)
n_total = len(item_list)
n_train = int(args.train_proportion * n_total)
n_val = int(args.val_proportion * n_total)
# Give data point a split attribute
for item in item_list[:n_train]:
item['split'] = 'train'
for item in item_list[n_train:n_train+n_val]:
item['split'] = 'val'
# Add to final list
final_list.extend(item_list)
for _, row in test_reviews.iterrows():
row_dict = row.to_dict()
row_dict['split'] = 'test'
final_list.append(row_dict)
3.4. Simple preprocessing and Write to file¶
final_reviews = pd.DataFrame(final_list)
final_reviews.split.value_counts()
train 392000
val 168000
test 38000
Name: split, dtype: int64
final_reviews.review.head()
0 The entrance was the #1 impressive thing about...
1 I'm a Mclover, and I had no problem\nwith the ...
2 Less than good here, not terrible, but I see n...
3 I don't know if I can ever bring myself to go ...
4 Food was OK/Good but the service was terrible....
Name: review, dtype: object
final_reviews[pd.isnull(final_reviews.review)]
rating | review | split |
---|
# Preprocess the reviews
# You can come up with better pre-processing
def preprocess_text(text):
if type(text) == float:
print(text)
text = text.lower()
text = re.sub(r"([.,!?])", r" \1 ", text)
text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
return text
final_reviews.review = final_reviews.review.apply(preprocess_text)
final_reviews['rating'] = final_reviews.rating.apply({1: 'negative', 2: 'positive'}.get)
final_reviews.head()
rating | review | split | |
---|---|---|---|
0 | negative | the entrance was the impressive thing about th... | train |
1 | negative | i m a mclover , and i had no problem nwith the... | train |
2 | negative | less than good here , not terrible , but i see... | train |
3 | negative | i don t know if i can ever bring myself to go ... | train |
4 | negative | food was ok good but the service was terrible ... | train |
final_reviews.to_csv(args.output_munged_csv, index=False)