3. Yelp Dataset at a glance

The Yelp dataset, pairs review texts with their sentiment labels (positive or negative). In this notebook, we take a look at the dataset by loading a csv file into a Pandas Data Frame, which will give us the foundation of motivating a more object-oriented data handling in PyTorch.

import collections
import numpy as np
import pandas as pd
import re

from argparse import Namespace

3.1. Group all initial settings together

args = Namespace(
    raw_train_dataset_csv="../data/yelp/raw_train.csv",
    raw_test_dataset_csv="../data/yelp/raw_test.csv",
    train_proportion=0.7,
    val_proportion=0.3,
    output_munged_csv="../data/yelp/reviews_with_splits_full.csv",
    seed=1337
)

3.2. Use pandas.read_csv to process CSV files

# Read raw data
train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names=['rating', 'review'])
train_reviews = train_reviews[~pd.isnull(train_reviews.review)]
test_reviews = pd.read_csv(args.raw_test_dataset_csv, header=None, names=['rating', 'review'])
test_reviews = test_reviews[~pd.isnull(test_reviews.review)]
train_reviews.head()
rating review
0 1 Unfortunately, the frustration of being Dr. Go...
1 2 Been going to Dr. Goldberg for over 10 years. ...
2 1 I don't know what Dr. Goldberg was like before...
3 1 I'm writing this review to give you a heads up...
4 2 All the food is great here. But the best thing...
test_reviews.head()
rating review
0 1 Ordered a large Mango-Pineapple smoothie. Stay...
1 2 Quite a surprise! \n\nMy wife and I loved thi...
2 1 First I will say, this is a nice atmosphere an...
3 2 I was overall pretty impressed by this hotel. ...
4 1 Video link at bottom review. Worst service I h...
# Unique classes
set(train_reviews.rating)
{1, 2}

3.3. Splitting the training dataset

For turning the hyper-parameters, we often want to retain a small proportion of validation data from the original training data.

# Splitting train by rating
# Create dict
by_rating = collections.defaultdict(list)
for _, row in train_reviews.iterrows():
    by_rating[row.rating].append(row.to_dict())
# Create split data
final_list = []
np.random.seed(args.seed)

for _, item_list in sorted(by_rating.items()):

    np.random.shuffle(item_list)
    
    n_total = len(item_list)
    n_train = int(args.train_proportion * n_total)
    n_val = int(args.val_proportion * n_total)
    
    # Give data point a split attribute
    for item in item_list[:n_train]:
        item['split'] = 'train'
    
    for item in item_list[n_train:n_train+n_val]:
        item['split'] = 'val'

    # Add to final list
    final_list.extend(item_list)
for _, row in test_reviews.iterrows():
    row_dict = row.to_dict()
    row_dict['split'] = 'test'
    final_list.append(row_dict)

3.4. Simple preprocessing and Write to file

final_reviews = pd.DataFrame(final_list)
final_reviews.split.value_counts()
train    392000
val      168000
test      38000
Name: split, dtype: int64
final_reviews.review.head()
0    The entrance was the #1 impressive thing about...
1    I'm a Mclover, and I had no problem\nwith the ...
2    Less than good here, not terrible, but I see n...
3    I don't know if I can ever bring myself to go ...
4    Food was OK/Good but the service was terrible....
Name: review, dtype: object
final_reviews[pd.isnull(final_reviews.review)]
rating review split
# Preprocess the reviews
# You can come up with better pre-processing
def preprocess_text(text):
    if type(text) == float:
        print(text)
    text = text.lower()
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text
    
final_reviews.review = final_reviews.review.apply(preprocess_text)
final_reviews['rating'] = final_reviews.rating.apply({1: 'negative', 2: 'positive'}.get)
final_reviews.head()
rating review split
0 negative the entrance was the impressive thing about th... train
1 negative i m a mclover , and i had no problem nwith the... train
2 negative less than good here , not terrible , but i see... train
3 negative i don t know if i can ever bring myself to go ... train
4 negative food was ok good but the service was terrible ... train
final_reviews.to_csv(args.output_munged_csv, index=False)