3. Yelp Dataset at a glance¶

The Yelp dataset, pairs review texts with their sentiment labels (positive or negative). In this notebook, we take a look at the dataset by loading a csv file into a Pandas Data Frame, which will give us the foundation of motivating a more object-oriented data handling in PyTorch.

import collections
import numpy as np
import pandas as pd
import re

from argparse import Namespace

3.1. Group all initial settings together¶

args = Namespace(
    raw_train_dataset_csv="../data/yelp/raw_train.csv",
    raw_test_dataset_csv="../data/yelp/raw_test.csv",
    train_proportion=0.7,
    val_proportion=0.3,
    output_munged_csv="../data/yelp/reviews_with_splits_full.csv",
    seed=1337
)

3.2. Use `pandas.read_csv` to process CSV files¶

# Read raw data
train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names=['rating', 'review'])
train_reviews = train_reviews[~pd.isnull(train_reviews.review)]
test_reviews = pd.read_csv(args.raw_test_dataset_csv, header=None, names=['rating', 'review'])
test_reviews = test_reviews[~pd.isnull(test_reviews.review)]

train_reviews.head()

	rating	review
0	1	Unfortunately, the frustration of being Dr. Go...
1	2	Been going to Dr. Goldberg for over 10 years. ...
2	1	I don't know what Dr. Goldberg was like before...
3	1	I'm writing this review to give you a heads up...
4	2	All the food is great here. But the best thing...

test_reviews.head()

	rating	review
0	1	Ordered a large Mango-Pineapple smoothie. Stay...
1	2	Quite a surprise! \n\nMy wife and I loved thi...
2	1	First I will say, this is a nice atmosphere an...
3	2	I was overall pretty impressed by this hotel. ...
4	1	Video link at bottom review. Worst service I h...

# Unique classes
set(train_reviews.rating)

{1, 2}

3.3. Splitting the training dataset¶

For turning the hyper-parameters, we often want to retain a small proportion of validation data from the original training data.

# Splitting train by rating
# Create dict
by_rating = collections.defaultdict(list)
for _, row in train_reviews.iterrows():
    by_rating[row.rating].append(row.to_dict())

# Create split data
final_list = []
np.random.seed(args.seed)

for _, item_list in sorted(by_rating.items()):

    np.random.shuffle(item_list)
    
    n_total = len(item_list)
    n_train = int(args.train_proportion * n_total)
    n_val = int(args.val_proportion * n_total)
    
    # Give data point a split attribute
    for item in item_list[:n_train]:
        item['split'] = 'train'
    
    for item in item_list[n_train:n_train+n_val]:
        item['split'] = 'val'

    # Add to final list
    final_list.extend(item_list)

for _, row in test_reviews.iterrows():
    row_dict = row.to_dict()
    row_dict['split'] = 'test'
    final_list.append(row_dict)

3.4. Simple preprocessing and Write to file¶

final_reviews = pd.DataFrame(final_list)
final_reviews.split.value_counts()

train    392000
val      168000
test      38000
Name: split, dtype: int64

final_reviews.review.head()

  The entrance was the #1 impressive thing about...
  I'm a Mclover, and I had no problem\nwith the ...
  Less than good here, not terrible, but I see n...
  I don't know if I can ever bring myself to go ...
  Food was OK/Good but the service was terrible....
Name: review, dtype: object

final_reviews[pd.isnull(final_reviews.review)]

	rating	review	split

# Preprocess the reviews
# You can come up with better pre-processing
def preprocess_text(text):
    if type(text) == float:
        print(text)
    text = text.lower()
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text
    
final_reviews.review = final_reviews.review.apply(preprocess_text)

final_reviews['rating'] = final_reviews.rating.apply({1: 'negative', 2: 'positive'}.get)

final_reviews.head()

	rating	review	split
0	negative	the entrance was the impressive thing about th...	train
1	negative	i m a mclover , and i had no problem nwith the...	train
2	negative	less than good here , not terrible , but i see...	train
3	negative	i don t know if i can ever bring myself to go ...	train
4	negative	food was ok good but the service was terrible ...	train

final_reviews.to_csv(args.output_munged_csv, index=False)

previous

2. Dataset and DataLoader

next

4. Yelp Review Dataset - Document Classification