NLP Project
Welcome to the NLP Project for this section of the course. In this NLP project you will be attempting to classify Yelp Reviews into 1 star or 5 star categories based off the text content in the reviews. This will be a simpler procedure than the lecture, since we will utilize the pipeline methods for more complex tasks.
We will use the Yelp Review Data Set from Kaggle.
Each observation in this dataset is a review of a particular business by a particular user.
The “stars” column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
The “cool” column is the number of “cool” votes this review received from other Yelp users.
All reviews start with 0 “cool” votes, and there is no limit to how many “cool” votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.
The “useful” and “funny” columns are similar to the “cool” column.
Let’s get started! Just follow the directions below!
Imports
**Import the usual suspects. :) **
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
%matplotlib inline
The Data
Read the yelp.csv file and set it as a dataframe called yelp.
yelp = pd.read_csv("yelp.csv")
** Check the head, info , and describe methods on yelp.**
yelp.head()
business_id | date | review_id | stars | text | type | user_id | cool | useful | funny | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 9yKzy9PApeiPPOUJEtnvkg | 2011-01-26 | fWKvX83p0-ka4JS3dc6E5A | 5 | My wife took me here on my birthday for breakf... | review | rLtl8ZkDX5vH5nAx9C3q5Q | 2 | 5 | 0 |
1 | ZRJwVLyzEJq1VAihDhYiow | 2011-07-27 | IjZ33sJrzXqU-0X6U8NwyA | 5 | I have no idea why some people give bad review... | review | 0a2KyEL0d3Yb1V6aivbIuQ | 0 | 0 | 0 |
2 | 6oRAC4uyJCsJl1X0WZpVSA | 2012-06-14 | IESLBzqUCLdSzSqm0eCSxQ | 4 | love the gyro plate. Rice is so good and I als... | review | 0hT2KtfLiobPvh6cDC8JQg | 0 | 1 | 0 |
3 | _1QQZuf4zZOyFCvXc0o6Vg | 2010-05-27 | G-WvGaISbqqaMHlNnByodA | 5 | Rosie, Dakota, and I LOVE Chaparral Dog Park!!... | review | uZetl9T0NcROGOyFfughhg | 1 | 2 | 0 |
4 | 6ozycU1RpktNG2-1BroVtw | 2012-01-05 | 1uJFq2r5QfJG_6ExMRCaGw | 5 | General Manager Scott Petello is a good egg!!!... | review | vYmM4KTsC8ZfQBg-j5MWkw | 0 | 0 | 0 |
yelp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 business_id 10000 non-null object
1 date 10000 non-null object
2 review_id 10000 non-null object
3 stars 10000 non-null int64
4 text 10000 non-null object
5 type 10000 non-null object
6 user_id 10000 non-null object
7 cool 10000 non-null int64
8 useful 10000 non-null int64
9 funny 10000 non-null int64
dtypes: int64(4), object(6)
memory usage: 781.4+ KB
yelp.describe()
stars | cool | useful | funny | |
---|---|---|---|---|
count | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
mean | 3.777500 | 0.876800 | 1.409300 | 0.701300 |
std | 1.214636 | 2.067861 | 2.336647 | 1.907942 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 4.000000 | 0.000000 | 1.000000 | 0.000000 |
75% | 5.000000 | 1.000000 | 2.000000 | 1.000000 |
max | 5.000000 | 77.000000 | 76.000000 | 57.000000 |
Create a new column called “text length” which is the number of words in the text column.
yelp['text length'] = yelp['text'].apply(lambda x:len(x))
yelp['word length'] = yelp['text'].apply(lambda x: len(x.split(' ')))
EDA
Let’s explore the data
Imports
Import the data visualization libraries if you haven’t done so already.
import seaborn as sns
sns.jointplot(data=yelp, x='text length', y='word length')
Use FacetGrid from the seaborn library to create a grid of 5 histograms of text length based off of the star ratings. Reference the seaborn documentation for hints on this
sns.displot(data=yelp, x="text length", col="stars")
Create a boxplot of text length for each star category.
plt.figure(figsize=[15,6])
sns.boxplot(data=yelp, x="stars", y="text length")
Create a countplot of the number of occurrences for each type of star rating.
plt.figure(figsize=[15,6])
sns.countplot(data=yelp, x="stars")
** Use groupby to get the mean values of the numerical columns, you should be able to create this dataframe with the operation:**
yelp.groupby('stars').mean()
cool | useful | funny | text length | word length | |
---|---|---|---|---|---|
stars | |||||
1 | 0.576769 | 1.604806 | 1.056075 | 826.515354 | 156.013351 |
2 | 0.719525 | 1.563107 | 0.875944 | 842.256742 | 158.508091 |
3 | 0.788501 | 1.306639 | 0.694730 | 758.498289 | 143.043806 |
4 | 0.954623 | 1.395916 | 0.670448 | 712.923142 | 132.921441 |
5 | 0.944261 | 1.381780 | 0.608631 | 624.999101 | 116.054840 |
star_mean = yelp.groupby('stars').mean().reset_index()
Use the corr() method on that groupby dataframe to produce this dataframe:
star_mean.corr()
stars | cool | useful | funny | text length | word length | |
---|---|---|---|---|---|---|
stars | 1.000000 | 0.964758 | -0.761288 | -0.950389 | -0.950171 | -0.953951 |
cool | 0.964758 | 1.000000 | -0.743329 | -0.944939 | -0.857664 | -0.865650 |
useful | -0.761288 | -0.743329 | 1.000000 | 0.894506 | 0.699881 | 0.690255 |
funny | -0.950389 | -0.944939 | 0.894506 | 1.000000 | 0.843461 | 0.844066 |
text length | -0.950171 | -0.857664 | 0.699881 | 0.843461 | 1.000000 | 0.999648 |
word length | -0.953951 | -0.865650 | 0.690255 | 0.844066 | 0.999648 | 1.000000 |
Then use seaborn to create a heatmap based off that .corr() dataframe:
plt.figure(figsize=[14,10])
sns.heatmap(star_mean.corr(), annot=True)
NLP Classification Task
Let’s move on to the actual task. To make things a little easier, go ahead and only grab reviews that were either 1 star or 5 stars.
Create a dataframe called yelp_class that contains the columns of yelp dataframe but for only the 1 or 5 star reviews.
yelp_class = yelp[(yelp['stars'] == 1) | (yelp['stars'] == 5)]
** Create two objects X and y. X will be the ‘text’ column of yelp_class and y will be the ‘stars’ column of yelp_class. (Your features and target/labels)**
X = yelp_class['text']
y = yelp_class['stars']
Import CountVectorizer and create a CountVectorizer object.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
** Use the fit_transform method on the CountVectorizer object and pass in X (the ‘text’ column). Save this result by overwriting X.**
X = cv.fit_transform(X)
Train Test Split
Let’s split our data into training and testing data.
** Use train_test_split to split up the data into X_train, X_test, y_train, y_test. Use test_size=0.3 and random_state=101 **
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
Training a Model
Time to train a model!
** Import MultinomialNB and create an instance of the estimator and call is nb **
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
Now fit nb using the training data.
nb.fit(X_train, y_train)
MultinomialNB()
Predictions and Evaluations
Time to see how our model did!
Use the predict method off of nb to predict labels from X_test.
pred_nb = nb.predict(X_test)
** Create a confusion matrix and classification report using these predictions and y_test **
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(pred_nb, y_test))
print(classification_report(pred_nb, y_test))
[[159 22]
[ 69 976]]
precision recall f1-score support
1 0.70 0.88 0.78 181
5 0.98 0.93 0.96 1045
accuracy 0.93 1226
macro avg 0.84 0.91 0.87 1226
weighted avg 0.94 0.93 0.93 1226
Great! Let’s see what happens if we try to include TF-IDF to this process using a pipeline.
Using Text Processing
** Import TfidfTransformer from sklearn. **
from sklearn.feature_extraction.text import TfidfTransformer
** Import Pipeline from sklearn. **
from sklearn.pipeline import Pipeline
** Now create a pipeline with the following steps:CountVectorizer(), TfidfTransformer(),MultinomialNB()**
pipeline = Pipeline([
('bow', CountVectorizer()), # strings to token integer counts
('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores
('classifier', MultinomialNB()), # train on TF-IDF vectors w/ Naive Bayes classifier
])
Using the Pipeline
Time to use the pipeline! Remember this pipeline has all your pre-process steps in it already, meaning we’ll need to re-split the original data (Remember that we overwrote X as the CountVectorized version. What we need is just the text
Train Test Split
Redo the train test split on the yelp_class object.
X = yelp_class['text']
y = yelp_class['stars']
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=101)
Now fit the pipeline to the training data. Remember you can’t use the same training data as last time because that data has already been vectorized. We need to pass in just the text and labels
pipeline.fit(X_train, y_train)
Pipeline(steps=[('bow', CountVectorizer()), ('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())])
Predictions and Evaluation
** Now use the pipeline to predict from the X_test and create a classification report and confusion matrix. You should notice strange results.**
pred_pipe = pipeline.predict(X_test)
print(confusion_matrix(pred_pipe, y_test))
print(classification_report(pred_pipe, y_test))
[[ 0 0]
[228 998]]
precision recall f1-score support
1 0.00 0.00 0.00 0
5 1.00 0.81 0.90 1226
accuracy 0.81 1226
macro avg 0.50 0.41 0.45 1226
weighted avg 1.00 0.81 0.90 1226
c:\users\saltfarmer\appdata\local\programs\python\python38\lib\site-packages\sklearn\metrics\_classification.py:1221: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Looks like Tf-Idf actually made things worse! That is it for this project. But there is still a lot more you can play with:
Some other things to try…. Try going back and playing around with the pipeline steps and seeing if creating a custom analyzer like we did in the lecture helps (note: it probably won’t). Or recreate the pipeline with just the CountVectorizer() and NaiveBayes. Does changing the ML model at the end to another classifier help at all?
Leave a comment