29 KiB
Text Classification Assessment - Solution¶
Goal: Given a set of text movie reviews that have been labeled negative or positive¶
For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/
Complete the tasks in bold below!¶
Task: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from '../DATA/moviereviews.csv'
.
# CODE HERE
import numpy as np
import pandas as pd
df = pd.read_csv('../DATA/moviereviews.csv')
df.head()
TASK: Check to see if there are any missing values in the dataframe.
#CODE HERE
# Check for NaN values:
df.isnull().sum()
TASK: Remove any reviews that are NaN
df = df.dropna()
TASK: Check to see if any reviews are blank strings and not just NaN. Note: This means a review text could just be: "" or " " or some other larger blank string. How would you check for this? Note: There are many ways! Once you've discovered the reviews that are blank strings, go ahead and remove them as well. Click me for a big hint
df['review'].str.isspace().sum()
df[df['review'].str.isspace()]
df = df[~df['review'].str.isspace()]
df.info()
TASK: Confirm the value counts per label:
#CODE HERE
df['label'].value_counts()
EDA on Bag of Words¶
Bonus Task: Can you figure out how to use a CountVectorizer model to get the top 20 words (that are not english stop words) per label type? Note, this is a bonus task as we did not show this in the lectures. But a quick cursory Google search should put you on the right path. Click me for a big hint
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')
matrix = cv.fit_transform(df[df['label']=='neg']['review'])
freqs = zip(cv.get_feature_names(), matrix.sum(axis=0).tolist()[0])
# sort from largest to smallest
print("Top 20 words used for Negative reviews.")
print(sorted(freqs, key=lambda x: -x[1])[:20])
matrix = cv.fit_transform(df[df['label']=='pos']['review'])
freqs = zip(cv.get_feature_names(), matrix.sum(axis=0).tolist()[0])
# sort from largest to smallest
print("Top 20 words used for Positive reviews.")
print(sorted(freqs, key=lambda x: -x[1])[:20])
Training and Data¶
TASK: Split the data into features and a label (X and y) and then preform a train/test split. You may use whatever settings you like. To compare your results to the solution notebook, use test_size=0.20, random_state=101
#CODE HERE
from sklearn.model_selection import train_test_split
X = df['review']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
Training a Mode¶
TASK: Create a PipeLine that will both create a TF-IDF Vector out of the raw text data and fit a supervised learning model of your choice. Then fit that pipeline on the training data.
#CODE HERE
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
pipe = Pipeline([('tfidf', TfidfVectorizer()),('svc', LinearSVC()),])
# Feed the training data through the pipeline
pipe.fit(X_train, y_train)
TASK: Create a classification report and plot a confusion matrix based on the results of your PipeLine.
#CODE HERE
from sklearn.metrics import classification_report,plot_confusion_matrix
preds = pipe.predict(X_test)
print(classification_report(y_test,preds))
plot_confusion_matrix(pipe,X_test,y_test)