{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"___\n",
"\n",
"\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Text Classification Assessment \n",
"\n",
"### Goal: Given a set of text movie reviews that have been labeled negative or positive\n",
"\n",
"For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/\n",
"\n",
"## Complete the tasks in bold below!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Task: Perform imports and load the dataset into a pandas DataFrame**\n",
"For this exercise you can load the dataset from `'../DATA/moviereviews.csv'`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('../DATA/moviereviews.csv')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
label
\n",
"
review
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
neg
\n",
"
how do films like mouse hunt get into theatres...
\n",
"
\n",
"
\n",
"
1
\n",
"
neg
\n",
"
some talented actresses are blessed with a dem...
\n",
"
\n",
"
\n",
"
2
\n",
"
pos
\n",
"
this has been an extraordinary year for austra...
\n",
"
\n",
"
\n",
"
3
\n",
"
pos
\n",
"
according to hollywood movies made in last few...
\n",
"
\n",
"
\n",
"
4
\n",
"
neg
\n",
"
my first press screening of 1998 and already i...
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" label review\n",
"0 neg how do films like mouse hunt get into theatres...\n",
"1 neg some talented actresses are blessed with a dem...\n",
"2 pos this has been an extraordinary year for austra...\n",
"3 pos according to hollywood movies made in last few...\n",
"4 neg my first press screening of 1998 and already i..."
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: Check to see if there are any missing values in the dataframe.**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"#CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"label 0\n",
"review 35\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: Remove any reviews that are NaN**"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: Check to see if any reviews are blank strings and not just NaN. Note: This means a review text could just be: \"\" or \" \" or some other larger blank string. How would you check for this? Note: There are many ways! Once you've discovered the reviews that are blank strings, go ahead and remove them as well. [Click me for a big hint](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.isspace.html)**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"27"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
label
\n",
"
review
\n",
"
\n",
" \n",
" \n",
"
\n",
"
57
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
71
\n",
"
pos
\n",
"
\n",
"
\n",
"
\n",
"
147
\n",
"
pos
\n",
"
\n",
"
\n",
"
\n",
"
151
\n",
"
pos
\n",
"
\n",
"
\n",
"
\n",
"
283
\n",
"
pos
\n",
"
\n",
"
\n",
"
\n",
"
307
\n",
"
pos
\n",
"
\n",
"
\n",
"
\n",
"
313
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
323
\n",
"
pos
\n",
"
\n",
"
\n",
"
\n",
"
343
\n",
"
pos
\n",
"
\n",
"
\n",
"
\n",
"
351
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
427
\n",
"
pos
\n",
"
\n",
"
\n",
"
\n",
"
501
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
633
\n",
"
pos
\n",
"
\n",
"
\n",
"
\n",
"
675
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
815
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
851
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
977
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
1079
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
1299
\n",
"
pos
\n",
"
\n",
"
\n",
"
\n",
"
1455
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
1493
\n",
"
pos
\n",
"
\n",
"
\n",
"
\n",
"
1525
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
1531
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
1763
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
1851
\n",
"
neg
\n",
"
\n",
"
\n",
"
\n",
"
1905
\n",
"
pos
\n",
"
\n",
"
\n",
"
\n",
"
1993
\n",
"
pos
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" label review\n",
"57 neg \n",
"71 pos \n",
"147 pos \n",
"151 pos \n",
"283 pos \n",
"307 pos \n",
"313 neg \n",
"323 pos \n",
"343 pos \n",
"351 neg \n",
"427 pos \n",
"501 neg \n",
"633 pos \n",
"675 neg \n",
"815 neg \n",
"851 neg \n",
"977 neg \n",
"1079 neg \n",
"1299 pos \n",
"1455 neg \n",
"1493 pos \n",
"1525 neg \n",
"1531 neg \n",
"1763 neg \n",
"1851 neg \n",
"1905 pos \n",
"1993 pos "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Int64Index: 1938 entries, 0 to 1999\n",
"Data columns (total 2 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 label 1938 non-null object\n",
" 1 review 1938 non-null object\n",
"dtypes: object(2)\n",
"memory usage: 45.4+ KB\n"
]
}
],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: Confirm the value counts per label:**"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"#CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pos 969\n",
"neg 969\n",
"Name: label, dtype: int64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## EDA on Bag of Words\n",
"\n",
"**Bonus Task: Can you figure out how to use a CountVectorizer model to get the top 20 words (that are not english stop words) per label type? Note, this is a bonus task as we did not show this in the lectures. But a quick cursory Google search should put you on the right path. [Click me for a big hint](https://stackoverflow.com/questions/16288497/find-the-most-common-term-in-scikit-learn-classifier)**"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"#CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Top 20 words used for Negative reviews.\n",
"[('film', 4063), ('movie', 3131), ('like', 1808), ('just', 1480), ('time', 1127), ('good', 1117), ('bad', 997), ('character', 926), ('story', 908), ('plot', 888), ('characters', 838), ('make', 813), ('really', 743), ('way', 734), ('little', 696), ('don', 683), ('does', 666), ('doesn', 648), ('action', 635), ('scene', 634)]\n"
]
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Top 20 words used for Positive reviews.\n",
"[('film', 5002), ('movie', 2389), ('like', 1721), ('just', 1273), ('story', 1199), ('good', 1193), ('time', 1175), ('character', 1037), ('life', 1032), ('characters', 957), ('way', 864), ('films', 851), ('does', 828), ('best', 788), ('people', 769), ('make', 764), ('little', 751), ('really', 731), ('man', 728), ('new', 702)]\n"
]
}
],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training and Data\n",
"\n",
"**TASK: Split the data into features and a label (X and y) and then preform a train/test split. You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.20, random_state=101`**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training a Mode\n",
"\n",
"**TASK: Create a PipeLine that will both create a TF-IDF Vector out of the raw text data and fit a supervised learning model of your choice. Then fit that pipeline on the training data.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(steps=[('tfidf', TfidfVectorizer()), ('svc', LinearSVC())])"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**TASK: Create a classification report and plot a confusion matrix based on the results of your PipeLine.**"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"#CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" neg 0.81 0.86 0.83 191\n",
" pos 0.85 0.81 0.83 197\n",
"\n",
" accuracy 0.83 388\n",
" macro avg 0.83 0.83 0.83 388\n",
"weighted avg 0.83 0.83 0.83 388\n",
"\n"
]
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"