You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
927 lines
109 KiB
927 lines
109 KiB
2 years ago
|
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"___\n",
|
||
|
"\n",
|
||
|
"<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>\n",
|
||
|
"___\n",
|
||
|
"<center><em>Copyright by Pierian Data Inc.</em></center>\n",
|
||
|
"<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# NLP and Supervised Learning\n",
|
||
|
"## Classification of Text Data"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### The Data\n",
|
||
|
"\n",
|
||
|
"Source: https://www.kaggle.com/crowdflower/twitter-airline-sentiment?select=Tweets.csv\n",
|
||
|
"\n",
|
||
|
"This data originally came from Crowdflower's Data for Everyone library.\n",
|
||
|
"\n",
|
||
|
"As the original source says,\n",
|
||
|
"\n",
|
||
|
"A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as \"late flight\" or \"rude service\").\n",
|
||
|
"\n",
|
||
|
"#### The Goal: Create a Machine Learning Algorithm that can predict if a tweet is positive, neutral, or negative. In the future we could use such an algorithm to automatically read and flag tweets for an airline for a customer service agent to reach out to contact."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 1,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import pandas as pd\n",
|
||
|
"import seaborn as sns\n",
|
||
|
"import matplotlib.pyplot as plt"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 2,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"df = pd.read_csv(\"../DATA/airline_tweets.csv\")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 3,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>tweet_id</th>\n",
|
||
|
" <th>airline_sentiment</th>\n",
|
||
|
" <th>airline_sentiment_confidence</th>\n",
|
||
|
" <th>negativereason</th>\n",
|
||
|
" <th>negativereason_confidence</th>\n",
|
||
|
" <th>airline</th>\n",
|
||
|
" <th>airline_sentiment_gold</th>\n",
|
||
|
" <th>name</th>\n",
|
||
|
" <th>negativereason_gold</th>\n",
|
||
|
" <th>retweet_count</th>\n",
|
||
|
" <th>text</th>\n",
|
||
|
" <th>tweet_coord</th>\n",
|
||
|
" <th>tweet_created</th>\n",
|
||
|
" <th>tweet_location</th>\n",
|
||
|
" <th>user_timezone</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>570306133677760513</td>\n",
|
||
|
" <td>neutral</td>\n",
|
||
|
" <td>1.0000</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>Virgin America</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>cairdin</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>@VirginAmerica What @dhepburn said.</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>2015-02-24 11:35:52 -0800</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>Eastern Time (US & Canada)</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>570301130888122368</td>\n",
|
||
|
" <td>positive</td>\n",
|
||
|
" <td>0.3486</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>0.0000</td>\n",
|
||
|
" <td>Virgin America</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>jnardino</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>@VirginAmerica plus you've added commercials t...</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>2015-02-24 11:15:59 -0800</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>Pacific Time (US & Canada)</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>570301083672813571</td>\n",
|
||
|
" <td>neutral</td>\n",
|
||
|
" <td>0.6837</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>Virgin America</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>yvonnalynn</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>@VirginAmerica I didn't today... Must mean I n...</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>2015-02-24 11:15:48 -0800</td>\n",
|
||
|
" <td>Lets Play</td>\n",
|
||
|
" <td>Central Time (US & Canada)</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>570301031407624196</td>\n",
|
||
|
" <td>negative</td>\n",
|
||
|
" <td>1.0000</td>\n",
|
||
|
" <td>Bad Flight</td>\n",
|
||
|
" <td>0.7033</td>\n",
|
||
|
" <td>Virgin America</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>jnardino</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>@VirginAmerica it's really aggressive to blast...</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>2015-02-24 11:15:36 -0800</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>Pacific Time (US & Canada)</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>570300817074462722</td>\n",
|
||
|
" <td>negative</td>\n",
|
||
|
" <td>1.0000</td>\n",
|
||
|
" <td>Can't Tell</td>\n",
|
||
|
" <td>1.0000</td>\n",
|
||
|
" <td>Virgin America</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>jnardino</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>@VirginAmerica and it's a really big bad thing...</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>2015-02-24 11:14:45 -0800</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>Pacific Time (US & Canada)</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" tweet_id airline_sentiment airline_sentiment_confidence \\\n",
|
||
|
"0 570306133677760513 neutral 1.0000 \n",
|
||
|
"1 570301130888122368 positive 0.3486 \n",
|
||
|
"2 570301083672813571 neutral 0.6837 \n",
|
||
|
"3 570301031407624196 negative 1.0000 \n",
|
||
|
"4 570300817074462722 negative 1.0000 \n",
|
||
|
"\n",
|
||
|
" negativereason negativereason_confidence airline \\\n",
|
||
|
"0 NaN NaN Virgin America \n",
|
||
|
"1 NaN 0.0000 Virgin America \n",
|
||
|
"2 NaN NaN Virgin America \n",
|
||
|
"3 Bad Flight 0.7033 Virgin America \n",
|
||
|
"4 Can't Tell 1.0000 Virgin America \n",
|
||
|
"\n",
|
||
|
" airline_sentiment_gold name negativereason_gold retweet_count \\\n",
|
||
|
"0 NaN cairdin NaN 0 \n",
|
||
|
"1 NaN jnardino NaN 0 \n",
|
||
|
"2 NaN yvonnalynn NaN 0 \n",
|
||
|
"3 NaN jnardino NaN 0 \n",
|
||
|
"4 NaN jnardino NaN 0 \n",
|
||
|
"\n",
|
||
|
" text tweet_coord \\\n",
|
||
|
"0 @VirginAmerica What @dhepburn said. NaN \n",
|
||
|
"1 @VirginAmerica plus you've added commercials t... NaN \n",
|
||
|
"2 @VirginAmerica I didn't today... Must mean I n... NaN \n",
|
||
|
"3 @VirginAmerica it's really aggressive to blast... NaN \n",
|
||
|
"4 @VirginAmerica and it's a really big bad thing... NaN \n",
|
||
|
"\n",
|
||
|
" tweet_created tweet_location user_timezone \n",
|
||
|
"0 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada) \n",
|
||
|
"1 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada) \n",
|
||
|
"2 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada) \n",
|
||
|
"3 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada) \n",
|
||
|
"4 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada) "
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 3,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df.head()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 4,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"<AxesSubplot:xlabel='airline', ylabel='count'>"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 4,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEGCAYAAACUzrmNAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAn/UlEQVR4nO3de5xVVf3/8dcHHEFCHW6aXL7O4I+AgRkGGZAivpIooGVgoqGYECleULHUb2T6za9JUZEaeCErRAsUhAgyH3khvIQiMjjK1QAdESKcMEgUjMvn98deMx6GmdlnhjlzfT8fj/M4e6+z99prn7P3/uy19t7rmLsjIiJSkSa1XQAREan7FCxERCSWgoWIiMRSsBARkVgKFiIiEuuY2i5AKrRt29YzMjJquxgiIvVKfn7+P929XVmfNchgkZGRwcqVK2u7GCIi9YqZvVveZ2qGEhGRWAoWIiISS8FCRERiNchrFtKw7d+/n61bt7Jv377aLkqj0rx5czp27EhaWlptF0VqgYKF1Dtbt27l+OOPJyMjAzOr7eI0Cu7Ozp072bp1K5mZmbVdHKkFaoaSemffvn20adNGgaIGmRlt2rRRba4RU7CQekmBoubpO2/cFCxERCSWgoWIiMTSBW45agOmD6jSfMuuX1bNJTnSeeedx5w5c0hPTz/is+In/du2bcsXvvAFXn755ZSXp7J+9KMfceutt5aMp7qcu3btYs6cOVx77bUpW4bUT6pZSIP21FNPHREo3J1Dhw4dllYXAwVEwSJRqsu5a9cuHnjggZQuQ+onBQtpMEaMGEGfPn3o0aMHDz30EBDVHv75z39SWFhI165dufzyy+nZsyfvvffeYfO2bNkSgOeff55BgwYxcuRIunXrxujRoyn+6+H8/HzOPPNM+vTpw9ChQ9m+fXu5ZZk2bRpZWVnk5OQwatQoAD766CPGjRtHv3796N27N4sWLQJg1qxZfO1rX2PYsGF06dKF//mf/wFg0qRJ7N27l9zcXEaPHn1EOc8880yGDx9O586dmTRpErNnz6Zfv35kZ2ezefNmAIqKirjwwgvp27cvffv2ZdmyqDZ3xx13MG7cOAYNGkTnzp2ZNm1ayTI3b95Mbm4ut9xyy1H+ItKQqBlKGoyZM2fSunVr9u7dS9++fbnwwgsP+3zjxo088sgj9O/fv8J8Xn/9ddauXUv79u0ZMGAAy5Yt44wzzuD6669n0aJFtGvXjrlz5/L973+fmTNnlpnHlClTeOedd2jWrBm7du0CYPLkyZx11lnMnDmTXbt20a9fP84++2wACgoKeP3112nWrBldu3bl+uuvZ8qUKdx3330UFBSUuYw33niD9evX07p1azp37swVV1zBihUr+MUvfsH06dO59957mThxIt/+9rf54he/yJYtWxg6dCjr168HYMOGDSxdupQPP/yQrl27cs011zBlyhTWrFlT7jKl8VKwkAZj2rRpLFy4EID33nuPjRs3Hvb5qaeeGhsoAPr160fHjh0ByM3NpbCwkPT0dNasWcM555wDwMGDBznllFPKzSMnJ4fRo0czYsQIRowYAcAzzzzD4sWLmTp1KhA9L7JlyxYABg8ezIknnghAVlYW7777Lp06daqwnH379i0pw2mnncaQIUMAyM7OZunSpQA899xzrFu3rmSef//73+zZsweAL3/5yzRr1oxmzZpx0kknsWPHjtjvRhovBQtpEJ5//nmee+45XnnlFVq0aMGgQYOOeIDsM5/5TFJ5NWvWrGS4adOmHDhwAHenR48evPLKK0nl8ac//YkXX3yRP/7xj0yePJnVq1fj7ixYsICuXbseNu2rr75a5jIrU84mTZqUjDdp0qRk/kOHDrF8+XKaN2+e1HqKlEfXLKRB2L17N61ataJFixZs2LCB5cuXV2v+Xbt2paioqCRY7N+/n7Vr15Y57aFDh3jvvff40pe+xE9+8hN2797Nnj17GDp0KNOnTy+5BvL666/HLjctLY39+/dXudxDhgxh+vTpJeNxzUvHH388H374YZWXJw1XyoKFmXUys6Vmts7M1prZxJB+h5ltM7OC8DovYZ7vmdkmM3vLzIYmpA8LaZvMbFKqyiz117Bhwzhw4ADdu3dn0qRJSTU3Vcaxxx7L/Pnz+e53v0uvXr3Izc0t986kgwcPctlll5GdnU3v3r254YYbSE9P5/bbb2f//v3k5OTQo0cPbr/99tjljh8/vqRJqyqmTZvGypUrycnJISsrixkzZlQ4fZs2bRgwYAA9e/bUBW45jBWf5VR7xmanAKe4+yozOx7IB0YAFwN73H1qqemzgMeAfkB74Dngc+HjvwHnAFuB14BL3H0d5cjLy3P9U17NqennLNavX0/37t2rNK8cHX33DZuZ5bt7XlmfpeyahbtvB7aH4Q/NbD3QoYJZhgOPu/snwDtmtokocABscve3Aczs8TBtucFCRESqV41cszCzDKA38GpIus7M3jSzmWbWKqR1ABJvft8a0spLL72M8Wa20sxWFhUVVfcqiJRpwoQJ5ObmHvZ6+OGHa7tYItUu5XdDmVlLYAFwo7v/28weBH4IeHj/OTDuaJfj7g8BD0HUDHW0+Ykk4/7776/tIojUiJQGCzNLIwoUs9399wDuviPh818BT4bRbUDijeUdQxoVpIuISA1I5d1QBvwGWO/udyekJz7JdAGwJgwvBkaZWTMzywS6ACuILmh3MbNMMzsWGBWmFRGRGpLKmsUA4BvAajMrCGm3ApeYWS5RM1QhcBWAu681s3lEF64PABPc/SCAmV0HPA00BWa6e9k3uIuISEqk8m6ovwJl/bXWUxXMMxmYXEb6UxXNJ1KRPrc8Wq355f/s8mrNryKFhYW8/PLLXHrppZWet2XLliVde4gcLT3BLVKHFRYWMmfOnDI/U/ccUpMULERSoLCwkO7du3PllVfSo0cPhgwZwt69e9m8eTPDhg2jT58+DBw4kA0bNgAwduxY5s+fXzJ/cVfkkyZN4qWXXiI3N5d77rmHWbNm8dWvfpWzzjqLwYMHs2fPHgYPHszpp59OdnZ2SbfnItVNwUIkRTZu3MiECRNYu3Yt6enpLFiwgPHjxzN9+nTy8/OZOnVq7D/STZkyhYEDB1JQUMC3v/1tAFatWsX8+fN54YUXaN68OQsXLmTVqlUsXbqUm266iVT1yiCNm3qdFUmRzMxMcnNzAejTp0/J9YeLLrqoZJpPPvmk0vmec845tG7dGoj+9e/WW2/lxRdfpEmTJmzbto0dO3bw2c9+tlrWQaSYgoVIipTuAnzHjh2kp6eX2fPrMcccU/JXr4cOHeI///lPufkmdrU+e/ZsioqKyM/PJy0tjYyMjCO6ZhepDmqGEqkhJ5xwApmZmTzxxBNAVCt44403gOjvX/Pz8wFYvHhxSbfkcV2G7969m5NOOom0tDSWLl3Ku+++m+K1kMZKNQtp8GryVtc4s2fP5pprruGuu+5i//79jBo1il69enHllVcyfPhwevXqxbBhw0pqDzk5OTRt2pRevXoxduxYWrVqdVh+o0eP5vzzzyc7O5u8vDy6detWG6sljUDKuiivTeqivGapi/LGQ999w1ZRF+VqhhIRkVgKFiIiEkvBQkREYilYiIhILAULERGJpVtnRaTBquqdelD1u/UaKgULafC23Jldrfn91/+urtb8yjNjxgxatGjB5ZdfzqxZsxgyZAjt27cH4IorruA73/kOWVlZNVIWEQULkTrq6quvLhmeNWsWPXv2LAkWv/71r2urWNJI6ZqFSAoUFhbSrVs3Ro8eTffu3Rk5ciQff/wxS5YsoXfv3mRnZzNu3LiSjgQnTZpEVlYWOTk53HzzzQDccccdTJ06lfnz57Ny5UpGjx5Nbm4ue/fuZdCgQaxcuZIZM2Zwyy23lCx31qxZXHfddQD87ne/o1+/fuTm5nLVVVdx8ODBmv8ipMFQsBBJkbfeeotrr72W9evXc8IJJ3D33XczduxY5s6dy+rVqzlw4AAPPvggO3fuZOHChaxdu5Y333yT22677bB8Ro4cSV5eHrNnz6agoIDjjjuu5LMLL7yQhQsXlozPnTu
|
||
|
"text/plain": [
|
||
|
"<Figure size 432x288 with 1 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {
|
||
|
"needs_background": "light"
|
||
|
},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"sns.countplot(data=df,x='airline',hue='airline_sentiment')"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 5,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAGJCAYAAAB7KB+AAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAA3y0lEQVR4nO3deZxcVZn/8c+XsC9hGQIiW1iiTBDZIoI4yjIgqAjIriwqiKOAuIwK6gwIIuIOqCjIqiADKLIIsoRNRIQEwhKQHxFQwEAiSAiihITn98c5RVc61X0b7Tr30vV9v1716ntPLedJV7qeumdVRGBmZjaYheoOwMzMms/JwszMKjlZmJlZJScLMzOr5GRhZmaVnCzMzKzSwt16YUmLAzcBi+V6LoqIoyStBZwP/BswGdgvIuZIWgw4B9gUeArYKyIeya91JHAgMA/4eERcNVjdK664YowdO7Yr/y4zs5Fq8uTJf4mIMZ3u61qyAF4AtomI5yQtAtws6UrgU8C3I+J8ST8gJYFT8s+/RsS6kvYGTgD2kjQe2BtYH3gtcK2k10XEvIEqHjt2LJMmTeriP83MbOSR9MeB7utaM1Qkz+XTRfItgG2Ai3L52cAu+XjnfE6+f1tJyuXnR8QLEfEwMA3YrFtxm5nZgrraZyFplKQpwAzgGuAPwDMRMTc/5DFg1Xy8KvAoQL5/Fqmp6uXyDs8xM7MCuposImJeRGwErEa6GlivW3VJOljSJEmTZs6c2a1qzMx6UpHRUBHxDHA9sAWwnKRWX8lqwOP5+HFgdYB8/7Kkju6Xyzs8p72OUyNiQkRMGDOmY/+MmZn9k7qWLCSNkbRcPl4C2A64n5Q0ds8POwC4JB9fms/J918XaZXDS4G9JS2WR1KNA27rVtxmZragbo6GWgU4W9IoUlK6ICIul3QfcL6kLwN3Aqfnx58O/FjSNOBp0ggoImKqpAuA+4C5wCGDjYQyM7Php5G4RPmECRPCQ2fNzF4ZSZMjYkKn+zyD28zMKnWzGcqsoxvf9vZidb39phuL1WU2kvnKwszMKjlZmJlZJScLMzOr5GRhZmaVnCzMzKySk4WZmVVysjAzs0pOFmZmVsnJwszMKjlZmJlZJScLMzOr5GRhZmaVnCzMzKySk4WZmVVysjAzs0pOFmZmVsnJwszMKjlZmJlZJScLMzOr5GRhZmaVnCzMzKySk4WZmVVysjAzs0pOFmZmVsnJwszMKnUtWUhaXdL1ku6TNFXS4bn8aEmPS5qSb+9se86RkqZJekDSO9rKd8hl0yQd0a2Yzcyss4W7+NpzgU9HxB2SlgEmS7om3/ftiPhG+4MljQf2BtYHXgtcK+l1+e7vAdsBjwG3S7o0Iu7rYuxmZtama8kiIqYD0/PxbEn3A6sO8pSdgfMj4gXgYUnTgM3yfdMi4iEASefnxzpZmJkVUqTPQtJYYGPgd7noUEl3SzpD0vK5bFXg0banPZbLBio3M7NCup4sJC0N/Az4REQ8C5wCrANsRLry+OYw1XOwpEmSJs2cOXM4XtLMzLKuJgtJi5ASxbkR8XOAiHgyIuZFxEvAafQ1NT0OrN729NVy2UDl84mIUyNiQkRMGDNmzPD/Y8zMelg3R0MJOB24PyK+1Va+StvDdgXuzceXAntLWkzSWsA44DbgdmCcpLUkLUrqBL+0W3GbmdmCujkaaktgP+AeSVNy2eeBfSRtBATwCPARgIiYKukCUsf1XOCQiJgHIOlQ4CpgFHBGREztYtxmZtZPN0dD3Qyow11XDPKc44DjOpRfMdjzzMysuzyD28zMKjlZmJlZJScLMzOr5GRhZmaVnCzMzKySk4WZmVVysjAzs0pOFmZmVsnJwszMKjlZmJlZJScLMzOr5GRhZmaVnCzMzKySk4WZmVVysjAzs0pOFmZmVsnJwszMKjlZmJlZJScLMzOr5GRhZmaVnCzMzKySk4WZmVVysjAzs0pOFmZmVsnJwszMKjlZmJlZpa4lC0mrS7pe0n2Spko6PJevIOkaSQ/mn8vnckk6SdI0SXdL2qTttQ7Ij39Q0gHditnMzDrr5pXFXODTETEe2Bw4RNJ44AhgYkSMAybmc4AdgXH5djBwCqTkAhwFvBnYDDiqlWDMzKyMriWLiJgeEXfk49nA/cCqwM7A2flhZwO75OOdgXMiuRVYTtIqwDuAayLi6Yj4K3ANsEO34jYzswUV6bOQNBbYGPgdsHJETM93PQGsnI9XBR5te9pjuWygcjMzK6TryULS0sDPgE9ExLPt90VEADFM9RwsaZKkSTNnzhyOlzQzs6yryULSIqREcW5E/DwXP5mbl8g/Z+Tyx4HV256+Wi4bqHw+EXFqREyIiAljxowZ3n+ImVmP6+ZoKAGnA/dHxLfa7roUaI1oOgC4pK18/zwqanNgVm6uugrYXtLyuWN7+1xmZmaFLNzF194S2A+4R9KUXPZ54KvABZIOBP4I7JnvuwJ4JzANeB74IEBEPC3pWOD2/LhjIuLpLsZtZmb9dC1ZRMTNgAa4e9sOjw/gkAFe6wzgjOGLzszMXgnP4DYzs0pOFmZmVsnJwszMKjlZmJlZJScLMzOr5GRhZmaVnCzMzKySk4WZmVVysjAzs0pOFmZmVsnJwszMKjlZmJlZJScLMzOr5GRhZmaVnCzMzKySk4WZmVVysjAzs0pDShaSJg6lzMzMRqZBt1WVtDiwJLCipOXp2yZ1NLBql2MzM7OGqNqD+yPAJ4DXApPpSxbPAt/tXljWLVuevGWxun5z2G+K1WVm3TVosoiIE4ETJR0WEScXisnMzBqm6soCgIg4WdJbgLHtz4mIc7oUl5mZNciQkoWkHwPrAFOAebk4ACcLM7MeMKRkAUwAxkdEdDMYMzNrpqHOs7gXeE03AzEzs+Ya6pXFisB9km4DXmgVRsR7uhKVmZk1ylCTxdHdDMLMzJptSM1QEXFjp9tgz5F0hqQZku5tKzta0uOSpuTbO9vuO1LSNEkPSHpHW/kOuWyapCP+mX+kmZn9a4a63MdsSc/m2z8kzZP0bMXTzgJ26FD+7YjYKN+uyK8/HtgbWD8/5/uSRkkaBXwP2BEYD+yTH2tmZgUNdZ7FMq1jSQJ2BjaveM5NksYOMY6dgfMj4gXgYUnTgM3yfdMi4qFc9/n5sfcN8XXNzGwYvOJVZyP5BfCOqscO4FBJd+dmquVz2arAo22PeSyXDVRuZmYFDXVS3nvbThcizbv4xz9R3ynAsaQJfccC3wQ+9E+8zgIkHQwcDLDGGmsMx0uamVk21NFQO7UdzwUeITUHvSIR8WTrWNJpwOX59HFg9baHrpbLGKS8/2ufCpwKMGHCBE8eNDMbRkPts/jgcFQmaZWImJ5PdyVN9gO4FDhP0rdIK9yOA24jrXI7TtJapCSxN/C+4YjFzMyGbqjNUKsBJwOt9a1/DRweEY8N8pyfAluR9sJ4DDgK2ErSRqRmqEdIS6ATEVMlXUDquJ4LHBIR8/LrHApcBYwCzoiIqa/sn2hmZv+qoTZDnQmcB+yRz/fNZdsN9ISI2KdD8emDPP444LgO5VcAVwwxTjMz64KhjoYaExFnRsTcfDsLGNPFuMzMrEGGmiyekrRva6KcpH2Bp7oZmJmZNcdQk8WHgD2BJ4DpwO7AB7oUk5mZNcxQ+yyOAQ6IiL8CSFoB+AbDNEfCzMyabahXFm9sJQqAiHga2Lg7IZmZWdMMNVks1LY0R+vKYqhXJWZm9io31A/8bwK/lXRhPt+DDsNczcxsZBrqDO5zJE0CtslF740Ir/xqZtYjhtyUlJODE4SZWQ96xUuUm5lZ73GyMDOzSk4WZmZWycnCzMwqOVmYmVklJwszM6vkZGFmZpWcLMzMrJLXdyrkT8dsUKyuNf73nmJ1mVlv8JWFmZlVcrIwM7NKThZmZlbJycLMzCo5WZiZWSUnCzMzq+RkYWZmlZwszMyskpOFmZlV6lqykHSGpBmS7m0rW0HSNZIezD+Xz+WSdJKkaZLulrRJ23MOyI9/UNIB3YrXzMwG1s0ri7OAHfqVHQFMjIhxwMR8DrAjMC7fDgZOgZRcgKOANwObAUe1EoyZmZX
|
||
|
"text/plain": [
|
||
|
"<Figure size 432x288 with 1 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {
|
||
|
"needs_background": "light"
|
||
|
},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"sns.countplot(data=df,x='negativereason')\n",
|
||
|
"plt.xticks(rotation=90);"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 6,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"<AxesSubplot:xlabel='airline_sentiment', ylabel='count'>"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 6,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEHCAYAAABfkmooAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAUwUlEQVR4nO3de7SldX3f8fcHBhAkcpspFRgzFGksmnhhFpeQpkZciLmINWgwImjooq4iqKlNNO0q1EsWVluCGk2IoGBIEfECMVakIK6GhMsgBJhBwpSLMEUZGcBbQQe+/eP5HdnOnDO/M8Psc5nzfq31rPN7fs/te84++3zO8+xn/3aqCkmSNmW72S5AkjT3GRaSpC7DQpLUZVhIkroMC0lS16LZLmAcFi9eXMuWLZvtMiRpXrnxxhu/W1VLJlu2TYbFsmXLWLFixWyXIUnzSpJ7p1rmZShJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkLsNCktRlWEiSugwLSVLXNvkObknzwxEfOWK2S9jmXXPqNVtlP55ZSJK6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldhoUkqcuwkCR1GRaSpC7DQpLUZVhIkroMC0lSl2EhSeoyLCRJXYaFJKnLsJAkdY01LJK8I8nKJLcl+R9JnpFk/yTXJVmd5DNJdmzr7tTmV7fly0b28+7Wf0eSV4yzZknSxsYWFkn2BU4DllfVC4DtgeOADwBnVdVzgYeBk9omJwEPt/6z2nokOaht93zgaOBjSbYfV92SpI2N+zLUImDnJIuAXYAHgJcBl7Tl5wOvbu1j2jxt+ZFJ0vovqqrHq+puYDVwyJjrliSNGFtYVNUa4EPAtxhC4lHgRuCRqlrfVrsf2Le19wXua9uub+vvNdo/yTY/leTkJCuSrFi7du3W/4YkaQEb52WoPRjOCvYH9gGeyXAZaSyq6pyqWl5Vy5csWTKuw0jSgjTOy1AvB+6uqrVV9RPg88ARwO7tshTAfsCa1l4DLAVoy3cDHhrtn2QbSdIMGGdYfAs4LMku7bWHI4FVwNeAY9s6JwKXtvZlbZ62/KqqqtZ/XLtban/gQOD6MdYtSdrAov4qW6aqrktyCfANYD1wE3AO8DfARUne1/rObZucC3w6yWpgHcMdUFTVyiQXMwTNeuCUqnpiXHVLkjY2trAAqKrTgdM36L6LSe5mqqrHgNdOsZ/3A+/f6gVKkqbFd3BLkroMC0lSl2EhSeoyLCRJXYaFJKnLsJAkdRkWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldhoUkqcuwkCR1GRaSpC7DQpLUZVhIkroMC0lSl2EhSeoyLCRJXYaFJKnLsJAkdRkWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV1jDYskuye5JMk3k9ye5PAkeya5Ismd7esebd0k+XCS1UluSfKSkf2c2Na/M8mJ46xZkrSxcZ9ZnA18paqeB7wQuB14F3BlVR0IXNnmAV4JHNimk4GPAyTZEzgdOBQ4BDh9ImAkSTNjbGGRZDfgV4FzAarqx1X1CHAMcH5b7Xzg1a19DHBBDa4Fdk/ybOAVwBVVta6qHgauAI4eV92SpI2N88xif2At8MkkNyX5RJJnAntX1QNtnW8De7f2vsB9I9vf3/qm6pckzZBxhsUi4CXAx6vqxcAPeeqSEwBVVUBtjYMlOTnJiiQr1q5duzV2KUlqxhkW9wP3V9V1bf4ShvD4Tru8RPv6YFu+Blg6sv1+rW+q/p9RVedU1fKqWr5kyZKt+o1I0kI3trCoqm8D9yX5hdZ1JLAKuAyYuKPpRODS1r4MOKHdFXUY8Gi7XHU5cFSSPdoL20e1PknSDFk05v2fClyYZEfgLuDNDAF1cZKTgHuB17V1vwz8OrAa+FFbl6pal+S9wA1tvfdU1box1y1JGjHWsKiqm4Hlkyw6cpJ1Czhliv2cB5y3VYuTJE2b7+CWJHUZFpKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldhoUkqcuwkCR1GRaSpK5phUWSK6fTJ0naNm1ybKgkzwB2ARa3EV/TFj0LP4BIkhaM3kCC/xZ4O7APcCNPhcX3gI+OryxJ0lyyybCoqrOBs5OcWlUfmaGaJElzzLSGKK+qjyT5ZWDZ6DZVdcGY6pIkzSHTCosknwYOAG4GnmjdBRgWkrQATPfDj5YDB7UPKJIkLTDTfZ/FbcA/HWchkqS5a7pnFouBVUmuBx6f6KyqV42lKknSnDLdsDhjnEVIkua26d4N9fVxFyJJmrumezfU9xnufgLYEdgB+GFVPWtchUmS5o7pnln83EQ7SYBjgMPGVZQkaW7Z7FFna/BF4BVbvxxJ0lw03ctQrxmZ3Y7hfRePjaUiSdKcM927oX5rpL0euIfhUpQkaQGY7msWbx53IZKkuWu6H360X5IvJHmwTZ9Lst+4i5MkzQ3TfYH7k8BlDJ9rsQ/w161PkrQATDcsllTVJ6tqfZs+BSwZY12SpDlkumHxUJLjk2zfpuOBh8ZZmCRp7phuWPwe8Drg28ADwLHAm8ZUkyRpjpnurbPvAU6sqocBkuwJfIghRCRJ27jpnln80kRQAFTVOuDF4ylJkjTXTDcstkuyx8RMO7OY7lmJJGmem+4f/P8G/H2Sz7b51wLvH09JkqS5ZlpnFlV1AfAa4Dttek1VfXo627a7p25K8qU2v3+S65KsTvKZJDu2/p3a/Oq2fNnIPt7d+u9I4gCGkjTDpj3qbFWtqqqPtmnVZhzjbcDtI/MfAM6qqucCDwMntf6TgIdb/1ltPZIcBBwHPB84GvhYku034/iSpKdps4co3xxtSJDfAD7R5gO8DLikrXI+8OrWPqbN05YfOfLZGRdV1eNVdTewGjhknHVLkn7WWMMC+BPgD4An2/xewCNVtb7N3w/s29r7AvcBtOWPtvV/2j/JNpKkGTC2sEjym8CDVXXjuI6xwfFOTrIiyYq1a9fOxCElacEY55nFEcCrktwDXMRw+elsYPckE3dh7Qesae01wFKAtnw3hiFFfto/yTY/VVXnVNXyqlq+ZInDVknS1jS2sKiqd1fVflW1jOEF6quq6g3A1xiGCwE4Ebi0tS9r87TlV1VVtf7j2t1S+wMHAtePq25J0sZm4411fwhclOR9wE3Aua3/XODTSVYD6xgChqpameRiYBXDp/SdUlVPzHzZkrRwzUhYVNXVwNWtfReT3M1UVY8xvNlvsu3fj28ClKRZM+67oSRJ2wDDQpLUZVhIkroMC0lSl2EhSeoyLCRJXYaFJKnLsJAkdRkWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpazY+/GhOOfg/XDDbJSwIN37whNkuQdLT4JmFJKnLsJAkdRkWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldC/7DjzS/fes9vzjbJWzznvOfb53tEjQHeGYhSeoyLCRJXYaFJKnLsJAkdY0tLJIsTfK1JKuSrEzytta/Z5IrktzZvu7R+pPkw0lWJ7klyUtG9nViW//OJCeOq2ZJ0uTGeWaxHvj3VXUQcBhwSpKDgHcBV1bVgcCVbR7glcCBbToZ+DgM4QKcDhwKHAKcPhEwkqSZMbawqKoHquobrf194HZgX+AY4Py22vnAq1v7GOCCGlwL7J7k2cArgCuqal1VPQxcARw9rrolSRubkdcskiwDXgxcB+xdVQ+0Rd8G9m7tfYH7Rja7v/VN1b/hMU5OsiLJirVr127db0CSFrixh0WSXYHPAW+vqu+NLquqAmprHKeqzqmq5VW1fMmSJVtjl5KkZqxhkWQHhqC4sKo+37q/0y4v0b4+2PrXAEtHNt+v9U3VL0maIeO8GyrAucDtVfXfRxZdBkzc0XQicOlI/wntrqjDgEfb5arLgaOS7NFe2D6q9UmSZsg4x4Y6AngjcGuSm1vfHwF
|
||
|
"text/plain": [
|
||
|
"<Figure size 432x288 with 1 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {
|
||
|
"needs_background": "light"
|
||
|
},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"sns.countplot(data=df,x='airline_sentiment')"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 7,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"negative 9178\n",
|
||
|
"neutral 3099\n",
|
||
|
"positive 2363\n",
|
||
|
"Name: airline_sentiment, dtype: int64"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 7,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df['airline_sentiment'].value_counts()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Features and Label"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 8,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"data = df[['airline_sentiment','text']]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 9,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>airline_sentiment</th>\n",
|
||
|
" <th>text</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>neutral</td>\n",
|
||
|
" <td>@VirginAmerica What @dhepburn said.</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>positive</td>\n",
|
||
|
" <td>@VirginAmerica plus you've added commercials t...</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>neutral</td>\n",
|
||
|
" <td>@VirginAmerica I didn't today... Must mean I n...</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>negative</td>\n",
|
||
|
" <td>@VirginAmerica it's really aggressive to blast...</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>negative</td>\n",
|
||
|
" <td>@VirginAmerica and it's a really big bad thing...</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" airline_sentiment text\n",
|
||
|
"0 neutral @VirginAmerica What @dhepburn said.\n",
|
||
|
"1 positive @VirginAmerica plus you've added commercials t...\n",
|
||
|
"2 neutral @VirginAmerica I didn't today... Must mean I n...\n",
|
||
|
"3 negative @VirginAmerica it's really aggressive to blast...\n",
|
||
|
"4 negative @VirginAmerica and it's a really big bad thing..."
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 9,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"data.head()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 10,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"y = df['airline_sentiment']\n",
|
||
|
"X = df['text']"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### Train Test Split"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 11,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from sklearn.model_selection import train_test_split"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 12,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Vectorization"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 13,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from sklearn.feature_extraction.text import TfidfVectorizer"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 14,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"tfidf = TfidfVectorizer(stop_words='english')"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 15,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"TfidfVectorizer(stop_words='english')"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 15,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"tfidf.fit(X_train)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 16,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"X_train_tfidf = tfidf.transform(X_train)\n",
|
||
|
"X_test_tfidf = tfidf.transform(X_test)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 17,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"<11712x12971 sparse matrix of type '<class 'numpy.float64'>'\n",
|
||
|
"\twith 107073 stored elements in Compressed Sparse Row format>"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 17,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"X_train_tfidf"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"**DO NOT USE .todense() for such a large sparse matrix!!!**"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Model Comparisons - Naive Bayes,LogisticRegression, LinearSVC "
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 18,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"MultinomialNB()"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 18,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from sklearn.naive_bayes import MultinomialNB\n",
|
||
|
"nb = MultinomialNB()\n",
|
||
|
"nb.fit(X_train_tfidf,y_train)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 19,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"LogisticRegression(max_iter=1000)"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 19,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from sklearn.linear_model import LogisticRegression\n",
|
||
|
"log = LogisticRegression(max_iter=1000)\n",
|
||
|
"log.fit(X_train_tfidf,y_train)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 20,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"LinearSVC()"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 20,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from sklearn.svm import LinearSVC\n",
|
||
|
"svc = LinearSVC()\n",
|
||
|
"svc.fit(X_train_tfidf,y_train)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Performance Evaluation"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 21,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from sklearn.metrics import plot_confusion_matrix,classification_report"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 22,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"def report(model):\n",
|
||
|
" preds = model.predict(X_test_tfidf)\n",
|
||
|
" print(classification_report(y_test,preds))\n",
|
||
|
" plot_confusion_matrix(model,X_test_tfidf,y_test)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 23,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"NB MODEL\n",
|
||
|
" precision recall f1-score support\n",
|
||
|
"\n",
|
||
|
" negative 0.66 0.99 0.79 1817\n",
|
||
|
" neutral 0.79 0.15 0.26 628\n",
|
||
|
" positive 0.89 0.14 0.24 483\n",
|
||
|
"\n",
|
||
|
" accuracy 0.67 2928\n",
|
||
|
" macro avg 0.78 0.43 0.43 2928\n",
|
||
|
"weighted avg 0.73 0.67 0.59 2928\n",
|
||
|
"\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWQAAAEGCAYAAABSJ+9xAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAqbElEQVR4nO3dd5wV1d3H8c93l6UjSBGpggYLEkUhihoNlqgYEzVqLJioMUGixERjEtQ8iU8s8YkaEx8jVqJGbGiIJSgoCbZH1EWRjnRp0ntzy+/5Y87CuCy79267cy+/t695OffMzDlnZpffnnvmzBmZGc455zIvL9MVcM45F/GA7JxzCeEB2TnnEsIDsnPOJYQHZOecS4gGma5A0rVtnW/duhRkuhqJ9enkppmugsty29jMF7ZdNcnjtBOb2eo1JSntO3Hy9jFmdnpNyqsrHpCr0K1LAR+M6ZLpaiTWaZ2OyHQVks+HllbqfRtX4zxWrynhgzFdU9o3v8PstjUusI54QHbOZT0DSinNdDVqzAOycy7rGUaRpdZlkWR+U885lxNKU/yvKpKGS1ohaWos7VlJk8KyQNKkkN5N0tbYtgdix/SRNEXSHEn3Sqqyn9xbyM65rGcYJbXXV/8YcB/wxI78zS4oW5d0N7A+tv9cM+tdQT7DgB8D7wOjgdOBVysr2FvIzrmcUIqltFTFzN4C1lS0LbRyvwc8XVkekjoAe5nZBIsmDHoCOLuqsj0gO+eyngElWEoL0FZSYWwZlEZRxwPLzWx2LK27pI8lvSnp+JDWCVgc22dxSKuUd1k453JCKq3fYJWZ9a1mMRfx5dbxMqCrma2W1Af4p6RDq5m3B2TnXPYzoKiOx3tLagB8F+izo1yz7cD2sD5R0lzgQGAJ0Dl2eOeQVinvsnDOZT1LsbuiJPVWdEVOAWaa2Y6uCEntJOWH9f2BHsA8M1sGbJDUL/Q7/wB4saoCPCA757KfQUmKS1UkPQ28BxwkabGkK8KmC9n1Zt4JwOQwDO55YLCZld0QvAp4BJgDzKWKERbgXRbOuRwQPalXS3mZXbSb9MsqSHsBeGE3+xcCvdIp2wOycy4HiBJqND9RInhAds5lveimngdk55zLuGgcsgdk55xLhFJvITvnXOZ5C9k55xLCECU5MIrXA7JzLid4l4VzziWAIb6w/ExXo8Y8IDvnsl70YIh3WTjnXCL4TT3nnEsAM1Fi3kJ2zrlEKPUWsnPOZV50Uy/7w1n2n4Fzbo/nN/Wccy5BSnwcsnPOZZ4/qeeccwlS6qMsnHMu86LJhTwgO+dcxhmiyB+ddrXh7mu78P4be9GqbTEP/WcWAHOnNuHeoZ35Ylse+Q2MIX9YzMFHbMEMhv1XJz749140blLKL+75jB6HbWXSu8158HedduS5aG4jbrx/IccOWJ+p06p3Z1+xkgEXr0aCV59qzahH9sl0lRLluj99xtGnbGTdqgZcedJBma5OrTIjJx4MydozkNRK0lWxzx0lPZ/JOlXXqRes4bYR876U9sitHbjkus8Z9sYsfvDLZTx6a0cAPvx3C5bMb8Tf3p3Bz/64iP+9oTMAvY/bxLA3ZjHsjVn8z8g5NGpSypHf2FDv55Ip+x20lQEXr+aabx3I4G8exNGnbKBjt+2ZrlaijH22NTcN7J7patQRUZrikmRZG5CBVkSv2QbAzJaa2XmZq071fbXfZlrsXfKlNAk2b4y+gm3ekE/r9kUAvDemJaectwYJDumzhc3r81m9/MtfdN75Vyu+duIGGjdN4Z3nOaJrj+3M/Lgp27flUVoiJk9oznED1mW6Woky9f3mbFybm1+KjaiFnMpSFUnDJa2QNDWWdrOkJZImheWM2LYbJM2RNEvSabH000PaHElDUzmPOgvIkrpJmiHpYUnTJI2V1ETSAZJekzRR0tuSDg77HyBpgqQpkm6VtCmkN5c0TtJHYdtZoYg7gAPCxbkzlDc1HDNB0qGxuoyX1FdSs3CxP5D0cSyvxBn8+yU8cktHBvbpycO3dOSHNy4FYNXnBbTrWLRjv7Ydi1j9ecGXjh3/Yiv6n72uPqubcQtmNqbX0ZtpsXcxjRqX8rWTNnzpOrncV0JeSksKHgNOryD9HjPrHZbRAJJ6AhcCh4Zj7peULykf+CswAOgJXBT2rVRdt5B7AH81s0OBdcC5wEPAT82sD3A9cH/Y9y/AX8zsq8DiWB7bgHPM7EjgROBuSQKGAnPDxflluXKfBb4HIKkD0MHMCoGbgH+b2VEhrzslNavtk64Nrzzeliv/ewkjJk7nypuX8qfruqZ03OrlDVgwowl9++853RUAi+Y05rm/7sMfnprLbSPmMm9aE0pLM10rV18MUWqpLVXmZfYWsCbFos8CnjGz7WY2H5gDHBWWOWY2z8y+AJ4J+1aqrgPyfDObFNYnAt2AY4GRkiYBDwIdwvZjgJFh/alYHgJulzQZeAPoBLSvotzngLLui+8BZX3LpwJDQ9njgcbALpFO0iBJhZIKV64uKb+5Xrw+sjVfPyO6IXfCt9fx6aSmALTdt4iVS3e2iFctLaDNvjtbgm+93IpjB6yjwZcbzXuEMc+0YciAg7j+3B5sWp/P4nmNM10lV08MKLIGKS1A27J/32EZlGIxQyRNDt+y9w5pnYBFsX0Wh7TdpVeqrgNy/K5KCdAaWBdr9vc2s0OqyGMg0A7oY2a9geVEgXS3zGwJsFrSYcAFRC1miIL7ubGyu5rZjAqOf8jM+ppZ33ZtMjOUpk37Iia/1xyASe80p2P36FL2O3UDbzzfGjOYMbEpTfcqoU374h3Hjf/n3ntcd0WZlm2iP0ztOn7BcQPW859RrTJbIVePREmKC7Cq7N93WB5KoYBhwAFAb2AZcHddnEV99/BvAOZLOt/MRoauh8PM7BNgAlGXxrNEfTJlWgIrzKxI0onAfiF9I9CikrKeBX4FtDSzySFtDPBTST81M5N0hJl9XHunVz1/+Ml+TH6vOevXNGBgn558/xef8/M7FzHst50oKRENG5Xy8zujP7ZHnbyBD8e14PJjD6FRGPZW5vNFDVm5tIDDjtmUqVPJqN8+vIAWexdTUizuu6kzmzfk5g2s6hp6/0IOO2YTLVsX82ThdP5+d3vGPN0m09WqFUbdPqlnZsvL1iU9DLwSPi4BusR27RzSqCR9tzLxGzsQGCbpN0ABUd/KJ8DPgScl3QS8BpQNoB0BvCxpClAIzAQws9WS3g038l4l6kCPe56oX/qWWNotwJ+ByZLygPnAmbV9gum6YdjCCtP/OubTXdIkGPKHJVT0s923yxc89dH02q5e1vjFd3tkugqJdsdV+1W9UxaryzeGSOpgZsvCx3OAshEYLwFPSfoT0JHovtkHRN/Ge0jqTvSP9ULg4qrKqbOAbGYLgF6xz3fFNld0B3MJ0C+0XC8EDgrHrSLqX66ojPInGC9vOeXOz8y2AlemfhbOuWxgplprIUt6GuhP1Ne8GPgd0F9Sb6LG+AJCHDGzaZKeA6YDxcDVZlYS8hlC9K08HxhuZtOqKjtJ3+n6APeFbox1wA8zWx3nXLaIburVzv0eM7uoguRHK9n/NuC2CtJHA6PTKTsxAdnM3gYOz3Q9nHPZyN+p55xziRDd1Ev2Y9Gp8IDsnMsJPv2mc84lQNmTetnOA7JzLif4S06dcy4BzKCo1AOyc85lXNRl4QHZOecSoS6f1KsvHpCdc1nPh70551xieJeFc84lRtLfl5cKD8jOuawXjbLIzNzltckDsnMu6/mDIc45lyDeZeGccwngoyyccy5BfJSFc84lgJko9oDsnHPJ4F0WzjmXAN6H7JxzCeIB2TnnEiBXxiFnfy+4c84RjUNOZamKpOGSVkiaGku7U9JMSZMljZLUKqR3k7RV0qSwPBA7po+kKZLmSLpXUpWFe0B2zmU9MyguzUtpScFjwOnl0l4HepnZYcCnwA2xbXPNrHdYBsfShwE/BnqEpXyeu/CA7JzLCaWmlJaqmNlbwJpyaWPNrDh8nAB0riwPSR2AvcxsgpkZ8ARwdlVle0B2zmW9sj7kFANyW0mFsWVQmsX9EHg19rm7pI8lvSnp+JD
|
||
|
"text/plain": [
|
||
|
"<Figure size 432x288 with 2 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {
|
||
|
"needs_background": "light"
|
||
|
},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"print(\"NB MODEL\")\n",
|
||
|
"report(nb)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 24,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"Logistic Regression\n",
|
||
|
" precision recall f1-score support\n",
|
||
|
"\n",
|
||
|
" negative 0.80 0.93 0.86 1817\n",
|
||
|
" neutral 0.63 0.47 0.54 628\n",
|
||
|
" positive 0.82 0.58 0.68 483\n",
|
||
|
"\n",
|
||
|
" accuracy 0.77 2928\n",
|
||
|
" macro avg 0.75 0.66 0.69 2928\n",
|
||
|
"weighted avg 0.77 0.77 0.76 2928\n",
|
||
|
"\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWQAAAEGCAYAAABSJ+9xAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAtv0lEQVR4nO3dd5xU1fnH8c8XWFiKssAioQZFLGgQBSmaGBFjixGNvWs02NDYSxJ/Go3GJBrFWLFEjSZYohENVqwxIgIiCIqsgBQVWJZetzy/P+5ZGHHZndmd3bkzPO/X6764c245595dnj1z7rnnyMxwzjmXeY0yXQDnnHMRD8jOORcTHpCdcy4mPCA751xMeEB2zrmYaJLpAsRdYdvG1r1rXqaLEVszp22T6SLEnpWXZ7oIsbaO1Wyw9arLOQ4e3NKWlCR3nydOWf+KmR1Sl/zqiwfkGnTvmsf4V7pmuhixddhugzNdhNgrX7o000WItQ9sbJ3PsaSknPGvdEtq38YdZxbWOcN64gHZOZf1DKigItPFqDMPyM65rGcYpZb9TUMekJ1zOcFryM45FwOGUZ4Dw0B4QHbO5YQKPCA751zGGVDuAdk55+LBa8jOORcDBpTmQBuyvzrtnMt6hlGe5FITSQ9LWiTpk83SL5T0maRpkv6UkH6NpCJJMyQdnJB+SEgrknR1MtfhNWTnXPYzKE9fBfkR4C7gscoESYOBocAeZrZe0nYhvRdwArAb0Al4XdJO4bC7gZ8A84EPJY02s+nVZewB2TmX9aI39dJ0LrN3JHXfLPk84BYzWx/2WRTShwKjQvpsSUVA/7CtyMxmAUgaFfatNiB7k4VzLgeI8iQXoFDShIRlWBIZ7AT8SNIHkt6WtHdI7wzMS9hvfkjbUnq1vIbsnMt60UO9pAeMKzazfilm0QRoCwwE9gaekrRDiudIKhPnnMtqUT/kOo3gWZP5wLMWzQo9XlIFUAgsABKHg+wS0qgmfYu8ycI5lxMqTEkttfRvYDBAeGjXFCgGRgMnSGomaXugJzAe+BDoKWl7SU2JHvyNrikTryE757JeOmvIkv4J7E/U1jwfuA54GHg4dIXbAJweasvTJD1F9LCuDLjALBp2TtJw4BWgMfCwmU2rKW8PyM65rGeI8jR94TezE7ew6ZQt7H8TcFMV6WOAMank7QHZOZcT6tAcERsekJ1zWc8QG6xxpotRZx6QnXNZL3oxJPv7KHhAds7lhHru9tYgPCA757KemSg3ryE751wsVHgN2TnnMi96qJf94Sz7r8A5t9Xzh3rOORcj5d4P2TnnMi+db+plkgdk51xOqPBeFs45l3nR4EIekJ1zLuMMUeqvTrt0uO2Srnzw+rYUFJYx8s0ZG9Off6iQ0Y8U0qixMWDICs6+9mtKN4gRV3Zh5pQWqBGcd8MC9thnFQBXHL0jJQub0DQ/mu3xD6O+oKCwLCPXVF8uvvEz+v94CctK8jj/yGjqslatS7nm1uls13kdixbk84fLerFqRR5dtl/NJb+fwY69VvLoiO159pFuGS59w2vfaQNXjJhLQfsyMBjzeDv+/VB7fnT4Mk697Bu69lzPRYf1ZOaUFpkuap2YkRMvhmTtFUgqkHR+wudOkp7JZJlq66DjS7jpiVnfSpv8Xiv+90pr7n19Bg+8NYNjzlsMwEtPtAPg/jdmcMuoLxj5u05UJMzueNXdX3Lv6zO49/UZOReMAV7/9/e49pze30o77uy5TP6ggF8eNoDJHxRw7NlzAVi5PI/7/rAj//pb16pOtVUoLxMjb+jEsP134VeH9+RnZxTTrec65nyWzw1nd2fquJaZLmKaiIoklzjL2oAMFAAbA7KZfWVmx2SuOLX3g4Gr2aZN+bfSXnysHccPX0jTZlFttzK4zv28GX1+uGpjWqvW5Xz+cXbXblLxycQCVi7/9he7gYOLef3f3wOigD3ogGIAlpc0ZeYn21JeFu//hPWpZFEeRVOj34+1qxszryifwo6lzCvKZ/4X+RkuXfoYUQ05mSXO6q10krpL+lTSA5KmSXpVUnNJPSS9LGmipHcl7RL27yFpnKSpkn4vaVVIbyVprKRJYdvQkMUtQA9JkyX9OeT3SThmnKTdEsrylqR+klpKeljSeEkfJZwrdhZ8kc8nH7Tiop/25PKf78iMyc0B2GG3dYx7tTXlZfDN3KbMnNKCxV/lbTzutku6cd6BO/PE7R0wy1TpG1ZBuw0sLW4GwNLiphS025DhEsVThy4b6LH7Wj6blJt/wMtplNQSZ/Vdup7A3Wa2G7AMOBoYCVxoZn2By4F7wr4jgBFm9gOiCQUrrQOOMrO9iOa0uk2SgKuBL8ysj5ldsVm+TwLHAUjqCHQ0swnAb4A3zKx/ONefJcXyO1t5Oaxc1pgRL87k7Gu/4qZzumMGB5+whMKOGxh+yM7c+3+d6dVvNY3DT/Gqu77k/jdmcNu/Z/LJBy15/Zk2mb2IjBCWAy8IpFt+i3KufXAO9/1fJ9asyv6HX5szkptPL5lB7EOlbVFlBW+zbZdJMkmF4bMk3SmpSNIUSXsl7Hu6pJlhOT2Z66jvh3qzzWxyWJ8IdAf2AZ6OYioAzcK/g4Ajw/o/gFvDuoCbJe0HVACdgQ415PsU8CrRXFjHAZVtywcBR0i6PHzOB7oBnyYeLGkYMAygW+fMPPcs7FjKvoctR4Jd9lxDo0awvKQxBe3KOfd3X23c7+Kf9aRzj3UbjwFo0aqCwUctY8ZHLfjJsUszUv6GtGxJU9oUrmdpcTPaFK5neUlezQdtRRo3Ma59cA5vPNuG914qyHRx6oUBpekby+IR4C7gscRESV2JYsjchORDiSqePYEBwL3AAEltieJPv1C8iZJGm1m1/yHru4a8PmG9HGgLLAu12spl1xrOcTLQHuhrZn2AhUSBdIvMbAGwRFJv4HiiGjNEwf3ohLy7mdmnVRw/0sz6mVm/9u0yU5vY55DlfPxeKwDmf9GM0g2iddty1q0R69ZEP7aJb7eicRPj+zutp7wMli+JylpWCh+8vi3dd1mXkbI3tHFvFnLgkd8AcOCR3zDuzcIMlyhOjEtvm8e8mfk8O7J9pgtTj0R5kktNzOwdoKSKTbcDVxIF2EpDgccsMg4oCN/KDwZeM7OSEIRfAw6pKe+Grv6tAGZLOtbMng5ND73N7GNgHFGTxpNEU2ZXag0sMrNSSYOB74f0lcA21eT1JNHNa21mU0LaK8CFki40M5O0p5l9lL7Lq50/nPd9przfiuUlTTi5by9OvewbDj6hhL9c2pVhg3cmL8+4YsRcJFi2JI/fnLgDagTtvlfKlX/9EoDSDY349Uk9KC8T5eWw149WcejJSzJ8Zel35Z+n03vvZWxbUMpjY//H43dvz9MPduOav0zjoJ9/w6KvmvGHy6LHB20K1zPiyYm0aFVORQUceep8zjmiP2tXbz29PXfrv5oDj13KrOn53PNa1KXyb3/oSF5T4/zfL6B1uzJu/PtsvpiWz29O6pHh0taekdKbeoWSJiR8HmlmI6s7IDxvWmBmHyd8u4foG/u8hM/zQ9qW0quVid/Mk4F7Jf0WyANGAR8DFwOPS/oN8DKwPOz/BPCCpKnABOAzADNbIum90M7zEnD3Zvk8Q9QufWNC2o3AHcAUSY2A2cDh6b7AVF1z75dVpl9119zvpH2v6wYe+u9n30nPb1HB3a98nvayxc2fruhVZfqvz+rznbSlxc04bcg+9VyieJs2vhUHd9qjym3/e7l1A5emfqUwY0ixmfVLdmdJLYBfEzVX1Kt6C8hmNgfYPeHzrQmbq6q6LwAGhprrCcDO4bhiovblqvI4abOkxPwWstn1mdla4Jzkr8I5lw3MVJ9jWfQAtgcqa8ddgEmS+hPFrcSO7l1C2gJg/83S36opozh9d+sL3BWaMZYBv8hscZxz2SJ6qFc/z3vMbCqwXeVnSXOAfmZWLGk0MFzSKKKHesvN7GtJrxB1Rqjs6nQQcE1NecUmIJvZu0DV362cc65a6ZtTT9I/iWq3hZLmA9eZ2UNb2H0McBhQBKwBzgQwsxJJNwIfhv1uMLOqHhR+S2wCsnPO1Vb0UC8
|
||
|
"text/plain": [
|
||
|
"<Figure size 432x288 with 2 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {
|
||
|
"needs_background": "light"
|
||
|
},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"print(\"Logistic Regression\")\n",
|
||
|
"report(log)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 25,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"SVC\n",
|
||
|
" precision recall f1-score support\n",
|
||
|
"\n",
|
||
|
" negative 0.82 0.89 0.86 1817\n",
|
||
|
" neutral 0.59 0.52 0.55 628\n",
|
||
|
" positive 0.76 0.64 0.69 483\n",
|
||
|
"\n",
|
||
|
" accuracy 0.77 2928\n",
|
||
|
" macro avg 0.73 0.68 0.70 2928\n",
|
||
|
"weighted avg 0.76 0.77 0.77 2928\n",
|
||
|
"\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWQAAAEHCAYAAACZezzUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAv3klEQVR4nO3dd5wV1fnH8c93l126C8siXYogCIioiKiJUbFHxdhLoiYqFqyx/DRNY6IxsUVjNMFoxKixF2wgosaKCIpIEVmKFEFYemfL8/tjzuIFYffusrt37uV5v17z2rlnyjkzLM89e+bMOTIznHPOpV5WqgvgnHMu4gHZOediwgOyc87FhAdk55yLCQ/IzjkXEx6QnXMuJuqlugBxV5CfbZ065KS6GLE1fWpeqosQe1ZcnOoixNp61rDRNmh7znHkIY1tydLSpPYdP3HDSDM7alvbJT0MHAssMrPeCemXAUOAUuBVM7supN8AnBfSLzezkSH9KOAeIBv4l5ndVlnZPCBXolOHHMaO7JDqYsTWj/c9JtVFiL2S+d+kugix9rGN3u5zLFlaytiRuyS1b3ab6QWV7PIIcB/waHmCpEOAQcCeZrZB0s4hvSdwOtALaAu8KWm3cNjfgcOBecAnkoab2ZSKMvaA7JxLewaUUVYz5zJ7V1KnLZIvBm4zsw1hn0UhfRDwZEifJakQ6B+2FZrZTABJT4Z9KwzI3obsnEt7hlFspUkt1bQb8ENJH0v6n6R9Q3o7YG7CfvNC2rbSK+Q1ZOdcRqhCDblA0riEz0PNbGglx9QD8oEBwL7A05K6VL2UlWfinHNpzTBKkx+Xp8jM+lUxi3nA8xYN/jNWUhlQAMwHEh8ytQ9pVJC+Td5k4ZzLCGVYUks1vQgcAhAe2uUCRcBw4HRJ9SV1BroBY4FPgG6SOkvKJXrwN7yyTLyG7JxLewaUVj/YbkbSf4GDiZo25gE3Ag8DD0uaBGwEzgm15cmSniZ6WFcCDDGLGqolXQqMJOr29rCZTa4sbw/IzrmMsB21382Y2Rnb2PTTbex/C3DLVtJfA16rSt4ekJ1zac+A4gwY290DsnMu7RlWY00WqeQB2TmX/gxK0z8ee0B2zqW/6E299OcB2TmXAUQp2zU+USx4QHbOpb3ooZ4HZOecS7moH7IHZOeci4UyryE751zqeQ3ZOediwhClGTA0jwdk51xG8CYL55yLAUNstOxUF2O7eUB2zqW96MUQb7JwzrlY8Id6zjkXA2ai1LyG7JxzsVDmNWTnnEu96KFe+oez9L8C59wOzx/qOedcjJRmQD/k9P9Kcc7t8Mrf1EtmqYykhyUtChOabrntakkmqSB8lqR7JRVKmihp74R9z5E0PSznJHMdHpCdcxmhzLKSWpLwCHDUlomSOgBHAHMSko8GuoVlMPBA2DefaLbq/YD+wI2SmleWsQdk51zaiwYXqpkaspm9Cyzdyqa7getCduUGAY9aZAzQTFIb4EhglJktNbNlwCi2EuS35G3Izrm0Z4ji5F+dLpA0LuHzUDMbWtEBkgYB883sc2mztup2wNyEz/NC2rbSK+QBOQbuvKoDH7+5E80KShj69rRN6S89VMDwRwrIyjb2G7iS83+7gPH/a8LDt7alpFjUyzEu+O039P3B6s3Od+M5nVkwJ3ezc2WKK347kf4/WMzyZbkMOf2HAJx5wXSOPGEuK5fnAjDs77sx7sOdyc4u4/LfTKJrjxVkZxujX2vHM4/smsri17mc+mXc+XwhOblGdj3jvVeb8Z87WnP13XPos/8a1qyKaox3XLkLMyc3THFpq8+MqrwYUmRm/ZLdWVIj4FdEzRW1Km0DsqRmwJlmdn/43Ba418xOTmnBquGI05Zy/M+LuP2KXTalTfigCR+OzOOBN6eRW99YXhT9U+Xll3LzsJm0aF3C7C8b8Kszu/DEp1M2Hff+a3k0aJwJ0z1u3ZuvtOeVpzvyy99P3Cz9pf924vnHumyW9oPDFpKTW8aQM35I/fqlPPD0e/xvZBsWLWhUl0VOqeIN4rpTdmX92myy6xl3vVjIJ281BeDBP7Th/VebpbaANUa1+WLIrkBnoLx23B74VFJ/YD7QIWHf9iFtPnDwFunvVJZROrchNwMuKf9gZt+kYzAG2GPAGpo2L90s7ZVHW3Dapd+SWz9qrmpWUAJA1z3W0aJ1tN6x+3o2rM9i44boF3Hdmiye/2dLzrxyYR2Wvm5N/iyfVStzktvZoEHDErKyy8htUEpJsVi7Jm3rINUk1q+N/pSvl2Nk5xhmlRyShoyohpzMUuVzm31hZjubWScz60TU/LC3mS0EhgNnh94WA4AVZrYAGAkcIal5eJh3REirUK0FZEmdJE2V9KCkyZLekNRQ0q6SRkgaL+k9ST3C/rtKGiPpC0l/lLQ6pDeRNFrSp2HboJDFbcCukiZIuj3kNykcM0ZSr4SyvCOpn6TGoUvLWEmfJZwrdubPaMCkj5tw+Y+7cc2JXZk24ft/Tr7/ah5de6/bFLSH/aU1J120mPoNM/B/XCWOPWUO9z3xPlf8diJNmhYD8P7o1qxfV4/HXn+LR15+h+cf78zqlbmpLWgKZGUZ94+axlMTJ/PZu02Y9lljAM69fiEPvDmNC2+aT05u+v9VVYPd3v4LfAR0lzRP0nkV7P4aMBMoBB4kVBLNbCnwB+CTsNwc0ipU2zXkbsDfzawXsBw4CRgKXGZm+wDXAPeHfe8B7jGzPYi+gcqtB35iZnsDhwB3Kvq74Xpghpn1NbNrt8j3KeBUgPDEs42ZjQN+DbxlZv3DuW6X1LimL7omlJbCquXZ3PPKdM7/7TfccmGnzWo2s6c14KFb2nLFX6LnBjMmNWTB7PocePSKFJU4dV57bhfO/8mPuOysA1lW1IDzrpwKwG69VlBWBj87+lB+MehH/OSs2bRutzbFpa17ZWXiksO7c9Y+Penedy0du6/j339qw/k/7M7lx3SjabNSTh2yKNXF3C6GKLPklkrPZXaGmbUxsxwza29mD22xvZOZFYV1M7MhZrarme0R4kz5fg+bWdew/DuZ66jtgDzLzCaE9fFAJ+AA4BlJE4B/Am3C9v2BZ8L6EwnnEHCrpInAm0RPKltVku/TQHnzxanAs2H9COD6kPc7QANgly0PljRY0jhJ4xYvKd1yc50oaFPMgcesQIIee60lKwtWLI3+9Fz8TQ43n9eJa++ZQ9tOGwGYMr4RX01sxNn9e3L1CV2ZP7M+157UNSVlr2vLl9anrEyYiREvtme3XtGX0sFHfcP4D1tSWprFimX1mfJ5M7ruvuN9YZVbszKbzz9swr6HrGLpohxAFG/M4o2n8uneN72/qAwotnpJLXFW2wF5Q8J6KZAPLA+12vJl90rOcRbQEtjHzPoC3xIF0m0ys/nAEkl9gNOIaswQBfeTEvLexcymbuX4oWbWz8z6tWyRmlkIDjhqBZ9/0ASAeTPqU7xR5OWXsnpFNr89uwu/+NUCevVfs2n/485Zwn8/m8yjY6dw54uFtOuygdufK0xJ2eta8xbrN60fcPC3fD0jemi1eGED9tx3CQD1G5TQo/dy5s2O5R9EtSYvv4TGO0WVitwGZex90GrmFjYgf+fisIdxwFErmD2twv9SaUCUJrnEWV1/XawEZkk6xcyeCU0Pfczsc2AMUZPGU8DpCcfkAYvMrFjSIUDHkL4KaFpBXk8RdeLOM7PyR/IjgcskXWZmJmkvM/us5i6vev50cUcmftSEFUvrcdY+PfnZ1Qs58vSl3PXLDgw+pDs5Oca198xBguH/LuCbWbk8fldrHr+rdXT8kzM2PfTLdNf9cQJ77LOUnZptZNgrb/H40G7ssc9Suuy2EjOxaEFD/nZr9PjglWc6ctXvvuD+p95DGKNebs/swp1SfAV1K79VMdfcM4esLMjKgndfzuPjN3fiz0/PIK9FCRLMmNyAe/+vfaqLul0Mkn0LL9ZktfTIVVIn4BUz6x0+XwM0AYYRvV7YBsgBnjSzmyV1Ax4DGgIjgLPMrF14Z/zlcOw4YABwtJnNlvQE0Ad4Hfj7Fvm1Iup68gcz+31Iawj8lajZJIuoSeXYiq6
|
||
|
"text/plain": [
|
||
|
"<Figure size 432x288 with 2 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {
|
||
|
"needs_background": "light"
|
||
|
},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"print('SVC')\n",
|
||
|
"report(svc)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### Finalizing a PipeLine for Deployment on New Tweets\n",
|
||
|
"\n",
|
||
|
"If we were satisfied with a model's performance, we should set up a pipeline that can take in a tweet directly."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 26,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from sklearn.pipeline import Pipeline"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 27,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"pipe = Pipeline([('tfidf',TfidfVectorizer()),('svc',LinearSVC())])"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 28,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"Pipeline(steps=[('tfidf', TfidfVectorizer()), ('svc', LinearSVC())])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 28,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"pipe.fit(df['text'],df['airline_sentiment'])"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 29,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"array(['positive'], dtype=object)"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 29,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"new_tweet = ['good flight']\n",
|
||
|
"pipe.predict(new_tweet)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 30,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"array(['negative'], dtype=object)"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 30,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"new_tweet = ['bad flight']\n",
|
||
|
"pipe.predict(new_tweet)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 31,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"array(['neutral'], dtype=object)"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 31,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"new_tweet = ['ok flight']\n",
|
||
|
"pipe.predict(new_tweet)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": []
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 3",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 3
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython3",
|
||
|
"version": "3.8.5"
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 4
|
||
|
}
|