___

Copyright by Pierian Data Inc. For more information, visit us at www.pieriandata.com

NLP and Supervised Learning¶

Classification of Text Data¶

The Data¶

Source: https://www.kaggle.com/crowdflower/twitter-airline-sentiment?select=Tweets.csv

This data originally came from Crowdflower's Data for Everyone library.

As the original source says,

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

The Goal: Create a Machine Learning Algorithm that can predict if a tweet is positive, neutral, or negative. In the future we could use such an algorithm to automatically read and flag tweets for an airline for a customer service agent to reach out to contact.¶

In [1]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:

df = pd.read_csv("../DATA/airline_tweets.csv")

In [3]:

df.head()

Out[3]:

	tweet_id	airline_sentiment	airline_sentiment_confidence	negativereason	negativereason_confidence	airline	airline_sentiment_gold	name	negativereason_gold	text	tweet_coord	tweet_created	tweet_location	user_timezone
0	570306133677760513	neutral	1.0000	NaN	NaN	Virgin America	NaN	cairdin	NaN	@VirginAmerica What @dhepburn said.	NaN	2015-02-24 11:35:52 -0800	NaN	Eastern Time (US & Canada)
1	570301130888122368	positive	0.3486	NaN	0.0000	Virgin America	NaN	jnardino	NaN	@VirginAmerica plus you've added commercials t...	NaN	2015-02-24 11:15:59 -0800	NaN	Pacific Time (US & Canada)
2	570301083672813571	neutral	0.6837	NaN	NaN	Virgin America	NaN	yvonnalynn	NaN	@VirginAmerica I didn't today... Must mean I n...	NaN	2015-02-24 11:15:48 -0800	Lets Play	Central Time (US & Canada)
3	570301031407624196	negative	1.0000	Bad Flight	0.7033	Virgin America	NaN	jnardino	NaN	@VirginAmerica it's really aggressive to blast...	NaN	2015-02-24 11:15:36 -0800	NaN	Pacific Time (US & Canada)
4	570300817074462722	negative	1.0000	Can't Tell	1.0000	Virgin America	NaN	jnardino	NaN	@VirginAmerica and it's a really big bad thing...	NaN	2015-02-24 11:14:45 -0800	NaN	Pacific Time (US & Canada)

In [4]:

sns.countplot(data=df,x='airline',hue='airline_sentiment')

Out[4]:

<AxesSubplot:xlabel='airline', ylabel='count'>

In [5]:

sns.countplot(data=df,x='negativereason')
plt.xticks(rotation=90);

In [6]:

sns.countplot(data=df,x='airline_sentiment')

Out[6]:

<AxesSubplot:xlabel='airline_sentiment', ylabel='count'>

In [7]:

df['airline_sentiment'].value_counts()

Out[7]:

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

Features and Label¶

In [8]:

data = df[['airline_sentiment','text']]

In [9]:

data.head()

Out[9]:

	airline_sentiment	text
0	neutral	@VirginAmerica What @dhepburn said.
1	positive	@VirginAmerica plus you've added commercials t...
2	neutral	@VirginAmerica I didn't today... Must mean I n...
3	negative	@VirginAmerica it's really aggressive to blast...
4	negative	@VirginAmerica and it's a really big bad thing...

In [10]:

y = df['airline_sentiment']
X = df['text']

Train Test Split¶

In [11]:

from sklearn.model_selection import train_test_split

In [12]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

Vectorization¶

In [13]:

from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:

tfidf = TfidfVectorizer(stop_words='english')

In [15]:

tfidf.fit(X_train)

Out[15]:

TfidfVectorizer(stop_words='english')

In [16]:

X_train_tfidf = tfidf.transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [17]:

X_train_tfidf

Out[17]:

<11712x12971 sparse matrix of type '<class 'numpy.float64'>'
	with 107073 stored elements in Compressed Sparse Row format>

DO NOT USE .todense() for such a large sparse matrix!!!

Model Comparisons - Naive Bayes,LogisticRegression, LinearSVC¶

In [18]:

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_tfidf,y_train)

Out[18]:

MultinomialNB()

In [19]:

from sklearn.linear_model import LogisticRegression
log = LogisticRegression(max_iter=1000)
log.fit(X_train_tfidf,y_train)

Out[19]:

LogisticRegression(max_iter=1000)

In [20]:

from sklearn.svm import LinearSVC
svc = LinearSVC()
svc.fit(X_train_tfidf,y_train)

Out[20]:

LinearSVC()

Performance Evaluation¶

In [21]:

from sklearn.metrics import plot_confusion_matrix,classification_report

In [22]:

def report(model):
    preds = model.predict(X_test_tfidf)
    print(classification_report(y_test,preds))
    plot_confusion_matrix(model,X_test_tfidf,y_test)

In [23]:

print("NB MODEL")
report(nb)

NB MODEL
              precision    recall  f1-score   support

    negative       0.66      0.99      0.79      1817
     neutral       0.79      0.15      0.26       628
    positive       0.89      0.14      0.24       483

    accuracy                           0.67      2928
   macro avg       0.78      0.43      0.43      2928
weighted avg       0.73      0.67      0.59      2928

In [24]:

print("Logistic Regression")
report(log)

Logistic Regression
              precision    recall  f1-score   support

    negative       0.80      0.93      0.86      1817
     neutral       0.63      0.47      0.54       628
    positive       0.82      0.58      0.68       483

    accuracy                           0.77      2928
   macro avg       0.75      0.66      0.69      2928
weighted avg       0.77      0.77      0.76      2928

In [25]:

print('SVC')
report(svc)

SVC
              precision    recall  f1-score   support

    negative       0.82      0.89      0.86      1817
     neutral       0.59      0.52      0.55       628
    positive       0.76      0.64      0.69       483

    accuracy                           0.77      2928
   macro avg       0.73      0.68      0.70      2928
weighted avg       0.76      0.77      0.77      2928

Finalizing a PipeLine for Deployment on New Tweets¶

If we were satisfied with a model's performance, we should set up a pipeline that can take in a tweet directly.

In [26]:

from sklearn.pipeline import Pipeline

In [27]:

pipe = Pipeline([('tfidf',TfidfVectorizer()),('svc',LinearSVC())])

In [28]:

pipe.fit(df['text'],df['airline_sentiment'])

Out[28]:

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('svc', LinearSVC())])

In [29]:

new_tweet = ['good flight']
pipe.predict(new_tweet)

Out[29]:

array(['positive'], dtype=object)

In [30]:

new_tweet = ['bad flight']
pipe.predict(new_tweet)

Out[30]:

array(['negative'], dtype=object)

In [31]:

new_tweet = ['ok flight']
pipe.predict(new_tweet)

Out[31]:

array(['neutral'], dtype=object)

In [ ]:

109 KiB Raw Blame History