You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
109 KiB
109 KiB
<html>
<head>
</head>
</html>
The Data¶
Source: https://www.kaggle.com/crowdflower/twitter-airline-sentiment?select=Tweets.csv
This data originally came from Crowdflower's Data for Everyone library.
As the original source says,
A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").
The Goal: Create a Machine Learning Algorithm that can predict if a tweet is positive, neutral, or negative. In the future we could use such an algorithm to automatically read and flag tweets for an airline for a customer service agent to reach out to contact.¶
In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [2]:
df = pd.read_csv("../DATA/airline_tweets.csv")
In [3]:
df.head()
Out[3]:
In [4]:
sns.countplot(data=df,x='airline',hue='airline_sentiment')
Out[4]:
In [5]:
sns.countplot(data=df,x='negativereason')
plt.xticks(rotation=90);
In [6]:
sns.countplot(data=df,x='airline_sentiment')
Out[6]:
In [7]:
df['airline_sentiment'].value_counts()
Out[7]:
Features and Label¶
In [8]:
data = df[['airline_sentiment','text']]
In [9]:
data.head()
Out[9]:
In [10]:
y = df['airline_sentiment']
X = df['text']
Train Test Split¶
In [11]:
from sklearn.model_selection import train_test_split
In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
Vectorization¶
In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
In [14]:
tfidf = TfidfVectorizer(stop_words='english')
In [15]:
tfidf.fit(X_train)
Out[15]:
In [16]:
X_train_tfidf = tfidf.transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
In [17]:
X_train_tfidf
Out[17]:
DO NOT USE .todense() for such a large sparse matrix!!!
Model Comparisons - Naive Bayes,LogisticRegression, LinearSVC¶
In [18]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_tfidf,y_train)
Out[18]:
In [19]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(max_iter=1000)
log.fit(X_train_tfidf,y_train)
Out[19]:
In [20]:
from sklearn.svm import LinearSVC
svc = LinearSVC()
svc.fit(X_train_tfidf,y_train)
Out[20]:
Performance Evaluation¶
In [21]:
from sklearn.metrics import plot_confusion_matrix,classification_report
In [22]:
def report(model):
preds = model.predict(X_test_tfidf)
print(classification_report(y_test,preds))
plot_confusion_matrix(model,X_test_tfidf,y_test)
In [23]:
print("NB MODEL")
report(nb)
In [24]:
print("Logistic Regression")
report(log)
In [25]:
print('SVC')
report(svc)
Finalizing a PipeLine for Deployment on New Tweets¶
If we were satisfied with a model's performance, we should set up a pipeline that can take in a tweet directly.
In [26]:
from sklearn.pipeline import Pipeline
In [27]:
pipe = Pipeline([('tfidf',TfidfVectorizer()),('svc',LinearSVC())])
In [28]:
pipe.fit(df['text'],df['airline_sentiment'])
Out[28]:
In [29]:
new_tweet = ['good flight']
pipe.predict(new_tweet)
Out[29]:
In [30]:
new_tweet = ['bad flight']
pipe.predict(new_tweet)
Out[30]:
In [31]:
new_tweet = ['ok flight']
pipe.predict(new_tweet)
Out[31]:
In [ ]: