334 KiB
Logistic Regression¶
Imports¶
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Data¶
An experiment was conducted on 5000 participants to study the effects of age and physical health on hearing loss, specifically the ability to hear high pitched tones. This data displays the result of the study in which participants were evaluated and scored for physical ability and then had to take an audio test (pass/no pass) which evaluated their ability to hear high frequencies. The age of the user was also noted. Is it possible to build a model that would predict someone's liklihood to hear the high frequency sound based solely on their features (age and physical score)?
Features
- age - Age of participant in years
- physical_score - Score achieved during physical exam
Label/Target
- test_result - 0 if no pass, 1 if test passed
df = pd.read_csv('../DATA/hearing_test.csv')
df.head()
Exploratory Data Analysis and Visualization¶
Feel free to explore the data further on your own.
df.info()
df.describe()
df['test_result'].value_counts()
sns.countplot(data=df,x='test_result')
sns.boxplot(x='test_result',y='age',data=df)
sns.boxplot(x='test_result',y='physical_score',data=df)
sns.scatterplot(x='age',y='physical_score',data=df,hue='test_result')
sns.pairplot(df,hue='test_result')
sns.heatmap(df.corr(),annot=True)
sns.scatterplot(x='physical_score',y='test_result',data=df)
sns.scatterplot(x='age',y='test_result',data=df)
Easily discover new plot types with a google search! Searching for "3d matplotlib scatter plot" quickly takes you to: https://matplotlib.org/3.1.1/gallery/mplot3d/scatter3d.html
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['age'],df['physical_score'],df['test_result'],c=df['test_result'])
Train | Test Split and Scaling¶
X = df.drop('test_result',axis=1)
y = df['test_result']
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
Logistic Regression Model¶
from sklearn.linear_model import LogisticRegression
# help(LogisticRegression)
# help(LogisticRegressionCV)
log_model = LogisticRegression()
log_model.fit(scaled_X_train,y_train)
Coefficient Interpretation¶
Things to remember:
- These coeffecients relate to the odds and can not be directly interpreted as in linear regression.
- We trained on a scaled version of the data
- It is much easier to understand and interpret the relationship between the coefficients than it is to interpret the coefficients relationship with the probability of the target/label class.
Make sure to watch the video explanation, also check out the links below:
- https://stats.idre.ucla.edu/stata/faq/how-do-i-interpret-odds-ratios-in-logistic-regression/
- https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/
The odds ratio¶
For a continuous independent variable the odds ratio can be defined as:
This exponential relationship provides an interpretation for $$\beta _{1}$$
The odds multiply by $${e^\beta _{1}}$$ for every 1-unit increase in x.
log_model.coef_
This means:
- We can expect the odds of passing the test to decrease (the original coeff was negative) per unit increase of the age.
- We can expect the odds of passing the test to increase (the original coeff was positive) per unit increase of the physical score.
- Based on the ratios with each other, the physical_score indicator is a stronger predictor than age.
Model Performance on Classification Tasks¶
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,plot_confusion_matrix
y_pred = log_model.predict(scaled_X_test)
accuracy_score(y_test,y_pred)
confusion_matrix(y_test,y_pred)
plot_confusion_matrix(log_model,scaled_X_test,y_test)
# Scaled so highest value=1
plot_confusion_matrix(log_model,scaled_X_test,y_test,normalize='true')
print(classification_report(y_test,y_pred))
X_train.iloc[0]
y_train.iloc[0]
# 0% probability of 0 class
# 100% probability of 1 class
log_model.predict_proba(X_train.iloc[0].values.reshape(1, -1))
log_model.predict(X_train.iloc[0].values.reshape(1, -1))
Evaluating Curves and AUC¶
Make sure to watch the video on this!
from sklearn.metrics import precision_recall_curve,plot_precision_recall_curve,plot_roc_curve
plot_precision_recall_curve(log_model,scaled_X_test,y_test)
plot_roc_curve(log_model,scaled_X_test,y_test)