___

Copyright by Pierian Data Inc. For more information, visit us at www.pieriandata.com

Logistic Regression¶

Imports¶

In [30]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Data¶

An experiment was conducted on 5000 participants to study the effects of age and physical health on hearing loss, specifically the ability to hear high pitched tones. This data displays the result of the study in which participants were evaluated and scored for physical ability and then had to take an audio test (pass/no pass) which evaluated their ability to hear high frequencies. The age of the user was also noted. Is it possible to build a model that would predict someone's liklihood to hear the high frequency sound based solely on their features (age and physical score)?

Features
- age - Age of participant in years
- physical_score - Score achieved during physical exam
Label/Target
- test_result - 0 if no pass, 1 if test passed

In [31]:

df = pd.read_csv('../DATA/hearing_test.csv')

In [32]:

df.head()

Out[32]:

	age	physical_score	test_result
0	33.0	40.7	1
1	50.0	37.2	1
2	52.0	24.7	0
3	56.0	31.0	0
4	35.0	42.9	1

Exploratory Data Analysis and Visualization¶

Feel free to explore the data further on your own.

In [33]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             5000 non-null   float64
 1   physical_score  5000 non-null   float64
 2   test_result     5000 non-null   int64  
dtypes: float64(2), int64(1)
memory usage: 117.3 KB

In [34]:

df.describe()

Out[34]:

	age	physical_score	test_result
count	5000.000000	5000.000000	5000.000000
mean	51.609000	32.760260	0.600000
std	11.287001	8.169802	0.489947
min	18.000000	-0.000000	0.000000
25%	43.000000	26.700000	0.000000
50%	51.000000	35.300000	1.000000
75%	60.000000	38.900000	1.000000
max	90.000000	50.000000	1.000000

In [35]:

df['test_result'].value_counts()

Out[35]:

1    3000
0    2000
Name: test_result, dtype: int64

In [36]:

sns.countplot(data=df,x='test_result')

Out[36]:

<AxesSubplot:xlabel='test_result', ylabel='count'>

In [37]:

sns.boxplot(x='test_result',y='age',data=df)

Out[37]:

<AxesSubplot:xlabel='test_result', ylabel='age'>

In [38]:

sns.boxplot(x='test_result',y='physical_score',data=df)

Out[38]:

<AxesSubplot:xlabel='test_result', ylabel='physical_score'>

In [39]:

sns.scatterplot(x='age',y='physical_score',data=df,hue='test_result')

Out[39]:

<AxesSubplot:xlabel='age', ylabel='physical_score'>

In [40]:

sns.pairplot(df,hue='test_result')

Out[40]:

<seaborn.axisgrid.PairGrid at 0x19ceae2fd08>

In [41]:

sns.heatmap(df.corr(),annot=True)

Out[41]:

<AxesSubplot:>

In [42]:

sns.scatterplot(x='physical_score',y='test_result',data=df)

Out[42]:

<AxesSubplot:xlabel='physical_score', ylabel='test_result'>

In [43]:

sns.scatterplot(x='age',y='test_result',data=df)

Out[43]:

<AxesSubplot:xlabel='age', ylabel='test_result'>

Easily discover new plot types with a google search! Searching for "3d matplotlib scatter plot" quickly takes you to: https://matplotlib.org/3.1.1/gallery/mplot3d/scatter3d.html

In [44]:

from mpl_toolkits.mplot3d import Axes3D 
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['age'],df['physical_score'],df['test_result'],c=df['test_result'])

Out[44]:

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x19ceaf878c8>

Train | Test Split and Scaling¶

In [45]:

X = df.drop('test_result',axis=1)
y = df['test_result']

In [46]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [47]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)

In [48]:

scaler = StandardScaler()

In [49]:

scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)