340 KiB
Seaborn Exercises - Solutions¶
Imports¶
Run the cell below to import the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
The Data¶
DATA SOURCE: https://www.kaggle.com/rikdifos/credit-card-approval-prediction
Data Information:
Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.
Feature Information:
application_record.csv | ||
---|---|---|
Feature name | Explanation | Remarks |
ID |
Client number | |
CODE_GENDER |
Gender | |
FLAG_OWN_CAR |
Is there a car | |
FLAG_OWN_REALTY |
Is there a property | |
CNT_CHILDREN |
Number of children | |
AMT_INCOME_TOTAL |
Annual income | |
NAME_INCOME_TYPE |
Income category | |
NAME_EDUCATION_TYPE |
Education level | |
NAME_FAMILY_STATUS |
Marital status | |
NAME_HOUSING_TYPE |
Way of living | |
DAYS_BIRTH |
Birthday | Count backwards from current day (0), -1 means yesterday |
DAYS_EMPLOYED |
Start date of employment | Count backwards from current day(0). If positive, it means the person currently unemployed. |
FLAG_MOBIL |
Is there a mobile phone | |
FLAG_WORK_PHONE |
Is there a work phone | |
FLAG_PHONE |
Is there a phone | |
FLAG_EMAIL |
Is there an email | |
OCCUPATION_TYPE |
Occupation | |
CNT_FAM_MEMBERS |
Family size |
df = pd.read_csv('application_record.csv')
df.head()
df.info()
TASKS¶
Recreate the plots shown in the markdown image cells. Each plot also contains a brief description of what it is trying to convey. Note, these are meant to be quite challenging. Start by first replicating the most basic form of the plot, then attempt to adjust its styling and parameters to match the given image.¶
In general do not worry about coloring,styling, or sizing matching up exactly. Instead focus on the content of the plot itself. Our goal is not to test you on recognizing figsize=(10,8) , its to test your understanding of being able to see a requested plot, and reproducing it.
NOTE: You may need to perform extra calculations on the pandas dataframe before calling seaborn to create the plot.
TASK: Recreate the Scatter Plot shown below¶
The scatterplot attempts to show the relationship between the days employed versus the age of the person (DAYS_BIRTH) for people who were not unemployed. Note, to reproduce this chart you must remove unemployed people from the dataset first. Also note the sign of the axis, they are both transformed to be positive. Finally, feel free to adjust the alpha and linewidth parameters in the scatterplot since there are so many points stacked on top of each other.
# CODE HERE TO RECREATE THE PLOT SHOWN ABOVE
import warnings
warnings.simplefilter('ignore')
plt.figure(figsize=(12,8))
# REMOVE UNEMPLOYED PEOPLE
employed = df[df['DAYS_EMPLOYED']<0]
# MAKE BOTH POSITIVE
employed['DAYS_EMPLOYED'] = -1*employed['DAYS_EMPLOYED']
employed['DAYS_BIRTH'] = -1*employed['DAYS_BIRTH']
# With so many points, alpha is tiny, might be an indicated that a
# scatterplot may not be the right choice!
sns.scatterplot(y='DAYS_EMPLOYED',x='DAYS_BIRTH',data=employed,
alpha=0.01,linewidth=0)
plt.savefig('task_one.jpg')
TASK: Recreate the Distribution Plot shown below:¶
Note, you will need to figure out how to calculate "Age in Years" from one of the columns in the DF. Think carefully about this. Don't worry too much if you are unable to replicate the styling exactly.
# CODE HERE TO RECREATE THE PLOT SHOWN ABOVE
plt.figure(figsize=(8,4))
df['YEARS'] = -1*df['DAYS_BIRTH']/365
sns.histplot(data=df,x='YEARS',linewidth=2,edgecolor='black',
color='red',bins=45,alpha=0.4)
plt.xlabel("Age in Years")
plt.savefig('DistPlot_solution.png')
TASK: Recreate the Categorical Plot shown below:¶
This plot shows information only for the bottom half of income earners in the data set. It shows the boxplots for each category of NAME_FAMILY_STATUS column for displaying their distribution of their total income. The hue is the "FLAG_OWN_REALTY" column. Note: You will need to adjust or only take part of the dataframe before recreating this plot.
# CODE HERE
plt.figure(figsize=(12,5))
xtick_order = ['Incomplete higher','Higher education','Lower secondary','Secondary / secondary special','Academic degree']
bottom_half_income = df.nsmallest(n=int(0.5*len(df)),columns='AMT_INCOME_TOTAL')
sns.boxplot(x='NAME_FAMILY_STATUS',y='AMT_INCOME_TOTAL',data=bottom_half_income,hue='FLAG_OWN_REALTY')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.,title='FLAG_OWN_REALTY')
plt.title('Income Totals per Family Status for Bottom Half of Earners')
TASK: Recreate the Heat Map shown below:¶
This heatmap shows the correlation between the columns in the dataframe. You can get correlation with .corr() , also note that the FLAG_MOBIL column has NaN correlation with every other column, so you should drop it before calling .corr().
df.corr()
sns.heatmap(df.drop('FLAG_MOBIL',axis=1).corr(),cmap="viridis")