___

Copyright by Pierian Data Inc. For more information, visit us at www.pieriandata.com

Categorical Plots - Distribution within Categories¶

So far we've seen how to apply a statistical estimation (like mean or count) to categories and compare them to one another. Let's now explore how to visualize the distribution within categories. We already know about distplot() which allows to view the distribution of a single feature, now we will break down that same distribution per category.

Imports¶

In [2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

The Data¶

In [6]:

df = pd.read_csv("StudentsPerformance.csv")

In [7]:

df.head()

Out[7]:

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
0	female	group B	bachelor's degree	standard	none	72	72	74
1	female	group C	some college	standard	completed	69	90	88
2	female	group B	master's degree	standard	none	90	95	93
3	male	group A	associate's degree	free/reduced	none	47	57	44
4	male	group C	some college	standard	none	76	78	75

Boxplot¶

As described in the video, a boxplot display distribution through the use of quartiles and an IQR for outliers.

In [17]:

plt.figure(figsize=(12,6))
sns.boxplot(x='parental level of education',y='math score',data=df)

Out[17]:

<AxesSubplot:xlabel='parental level of education', ylabel='math score'>

Adding hue for further segmentation¶

In [19]:

plt.figure(figsize=(12,6))
sns.boxplot(x='parental level of education',y='math score',data=df,hue='gender')

# Optional move the legend outside
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Out[19]:

<matplotlib.legend.Legend at 0x2116e6b9948>

Boxplot Styling Parameters¶

Orientation¶

In [26]:

# NOTICE HOW WE HAVE TO SWITCH X AND Y FOR THE ORIENTATION TO MAKE SENSE!
sns.boxplot(x='math score',y='parental level of education',data=df,orient='h')

Out[26]:

<AxesSubplot:xlabel='math score', ylabel='parental level of education'>

Width¶

In [29]:

plt.figure(figsize=(12,6))
sns.boxplot(x='parental level of education',y='math score',data=df,hue='gender',width=0.3)

Out[29]:

<AxesSubplot:xlabel='parental level of education', ylabel='math score'>

Violinplot¶

A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

In [30]:

plt.figure(figsize=(12,6))
sns.violinplot(x='parental level of education',y='math score',data=df)

Out[30]:

<AxesSubplot:xlabel='parental level of education', ylabel='math score'>

In [31]:

plt.figure(figsize=(12,6))
sns.violinplot(x='parental level of education',y='math score',data=df,hue='gender')

Out[31]:

<AxesSubplot:xlabel='parental level of education', ylabel='math score'>

Violinplot Parameters¶

split¶

When using hue nesting with a variable that takes two levels, setting split to True will draw half of a violin for each level. This can make it easier to directly compare the distributions.

In [32]:

plt.figure(figsize=(12,6))
sns.violinplot(x='parental level of education',y='math score',data=df,hue='gender',split=True)

Out[32]:

<AxesSubplot:xlabel='parental level of education', ylabel='math score'>

inner¶

Representation of the datapoints in the violin interior. If box, draw a miniature boxplot. If quartiles, draw the quartiles of the distribution. If point or stick, show each underlying datapoint. Using None will draw unadorned violins.

In [38]:

plt.figure(figsize=(12,6))
sns.violinplot(x='parental level of education',y='math score',data=df,inner=None)

Out[38]:

<AxesSubplot:xlabel='parental level of education', ylabel='math score'>

In [39]:

plt.figure(figsize=(12,6))
sns.violinplot(x='parental level of education',y='math score',data=df,inner='box')

Out[39]:

<AxesSubplot:xlabel='parental level of education', ylabel='math score'>

In [35]:

plt.figure(figsize=(12,6))
sns.violinplot(x='parental level of education',y='math score',data=df,inner='quartile')

Out[35]:

<AxesSubplot:xlabel='parental level of education', ylabel='math score'>

In [43]:

plt.figure(figsize=(12,6))
sns.violinplot(x='parental level of education',y='math score',data=df,inner='stick')

Out[43]:

<AxesSubplot:xlabel='parental level of education', ylabel='math score'>

orientation¶

In [45]:

# Simply switch the continuous variable to y and the categorical to x
sns.violinplot(x='math score',y='parental level of education',data=df,)

Out[45]:

<AxesSubplot:xlabel='math score', ylabel='parental level of education'>

bandwidth¶

Similar to bandwidth argument for kdeplot

In [48]:

plt.figure(figsize=(12,6))
sns.violinplot(x='parental level of education',y='math score',data=df,bw=0.1)

Out[48]:

<AxesSubplot:xlabel='parental level of education', ylabel='math score'>

Advanced Plots¶

We can use a boxenplot and swarmplot to achieve the same effect as the boxplot and violinplot, but with slightly more information included. Be careful when using these plots, as they often require you to educate the viewer with how the plot is actually constructed. Only use these if you are sure your audience will understand the visualization.

In [49]:

df.head()

Out[49]:

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
0	female	group B	bachelor's degree	standard	none	72	72	74
1	female	group C	some college	standard	completed	69	90	88
2	female	group B	master's degree	standard	none	90	95	93
3	male	group A	associate's degree	free/reduced	none	47	57	44
4	male	group C	some college	standard	none	76	78	75

swarmplot¶

In [50]:

sns.swarmplot(x='math score',data=df)

Out[50]:

<AxesSubplot:xlabel='math score'>

In [53]:

sns.swarmplot(x='math score',y='race/ethnicity',data=df)

Out[53]:

<AxesSubplot:xlabel='math score', ylabel='race/ethnicity'>

In [54]:

sns.swarmplot(x='race/ethnicity',y='math score',data=df)

Out[54]:

<AxesSubplot:xlabel='race/ethnicity', ylabel='math score'>

In [56]:

plt.figure(figsize=(12,6))
sns.swarmplot(x='race/ethnicity',y='math score',data=df,hue='gender')

Out[56]:

<AxesSubplot:xlabel='race/ethnicity', ylabel='math score'>

In [57]:

plt.figure(figsize=(12,6))
sns.swarmplot(x='race/ethnicity',y='math score',data=df,hue='gender',dodge=True)

Out[57]:

<AxesSubplot:xlabel='race/ethnicity', ylabel='math score'>

boxenplot (letter-value plot)¶

Official Paper on this plot: https://vita.had.co.nz/papers/letter-value-plot.html

This style of plot was originally named a “letter value” plot because it shows a large number of quantiles that are defined as “letter values”. It is similar to a box plot in plotting a nonparametric representation of a distribution in which all features correspond to actual observations. By plotting more quantiles, it provides more information about the shape of the distribution, particularly in the tails.

In [59]:

sns.boxenplot(x='math score',y='race/ethnicity',data=df)

Out[59]:

<AxesSubplot:xlabel='math score', ylabel='race/ethnicity'>

In [60]:

sns.boxenplot(x='race/ethnicity',y='math score',data=df)

Out[60]:

<AxesSubplot:xlabel='race/ethnicity', ylabel='math score'>

In [62]:

plt.figure(figsize=(12,6))
sns.boxenplot(x='race/ethnicity',y='math score',data=df,hue='gender')

Out[62]:

<AxesSubplot:xlabel='race/ethnicity', ylabel='math score'>

1.1 MiB Raw Blame History