216 KiB
Dealing with Outliers¶
In statistics, an outlier is a data point that differs significantly from other observations.An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.
Remember that even if a data point is an outlier, its still a data point! Carefully consider your data, its sources, and your goals whenver deciding to remove an outlier. Each case is different!
Lecture Goals¶
- Understand different mathmatical definitions of outliers
- Use Python tools to recognize outliers and remove them
Useful Links¶
Imports¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Generating Data¶
# Choose a mean,standard deviation, and number of samples
def create_ages(mu=50,sigma=13,num_samples=100,seed=42):
# Set a random seed in the same cell as the random call to get the same values as us
# We set seed to 42 (42 is an arbitrary choice from Hitchhiker's Guide to the Galaxy)
np.random.seed(seed)
sample_ages = np.random.normal(loc=mu,scale=sigma,size=num_samples)
sample_ages = np.round(sample_ages,decimals=0)
return sample_ages
sample = create_ages()
sample
Visualize and Describe the Data¶
sns.distplot(sample,bins=10,kde=False)
sns.boxplot(sample)
ser = pd.Series(sample)
ser.describe()
Trimming or Fixing Based Off Domain Knowledge¶
If we know we're dealing with a dataset pertaining to voting age (18 years old in the USA), then it makes sense to either drop anything less than that OR fix values lower than 18 and push them up to 18.
ser[ser > 18]
# It dropped one person
len(ser[ser > 18])
def fix_values(age):
if age < 18:
return 18
else:
return age
# "Fixes" one person's age
ser.apply(fix_values)
len(ser.apply(fix_values))
There are many ways to identify and remove outliers:
- Trimming based off a provided value
- Capping based off IQR or STD
- https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
- https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623
Ames Data Set¶
Let's explore any extreme outliers in our Ames Housing Data Set
df = pd.read_csv("../DATA/Ames_Housing_Data.csv")
df.head()
sns.heatmap(df.corr())
df.corr()['SalePrice'].sort_values()
sns.distplot(df["SalePrice"])
sns.scatterplot(x='Overall Qual',y='SalePrice',data=df)
df[(df['Overall Qual']>8) & (df['SalePrice']<200000)]
sns.scatterplot(x='Gr Liv Area',y='SalePrice',data=df)
df[(df['Gr Liv Area']>4000) & (df['SalePrice']<400000)]
df[(df['Gr Liv Area']>4000) & (df['SalePrice']<400000)].index
ind_drop = df[(df['Gr Liv Area']>4000) & (df['SalePrice']<400000)].index
df = df.drop(ind_drop,axis=0)
sns.scatterplot(x='Gr Liv Area',y='SalePrice',data=df)
sns.scatterplot(x='Overall Qual',y='SalePrice',data=df)
df.to_csv("../DATA/Ames_outliers_removed.csv",index=False)