63 KiB
Missing Data¶
Make sure to review the video for a full discussion on the strategies of dealing with missing data.
What Null/NA/nan objects look like:¶
Source: https://github.com/pandas-dev/pandas/issues/28095
A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type
import numpy as np
import pandas as pd
np.nan
pd.NA
pd.NaT
Note! Typical comparisons should be avoided with Missing Values¶
- https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b
- https://stackoverflow.com/questions/20320022/why-in-numpy-nan-nan-is-false-while-nan-in-nan-is-true
This is generally because the logic here is, since we don't know these values, we can't know if they are equal to each other.
np.nan == np.nan
np.nan in [np.nan]
np.nan is np.nan
pd.NA == pd.NA
Data¶
People were asked to score their opinions of actors from a 1-10 scale before and after watching one of their movies. However, some data is missing.
df = pd.read_csv('movie_scores.csv')
df
Checking and Selecting for Null Values¶
df
df.isnull()
df.notnull()
df['first_name']
df[df['first_name'].notnull()]
df[(df['pre_movie_score'].isnull()) & df['sex'].notnull()]
Drop Data¶
df
help(df.dropna)
df.dropna()
df.dropna(thresh=1)
df.dropna(axis=1)
df.dropna(thresh=4,axis=1)
Fill Data¶
df
df.fillna("NEW VALUE!")
df['first_name'].fillna("Empty")
df['first_name'] = df['first_name'].fillna("Empty")
df
df['pre_movie_score'].mean()
df['pre_movie_score'].fillna(df['pre_movie_score'].mean())
df.fillna(df.mean())
Filling with Interpolation¶
Be careful with this technique, you should try to really understand whether or not this is a valid choice for your data. You should also note there are several methods available, the default is a linear method.
Full Docs on this Method: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html
airline_tix = {'first':100,'business':np.nan,'economy-plus':50,'economy':30}
ser = pd.Series(airline_tix)
ser
ser.interpolate()
ser.interpolate(method='spline')
df = pd.DataFrame(ser,columns=['Price'])
df
df.interpolate()
df = df.reset_index()
df
df.interpolate(method='spline',order=2)