You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

2293 lines
63 KiB

2 years ago
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>\n",
"___\n",
"<center><em>Copyright by Pierian Data Inc.</em></center>\n",
"<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Missing Data\n",
"\n",
"Make sure to review the video for a full discussion on the strategies of dealing with missing data.\n",
"\n",
"--------\n",
"\n",
"\n",
"## What Null/NA/nan objects look like:\n",
"\n",
"Source: https://github.com/pandas-dev/pandas/issues/28095\n",
"\n",
"A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"nan"
]
},
"execution_count": 128,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.nan"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<NA>"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.NA"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"NaT"
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.NaT"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----\n",
"------\n",
"## Note! Typical comparisons should be avoided with Missing Values\n",
"\n",
"* https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b\n",
"* https://stackoverflow.com/questions/20320022/why-in-numpy-nan-nan-is-false-while-nan-in-nan-is-true\n",
"\n",
"This is generally because the logic here is, since we don't know these values, we can't know if they are equal to each other."
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 131,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.nan == np.nan"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 132,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.nan in [np.nan]"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 133,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.nan is np.nan"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<NA>"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.NA == pd.NA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data\n",
"\n",
"People were asked to score their opinions of actors from a 1-10 scale before and after watching one of their movies. However, some data is missing."
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('movie_scores.csv')"
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>pre_movie_score</th>\n",
" <th>post_movie_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tom</td>\n",
" <td>Hanks</td>\n",
" <td>63.0</td>\n",
" <td>m</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Hugh</td>\n",
" <td>Jackman</td>\n",
" <td>51.0</td>\n",
" <td>m</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Oprah</td>\n",
" <td>Winfrey</td>\n",
" <td>66.0</td>\n",
" <td>f</td>\n",
" <td>6.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Emma</td>\n",
" <td>Stone</td>\n",
" <td>31.0</td>\n",
" <td>f</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex pre_movie_score post_movie_score\n",
"0 Tom Hanks 63.0 m 8.0 10.0\n",
"1 NaN NaN NaN NaN NaN NaN\n",
"2 Hugh Jackman 51.0 m NaN NaN\n",
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
"4 Emma Stone 31.0 f 7.0 9.0"
]
},
"execution_count": 136,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Checking and Selecting for Null Values"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>pre_movie_score</th>\n",
" <th>post_movie_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tom</td>\n",
" <td>Hanks</td>\n",
" <td>63.0</td>\n",
" <td>m</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Hugh</td>\n",
" <td>Jackman</td>\n",
" <td>51.0</td>\n",
" <td>m</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Oprah</td>\n",
" <td>Winfrey</td>\n",
" <td>66.0</td>\n",
" <td>f</td>\n",
" <td>6.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Emma</td>\n",
" <td>Stone</td>\n",
" <td>31.0</td>\n",
" <td>f</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex pre_movie_score post_movie_score\n",
"0 Tom Hanks 63.0 m 8.0 10.0\n",
"1 NaN NaN NaN NaN NaN NaN\n",
"2 Hugh Jackman 51.0 m NaN NaN\n",
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
"4 Emma Stone 31.0 f 7.0 9.0"
]
},
"execution_count": 137,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>pre_movie_score</th>\n",
" <th>post_movie_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex pre_movie_score post_movie_score\n",
"0 False False False False False False\n",
"1 True True True True True True\n",
"2 False False False False True True\n",
"3 False False False False False False\n",
"4 False False False False False False"
]
},
"execution_count": 138,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.isnull()"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>pre_movie_score</th>\n",
" <th>post_movie_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex pre_movie_score post_movie_score\n",
"0 True True True True True True\n",
"1 False False False False False False\n",
"2 True True True True False False\n",
"3 True True True True True True\n",
"4 True True True True True True"
]
},
"execution_count": 139,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.notnull()"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Tom\n",
"1 NaN\n",
"2 Hugh\n",
"3 Oprah\n",
"4 Emma\n",
"Name: first_name, dtype: object"
]
},
"execution_count": 140,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['first_name']"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>pre_movie_score</th>\n",
" <th>post_movie_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tom</td>\n",
" <td>Hanks</td>\n",
" <td>63.0</td>\n",
" <td>m</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Hugh</td>\n",
" <td>Jackman</td>\n",
" <td>51.0</td>\n",
" <td>m</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Oprah</td>\n",
" <td>Winfrey</td>\n",
" <td>66.0</td>\n",
" <td>f</td>\n",
" <td>6.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Emma</td>\n",
" <td>Stone</td>\n",
" <td>31.0</td>\n",
" <td>f</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex pre_movie_score post_movie_score\n",
"0 Tom Hanks 63.0 m 8.0 10.0\n",
"2 Hugh Jackman 51.0 m NaN NaN\n",
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
"4 Emma Stone 31.0 f 7.0 9.0"
]
},
"execution_count": 141,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df['first_name'].notnull()]"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>pre_movie_score</th>\n",
" <th>post_movie_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Hugh</td>\n",
" <td>Jackman</td>\n",
" <td>51.0</td>\n",
" <td>m</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex pre_movie_score post_movie_score\n",
"2 Hugh Jackman 51.0 m NaN NaN"
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[(df['pre_movie_score'].isnull()) & df['sex'].notnull()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Drop Data"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>pre_movie_score</th>\n",
" <th>post_movie_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tom</td>\n",
" <td>Hanks</td>\n",
" <td>63.0</td>\n",
" <td>m</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Hugh</td>\n",
" <td>Jackman</td>\n",
" <td>51.0</td>\n",
" <td>m</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Oprah</td>\n",
" <td>Winfrey</td>\n",
" <td>66.0</td>\n",
" <td>f</td>\n",
" <td>6.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Emma</td>\n",
" <td>Stone</td>\n",
" <td>31.0</td>\n",
" <td>f</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex pre_movie_score post_movie_score\n",
"0 Tom Hanks 63.0 m 8.0 10.0\n",
"1 NaN NaN NaN NaN NaN NaN\n",
"2 Hugh Jackman 51.0 m NaN NaN\n",
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
"4 Emma Stone 31.0 f 7.0 9.0"
]
},
"execution_count": 143,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Help on method dropna in module pandas.core.frame:\n",
"\n",
"dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance\n",
" Remove missing values.\n",
" \n",
" See the :ref:`User Guide <missing_data>` for more on which values are\n",
" considered missing, and how to work with missing data.\n",
" \n",
" Parameters\n",
" ----------\n",
" axis : {0 or 'index', 1 or 'columns'}, default 0\n",
" Determine if rows or columns which contain missing values are\n",
" removed.\n",
" \n",
" * 0, or 'index' : Drop rows which contain missing values.\n",
" * 1, or 'columns' : Drop columns which contain missing value.\n",
" \n",
" .. versionchanged:: 1.0.0\n",
" \n",
" Pass tuple or list to drop on multiple axes.\n",
" Only a single axis is allowed.\n",
" \n",
" how : {'any', 'all'}, default 'any'\n",
" Determine if row or column is removed from DataFrame, when we have\n",
" at least one NA or all NA.\n",
" \n",
" * 'any' : If any NA values are present, drop that row or column.\n",
" * 'all' : If all values are NA, drop that row or column.\n",
" \n",
" thresh : int, optional\n",
" Require that many non-NA values.\n",
" subset : array-like, optional\n",
" Labels along other axis to consider, e.g. if you are dropping rows\n",
" these would be a list of columns to include.\n",
" inplace : bool, default False\n",
" If True, do operation inplace and return None.\n",
" \n",
" Returns\n",
" -------\n",
" DataFrame\n",
" DataFrame with NA entries dropped from it.\n",
" \n",
" See Also\n",
" --------\n",
" DataFrame.isna: Indicate missing values.\n",
" DataFrame.notna : Indicate existing (non-missing) values.\n",
" DataFrame.fillna : Replace missing values.\n",
" Series.dropna : Drop missing values.\n",
" Index.dropna : Drop missing indices.\n",
" \n",
" Examples\n",
" --------\n",
" >>> df = pd.DataFrame({\"name\": ['Alfred', 'Batman', 'Catwoman'],\n",
" ... \"toy\": [np.nan, 'Batmobile', 'Bullwhip'],\n",
" ... \"born\": [pd.NaT, pd.Timestamp(\"1940-04-25\"),\n",
" ... pd.NaT]})\n",
" >>> df\n",
" name toy born\n",
" 0 Alfred NaN NaT\n",
" 1 Batman Batmobile 1940-04-25\n",
" 2 Catwoman Bullwhip NaT\n",
" \n",
" Drop the rows where at least one element is missing.\n",
" \n",
" >>> df.dropna()\n",
" name toy born\n",
" 1 Batman Batmobile 1940-04-25\n",
" \n",
" Drop the columns where at least one element is missing.\n",
" \n",
" >>> df.dropna(axis='columns')\n",
" name\n",
" 0 Alfred\n",
" 1 Batman\n",
" 2 Catwoman\n",
" \n",
" Drop the rows where all elements are missing.\n",
" \n",
" >>> df.dropna(how='all')\n",
" name toy born\n",
" 0 Alfred NaN NaT\n",
" 1 Batman Batmobile 1940-04-25\n",
" 2 Catwoman Bullwhip NaT\n",
" \n",
" Keep only the rows with at least 2 non-NA values.\n",
" \n",
" >>> df.dropna(thresh=2)\n",
" name toy born\n",
" 1 Batman Batmobile 1940-04-25\n",
" 2 Catwoman Bullwhip NaT\n",
" \n",
" Define in which columns to look for missing values.\n",
" \n",
" >>> df.dropna(subset=['name', 'born'])\n",
" name toy born\n",
" 1 Batman Batmobile 1940-04-25\n",
" \n",
" Keep the DataFrame with valid entries in the same variable.\n",
" \n",
" >>> df.dropna(inplace=True)\n",
" >>> df\n",
" name toy born\n",
" 1 Batman Batmobile 1940-04-25\n",
"\n"
]
}
],
"source": [
"help(df.dropna)"
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>pre_movie_score</th>\n",
" <th>post_movie_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tom</td>\n",
" <td>Hanks</td>\n",
" <td>63.0</td>\n",
" <td>m</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Oprah</td>\n",
" <td>Winfrey</td>\n",
" <td>66.0</td>\n",
" <td>f</td>\n",
" <td>6.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Emma</td>\n",
" <td>Stone</td>\n",
" <td>31.0</td>\n",
" <td>f</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex pre_movie_score post_movie_score\n",
"0 Tom Hanks 63.0 m 8.0 10.0\n",
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
"4 Emma Stone 31.0 f 7.0 9.0"
]
},
"execution_count": 145,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dropna()"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>pre_movie_score</th>\n",
" <th>post_movie_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tom</td>\n",
" <td>Hanks</td>\n",
" <td>63.0</td>\n",
" <td>m</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Hugh</td>\n",
" <td>Jackman</td>\n",
" <td>51.0</td>\n",
" <td>m</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Oprah</td>\n",
" <td>Winfrey</td>\n",
" <td>66.0</td>\n",
" <td>f</td>\n",
" <td>6.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Emma</td>\n",
" <td>Stone</td>\n",
" <td>31.0</td>\n",
" <td>f</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex pre_movie_score post_movie_score\n",
"0 Tom Hanks 63.0 m 8.0 10.0\n",
"2 Hugh Jackman 51.0 m NaN NaN\n",
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
"4 Emma Stone 31.0 f 7.0 9.0"
]
},
"execution_count": 146,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dropna(thresh=1)"
]
},
{
"cell_type": "code",
"execution_count": 147,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: [0, 1, 2, 3, 4]"
]
},
"execution_count": 147,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dropna(axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 148,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tom</td>\n",
" <td>Hanks</td>\n",
" <td>63.0</td>\n",
" <td>m</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Hugh</td>\n",
" <td>Jackman</td>\n",
" <td>51.0</td>\n",
" <td>m</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Oprah</td>\n",
" <td>Winfrey</td>\n",
" <td>66.0</td>\n",
" <td>f</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Emma</td>\n",
" <td>Stone</td>\n",
" <td>31.0</td>\n",
" <td>f</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex\n",
"0 Tom Hanks 63.0 m\n",
"1 NaN NaN NaN NaN\n",
"2 Hugh Jackman 51.0 m\n",
"3 Oprah Winfrey 66.0 f\n",
"4 Emma Stone 31.0 f"
]
},
"execution_count": 148,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dropna(thresh=4,axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fill Data"
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>pre_movie_score</th>\n",
" <th>post_movie_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tom</td>\n",
" <td>Hanks</td>\n",
" <td>63.0</td>\n",
" <td>m</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Hugh</td>\n",
" <td>Jackman</td>\n",
" <td>51.0</td>\n",
" <td>m</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Oprah</td>\n",
" <td>Winfrey</td>\n",
" <td>66.0</td>\n",
" <td>f</td>\n",
" <td>6.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Emma</td>\n",
" <td>Stone</td>\n",
" <td>31.0</td>\n",
" <td>f</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex pre_movie_score post_movie_score\n",
"0 Tom Hanks 63.0 m 8.0 10.0\n",
"1 NaN NaN NaN NaN NaN NaN\n",
"2 Hugh Jackman 51.0 m NaN NaN\n",
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
"4 Emma Stone 31.0 f 7.0 9.0"
]
},
"execution_count": 149,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 150,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>pre_movie_score</th>\n",
" <th>post_movie_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tom</td>\n",
" <td>Hanks</td>\n",
" <td>63</td>\n",
" <td>m</td>\n",
" <td>8</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NEW VALUE!</td>\n",
" <td>NEW VALUE!</td>\n",
" <td>NEW VALUE!</td>\n",
" <td>NEW VALUE!</td>\n",
" <td>NEW VALUE!</td>\n",
" <td>NEW VALUE!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Hugh</td>\n",
" <td>Jackman</td>\n",
" <td>51</td>\n",
" <td>m</td>\n",
" <td>NEW VALUE!</td>\n",
" <td>NEW VALUE!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Oprah</td>\n",
" <td>Winfrey</td>\n",
" <td>66</td>\n",
" <td>f</td>\n",
" <td>6</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Emma</td>\n",
" <td>Stone</td>\n",
" <td>31</td>\n",
" <td>f</td>\n",
" <td>7</td>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex pre_movie_score \\\n",
"0 Tom Hanks 63 m 8 \n",
"1 NEW VALUE! NEW VALUE! NEW VALUE! NEW VALUE! NEW VALUE! \n",
"2 Hugh Jackman 51 m NEW VALUE! \n",
"3 Oprah Winfrey 66 f 6 \n",
"4 Emma Stone 31 f 7 \n",
"\n",
" post_movie_score \n",
"0 10 \n",
"1 NEW VALUE! \n",
"2 NEW VALUE! \n",
"3 8 \n",
"4 9 "
]
},
"execution_count": 150,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.fillna(\"NEW VALUE!\")"
]
},
{
"cell_type": "code",
"execution_count": 151,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Tom\n",
"1 Empty\n",
"2 Hugh\n",
"3 Oprah\n",
"4 Emma\n",
"Name: first_name, dtype: object"
]
},
"execution_count": 151,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['first_name'].fillna(\"Empty\")"
]
},
{
"cell_type": "code",
"execution_count": 152,
"metadata": {},
"outputs": [],
"source": [
"df['first_name'] = df['first_name'].fillna(\"Empty\")"
]
},
{
"cell_type": "code",
"execution_count": 153,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>pre_movie_score</th>\n",
" <th>post_movie_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tom</td>\n",
" <td>Hanks</td>\n",
" <td>63.0</td>\n",
" <td>m</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Empty</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Hugh</td>\n",
" <td>Jackman</td>\n",
" <td>51.0</td>\n",
" <td>m</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Oprah</td>\n",
" <td>Winfrey</td>\n",
" <td>66.0</td>\n",
" <td>f</td>\n",
" <td>6.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Emma</td>\n",
" <td>Stone</td>\n",
" <td>31.0</td>\n",
" <td>f</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex pre_movie_score post_movie_score\n",
"0 Tom Hanks 63.0 m 8.0 10.0\n",
"1 Empty NaN NaN NaN NaN NaN\n",
"2 Hugh Jackman 51.0 m NaN NaN\n",
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
"4 Emma Stone 31.0 f 7.0 9.0"
]
},
"execution_count": 153,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 154,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7.0"
]
},
"execution_count": 154,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['pre_movie_score'].mean()"
]
},
{
"cell_type": "code",
"execution_count": 155,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 8.0\n",
"1 7.0\n",
"2 7.0\n",
"3 6.0\n",
"4 7.0\n",
"Name: pre_movie_score, dtype: float64"
]
},
"execution_count": 155,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['pre_movie_score'].fillna(df['pre_movie_score'].mean())"
]
},
{
"cell_type": "code",
"execution_count": 156,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>pre_movie_score</th>\n",
" <th>post_movie_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tom</td>\n",
" <td>Hanks</td>\n",
" <td>63.00</td>\n",
" <td>m</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Empty</td>\n",
" <td>NaN</td>\n",
" <td>52.75</td>\n",
" <td>NaN</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Hugh</td>\n",
" <td>Jackman</td>\n",
" <td>51.00</td>\n",
" <td>m</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Oprah</td>\n",
" <td>Winfrey</td>\n",
" <td>66.00</td>\n",
" <td>f</td>\n",
" <td>6.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Emma</td>\n",
" <td>Stone</td>\n",
" <td>31.00</td>\n",
" <td>f</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name age sex pre_movie_score post_movie_score\n",
"0 Tom Hanks 63.00 m 8.0 10.0\n",
"1 Empty NaN 52.75 NaN 7.0 9.0\n",
"2 Hugh Jackman 51.00 m 7.0 9.0\n",
"3 Oprah Winfrey 66.00 f 6.0 8.0\n",
"4 Emma Stone 31.00 f 7.0 9.0"
]
},
"execution_count": 156,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.fillna(df.mean())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filling with Interpolation\n",
"\n",
"Be careful with this technique, you should try to really understand whether or not this is a valid choice for your data. You should also note there are several methods available, the default is a linear method.\n",
"\n",
"Full Docs on this Method:\n",
"https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html"
]
},
{
"cell_type": "code",
"execution_count": 164,
"metadata": {},
"outputs": [],
"source": [
"airline_tix = {'first':100,'business':np.nan,'economy-plus':50,'economy':30}"
]
},
{
"cell_type": "code",
"execution_count": 165,
"metadata": {},
"outputs": [],
"source": [
"ser = pd.Series(airline_tix)"
]
},
{
"cell_type": "code",
"execution_count": 166,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"first 100.0\n",
"business NaN\n",
"economy-plus 50.0\n",
"economy 30.0\n",
"dtype: float64"
]
},
"execution_count": 166,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser"
]
},
{
"cell_type": "code",
"execution_count": 167,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"first 100.0\n",
"business 75.0\n",
"economy-plus 50.0\n",
"economy 30.0\n",
"dtype: float64"
]
},
"execution_count": 167,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser.interpolate()"
]
},
{
"cell_type": "code",
"execution_count": 163,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating.",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-163-106f2287918c>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mser\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0minterpolate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'spline'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;32mc:\\users\\marcial\\anaconda3\\envs\\ml_master\\lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36minterpolate\u001b[1;34m(self, method, axis, limit, inplace, limit_direction, limit_area, downcast, **kwargs)\u001b[0m\n\u001b[0;32m 6992\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mmethod\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mmethods\u001b[0m \u001b[1;32mand\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mis_numeric_or_datetime\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 6993\u001b[0m raise ValueError(\n\u001b[1;32m-> 6994\u001b[1;33m \u001b[1;34m\"Index column must be numeric or datetime type when \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 6995\u001b[0m \u001b[1;34mf\"using {method} method other than linear. \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 6996\u001b[0m \u001b[1;34m\"Try setting a numeric or datetime index column before \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mValueError\u001b[0m: Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating."
]
}
],
"source": [
"ser.interpolate(method='spline')"
]
},
{
"cell_type": "code",
"execution_count": 169,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(ser,columns=['Price'])"
]
},
{
"cell_type": "code",
"execution_count": 170,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>first</th>\n",
" <td>100.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>business</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>economy-plus</th>\n",
" <td>50.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>economy</th>\n",
" <td>30.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Price\n",
"first 100.0\n",
"business NaN\n",
"economy-plus 50.0\n",
"economy 30.0"
]
},
"execution_count": 170,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 171,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>first</th>\n",
" <td>100.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>business</th>\n",
" <td>75.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>economy-plus</th>\n",
" <td>50.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>economy</th>\n",
" <td>30.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Price\n",
"first 100.0\n",
"business 75.0\n",
"economy-plus 50.0\n",
"economy 30.0"
]
},
"execution_count": 171,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.interpolate()"
]
},
{
"cell_type": "code",
"execution_count": 174,
"metadata": {},
"outputs": [],
"source": [
"df = df.reset_index()"
]
},
{
"cell_type": "code",
"execution_count": 175,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>index</th>\n",
" <th>Price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>first</td>\n",
" <td>100.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>business</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>economy-plus</td>\n",
" <td>50.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>economy</td>\n",
" <td>30.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" index Price\n",
"0 first 100.0\n",
"1 business NaN\n",
"2 economy-plus 50.0\n",
"3 economy 30.0"
]
},
"execution_count": 175,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 178,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>index</th>\n",
" <th>Price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>first</td>\n",
" <td>100.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>business</td>\n",
" <td>73.333333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>economy-plus</td>\n",
" <td>50.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>economy</td>\n",
" <td>30.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" index Price\n",
"0 first 100.000000\n",
"1 business 73.333333\n",
"2 economy-plus 50.000000\n",
"3 economy 30.000000"
]
},
"execution_count": 178,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.interpolate(method='spline',order=2)"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}