You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
2293 lines
63 KiB
2293 lines
63 KiB
2 years ago
|
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"___\n",
|
||
|
"\n",
|
||
|
"<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>\n",
|
||
|
"___\n",
|
||
|
"<center><em>Copyright by Pierian Data Inc.</em></center>\n",
|
||
|
"<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# Missing Data\n",
|
||
|
"\n",
|
||
|
"Make sure to review the video for a full discussion on the strategies of dealing with missing data.\n",
|
||
|
"\n",
|
||
|
"--------\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"## What Null/NA/nan objects look like:\n",
|
||
|
"\n",
|
||
|
"Source: https://github.com/pandas-dev/pandas/issues/28095\n",
|
||
|
"\n",
|
||
|
"A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 127,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import numpy as np\n",
|
||
|
"import pandas as pd"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 128,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"nan"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 128,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"np.nan"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 129,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"<NA>"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 129,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"pd.NA"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 130,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"NaT"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 130,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"pd.NaT"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"----\n",
|
||
|
"------\n",
|
||
|
"## Note! Typical comparisons should be avoided with Missing Values\n",
|
||
|
"\n",
|
||
|
"* https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b\n",
|
||
|
"* https://stackoverflow.com/questions/20320022/why-in-numpy-nan-nan-is-false-while-nan-in-nan-is-true\n",
|
||
|
"\n",
|
||
|
"This is generally because the logic here is, since we don't know these values, we can't know if they are equal to each other."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 131,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"False"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 131,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"np.nan == np.nan"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 132,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"True"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 132,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"np.nan in [np.nan]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 133,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"True"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 133,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"np.nan is np.nan"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 134,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"<NA>"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 134,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"pd.NA == pd.NA"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"-------"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Data\n",
|
||
|
"\n",
|
||
|
"People were asked to score their opinions of actors from a 1-10 scale before and after watching one of their movies. However, some data is missing."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 135,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"df = pd.read_csv('movie_scores.csv')"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 136,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" <th>pre_movie_score</th>\n",
|
||
|
" <th>post_movie_score</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Tom</td>\n",
|
||
|
" <td>Hanks</td>\n",
|
||
|
" <td>63.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" <td>10.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>Hugh</td>\n",
|
||
|
" <td>Jackman</td>\n",
|
||
|
" <td>51.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Oprah</td>\n",
|
||
|
" <td>Winfrey</td>\n",
|
||
|
" <td>66.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>6.0</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>Emma</td>\n",
|
||
|
" <td>Stone</td>\n",
|
||
|
" <td>31.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>7.0</td>\n",
|
||
|
" <td>9.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex pre_movie_score post_movie_score\n",
|
||
|
"0 Tom Hanks 63.0 m 8.0 10.0\n",
|
||
|
"1 NaN NaN NaN NaN NaN NaN\n",
|
||
|
"2 Hugh Jackman 51.0 m NaN NaN\n",
|
||
|
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
|
||
|
"4 Emma Stone 31.0 f 7.0 9.0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 136,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Checking and Selecting for Null Values"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 137,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" <th>pre_movie_score</th>\n",
|
||
|
" <th>post_movie_score</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Tom</td>\n",
|
||
|
" <td>Hanks</td>\n",
|
||
|
" <td>63.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" <td>10.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>Hugh</td>\n",
|
||
|
" <td>Jackman</td>\n",
|
||
|
" <td>51.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Oprah</td>\n",
|
||
|
" <td>Winfrey</td>\n",
|
||
|
" <td>66.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>6.0</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>Emma</td>\n",
|
||
|
" <td>Stone</td>\n",
|
||
|
" <td>31.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>7.0</td>\n",
|
||
|
" <td>9.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex pre_movie_score post_movie_score\n",
|
||
|
"0 Tom Hanks 63.0 m 8.0 10.0\n",
|
||
|
"1 NaN NaN NaN NaN NaN NaN\n",
|
||
|
"2 Hugh Jackman 51.0 m NaN NaN\n",
|
||
|
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
|
||
|
"4 Emma Stone 31.0 f 7.0 9.0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 137,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 138,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" <th>pre_movie_score</th>\n",
|
||
|
" <th>post_movie_score</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex pre_movie_score post_movie_score\n",
|
||
|
"0 False False False False False False\n",
|
||
|
"1 True True True True True True\n",
|
||
|
"2 False False False False True True\n",
|
||
|
"3 False False False False False False\n",
|
||
|
"4 False False False False False False"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 138,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df.isnull()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 139,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" <th>pre_movie_score</th>\n",
|
||
|
" <th>post_movie_score</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex pre_movie_score post_movie_score\n",
|
||
|
"0 True True True True True True\n",
|
||
|
"1 False False False False False False\n",
|
||
|
"2 True True True True False False\n",
|
||
|
"3 True True True True True True\n",
|
||
|
"4 True True True True True True"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 139,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df.notnull()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 140,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0 Tom\n",
|
||
|
"1 NaN\n",
|
||
|
"2 Hugh\n",
|
||
|
"3 Oprah\n",
|
||
|
"4 Emma\n",
|
||
|
"Name: first_name, dtype: object"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 140,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df['first_name']"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 141,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" <th>pre_movie_score</th>\n",
|
||
|
" <th>post_movie_score</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Tom</td>\n",
|
||
|
" <td>Hanks</td>\n",
|
||
|
" <td>63.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" <td>10.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>Hugh</td>\n",
|
||
|
" <td>Jackman</td>\n",
|
||
|
" <td>51.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Oprah</td>\n",
|
||
|
" <td>Winfrey</td>\n",
|
||
|
" <td>66.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>6.0</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>Emma</td>\n",
|
||
|
" <td>Stone</td>\n",
|
||
|
" <td>31.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>7.0</td>\n",
|
||
|
" <td>9.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex pre_movie_score post_movie_score\n",
|
||
|
"0 Tom Hanks 63.0 m 8.0 10.0\n",
|
||
|
"2 Hugh Jackman 51.0 m NaN NaN\n",
|
||
|
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
|
||
|
"4 Emma Stone 31.0 f 7.0 9.0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 141,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df[df['first_name'].notnull()]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 142,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" <th>pre_movie_score</th>\n",
|
||
|
" <th>post_movie_score</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>Hugh</td>\n",
|
||
|
" <td>Jackman</td>\n",
|
||
|
" <td>51.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex pre_movie_score post_movie_score\n",
|
||
|
"2 Hugh Jackman 51.0 m NaN NaN"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 142,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df[(df['pre_movie_score'].isnull()) & df['sex'].notnull()]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Drop Data"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 143,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" <th>pre_movie_score</th>\n",
|
||
|
" <th>post_movie_score</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Tom</td>\n",
|
||
|
" <td>Hanks</td>\n",
|
||
|
" <td>63.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" <td>10.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>Hugh</td>\n",
|
||
|
" <td>Jackman</td>\n",
|
||
|
" <td>51.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Oprah</td>\n",
|
||
|
" <td>Winfrey</td>\n",
|
||
|
" <td>66.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>6.0</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>Emma</td>\n",
|
||
|
" <td>Stone</td>\n",
|
||
|
" <td>31.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>7.0</td>\n",
|
||
|
" <td>9.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex pre_movie_score post_movie_score\n",
|
||
|
"0 Tom Hanks 63.0 m 8.0 10.0\n",
|
||
|
"1 NaN NaN NaN NaN NaN NaN\n",
|
||
|
"2 Hugh Jackman 51.0 m NaN NaN\n",
|
||
|
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
|
||
|
"4 Emma Stone 31.0 f 7.0 9.0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 143,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 144,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"Help on method dropna in module pandas.core.frame:\n",
|
||
|
"\n",
|
||
|
"dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance\n",
|
||
|
" Remove missing values.\n",
|
||
|
" \n",
|
||
|
" See the :ref:`User Guide <missing_data>` for more on which values are\n",
|
||
|
" considered missing, and how to work with missing data.\n",
|
||
|
" \n",
|
||
|
" Parameters\n",
|
||
|
" ----------\n",
|
||
|
" axis : {0 or 'index', 1 or 'columns'}, default 0\n",
|
||
|
" Determine if rows or columns which contain missing values are\n",
|
||
|
" removed.\n",
|
||
|
" \n",
|
||
|
" * 0, or 'index' : Drop rows which contain missing values.\n",
|
||
|
" * 1, or 'columns' : Drop columns which contain missing value.\n",
|
||
|
" \n",
|
||
|
" .. versionchanged:: 1.0.0\n",
|
||
|
" \n",
|
||
|
" Pass tuple or list to drop on multiple axes.\n",
|
||
|
" Only a single axis is allowed.\n",
|
||
|
" \n",
|
||
|
" how : {'any', 'all'}, default 'any'\n",
|
||
|
" Determine if row or column is removed from DataFrame, when we have\n",
|
||
|
" at least one NA or all NA.\n",
|
||
|
" \n",
|
||
|
" * 'any' : If any NA values are present, drop that row or column.\n",
|
||
|
" * 'all' : If all values are NA, drop that row or column.\n",
|
||
|
" \n",
|
||
|
" thresh : int, optional\n",
|
||
|
" Require that many non-NA values.\n",
|
||
|
" subset : array-like, optional\n",
|
||
|
" Labels along other axis to consider, e.g. if you are dropping rows\n",
|
||
|
" these would be a list of columns to include.\n",
|
||
|
" inplace : bool, default False\n",
|
||
|
" If True, do operation inplace and return None.\n",
|
||
|
" \n",
|
||
|
" Returns\n",
|
||
|
" -------\n",
|
||
|
" DataFrame\n",
|
||
|
" DataFrame with NA entries dropped from it.\n",
|
||
|
" \n",
|
||
|
" See Also\n",
|
||
|
" --------\n",
|
||
|
" DataFrame.isna: Indicate missing values.\n",
|
||
|
" DataFrame.notna : Indicate existing (non-missing) values.\n",
|
||
|
" DataFrame.fillna : Replace missing values.\n",
|
||
|
" Series.dropna : Drop missing values.\n",
|
||
|
" Index.dropna : Drop missing indices.\n",
|
||
|
" \n",
|
||
|
" Examples\n",
|
||
|
" --------\n",
|
||
|
" >>> df = pd.DataFrame({\"name\": ['Alfred', 'Batman', 'Catwoman'],\n",
|
||
|
" ... \"toy\": [np.nan, 'Batmobile', 'Bullwhip'],\n",
|
||
|
" ... \"born\": [pd.NaT, pd.Timestamp(\"1940-04-25\"),\n",
|
||
|
" ... pd.NaT]})\n",
|
||
|
" >>> df\n",
|
||
|
" name toy born\n",
|
||
|
" 0 Alfred NaN NaT\n",
|
||
|
" 1 Batman Batmobile 1940-04-25\n",
|
||
|
" 2 Catwoman Bullwhip NaT\n",
|
||
|
" \n",
|
||
|
" Drop the rows where at least one element is missing.\n",
|
||
|
" \n",
|
||
|
" >>> df.dropna()\n",
|
||
|
" name toy born\n",
|
||
|
" 1 Batman Batmobile 1940-04-25\n",
|
||
|
" \n",
|
||
|
" Drop the columns where at least one element is missing.\n",
|
||
|
" \n",
|
||
|
" >>> df.dropna(axis='columns')\n",
|
||
|
" name\n",
|
||
|
" 0 Alfred\n",
|
||
|
" 1 Batman\n",
|
||
|
" 2 Catwoman\n",
|
||
|
" \n",
|
||
|
" Drop the rows where all elements are missing.\n",
|
||
|
" \n",
|
||
|
" >>> df.dropna(how='all')\n",
|
||
|
" name toy born\n",
|
||
|
" 0 Alfred NaN NaT\n",
|
||
|
" 1 Batman Batmobile 1940-04-25\n",
|
||
|
" 2 Catwoman Bullwhip NaT\n",
|
||
|
" \n",
|
||
|
" Keep only the rows with at least 2 non-NA values.\n",
|
||
|
" \n",
|
||
|
" >>> df.dropna(thresh=2)\n",
|
||
|
" name toy born\n",
|
||
|
" 1 Batman Batmobile 1940-04-25\n",
|
||
|
" 2 Catwoman Bullwhip NaT\n",
|
||
|
" \n",
|
||
|
" Define in which columns to look for missing values.\n",
|
||
|
" \n",
|
||
|
" >>> df.dropna(subset=['name', 'born'])\n",
|
||
|
" name toy born\n",
|
||
|
" 1 Batman Batmobile 1940-04-25\n",
|
||
|
" \n",
|
||
|
" Keep the DataFrame with valid entries in the same variable.\n",
|
||
|
" \n",
|
||
|
" >>> df.dropna(inplace=True)\n",
|
||
|
" >>> df\n",
|
||
|
" name toy born\n",
|
||
|
" 1 Batman Batmobile 1940-04-25\n",
|
||
|
"\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"help(df.dropna)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 145,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" <th>pre_movie_score</th>\n",
|
||
|
" <th>post_movie_score</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Tom</td>\n",
|
||
|
" <td>Hanks</td>\n",
|
||
|
" <td>63.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" <td>10.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Oprah</td>\n",
|
||
|
" <td>Winfrey</td>\n",
|
||
|
" <td>66.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>6.0</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>Emma</td>\n",
|
||
|
" <td>Stone</td>\n",
|
||
|
" <td>31.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>7.0</td>\n",
|
||
|
" <td>9.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex pre_movie_score post_movie_score\n",
|
||
|
"0 Tom Hanks 63.0 m 8.0 10.0\n",
|
||
|
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
|
||
|
"4 Emma Stone 31.0 f 7.0 9.0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 145,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df.dropna()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 146,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" <th>pre_movie_score</th>\n",
|
||
|
" <th>post_movie_score</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Tom</td>\n",
|
||
|
" <td>Hanks</td>\n",
|
||
|
" <td>63.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" <td>10.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>Hugh</td>\n",
|
||
|
" <td>Jackman</td>\n",
|
||
|
" <td>51.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Oprah</td>\n",
|
||
|
" <td>Winfrey</td>\n",
|
||
|
" <td>66.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>6.0</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>Emma</td>\n",
|
||
|
" <td>Stone</td>\n",
|
||
|
" <td>31.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>7.0</td>\n",
|
||
|
" <td>9.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex pre_movie_score post_movie_score\n",
|
||
|
"0 Tom Hanks 63.0 m 8.0 10.0\n",
|
||
|
"2 Hugh Jackman 51.0 m NaN NaN\n",
|
||
|
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
|
||
|
"4 Emma Stone 31.0 f 7.0 9.0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 146,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df.dropna(thresh=1)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 147,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
"Empty DataFrame\n",
|
||
|
"Columns: []\n",
|
||
|
"Index: [0, 1, 2, 3, 4]"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 147,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df.dropna(axis=1)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 148,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Tom</td>\n",
|
||
|
" <td>Hanks</td>\n",
|
||
|
" <td>63.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>Hugh</td>\n",
|
||
|
" <td>Jackman</td>\n",
|
||
|
" <td>51.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Oprah</td>\n",
|
||
|
" <td>Winfrey</td>\n",
|
||
|
" <td>66.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>Emma</td>\n",
|
||
|
" <td>Stone</td>\n",
|
||
|
" <td>31.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex\n",
|
||
|
"0 Tom Hanks 63.0 m\n",
|
||
|
"1 NaN NaN NaN NaN\n",
|
||
|
"2 Hugh Jackman 51.0 m\n",
|
||
|
"3 Oprah Winfrey 66.0 f\n",
|
||
|
"4 Emma Stone 31.0 f"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 148,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df.dropna(thresh=4,axis=1)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Fill Data"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 149,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" <th>pre_movie_score</th>\n",
|
||
|
" <th>post_movie_score</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Tom</td>\n",
|
||
|
" <td>Hanks</td>\n",
|
||
|
" <td>63.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" <td>10.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>Hugh</td>\n",
|
||
|
" <td>Jackman</td>\n",
|
||
|
" <td>51.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Oprah</td>\n",
|
||
|
" <td>Winfrey</td>\n",
|
||
|
" <td>66.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>6.0</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>Emma</td>\n",
|
||
|
" <td>Stone</td>\n",
|
||
|
" <td>31.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>7.0</td>\n",
|
||
|
" <td>9.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex pre_movie_score post_movie_score\n",
|
||
|
"0 Tom Hanks 63.0 m 8.0 10.0\n",
|
||
|
"1 NaN NaN NaN NaN NaN NaN\n",
|
||
|
"2 Hugh Jackman 51.0 m NaN NaN\n",
|
||
|
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
|
||
|
"4 Emma Stone 31.0 f 7.0 9.0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 149,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 150,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" <th>pre_movie_score</th>\n",
|
||
|
" <th>post_movie_score</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Tom</td>\n",
|
||
|
" <td>Hanks</td>\n",
|
||
|
" <td>63</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>8</td>\n",
|
||
|
" <td>10</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>NEW VALUE!</td>\n",
|
||
|
" <td>NEW VALUE!</td>\n",
|
||
|
" <td>NEW VALUE!</td>\n",
|
||
|
" <td>NEW VALUE!</td>\n",
|
||
|
" <td>NEW VALUE!</td>\n",
|
||
|
" <td>NEW VALUE!</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>Hugh</td>\n",
|
||
|
" <td>Jackman</td>\n",
|
||
|
" <td>51</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>NEW VALUE!</td>\n",
|
||
|
" <td>NEW VALUE!</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Oprah</td>\n",
|
||
|
" <td>Winfrey</td>\n",
|
||
|
" <td>66</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>6</td>\n",
|
||
|
" <td>8</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>Emma</td>\n",
|
||
|
" <td>Stone</td>\n",
|
||
|
" <td>31</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>7</td>\n",
|
||
|
" <td>9</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex pre_movie_score \\\n",
|
||
|
"0 Tom Hanks 63 m 8 \n",
|
||
|
"1 NEW VALUE! NEW VALUE! NEW VALUE! NEW VALUE! NEW VALUE! \n",
|
||
|
"2 Hugh Jackman 51 m NEW VALUE! \n",
|
||
|
"3 Oprah Winfrey 66 f 6 \n",
|
||
|
"4 Emma Stone 31 f 7 \n",
|
||
|
"\n",
|
||
|
" post_movie_score \n",
|
||
|
"0 10 \n",
|
||
|
"1 NEW VALUE! \n",
|
||
|
"2 NEW VALUE! \n",
|
||
|
"3 8 \n",
|
||
|
"4 9 "
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 150,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df.fillna(\"NEW VALUE!\")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 151,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0 Tom\n",
|
||
|
"1 Empty\n",
|
||
|
"2 Hugh\n",
|
||
|
"3 Oprah\n",
|
||
|
"4 Emma\n",
|
||
|
"Name: first_name, dtype: object"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 151,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df['first_name'].fillna(\"Empty\")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 152,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"df['first_name'] = df['first_name'].fillna(\"Empty\")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 153,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" <th>pre_movie_score</th>\n",
|
||
|
" <th>post_movie_score</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Tom</td>\n",
|
||
|
" <td>Hanks</td>\n",
|
||
|
" <td>63.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" <td>10.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>Empty</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>Hugh</td>\n",
|
||
|
" <td>Jackman</td>\n",
|
||
|
" <td>51.0</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Oprah</td>\n",
|
||
|
" <td>Winfrey</td>\n",
|
||
|
" <td>66.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>6.0</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>Emma</td>\n",
|
||
|
" <td>Stone</td>\n",
|
||
|
" <td>31.0</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>7.0</td>\n",
|
||
|
" <td>9.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex pre_movie_score post_movie_score\n",
|
||
|
"0 Tom Hanks 63.0 m 8.0 10.0\n",
|
||
|
"1 Empty NaN NaN NaN NaN NaN\n",
|
||
|
"2 Hugh Jackman 51.0 m NaN NaN\n",
|
||
|
"3 Oprah Winfrey 66.0 f 6.0 8.0\n",
|
||
|
"4 Emma Stone 31.0 f 7.0 9.0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 153,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 154,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"7.0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 154,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df['pre_movie_score'].mean()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 155,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0 8.0\n",
|
||
|
"1 7.0\n",
|
||
|
"2 7.0\n",
|
||
|
"3 6.0\n",
|
||
|
"4 7.0\n",
|
||
|
"Name: pre_movie_score, dtype: float64"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 155,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df['pre_movie_score'].fillna(df['pre_movie_score'].mean())"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 156,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>first_name</th>\n",
|
||
|
" <th>last_name</th>\n",
|
||
|
" <th>age</th>\n",
|
||
|
" <th>sex</th>\n",
|
||
|
" <th>pre_movie_score</th>\n",
|
||
|
" <th>post_movie_score</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Tom</td>\n",
|
||
|
" <td>Hanks</td>\n",
|
||
|
" <td>63.00</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" <td>10.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>Empty</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>52.75</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" <td>7.0</td>\n",
|
||
|
" <td>9.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>Hugh</td>\n",
|
||
|
" <td>Jackman</td>\n",
|
||
|
" <td>51.00</td>\n",
|
||
|
" <td>m</td>\n",
|
||
|
" <td>7.0</td>\n",
|
||
|
" <td>9.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Oprah</td>\n",
|
||
|
" <td>Winfrey</td>\n",
|
||
|
" <td>66.00</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>6.0</td>\n",
|
||
|
" <td>8.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>Emma</td>\n",
|
||
|
" <td>Stone</td>\n",
|
||
|
" <td>31.00</td>\n",
|
||
|
" <td>f</td>\n",
|
||
|
" <td>7.0</td>\n",
|
||
|
" <td>9.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" first_name last_name age sex pre_movie_score post_movie_score\n",
|
||
|
"0 Tom Hanks 63.00 m 8.0 10.0\n",
|
||
|
"1 Empty NaN 52.75 NaN 7.0 9.0\n",
|
||
|
"2 Hugh Jackman 51.00 m 7.0 9.0\n",
|
||
|
"3 Oprah Winfrey 66.00 f 6.0 8.0\n",
|
||
|
"4 Emma Stone 31.00 f 7.0 9.0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 156,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df.fillna(df.mean())"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Filling with Interpolation\n",
|
||
|
"\n",
|
||
|
"Be careful with this technique, you should try to really understand whether or not this is a valid choice for your data. You should also note there are several methods available, the default is a linear method.\n",
|
||
|
"\n",
|
||
|
"Full Docs on this Method:\n",
|
||
|
"https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 164,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"airline_tix = {'first':100,'business':np.nan,'economy-plus':50,'economy':30}"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 165,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"ser = pd.Series(airline_tix)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 166,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"first 100.0\n",
|
||
|
"business NaN\n",
|
||
|
"economy-plus 50.0\n",
|
||
|
"economy 30.0\n",
|
||
|
"dtype: float64"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 166,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"ser"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 167,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"first 100.0\n",
|
||
|
"business 75.0\n",
|
||
|
"economy-plus 50.0\n",
|
||
|
"economy 30.0\n",
|
||
|
"dtype: float64"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 167,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"ser.interpolate()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 163,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"ename": "ValueError",
|
||
|
"evalue": "Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating.",
|
||
|
"output_type": "error",
|
||
|
"traceback": [
|
||
|
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
|
||
|
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
|
||
|
"\u001b[1;32m<ipython-input-163-106f2287918c>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mser\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0minterpolate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'spline'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
|
||
|
"\u001b[1;32mc:\\users\\marcial\\anaconda3\\envs\\ml_master\\lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36minterpolate\u001b[1;34m(self, method, axis, limit, inplace, limit_direction, limit_area, downcast, **kwargs)\u001b[0m\n\u001b[0;32m 6992\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mmethod\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mmethods\u001b[0m \u001b[1;32mand\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mis_numeric_or_datetime\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 6993\u001b[0m raise ValueError(\n\u001b[1;32m-> 6994\u001b[1;33m \u001b[1;34m\"Index column must be numeric or datetime type when \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 6995\u001b[0m \u001b[1;34mf\"using {method} method other than linear. \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 6996\u001b[0m \u001b[1;34m\"Try setting a numeric or datetime index column before \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
|
||
|
"\u001b[1;31mValueError\u001b[0m: Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating."
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"ser.interpolate(method='spline')"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 169,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"df = pd.DataFrame(ser,columns=['Price'])"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 170,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>Price</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>first</th>\n",
|
||
|
" <td>100.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>business</th>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>economy-plus</th>\n",
|
||
|
" <td>50.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>economy</th>\n",
|
||
|
" <td>30.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" Price\n",
|
||
|
"first 100.0\n",
|
||
|
"business NaN\n",
|
||
|
"economy-plus 50.0\n",
|
||
|
"economy 30.0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 170,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 171,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>Price</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>first</th>\n",
|
||
|
" <td>100.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>business</th>\n",
|
||
|
" <td>75.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>economy-plus</th>\n",
|
||
|
" <td>50.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>economy</th>\n",
|
||
|
" <td>30.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" Price\n",
|
||
|
"first 100.0\n",
|
||
|
"business 75.0\n",
|
||
|
"economy-plus 50.0\n",
|
||
|
"economy 30.0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 171,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df.interpolate()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 174,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"df = df.reset_index()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 175,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>index</th>\n",
|
||
|
" <th>Price</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>first</td>\n",
|
||
|
" <td>100.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>business</td>\n",
|
||
|
" <td>NaN</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>economy-plus</td>\n",
|
||
|
" <td>50.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>economy</td>\n",
|
||
|
" <td>30.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" index Price\n",
|
||
|
"0 first 100.0\n",
|
||
|
"1 business NaN\n",
|
||
|
"2 economy-plus 50.0\n",
|
||
|
"3 economy 30.0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 175,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 178,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>index</th>\n",
|
||
|
" <th>Price</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>first</td>\n",
|
||
|
" <td>100.000000</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>business</td>\n",
|
||
|
" <td>73.333333</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>economy-plus</td>\n",
|
||
|
" <td>50.000000</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>economy</td>\n",
|
||
|
" <td>30.000000</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" index Price\n",
|
||
|
"0 first 100.000000\n",
|
||
|
"1 business 73.333333\n",
|
||
|
"2 economy-plus 50.000000\n",
|
||
|
"3 economy 30.000000"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 178,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df.interpolate(method='spline',order=2)"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"anaconda-cloud": {},
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 3",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 3
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython3",
|
||
|
"version": "3.7.6"
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 1
|
||
|
}
|