{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "\n",
    "<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>\n",
    "___\n",
    "<center><em>Copyright by Pierian Data Inc.</em></center>\n",
    "<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Missing Data\n",
    "\n",
    "Make sure to review the video for a full discussion on the strategies of dealing with missing data.\n",
    "\n",
    "--------\n",
    "\n",
    "\n",
    "## What Null/NA/nan objects look like:\n",
    "\n",
    "Source: https://github.com/pandas-dev/pandas/issues/28095\n",
    "\n",
    "A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nan"
      ]
     },
     "execution_count": 128,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<NA>"
      ]
     },
     "execution_count": 129,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.NA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "NaT"
      ]
     },
     "execution_count": 130,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.NaT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "------\n",
    "## Note! Typical comparisons should be avoided with Missing Values\n",
    "\n",
    "* https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b\n",
    "* https://stackoverflow.com/questions/20320022/why-in-numpy-nan-nan-is-false-while-nan-in-nan-is-true\n",
    "\n",
    "This is generally because the logic here is, since we don't know these values, we can't know if they are equal to each other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 131,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.nan == np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 132,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.nan in [np.nan]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 133,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.nan is np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<NA>"
      ]
     },
     "execution_count": 134,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.NA == pd.NA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "People were asked to score their opinions of actors from a 1-10 scale before and after watching one of their movies. However, some data is missing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('movie_scores.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>pre_movie_score</th>\n",
       "      <th>post_movie_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Tom</td>\n",
       "      <td>Hanks</td>\n",
       "      <td>63.0</td>\n",
       "      <td>m</td>\n",
       "      <td>8.0</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hugh</td>\n",
       "      <td>Jackman</td>\n",
       "      <td>51.0</td>\n",
       "      <td>m</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Oprah</td>\n",
       "      <td>Winfrey</td>\n",
       "      <td>66.0</td>\n",
       "      <td>f</td>\n",
       "      <td>6.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Emma</td>\n",
       "      <td>Stone</td>\n",
       "      <td>31.0</td>\n",
       "      <td>f</td>\n",
       "      <td>7.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name   age  sex  pre_movie_score  post_movie_score\n",
       "0        Tom     Hanks  63.0    m              8.0              10.0\n",
       "1        NaN       NaN   NaN  NaN              NaN               NaN\n",
       "2       Hugh   Jackman  51.0    m              NaN               NaN\n",
       "3      Oprah   Winfrey  66.0    f              6.0               8.0\n",
       "4       Emma     Stone  31.0    f              7.0               9.0"
      ]
     },
     "execution_count": 136,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Checking and Selecting for Null Values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 137,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>pre_movie_score</th>\n",
       "      <th>post_movie_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Tom</td>\n",
       "      <td>Hanks</td>\n",
       "      <td>63.0</td>\n",
       "      <td>m</td>\n",
       "      <td>8.0</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hugh</td>\n",
       "      <td>Jackman</td>\n",
       "      <td>51.0</td>\n",
       "      <td>m</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Oprah</td>\n",
       "      <td>Winfrey</td>\n",
       "      <td>66.0</td>\n",
       "      <td>f</td>\n",
       "      <td>6.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Emma</td>\n",
       "      <td>Stone</td>\n",
       "      <td>31.0</td>\n",
       "      <td>f</td>\n",
       "      <td>7.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name   age  sex  pre_movie_score  post_movie_score\n",
       "0        Tom     Hanks  63.0    m              8.0              10.0\n",
       "1        NaN       NaN   NaN  NaN              NaN               NaN\n",
       "2       Hugh   Jackman  51.0    m              NaN               NaN\n",
       "3      Oprah   Winfrey  66.0    f              6.0               8.0\n",
       "4       Emma     Stone  31.0    f              7.0               9.0"
      ]
     },
     "execution_count": 137,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>pre_movie_score</th>\n",
       "      <th>post_movie_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   first_name  last_name    age    sex  pre_movie_score  post_movie_score\n",
       "0       False      False  False  False            False             False\n",
       "1        True       True   True   True             True              True\n",
       "2       False      False  False  False             True              True\n",
       "3       False      False  False  False            False             False\n",
       "4       False      False  False  False            False             False"
      ]
     },
     "execution_count": 138,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 139,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>pre_movie_score</th>\n",
       "      <th>post_movie_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   first_name  last_name    age    sex  pre_movie_score  post_movie_score\n",
       "0        True       True   True   True             True              True\n",
       "1       False      False  False  False            False             False\n",
       "2        True       True   True   True            False             False\n",
       "3        True       True   True   True             True              True\n",
       "4        True       True   True   True             True              True"
      ]
     },
     "execution_count": 139,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.notnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 140,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      Tom\n",
       "1      NaN\n",
       "2     Hugh\n",
       "3    Oprah\n",
       "4     Emma\n",
       "Name: first_name, dtype: object"
      ]
     },
     "execution_count": 140,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['first_name']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 141,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>pre_movie_score</th>\n",
       "      <th>post_movie_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Tom</td>\n",
       "      <td>Hanks</td>\n",
       "      <td>63.0</td>\n",
       "      <td>m</td>\n",
       "      <td>8.0</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hugh</td>\n",
       "      <td>Jackman</td>\n",
       "      <td>51.0</td>\n",
       "      <td>m</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Oprah</td>\n",
       "      <td>Winfrey</td>\n",
       "      <td>66.0</td>\n",
       "      <td>f</td>\n",
       "      <td>6.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Emma</td>\n",
       "      <td>Stone</td>\n",
       "      <td>31.0</td>\n",
       "      <td>f</td>\n",
       "      <td>7.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name   age sex  pre_movie_score  post_movie_score\n",
       "0        Tom     Hanks  63.0   m              8.0              10.0\n",
       "2       Hugh   Jackman  51.0   m              NaN               NaN\n",
       "3      Oprah   Winfrey  66.0   f              6.0               8.0\n",
       "4       Emma     Stone  31.0   f              7.0               9.0"
      ]
     },
     "execution_count": 141,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df['first_name'].notnull()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>pre_movie_score</th>\n",
       "      <th>post_movie_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hugh</td>\n",
       "      <td>Jackman</td>\n",
       "      <td>51.0</td>\n",
       "      <td>m</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name   age sex  pre_movie_score  post_movie_score\n",
       "2       Hugh   Jackman  51.0   m              NaN               NaN"
      ]
     },
     "execution_count": 142,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[(df['pre_movie_score'].isnull()) & df['sex'].notnull()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Drop Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>pre_movie_score</th>\n",
       "      <th>post_movie_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Tom</td>\n",
       "      <td>Hanks</td>\n",
       "      <td>63.0</td>\n",
       "      <td>m</td>\n",
       "      <td>8.0</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hugh</td>\n",
       "      <td>Jackman</td>\n",
       "      <td>51.0</td>\n",
       "      <td>m</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Oprah</td>\n",
       "      <td>Winfrey</td>\n",
       "      <td>66.0</td>\n",
       "      <td>f</td>\n",
       "      <td>6.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Emma</td>\n",
       "      <td>Stone</td>\n",
       "      <td>31.0</td>\n",
       "      <td>f</td>\n",
       "      <td>7.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name   age  sex  pre_movie_score  post_movie_score\n",
       "0        Tom     Hanks  63.0    m              8.0              10.0\n",
       "1        NaN       NaN   NaN  NaN              NaN               NaN\n",
       "2       Hugh   Jackman  51.0    m              NaN               NaN\n",
       "3      Oprah   Winfrey  66.0    f              6.0               8.0\n",
       "4       Emma     Stone  31.0    f              7.0               9.0"
      ]
     },
     "execution_count": 143,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on method dropna in module pandas.core.frame:\n",
      "\n",
      "dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance\n",
      "    Remove missing values.\n",
      "    \n",
      "    See the :ref:`User Guide <missing_data>` for more on which values are\n",
      "    considered missing, and how to work with missing data.\n",
      "    \n",
      "    Parameters\n",
      "    ----------\n",
      "    axis : {0 or 'index', 1 or 'columns'}, default 0\n",
      "        Determine if rows or columns which contain missing values are\n",
      "        removed.\n",
      "    \n",
      "        * 0, or 'index' : Drop rows which contain missing values.\n",
      "        * 1, or 'columns' : Drop columns which contain missing value.\n",
      "    \n",
      "        .. versionchanged:: 1.0.0\n",
      "    \n",
      "           Pass tuple or list to drop on multiple axes.\n",
      "           Only a single axis is allowed.\n",
      "    \n",
      "    how : {'any', 'all'}, default 'any'\n",
      "        Determine if row or column is removed from DataFrame, when we have\n",
      "        at least one NA or all NA.\n",
      "    \n",
      "        * 'any' : If any NA values are present, drop that row or column.\n",
      "        * 'all' : If all values are NA, drop that row or column.\n",
      "    \n",
      "    thresh : int, optional\n",
      "        Require that many non-NA values.\n",
      "    subset : array-like, optional\n",
      "        Labels along other axis to consider, e.g. if you are dropping rows\n",
      "        these would be a list of columns to include.\n",
      "    inplace : bool, default False\n",
      "        If True, do operation inplace and return None.\n",
      "    \n",
      "    Returns\n",
      "    -------\n",
      "    DataFrame\n",
      "        DataFrame with NA entries dropped from it.\n",
      "    \n",
      "    See Also\n",
      "    --------\n",
      "    DataFrame.isna: Indicate missing values.\n",
      "    DataFrame.notna : Indicate existing (non-missing) values.\n",
      "    DataFrame.fillna : Replace missing values.\n",
      "    Series.dropna : Drop missing values.\n",
      "    Index.dropna : Drop missing indices.\n",
      "    \n",
      "    Examples\n",
      "    --------\n",
      "    >>> df = pd.DataFrame({\"name\": ['Alfred', 'Batman', 'Catwoman'],\n",
      "    ...                    \"toy\": [np.nan, 'Batmobile', 'Bullwhip'],\n",
      "    ...                    \"born\": [pd.NaT, pd.Timestamp(\"1940-04-25\"),\n",
      "    ...                             pd.NaT]})\n",
      "    >>> df\n",
      "           name        toy       born\n",
      "    0    Alfred        NaN        NaT\n",
      "    1    Batman  Batmobile 1940-04-25\n",
      "    2  Catwoman   Bullwhip        NaT\n",
      "    \n",
      "    Drop the rows where at least one element is missing.\n",
      "    \n",
      "    >>> df.dropna()\n",
      "         name        toy       born\n",
      "    1  Batman  Batmobile 1940-04-25\n",
      "    \n",
      "    Drop the columns where at least one element is missing.\n",
      "    \n",
      "    >>> df.dropna(axis='columns')\n",
      "           name\n",
      "    0    Alfred\n",
      "    1    Batman\n",
      "    2  Catwoman\n",
      "    \n",
      "    Drop the rows where all elements are missing.\n",
      "    \n",
      "    >>> df.dropna(how='all')\n",
      "           name        toy       born\n",
      "    0    Alfred        NaN        NaT\n",
      "    1    Batman  Batmobile 1940-04-25\n",
      "    2  Catwoman   Bullwhip        NaT\n",
      "    \n",
      "    Keep only the rows with at least 2 non-NA values.\n",
      "    \n",
      "    >>> df.dropna(thresh=2)\n",
      "           name        toy       born\n",
      "    1    Batman  Batmobile 1940-04-25\n",
      "    2  Catwoman   Bullwhip        NaT\n",
      "    \n",
      "    Define in which columns to look for missing values.\n",
      "    \n",
      "    >>> df.dropna(subset=['name', 'born'])\n",
      "           name        toy       born\n",
      "    1    Batman  Batmobile 1940-04-25\n",
      "    \n",
      "    Keep the DataFrame with valid entries in the same variable.\n",
      "    \n",
      "    >>> df.dropna(inplace=True)\n",
      "    >>> df\n",
      "         name        toy       born\n",
      "    1  Batman  Batmobile 1940-04-25\n",
      "\n"
     ]
    }
   ],
   "source": [
    "help(df.dropna)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>pre_movie_score</th>\n",
       "      <th>post_movie_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Tom</td>\n",
       "      <td>Hanks</td>\n",
       "      <td>63.0</td>\n",
       "      <td>m</td>\n",
       "      <td>8.0</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Oprah</td>\n",
       "      <td>Winfrey</td>\n",
       "      <td>66.0</td>\n",
       "      <td>f</td>\n",
       "      <td>6.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Emma</td>\n",
       "      <td>Stone</td>\n",
       "      <td>31.0</td>\n",
       "      <td>f</td>\n",
       "      <td>7.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name   age sex  pre_movie_score  post_movie_score\n",
       "0        Tom     Hanks  63.0   m              8.0              10.0\n",
       "3      Oprah   Winfrey  66.0   f              6.0               8.0\n",
       "4       Emma     Stone  31.0   f              7.0               9.0"
      ]
     },
     "execution_count": 145,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>pre_movie_score</th>\n",
       "      <th>post_movie_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Tom</td>\n",
       "      <td>Hanks</td>\n",
       "      <td>63.0</td>\n",
       "      <td>m</td>\n",
       "      <td>8.0</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hugh</td>\n",
       "      <td>Jackman</td>\n",
       "      <td>51.0</td>\n",
       "      <td>m</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Oprah</td>\n",
       "      <td>Winfrey</td>\n",
       "      <td>66.0</td>\n",
       "      <td>f</td>\n",
       "      <td>6.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Emma</td>\n",
       "      <td>Stone</td>\n",
       "      <td>31.0</td>\n",
       "      <td>f</td>\n",
       "      <td>7.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name   age sex  pre_movie_score  post_movie_score\n",
       "0        Tom     Hanks  63.0   m              8.0              10.0\n",
       "2       Hugh   Jackman  51.0   m              NaN               NaN\n",
       "3      Oprah   Winfrey  66.0   f              6.0               8.0\n",
       "4       Emma     Stone  31.0   f              7.0               9.0"
      ]
     },
     "execution_count": 146,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna(thresh=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 147,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: [0, 1, 2, 3, 4]"
      ]
     },
     "execution_count": 147,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna(axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Tom</td>\n",
       "      <td>Hanks</td>\n",
       "      <td>63.0</td>\n",
       "      <td>m</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hugh</td>\n",
       "      <td>Jackman</td>\n",
       "      <td>51.0</td>\n",
       "      <td>m</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Oprah</td>\n",
       "      <td>Winfrey</td>\n",
       "      <td>66.0</td>\n",
       "      <td>f</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Emma</td>\n",
       "      <td>Stone</td>\n",
       "      <td>31.0</td>\n",
       "      <td>f</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name   age  sex\n",
       "0        Tom     Hanks  63.0    m\n",
       "1        NaN       NaN   NaN  NaN\n",
       "2       Hugh   Jackman  51.0    m\n",
       "3      Oprah   Winfrey  66.0    f\n",
       "4       Emma     Stone  31.0    f"
      ]
     },
     "execution_count": 148,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna(thresh=4,axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fill Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>pre_movie_score</th>\n",
       "      <th>post_movie_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Tom</td>\n",
       "      <td>Hanks</td>\n",
       "      <td>63.0</td>\n",
       "      <td>m</td>\n",
       "      <td>8.0</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hugh</td>\n",
       "      <td>Jackman</td>\n",
       "      <td>51.0</td>\n",
       "      <td>m</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Oprah</td>\n",
       "      <td>Winfrey</td>\n",
       "      <td>66.0</td>\n",
       "      <td>f</td>\n",
       "      <td>6.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Emma</td>\n",
       "      <td>Stone</td>\n",
       "      <td>31.0</td>\n",
       "      <td>f</td>\n",
       "      <td>7.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name   age  sex  pre_movie_score  post_movie_score\n",
       "0        Tom     Hanks  63.0    m              8.0              10.0\n",
       "1        NaN       NaN   NaN  NaN              NaN               NaN\n",
       "2       Hugh   Jackman  51.0    m              NaN               NaN\n",
       "3      Oprah   Winfrey  66.0    f              6.0               8.0\n",
       "4       Emma     Stone  31.0    f              7.0               9.0"
      ]
     },
     "execution_count": 149,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>pre_movie_score</th>\n",
       "      <th>post_movie_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Tom</td>\n",
       "      <td>Hanks</td>\n",
       "      <td>63</td>\n",
       "      <td>m</td>\n",
       "      <td>8</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NEW VALUE!</td>\n",
       "      <td>NEW VALUE!</td>\n",
       "      <td>NEW VALUE!</td>\n",
       "      <td>NEW VALUE!</td>\n",
       "      <td>NEW VALUE!</td>\n",
       "      <td>NEW VALUE!</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hugh</td>\n",
       "      <td>Jackman</td>\n",
       "      <td>51</td>\n",
       "      <td>m</td>\n",
       "      <td>NEW VALUE!</td>\n",
       "      <td>NEW VALUE!</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Oprah</td>\n",
       "      <td>Winfrey</td>\n",
       "      <td>66</td>\n",
       "      <td>f</td>\n",
       "      <td>6</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Emma</td>\n",
       "      <td>Stone</td>\n",
       "      <td>31</td>\n",
       "      <td>f</td>\n",
       "      <td>7</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   first_name   last_name         age         sex pre_movie_score  \\\n",
       "0         Tom       Hanks          63           m               8   \n",
       "1  NEW VALUE!  NEW VALUE!  NEW VALUE!  NEW VALUE!      NEW VALUE!   \n",
       "2        Hugh     Jackman          51           m      NEW VALUE!   \n",
       "3       Oprah     Winfrey          66           f               6   \n",
       "4        Emma       Stone          31           f               7   \n",
       "\n",
       "  post_movie_score  \n",
       "0               10  \n",
       "1       NEW VALUE!  \n",
       "2       NEW VALUE!  \n",
       "3                8  \n",
       "4                9  "
      ]
     },
     "execution_count": 150,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna(\"NEW VALUE!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      Tom\n",
       "1    Empty\n",
       "2     Hugh\n",
       "3    Oprah\n",
       "4     Emma\n",
       "Name: first_name, dtype: object"
      ]
     },
     "execution_count": 151,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['first_name'].fillna(\"Empty\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['first_name'] = df['first_name'].fillna(\"Empty\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>pre_movie_score</th>\n",
       "      <th>post_movie_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Tom</td>\n",
       "      <td>Hanks</td>\n",
       "      <td>63.0</td>\n",
       "      <td>m</td>\n",
       "      <td>8.0</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Empty</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hugh</td>\n",
       "      <td>Jackman</td>\n",
       "      <td>51.0</td>\n",
       "      <td>m</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Oprah</td>\n",
       "      <td>Winfrey</td>\n",
       "      <td>66.0</td>\n",
       "      <td>f</td>\n",
       "      <td>6.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Emma</td>\n",
       "      <td>Stone</td>\n",
       "      <td>31.0</td>\n",
       "      <td>f</td>\n",
       "      <td>7.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name   age  sex  pre_movie_score  post_movie_score\n",
       "0        Tom     Hanks  63.0    m              8.0              10.0\n",
       "1      Empty       NaN   NaN  NaN              NaN               NaN\n",
       "2       Hugh   Jackman  51.0    m              NaN               NaN\n",
       "3      Oprah   Winfrey  66.0    f              6.0               8.0\n",
       "4       Emma     Stone  31.0    f              7.0               9.0"
      ]
     },
     "execution_count": 153,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7.0"
      ]
     },
     "execution_count": 154,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['pre_movie_score'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    8.0\n",
       "1    7.0\n",
       "2    7.0\n",
       "3    6.0\n",
       "4    7.0\n",
       "Name: pre_movie_score, dtype: float64"
      ]
     },
     "execution_count": 155,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['pre_movie_score'].fillna(df['pre_movie_score'].mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>pre_movie_score</th>\n",
       "      <th>post_movie_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Tom</td>\n",
       "      <td>Hanks</td>\n",
       "      <td>63.00</td>\n",
       "      <td>m</td>\n",
       "      <td>8.0</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Empty</td>\n",
       "      <td>NaN</td>\n",
       "      <td>52.75</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hugh</td>\n",
       "      <td>Jackman</td>\n",
       "      <td>51.00</td>\n",
       "      <td>m</td>\n",
       "      <td>7.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Oprah</td>\n",
       "      <td>Winfrey</td>\n",
       "      <td>66.00</td>\n",
       "      <td>f</td>\n",
       "      <td>6.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Emma</td>\n",
       "      <td>Stone</td>\n",
       "      <td>31.00</td>\n",
       "      <td>f</td>\n",
       "      <td>7.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name    age  sex  pre_movie_score  post_movie_score\n",
       "0        Tom     Hanks  63.00    m              8.0              10.0\n",
       "1      Empty       NaN  52.75  NaN              7.0               9.0\n",
       "2       Hugh   Jackman  51.00    m              7.0               9.0\n",
       "3      Oprah   Winfrey  66.00    f              6.0               8.0\n",
       "4       Emma     Stone  31.00    f              7.0               9.0"
      ]
     },
     "execution_count": 156,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna(df.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filling with Interpolation\n",
    "\n",
    "Be careful with this technique, you should try to really understand whether or not this is a valid choice for your data. You should also note there are several methods available, the default is a linear method.\n",
    "\n",
    "Full Docs on this Method:\n",
    "https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "metadata": {},
   "outputs": [],
   "source": [
    "airline_tix = {'first':100,'business':np.nan,'economy-plus':50,'economy':30}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 165,
   "metadata": {},
   "outputs": [],
   "source": [
    "ser = pd.Series(airline_tix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 166,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "first           100.0\n",
       "business          NaN\n",
       "economy-plus     50.0\n",
       "economy          30.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 166,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 167,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "first           100.0\n",
       "business         75.0\n",
       "economy-plus     50.0\n",
       "economy          30.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 167,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ser.interpolate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 163,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating.",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-163-106f2287918c>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mser\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0minterpolate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'spline'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32mc:\\users\\marcial\\anaconda3\\envs\\ml_master\\lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36minterpolate\u001b[1;34m(self, method, axis, limit, inplace, limit_direction, limit_area, downcast, **kwargs)\u001b[0m\n\u001b[0;32m   6992\u001b[0m             \u001b[1;32mif\u001b[0m \u001b[0mmethod\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mmethods\u001b[0m \u001b[1;32mand\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mis_numeric_or_datetime\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   6993\u001b[0m                 raise ValueError(\n\u001b[1;32m-> 6994\u001b[1;33m                     \u001b[1;34m\"Index column must be numeric or datetime type when \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m   6995\u001b[0m                     \u001b[1;34mf\"using {method} method other than linear. \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   6996\u001b[0m                     \u001b[1;34m\"Try setting a numeric or datetime index column before \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mValueError\u001b[0m: Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating."
     ]
    }
   ],
   "source": [
    "ser.interpolate(method='spline')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 169,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(ser,columns=['Price'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 170,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>first</th>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>business</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>economy-plus</th>\n",
       "      <td>50.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>economy</th>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Price\n",
       "first         100.0\n",
       "business        NaN\n",
       "economy-plus   50.0\n",
       "economy        30.0"
      ]
     },
     "execution_count": 170,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 171,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>first</th>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>business</th>\n",
       "      <td>75.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>economy-plus</th>\n",
       "      <td>50.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>economy</th>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Price\n",
       "first         100.0\n",
       "business       75.0\n",
       "economy-plus   50.0\n",
       "economy        30.0"
      ]
     },
     "execution_count": 171,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.interpolate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 174,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 175,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>first</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>business</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>economy-plus</td>\n",
       "      <td>50.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>economy</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          index  Price\n",
       "0         first  100.0\n",
       "1      business    NaN\n",
       "2  economy-plus   50.0\n",
       "3       economy   30.0"
      ]
     },
     "execution_count": 175,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 178,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>first</td>\n",
       "      <td>100.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>business</td>\n",
       "      <td>73.333333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>economy-plus</td>\n",
       "      <td>50.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>economy</td>\n",
       "      <td>30.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          index       Price\n",
       "0         first  100.000000\n",
       "1      business   73.333333\n",
       "2  economy-plus   50.000000\n",
       "3       economy   30.000000"
      ]
     },
     "execution_count": 178,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.interpolate(method='spline',order=2)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}