{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "\n", "\n", "___\n", "
Copyright by Pierian Data Inc.
\n", "
For more information, visit us at www.pieriandata.com
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Missing Data\n", "\n", "Make sure to review the video for a full discussion on the strategies of dealing with missing data.\n", "\n", "--------\n", "\n", "\n", "## What Null/NA/nan objects look like:\n", "\n", "Source: https://github.com/pandas-dev/pandas/issues/28095\n", "\n", "A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.NA" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NaT" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.NaT" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "------\n", "## Note! Typical comparisons should be avoided with Missing Values\n", "\n", "* https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b\n", "* https://stackoverflow.com/questions/20320022/why-in-numpy-nan-nan-is-false-while-nan-in-nan-is-true\n", "\n", "This is generally because the logic here is, since we don't know these values, we can't know if they are equal to each other." ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan == np.nan" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan in [np.nan]" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan is np.nan" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.NA == pd.NA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "People were asked to score their opinions of actors from a 1-10 scale before and after watching one of their movies. However, some data is missing." ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('movie_scores.csv')" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesexpre_movie_scorepost_movie_score
0TomHanks63.0m8.010.0
1NaNNaNNaNNaNNaNNaN
2HughJackman51.0mNaNNaN
3OprahWinfrey66.0f6.08.0
4EmmaStone31.0f7.09.0
\n", "
" ], "text/plain": [ " first_name last_name age sex pre_movie_score post_movie_score\n", "0 Tom Hanks 63.0 m 8.0 10.0\n", "1 NaN NaN NaN NaN NaN NaN\n", "2 Hugh Jackman 51.0 m NaN NaN\n", "3 Oprah Winfrey 66.0 f 6.0 8.0\n", "4 Emma Stone 31.0 f 7.0 9.0" ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking and Selecting for Null Values" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesexpre_movie_scorepost_movie_score
0TomHanks63.0m8.010.0
1NaNNaNNaNNaNNaNNaN
2HughJackman51.0mNaNNaN
3OprahWinfrey66.0f6.08.0
4EmmaStone31.0f7.09.0
\n", "
" ], "text/plain": [ " first_name last_name age sex pre_movie_score post_movie_score\n", "0 Tom Hanks 63.0 m 8.0 10.0\n", "1 NaN NaN NaN NaN NaN NaN\n", "2 Hugh Jackman 51.0 m NaN NaN\n", "3 Oprah Winfrey 66.0 f 6.0 8.0\n", "4 Emma Stone 31.0 f 7.0 9.0" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesexpre_movie_scorepost_movie_score
0FalseFalseFalseFalseFalseFalse
1TrueTrueTrueTrueTrueTrue
2FalseFalseFalseFalseTrueTrue
3FalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalse
\n", "
" ], "text/plain": [ " first_name last_name age sex pre_movie_score post_movie_score\n", "0 False False False False False False\n", "1 True True True True True True\n", "2 False False False False True True\n", "3 False False False False False False\n", "4 False False False False False False" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isnull()" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesexpre_movie_scorepost_movie_score
0TrueTrueTrueTrueTrueTrue
1FalseFalseFalseFalseFalseFalse
2TrueTrueTrueTrueFalseFalse
3TrueTrueTrueTrueTrueTrue
4TrueTrueTrueTrueTrueTrue
\n", "
" ], "text/plain": [ " first_name last_name age sex pre_movie_score post_movie_score\n", "0 True True True True True True\n", "1 False False False False False False\n", "2 True True True True False False\n", "3 True True True True True True\n", "4 True True True True True True" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.notnull()" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Tom\n", "1 NaN\n", "2 Hugh\n", "3 Oprah\n", "4 Emma\n", "Name: first_name, dtype: object" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['first_name']" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesexpre_movie_scorepost_movie_score
0TomHanks63.0m8.010.0
2HughJackman51.0mNaNNaN
3OprahWinfrey66.0f6.08.0
4EmmaStone31.0f7.09.0
\n", "
" ], "text/plain": [ " first_name last_name age sex pre_movie_score post_movie_score\n", "0 Tom Hanks 63.0 m 8.0 10.0\n", "2 Hugh Jackman 51.0 m NaN NaN\n", "3 Oprah Winfrey 66.0 f 6.0 8.0\n", "4 Emma Stone 31.0 f 7.0 9.0" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['first_name'].notnull()]" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesexpre_movie_scorepost_movie_score
2HughJackman51.0mNaNNaN
\n", "
" ], "text/plain": [ " first_name last_name age sex pre_movie_score post_movie_score\n", "2 Hugh Jackman 51.0 m NaN NaN" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(df['pre_movie_score'].isnull()) & df['sex'].notnull()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Drop Data" ] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesexpre_movie_scorepost_movie_score
0TomHanks63.0m8.010.0
1NaNNaNNaNNaNNaNNaN
2HughJackman51.0mNaNNaN
3OprahWinfrey66.0f6.08.0
4EmmaStone31.0f7.09.0
\n", "
" ], "text/plain": [ " first_name last_name age sex pre_movie_score post_movie_score\n", "0 Tom Hanks 63.0 m 8.0 10.0\n", "1 NaN NaN NaN NaN NaN NaN\n", "2 Hugh Jackman 51.0 m NaN NaN\n", "3 Oprah Winfrey 66.0 f 6.0 8.0\n", "4 Emma Stone 31.0 f 7.0 9.0" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on method dropna in module pandas.core.frame:\n", "\n", "dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance\n", " Remove missing values.\n", " \n", " See the :ref:`User Guide ` for more on which values are\n", " considered missing, and how to work with missing data.\n", " \n", " Parameters\n", " ----------\n", " axis : {0 or 'index', 1 or 'columns'}, default 0\n", " Determine if rows or columns which contain missing values are\n", " removed.\n", " \n", " * 0, or 'index' : Drop rows which contain missing values.\n", " * 1, or 'columns' : Drop columns which contain missing value.\n", " \n", " .. versionchanged:: 1.0.0\n", " \n", " Pass tuple or list to drop on multiple axes.\n", " Only a single axis is allowed.\n", " \n", " how : {'any', 'all'}, default 'any'\n", " Determine if row or column is removed from DataFrame, when we have\n", " at least one NA or all NA.\n", " \n", " * 'any' : If any NA values are present, drop that row or column.\n", " * 'all' : If all values are NA, drop that row or column.\n", " \n", " thresh : int, optional\n", " Require that many non-NA values.\n", " subset : array-like, optional\n", " Labels along other axis to consider, e.g. if you are dropping rows\n", " these would be a list of columns to include.\n", " inplace : bool, default False\n", " If True, do operation inplace and return None.\n", " \n", " Returns\n", " -------\n", " DataFrame\n", " DataFrame with NA entries dropped from it.\n", " \n", " See Also\n", " --------\n", " DataFrame.isna: Indicate missing values.\n", " DataFrame.notna : Indicate existing (non-missing) values.\n", " DataFrame.fillna : Replace missing values.\n", " Series.dropna : Drop missing values.\n", " Index.dropna : Drop missing indices.\n", " \n", " Examples\n", " --------\n", " >>> df = pd.DataFrame({\"name\": ['Alfred', 'Batman', 'Catwoman'],\n", " ... \"toy\": [np.nan, 'Batmobile', 'Bullwhip'],\n", " ... \"born\": [pd.NaT, pd.Timestamp(\"1940-04-25\"),\n", " ... pd.NaT]})\n", " >>> df\n", " name toy born\n", " 0 Alfred NaN NaT\n", " 1 Batman Batmobile 1940-04-25\n", " 2 Catwoman Bullwhip NaT\n", " \n", " Drop the rows where at least one element is missing.\n", " \n", " >>> df.dropna()\n", " name toy born\n", " 1 Batman Batmobile 1940-04-25\n", " \n", " Drop the columns where at least one element is missing.\n", " \n", " >>> df.dropna(axis='columns')\n", " name\n", " 0 Alfred\n", " 1 Batman\n", " 2 Catwoman\n", " \n", " Drop the rows where all elements are missing.\n", " \n", " >>> df.dropna(how='all')\n", " name toy born\n", " 0 Alfred NaN NaT\n", " 1 Batman Batmobile 1940-04-25\n", " 2 Catwoman Bullwhip NaT\n", " \n", " Keep only the rows with at least 2 non-NA values.\n", " \n", " >>> df.dropna(thresh=2)\n", " name toy born\n", " 1 Batman Batmobile 1940-04-25\n", " 2 Catwoman Bullwhip NaT\n", " \n", " Define in which columns to look for missing values.\n", " \n", " >>> df.dropna(subset=['name', 'born'])\n", " name toy born\n", " 1 Batman Batmobile 1940-04-25\n", " \n", " Keep the DataFrame with valid entries in the same variable.\n", " \n", " >>> df.dropna(inplace=True)\n", " >>> df\n", " name toy born\n", " 1 Batman Batmobile 1940-04-25\n", "\n" ] } ], "source": [ "help(df.dropna)" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesexpre_movie_scorepost_movie_score
0TomHanks63.0m8.010.0
3OprahWinfrey66.0f6.08.0
4EmmaStone31.0f7.09.0
\n", "
" ], "text/plain": [ " first_name last_name age sex pre_movie_score post_movie_score\n", "0 Tom Hanks 63.0 m 8.0 10.0\n", "3 Oprah Winfrey 66.0 f 6.0 8.0\n", "4 Emma Stone 31.0 f 7.0 9.0" ] }, "execution_count": 145, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna()" ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesexpre_movie_scorepost_movie_score
0TomHanks63.0m8.010.0
2HughJackman51.0mNaNNaN
3OprahWinfrey66.0f6.08.0
4EmmaStone31.0f7.09.0
\n", "
" ], "text/plain": [ " first_name last_name age sex pre_movie_score post_movie_score\n", "0 Tom Hanks 63.0 m 8.0 10.0\n", "2 Hugh Jackman 51.0 m NaN NaN\n", "3 Oprah Winfrey 66.0 f 6.0 8.0\n", "4 Emma Stone 31.0 f 7.0 9.0" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna(thresh=1)" ] }, { "cell_type": "code", "execution_count": 147, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
1
2
3
4
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: []\n", "Index: [0, 1, 2, 3, 4]" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna(axis=1)" ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesex
0TomHanks63.0m
1NaNNaNNaNNaN
2HughJackman51.0m
3OprahWinfrey66.0f
4EmmaStone31.0f
\n", "
" ], "text/plain": [ " first_name last_name age sex\n", "0 Tom Hanks 63.0 m\n", "1 NaN NaN NaN NaN\n", "2 Hugh Jackman 51.0 m\n", "3 Oprah Winfrey 66.0 f\n", "4 Emma Stone 31.0 f" ] }, "execution_count": 148, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna(thresh=4,axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fill Data" ] }, { "cell_type": "code", "execution_count": 149, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesexpre_movie_scorepost_movie_score
0TomHanks63.0m8.010.0
1NaNNaNNaNNaNNaNNaN
2HughJackman51.0mNaNNaN
3OprahWinfrey66.0f6.08.0
4EmmaStone31.0f7.09.0
\n", "
" ], "text/plain": [ " first_name last_name age sex pre_movie_score post_movie_score\n", "0 Tom Hanks 63.0 m 8.0 10.0\n", "1 NaN NaN NaN NaN NaN NaN\n", "2 Hugh Jackman 51.0 m NaN NaN\n", "3 Oprah Winfrey 66.0 f 6.0 8.0\n", "4 Emma Stone 31.0 f 7.0 9.0" ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesexpre_movie_scorepost_movie_score
0TomHanks63m810
1NEW VALUE!NEW VALUE!NEW VALUE!NEW VALUE!NEW VALUE!NEW VALUE!
2HughJackman51mNEW VALUE!NEW VALUE!
3OprahWinfrey66f68
4EmmaStone31f79
\n", "
" ], "text/plain": [ " first_name last_name age sex pre_movie_score \\\n", "0 Tom Hanks 63 m 8 \n", "1 NEW VALUE! NEW VALUE! NEW VALUE! NEW VALUE! NEW VALUE! \n", "2 Hugh Jackman 51 m NEW VALUE! \n", "3 Oprah Winfrey 66 f 6 \n", "4 Emma Stone 31 f 7 \n", "\n", " post_movie_score \n", "0 10 \n", "1 NEW VALUE! \n", "2 NEW VALUE! \n", "3 8 \n", "4 9 " ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(\"NEW VALUE!\")" ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Tom\n", "1 Empty\n", "2 Hugh\n", "3 Oprah\n", "4 Emma\n", "Name: first_name, dtype: object" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['first_name'].fillna(\"Empty\")" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [], "source": [ "df['first_name'] = df['first_name'].fillna(\"Empty\")" ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesexpre_movie_scorepost_movie_score
0TomHanks63.0m8.010.0
1EmptyNaNNaNNaNNaNNaN
2HughJackman51.0mNaNNaN
3OprahWinfrey66.0f6.08.0
4EmmaStone31.0f7.09.0
\n", "
" ], "text/plain": [ " first_name last_name age sex pre_movie_score post_movie_score\n", "0 Tom Hanks 63.0 m 8.0 10.0\n", "1 Empty NaN NaN NaN NaN NaN\n", "2 Hugh Jackman 51.0 m NaN NaN\n", "3 Oprah Winfrey 66.0 f 6.0 8.0\n", "4 Emma Stone 31.0 f 7.0 9.0" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7.0" ] }, "execution_count": 154, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['pre_movie_score'].mean()" ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 8.0\n", "1 7.0\n", "2 7.0\n", "3 6.0\n", "4 7.0\n", "Name: pre_movie_score, dtype: float64" ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['pre_movie_score'].fillna(df['pre_movie_score'].mean())" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameagesexpre_movie_scorepost_movie_score
0TomHanks63.00m8.010.0
1EmptyNaN52.75NaN7.09.0
2HughJackman51.00m7.09.0
3OprahWinfrey66.00f6.08.0
4EmmaStone31.00f7.09.0
\n", "
" ], "text/plain": [ " first_name last_name age sex pre_movie_score post_movie_score\n", "0 Tom Hanks 63.00 m 8.0 10.0\n", "1 Empty NaN 52.75 NaN 7.0 9.0\n", "2 Hugh Jackman 51.00 m 7.0 9.0\n", "3 Oprah Winfrey 66.00 f 6.0 8.0\n", "4 Emma Stone 31.00 f 7.0 9.0" ] }, "execution_count": 156, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(df.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filling with Interpolation\n", "\n", "Be careful with this technique, you should try to really understand whether or not this is a valid choice for your data. You should also note there are several methods available, the default is a linear method.\n", "\n", "Full Docs on this Method:\n", "https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [], "source": [ "airline_tix = {'first':100,'business':np.nan,'economy-plus':50,'economy':30}" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [], "source": [ "ser = pd.Series(airline_tix)" ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "first 100.0\n", "business NaN\n", "economy-plus 50.0\n", "economy 30.0\n", "dtype: float64" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser" ] }, { "cell_type": "code", "execution_count": 167, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "first 100.0\n", "business 75.0\n", "economy-plus 50.0\n", "economy 30.0\n", "dtype: float64" ] }, "execution_count": 167, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser.interpolate()" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating.", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mser\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0minterpolate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'spline'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32mc:\\users\\marcial\\anaconda3\\envs\\ml_master\\lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36minterpolate\u001b[1;34m(self, method, axis, limit, inplace, limit_direction, limit_area, downcast, **kwargs)\u001b[0m\n\u001b[0;32m 6992\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mmethod\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mmethods\u001b[0m \u001b[1;32mand\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mis_numeric_or_datetime\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 6993\u001b[0m raise ValueError(\n\u001b[1;32m-> 6994\u001b[1;33m \u001b[1;34m\"Index column must be numeric or datetime type when \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 6995\u001b[0m \u001b[1;34mf\"using {method} method other than linear. \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 6996\u001b[0m \u001b[1;34m\"Try setting a numeric or datetime index column before \"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mValueError\u001b[0m: Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating." ] } ], "source": [ "ser.interpolate(method='spline')" ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(ser,columns=['Price'])" ] }, { "cell_type": "code", "execution_count": 170, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Price
first100.0
businessNaN
economy-plus50.0
economy30.0
\n", "
" ], "text/plain": [ " Price\n", "first 100.0\n", "business NaN\n", "economy-plus 50.0\n", "economy 30.0" ] }, "execution_count": 170, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 171, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Price
first100.0
business75.0
economy-plus50.0
economy30.0
\n", "
" ], "text/plain": [ " Price\n", "first 100.0\n", "business 75.0\n", "economy-plus 50.0\n", "economy 30.0" ] }, "execution_count": 171, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.interpolate()" ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [], "source": [ "df = df.reset_index()" ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexPrice
0first100.0
1businessNaN
2economy-plus50.0
3economy30.0
\n", "
" ], "text/plain": [ " index Price\n", "0 first 100.0\n", "1 business NaN\n", "2 economy-plus 50.0\n", "3 economy 30.0" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexPrice
0first100.000000
1business73.333333
2economy-plus50.000000
3economy30.000000
\n", "
" ], "text/plain": [ " index Price\n", "0 first 100.000000\n", "1 business 73.333333\n", "2 economy-plus 50.000000\n", "3 economy 30.000000" ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.interpolate(method='spline',order=2)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 1 }