192 KiB
Useful Methods¶
Let's cover some useful methods and functions built in to pandas. This is actually just a small sampling of the functions and methods available in Pandas, but they are some of the most commonly used. The documentation is a great resource to continue exploring more methods and functions (we will introduce more further along in the course). Here is a list of functions and methods we'll cover here (click on one to jump to that section in this notebook.):
- apply() method
- apply() with a function
- apply() with a lambda expression
- apply() on multiple columns
- describe()
- sort_values()
- corr()
- idxmin and idxmax
- value_counts
- replace
- unique and nunique
- map
- duplicated and drop_duplicates
- between
- sample
- nlargest
Make sure to view the video lessons to get the full explanation!
The .apply() method¶
Here we will learn about a very useful method known as apply on a DataFrame. This allows us to apply and broadcast custom functions on a DataFrame column
import pandas as pd
import numpy as np
df = pd.read_csv('tips.csv')
df.head()
apply with a function¶
df.info()
def last_four(num):
return str(num)[-4:]
df['CC Number'][0]
last_four(3560325168603410)
df['last_four'] = df['CC Number'].apply(last_four)
df.head()
Using .apply() with more complex functions¶
df['total_bill'].mean()
def yelp(price):
if price < 10:
return '$'
elif price >= 10 and price < 30:
return '$$'
else:
return '$$$'
df['Expensive'] = df['total_bill'].apply(yelp)
# df
apply with lambda¶
def simple(num):
return num*2
lambda num: num*2
df['total_bill'].apply(lambda bill:bill*0.18)
apply that uses multiple columns¶
Note, there are several ways to do this:
df.head()
def quality(total_bill,tip):
if tip/total_bill > 0.25:
return "Generous"
else:
return "Other"
df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)
df.head()
import numpy as np
df['Tip Quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])
df.head()
So, which one is faster?
import timeit
# code snippet to be executed only once
setup = '''
import numpy as np
import pandas as pd
df = pd.read_csv('tips.csv')
def quality(total_bill,tip):
if tip/total_bill > 0.25:
return "Generous"
else:
return "Other"
'''
# code snippet whose execution time is to be measured
stmt_one = '''
df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)
'''
stmt_two = '''
df['Tip Quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])
'''
timeit.timeit(setup = setup,
stmt = stmt_one,
number = 1000)
timeit.timeit(setup = setup,
stmt = stmt_two,
number = 1000)
Wow! Vectorization is much faster! Keep np.vectorize() in mind for the future.
Full Details: https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html
df.describe for statistical summaries¶
df.describe()
df.describe().transpose()
sort_values()¶
df.sort_values('tip')
# Helpful if you want to reorder after a sort
# https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns
df.sort_values(['tip','size'])
df.corr()
df[['total_bill','tip']].corr()
idxmin and idxmax¶
df.head()
df['total_bill'].max()
df['total_bill'].idxmax()
df['total_bill'].idxmin()
df.iloc[67]
df.iloc[170]
value_counts¶
Nice method to quickly get a count per category. Only makes sense on categorical columns.
df.head()
df['sex'].value_counts()
df.head()
df['Tip Quality'].replace(to_replace='Other',value='Ok')
df['Tip Quality'] = df['Tip Quality'].replace(to_replace='Other',value='Ok')
df.head()
unique¶
df['size'].unique()
df['size'].nunique()
df['time'].unique()
map¶
my_map = {'Dinner':'D','Lunch':'L'}
df['time'].map(my_map)
df.head()
# Returns True for the 1st instance of a duplicated row
df.duplicated()
simple_df = pd.DataFrame([1,2,2],['a','b','c'])
simple_df
simple_df.duplicated()
simple_df.drop_duplicates()
between¶
left: A scalar value that defines the left boundary right: A scalar value that defines the right boundary inclusive: A Boolean value which is True by default. If False, it excludes the two passed arguments while checking.
df['total_bill'].between(10,20,inclusive=True)
df[df['total_bill'].between(10,20,inclusive=True)]
sample¶
df.sample(5)
df.sample(frac=0.1)
nlargest and nsmallest¶
df.nlargest(10,'tip')