You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
32 KiB
32 KiB
<html>
<head>
</head>
</html>
Text Methods¶
A normal Python string has a variety of method calls available:
In [2]:
mystring = 'hello'
In [3]:
mystring.capitalize()
Out[3]:
In [4]:
mystring.isdigit()
Out[4]:
In [5]:
help(str)
Pandas and Text¶
Pandas can do a lot more than what we show here. Full online documentation on things like advanced string indexing and regular expressions with pandas can be found here: https://pandas.pydata.org/docs/user_guide/text.html
Text Methods on Pandas String Column¶
In [6]:
import pandas as pd
In [7]:
names = pd.Series(['andrew','bobo','claire','david','4'])
In [8]:
names
Out[8]:
In [9]:
names.str.capitalize()
Out[9]:
In [10]:
names.str.isdigit()
Out[10]:
Splitting , Grabbing, and Expanding¶
In [14]:
tech_finance = ['GOOG,APPL,AMZN','JPM,BAC,GS']
In [15]:
len(tech_finance)
Out[15]:
In [16]:
tickers = pd.Series(tech_finance)
In [17]:
tickers
Out[17]:
In [18]:
tickers.str.split(',')
Out[18]:
In [19]:
tickers.str.split(',').str[0]
Out[19]:
In [21]:
tickers.str.split(',',expand=True)
Out[21]:
Cleaning or Editing Strings¶
In [22]:
messy_names = pd.Series(["andrew ","bo;bo"," claire "])
In [27]:
# Notice the "mis-alignment" on the right hand side due to spacing in "andrew " and " claire "
messy_names
Out[27]:
In [28]:
messy_names.str.replace(";","")
Out[28]:
In [29]:
messy_names.str.strip()
Out[29]:
In [31]:
messy_names.str.replace(";","").str.strip()
Out[31]:
In [32]:
messy_names.str.replace(";","").str.strip().str.capitalize()
Out[32]:
Alternative with Custom apply() call¶
In [33]:
def cleanup(name):
name = name.replace(";","")
name = name.strip()
name = name.capitalize()
return name
In [34]:
messy_names
Out[34]:
In [35]:
messy_names.apply(cleanup)
Out[35]:
Which one is more efficient?¶
In [43]:
import timeit
# code snippet to be executed only once
setup = '''
import pandas as pd
import numpy as np
messy_names = pd.Series(["andrew ","bo;bo"," claire "])
def cleanup(name):
name = name.replace(";","")
name = name.strip()
name = name.capitalize()
return name
'''
# code snippet whose execution time is to be measured
stmt_pandas_str = '''
messy_names.str.replace(";","").str.strip().str.capitalize()
'''
stmt_pandas_apply = '''
messy_names.apply(cleanup)
'''
stmt_pandas_vectorize='''
np.vectorize(cleanup)(messy_names)
'''
In [44]:
timeit.timeit(setup = setup,
stmt = stmt_pandas_str,
number = 10000)
Out[44]:
In [45]:
timeit.timeit(setup = setup,
stmt = stmt_pandas_apply,
number = 10000)
Out[45]:
In [46]:
timeit.timeit(setup = setup,
stmt = stmt_pandas_vectorize,
number = 10000)
Out[46]:
Wow! While .str() methods can be extremely convienent, when it comes to performance, don't forget about np.vectorize()! Review the "Useful Methods" lecture for a deeper discussion on np.vectorize()