204 KiB
DataFrames¶
Throughout the course, most of our data exploration will be done with DataFrames. DataFrames are an extremely powerful tool and a natural extension of the Pandas Series. By definition all a DataFrame is:
A Pandas DataFrame consists of multiple Pandas Series that share index values.
Imports¶
import numpy as np
import pandas as pd
Creating a DataFrame from Python Objects¶
# help(pd.DataFrame)
# Make sure the seed is in the same cell as the random call
# https://stackoverflow.com/questions/21494489/what-does-numpy-random-seed0-do
np.random.seed(101)
mydata = np.random.randint(0,101,(4,3))
mydata
myindex = ['CA','NY','AZ','TX']
mycolumns = ['Jan','Feb','Mar']
df = pd.DataFrame(data=mydata)
df
df = pd.DataFrame(data=mydata,index=myindex)
df
df = pd.DataFrame(data=mydata,index=myindex,columns=mycolumns)
df
df.info()
CSV¶
Comma Separated Values files are text files that use commas as field delimeters.
Unless you're running the virtual environment included with the course, you may need to install xlrd and openpyxl.
In your terminal/command prompt run:
conda install xlrd
conda install openpyxl
Then restart Jupyter Notebook. (or use pip install if you aren't using the Anaconda Distribution)
Understanding File Paths¶
You have two options when reading a file with pandas:
If your .py file or .ipynb notebook is located in the exact same folder location as the .csv file you want to read, simply pass in the file name as a string, for example:
df = pd.read_csv('some_file.csv')
Pass in the entire file path if you are located in a different directory. The file path must be 100% correct in order for this to work. For example:
df = pd.read_csv("C:\\Users\\myself\\files\\some_file.csv")
Print your current directory file path with pwd¶
pwd
List the files in your current directory with ls¶
ls
df = pd.read_csv('tips.csv')
df
About this DataSet (in case you are interested)
Description
- One waiter recorded information about each tip he received over a period of a few months working in one restaurant. He collected several variables:
Format
- A data frame with 244 rows and 7 variables
Details
- tip in dollars,
- bill in dollars,
- sex of the bill payer,
- whether there were smokers in the party,
- day of the week,
- time of day,
- size of the party.
In all he recorded 244 tips. The data was reported in a collection of case studies for business statistics (Bryant & Smith 1995).
References
- Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing:
Note: We created some additional columns with Fake data, including Name, CC Number, and Payment ID.
DataFrames¶
Obtaining Basic Information About DataFrame¶
df.columns
df.index
df.head(3)
df.tail(3)
df.info()
len(df)
df.describe()
df.describe().transpose()
Selection and Indexing¶
Let's learn how to retrieve information from a DataFrame.
COLUMNS¶
We will begin be learning how to extract information based on the columns
df.head()
Grab a Single Column¶
df['total_bill']
type(df['total_bill'])
Grab Multiple Columns¶
# Note how its a python list of column names! Thus the double brackets.
df[['total_bill','tip']]
Create New Columns¶
df['tip_percentage'] = 100* df['tip'] / df['total_bill']
df.head()
df['price_per_person'] = df['total_bill'] / df['size']
df.head()
help(np.round)
Adjust Existing Columns¶
# Because pandas is based on numpy, we get awesome capabilities with numpy's universal functions!
df['price_per_person'] = np.round(df['price_per_person'],2)
df.head()
Remove Columns¶
# df.drop('tip_percentage',axis=1)
df = df.drop("tip_percentage",axis=1)
df.head()
Index Basics¶
Before going over the same retrieval tasks for rows, let's build some basic understanding of the pandas DataFrame Index.
df.head()
df.index
df.set_index('Payment ID')
df.head()
df = df.set_index('Payment ID')
df.head()
df = df.reset_index()
df.head()
ROWS¶
Let's now explore these same concepts but with Rows.
df.head()
df = df.set_index('Payment ID')
df.head()
Grab a Single Row¶
# Integer Based
df.iloc[0]
# Name Based
df.loc['Sun2959']
Grab Multiple Rows¶
df.iloc[0:4]
df.loc[['Sun2959','Sun5260']]
Remove Row¶
Typically are datasets will be large enough that we won't remove rows like this since we won't know thier row location for some specific condition, instead, we drop rows based on conditions such as missing data or column values. The next lecture will cover this in a lot more detail.
df.head()
df.drop('Sun2959',axis=0).head()
# Error if you have a named index!
# df.drop(0,axis=0).head()
Insert a New Row¶
Pretty rare to add a single row like this. Usually you use pd.concat() to add many rows at once. You could use the .append() method with a list of pd.Series() objects, but you won't see us do this with realistic real-world data.
one_row = df.iloc[0]
one_row
type(one_row)
df.tail()
df.append(one_row).tail()