64 KiB

Raw Blame History

___

Copyright by Pierian Data Inc. For more information, visit us at www.pieriandata.com

Inputs and Outputs¶

NOTE: Typically we will just be either reading csv files directly or using pandas-datareader to pull data from the web. Consider this lecture just a quick overview of what is possible with pandas (we won't be working with SQL or Excel files in this course)

Data Input and Output¶

This notebook is the reference code for getting input and output, pandas can read a variety of file types using its pd.read_ methods. Let's take a look at the most common data types:

In [52]:

import numpy as np
import pandas as pd

Check out the references here!¶

This is the best online resource for how to read/write to a variety of data sources!

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Format Type	Data Description	Reader	Writer
text	CSV	read_csv	to_csv
text	JSON	read_json	to_json
text	HTML	read_html	to_html
text	Local clipboard	read_clipboard	to_clipboard
binary	MS Excel	read_excel	to_excel
binary	OpenDocument	read_excel
binary	HDF5 Format	read_hdf	to_hdf
binary	Feather Format	read_feather	to_feather
binary	Parquet Format	read_parquet	to_parquet
binary	Msgpack	read_msgpack	to_msgpack
binary	Stata	read_stata	to_stata
binary	SAS	read_sas
binary	Python Pickle Format	read_pickle	to_pickle
SQL	SQL	read_sql	to_sql
SQL	Google Big Query	read_gbq	to_gbq

Reading in a CSV¶

Comma Separated Values files are text files that use commas as field delimeters.
Unless you're running the virtual environment included with the course, you may need to install xlrd and openpyxl.
In your terminal/command prompt run:

conda install xlrd
conda install openpyxl

Then restart Jupyter Notebook. (or use pip install if you aren't using the Anaconda Distribution)

Understanding File Paths¶

You have two options when reading a file with pandas:

If your .py file or .ipynb notebook is located in the exact same folder location as the .csv file you want to read, simply pass in the file name as a string, for example:
```
 df = pd.read_csv('some_file.csv')
```
Pass in the entire file path if you are located in a different directory. The file path must be 100% correct in order for this to work. For example:
```
 df = pd.read_csv("C:\\Users\\myself\\files\\some_file.csv")
```

Print your current directory file path with pwd¶

In [53]:

pwd

Out[53]:

'C:\\Users\\Marcial\\Pierian-Data-Courses\\Machine-Learning-MasterClass\\03-Pandas'

List the files in your current directory with ls¶

In [54]:

ls

 Volume in drive C has no label.
 Volume Serial Number is 3652-BD2F

 Directory of C:\Users\Marcial\Pierian-Data-Courses\Machine-Learning-MasterClass\03-Pandas

07/04/2020  06:10 PM    <DIR>          .
07/04/2020  06:10 PM    <DIR>          ..
07/02/2020  05:40 PM    <DIR>          .ipynb_checkpoints
06/30/2020  04:51 PM           565,390 00-Series.ipynb
07/01/2020  12:48 PM           208,957 01-DataFrames.ipynb
07/01/2020  12:48 PM           194,591 02-Conditional-Filtering.ipynb
07/02/2020  07:02 PM           196,047 03-Useful-Methods.ipynb
07/01/2020  03:32 PM            64,227 04-Missing-Data.ipynb
07/04/2020  01:28 PM           219,627 05-Groupby-Operations-and-MultiIndex.ipynb
07/04/2020  03:19 PM            62,966 06-Combining-DataFrames.ipynb
07/02/2020  07:02 PM            29,356 07-Text-Methods.ipynb
07/02/2020  06:38 PM            35,705 08-Time-Methods.ipynb
07/04/2020  06:10 PM            53,097 09-Inputs-and-Outputs.ipynb
07/02/2020  05:34 PM             1,095 10-Pivot-Tables.ipynb
07/02/2020  05:34 PM               951 11-Pandas-Project-Exercise.ipynb
07/02/2020  05:34 PM             1,118 12-Pandas-Project-Exercise-Solution.ipynb
07/04/2020  05:39 PM                51 example.csv
07/04/2020  06:02 PM             5,022 example.xlsx
02/07/2020  12:26 PM               177 movie_scores.csv
07/01/2020  03:56 PM            17,727 mpg.csv
07/04/2020  05:58 PM             5,022 my_excel_file.xlsx
07/04/2020  05:56 PM                51 new_file.csv
07/02/2020  05:56 PM             5,459 RetailSales_BeerWineLiquor.csv
07/04/2020  05:56 PM               555 simple.html
01/27/2020  02:28 PM            18,752 tips.csv
              22 File(s)      1,685,943 bytes
               3 Dir(s)  82,818,367,488 bytes free

NOTE! Common confusion point! Take note that all read input methods are called directly from pandas with pd.read_ , all output methods are called directly off the dataframe with df.to_¶

CSV Input¶

In [55]:

df = pd.read_csv('example.csv')

In [56]:

df

Out[56]:

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15

In [57]:

df = pd.read_csv('example.csv',index_col=0)

In [58]:

df

Out[58]:

	b	c	d
a
0	1	2	3
4	5	6	7
8	9	10	11
12	13	14	15

In [59]:

df = pd.read_csv('example.csv')

In [60]:

df

Out[60]:

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15

CSV Output¶

Set index=False if you do not want to save the index , otherwise it will add a new column to the .csv file that includes your index and call it "Unnamed: 0" if your index did not have a name. If you do want to save your index, simply set it to True (the default value).

In [61]:

df.to_csv('new_file.csv',index=False)

HTML¶

Pandas can read table tabs off of HTML. This only works if your firewall isn't blocking pandas from accessing the internet!

Unless you're running the virtual environment included with the course, you may need to install lxml, htmllib5, and BeautifulSoup4.
In your terminal/command prompt run:

conda install lxml

or

pip install lxml

Then restart Jupyter Notebook (you may need to restart your computer). (or use pip install if you aren't using the Anaconda Distribution)

read_html¶

HTML Input¶

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects. NOTE: This only works with well defined objects in the html on the page, this can not magically read in tables that are images on a page.

In [62]:

tables = pd.read_html('https://en.wikipedia.org/wiki/World_population')

In [63]:

len(tables) #tables

Out[63]:

Not Useful Tables¶

Pandas found 26 tables on that page. Some are not useful:

In [64]:

tables[0]

Out[64]:

	0	1
0	NaN	An editor has expressed concern that this arti...

Tables that need formatting¶

Some will be misaligned, meaning you need to do extra work to fix the columns and rows:

In [65]:

tables[1]

Out[65]:

	World population (millions, UN estimates)[14]
	#	Top ten most populous countries	2000	2015	2030[A]
0	1	China[B]	1270	1376	1416
1	2	India	1053	1311	1528
2	3	United States	283	322	356
3	4	Indonesia	212	258	295
4	5	Pakistan	136	208	245
5	6	Brazil	176	206	228
6	7	Nigeria	123	182	263
7	8	Bangladesh	131	161	186
8	9	Russia	146	146	149
9	10	Mexico	103	127	148
10	NaN	World total	6127	7349	8501
11	Notes: ^ 2030 = Medium variant. ^ China exclud...	Notes: ^ 2030 = Medium variant. ^ China exclud...	Notes: ^ 2030 = Medium variant. ^ China exclud...	Notes: ^ 2030 = Medium variant. ^ China exclud...	Notes: ^ 2030 = Medium variant. ^ China exclud...

In [66]:

world_pop = tables[1]

In [67]:

world_pop.columns

Out[67]:

MultiIndex([('World population (millions, UN estimates)[14]', ...),
            ('World population (millions, UN estimates)[14]', ...),
            ('World population (millions, UN estimates)[14]', ...),
            ('World population (millions, UN estimates)[14]', ...),
            ('World population (millions, UN estimates)[14]', ...)],
           )

In [68]:

world_pop = world_pop['World population (millions, UN estimates)[14]'].drop('#',axis=1)

In [69]:

world_pop.columns

Out[69]:

Index(['Top ten most populous countries', '2000', '2015', '2030[A]'], dtype='object')

In [70]:

world_pop.columns = ['Countries', '2000', '2015', '2030 Est.']
world_pop = world_pop.drop(11,axis=0)

In [71]:

world_pop

Out[71]:

	Countries	2000	2015	2030 Est.
0	China[B]	1270	1376	1416
1	India	1053	1311	1528
2	United States	283	322	356
3	Indonesia	212	258	295
4	Pakistan	136	208	245
5	Brazil	176	206	228
6	Nigeria	123	182	263
7	Bangladesh	131	161	186
8	Russia	146	146	149
9	Mexico	103	127	148
10	World total	6127	7349	8501

Tables that are intact¶

In [72]:

tables[6]

Out[72]:

	Rank	Country	Population	Area (km2)	Density (Pop. per km2)
0	1	Singapore	5703600	710	8033
1	2	Bangladesh	168870000	143998	1173
2	3	Lebanon	6855713	10452	656
3	4	Taiwan	23604265	36193	652
4	5	South Korea	51780579	99538	520
5	6	Rwanda	12374397	26338	470
6	7	Haiti	11577779	27065	428
7	8	Netherlands	17480000	41526	421
8	9	Israel	9220000	22072	418
9	10	India	1364080000	3287240	415

Write to html Output¶

If you are working on a website and want to quickly output the .html file, you can use to_html

In [73]:

df.to_html('simple.html',index=False)

read_html is not perfect, but its quite powerful for such a simple method call!

Excel Files¶

Pandas can read in basic excel files (it will get errors if there are macros or extensive formulas relying on outside excel files), in general, pandas can only grab the raw information from an .excel file.

NOTE: Requires the openpyxl and xlrd library! Its provided for you in our environment, or simply install with:¶

pip install openpyxl
pip install xlrd

Heavy excel users may want to check out this website: https://www.python-excel.org/

You can think of an excel file as a Workbook containin sheets, which for pandas means each sheet can be a DataFrame.

Excel file input with read_excel()¶

In [74]:

df = pd.read_excel('my_excel_file.xlsx',sheet_name='First_Sheet')

In [75]:

df

Out[75]:

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15

What if you don't know the sheet name? Or want to run a for loop for certain sheet names? Or want every sheet?¶

Several ways to do this: https://stackoverflow.com/questions/17977540/pandas-looking-up-the-list-of-sheets-in-an-excel-file

In [76]:

# Returns a list of sheet_names
pd.ExcelFile('my_excel_file.xlsx').sheet_names

Out[76]:

['First_Sheet']

Grab all sheets¶

In [77]:

excel_sheets = pd.read_excel('my_excel_file.xlsx',sheet_name=None)

In [78]:

type(excel_sheets)

Out[78]:

dict

In [79]:

excel_sheets.keys()

Out[79]:

dict_keys(['First_Sheet'])

In [80]:

excel_sheets['First_Sheet']

Out[80]:

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15

Write to Excel File¶

In [81]:

df.to_excel('example.xlsx',sheet_name='First_Sheet',index=False)

SQL Connections¶

Let's review pandas capabilities by using SQLite, which comes built in with Python.

Example SQL Database (temporary in your RAM)¶

You will need to install sqlalchemy with:

pip install sqlalchemy

to follow along. To understand how to make a connection to your own database, make sure to review: https://docs.sqlalchemy.org/en/13/core/connections.html

In [82]:

from sqlalchemy import create_engine

In [83]:

temp_db = create_engine('sqlite:///:memory:')

Write to Database¶

In [85]:

tables[6]

Out[85]:

	Rank	Country	Population	Area (km2)	Density (Pop. per km2)
0	1	Singapore	5703600	710	8033
1	2	Bangladesh	168870000	143998	1173
2	3	Lebanon	6855713	10452	656
3	4	Taiwan	23604265	36193	652
4	5	South Korea	51780579	99538	520
5	6	Rwanda	12374397	26338	470
6	7	Haiti	11577779	27065	428
7	8	Netherlands	17480000	41526	421
8	9	Israel	9220000	22072	418
9	10	India	1364080000	3287240	415

In [86]:

pop = tables[6]

In [87]:

pop.to_sql(name='populations',con=temp_db)

Read from SQL Database¶

In [89]:

# Read in an entire table
pd.read_sql(sql='populations',con=temp_db)

Out[89]:

	index	Rank	Country	Population	Area (km2)	Density (Pop. per km2)
0	0	1	Singapore	5703600	710	8033
1	1	2	Bangladesh	168870000	143998	1173
2	2	3	Lebanon	6855713	10452	656
3	3	4	Taiwan	23604265	36193	652
4	4	5	South Korea	51780579	99538	520
5	5	6	Rwanda	12374397	26338	470
6	6	7	Haiti	11577779	27065	428
7	7	8	Netherlands	17480000	41526	421
8	8	9	Israel	9220000	22072	418
9	9	10	India	1364080000	3287240	415

In [92]:

# Read in with a SQL Query
pd.read_sql_query(sql="SELECT Country FROM populations",con=temp_db)

Out[92]:

	Country
0	Singapore
1	Bangladesh
2	Lebanon
3	Taiwan
4	South Korea
5	Rwanda
6	Haiti
7	Netherlands
8	Israel
9	India

It is difficult to generalize pandas and SQL, due to a wide array of issues, including permissions,security, online access, varying SQL engines, etc... Use these ideas as a starting off point, and you will most likely need to do your own research for your own situation.

</html>

64 KiB Raw Blame History