{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "\n", "\n", "___\n", "
Copyright by Pierian Data Inc.
\n", "
For more information, visit us at www.pieriandata.com
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Inputs and Outputs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
NOTE: Typically we will just be either reading csv files directly or using pandas-datareader to pull data from the web. Consider this lecture just a quick overview of what is possible with pandas (we won't be working with SQL or Excel files in this course)
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Input and Output\n", "\n", "This notebook is the reference code for getting input and output, pandas can read a variety of file types using its pd.read_ methods. Let's take a look at the most common data types:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check out the references here! \n", "\n", "**This is the best online resource for how to read/write to a variety of data sources!**\n", "\n", "https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html\n", "\n", "----\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Format TypeData DescriptionReaderWriter
textCSVread_csvto_csv
textJSONread_jsonto_json
textHTMLread_htmlto_html
textLocal clipboardread_clipboardto_clipboard
binaryMS Excelread_excelto_excel
binaryOpenDocumentread_excel 
binaryHDF5 Formatread_hdfto_hdf
binaryFeather Formatread_featherto_feather
binaryParquet Formatread_parquetto_parquet
binaryMsgpackread_msgpackto_msgpack
binaryStataread_statato_stata
binarySASread_sas 
binaryPython Pickle Formatread_pickleto_pickle
SQLSQLread_sqlto_sql
SQLGoogle Big Queryread_gbqto_gbq
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reading in a CSV\n", "Comma Separated Values files are text files that use commas as field delimeters.
\n", "Unless you're running the virtual environment included with the course, you may need to install xlrd and openpyxl.
\n", "In your terminal/command prompt run:\n", "\n", " conda install xlrd\n", " conda install openpyxl\n", "\n", "Then restart Jupyter Notebook.\n", "(or use pip install if you aren't using the Anaconda Distribution)\n", "\n", "## Understanding File Paths\n", "\n", "You have two options when reading a file with pandas:\n", "\n", "1. If your .py file or .ipynb notebook is located in the **exact** same folder location as the .csv file you want to read, simply pass in the file name as a string, for example:\n", " \n", " df = pd.read_csv('some_file.csv')\n", " \n", "2. Pass in the entire file path if you are located in a different directory. The file path must be 100% correct in order for this to work. For example:\n", "\n", " df = pd.read_csv(\"C:\\\\Users\\\\myself\\\\files\\\\some_file.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Print your current directory file path with pwd" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'C:\\\\Users\\\\Marcial\\\\Pierian-Data-Courses\\\\Machine-Learning-MasterClass\\\\03-Pandas'" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pwd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### List the files in your current directory with ls" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Volume in drive C has no label.\n", " Volume Serial Number is 3652-BD2F\n", "\n", " Directory of C:\\Users\\Marcial\\Pierian-Data-Courses\\Machine-Learning-MasterClass\\03-Pandas\n", "\n", "07/04/2020 06:10 PM .\n", "07/04/2020 06:10 PM ..\n", "07/02/2020 05:40 PM .ipynb_checkpoints\n", "06/30/2020 04:51 PM 565,390 00-Series.ipynb\n", "07/01/2020 12:48 PM 208,957 01-DataFrames.ipynb\n", "07/01/2020 12:48 PM 194,591 02-Conditional-Filtering.ipynb\n", "07/02/2020 07:02 PM 196,047 03-Useful-Methods.ipynb\n", "07/01/2020 03:32 PM 64,227 04-Missing-Data.ipynb\n", "07/04/2020 01:28 PM 219,627 05-Groupby-Operations-and-MultiIndex.ipynb\n", "07/04/2020 03:19 PM 62,966 06-Combining-DataFrames.ipynb\n", "07/02/2020 07:02 PM 29,356 07-Text-Methods.ipynb\n", "07/02/2020 06:38 PM 35,705 08-Time-Methods.ipynb\n", "07/04/2020 06:10 PM 53,097 09-Inputs-and-Outputs.ipynb\n", "07/02/2020 05:34 PM 1,095 10-Pivot-Tables.ipynb\n", "07/02/2020 05:34 PM 951 11-Pandas-Project-Exercise.ipynb\n", "07/02/2020 05:34 PM 1,118 12-Pandas-Project-Exercise-Solution.ipynb\n", "07/04/2020 05:39 PM 51 example.csv\n", "07/04/2020 06:02 PM 5,022 example.xlsx\n", "02/07/2020 12:26 PM 177 movie_scores.csv\n", "07/01/2020 03:56 PM 17,727 mpg.csv\n", "07/04/2020 05:58 PM 5,022 my_excel_file.xlsx\n", "07/04/2020 05:56 PM 51 new_file.csv\n", "07/02/2020 05:56 PM 5,459 RetailSales_BeerWineLiquor.csv\n", "07/04/2020 05:56 PM 555 simple.html\n", "01/27/2020 02:28 PM 18,752 tips.csv\n", " 22 File(s) 1,685,943 bytes\n", " 3 Dir(s) 82,818,367,488 bytes free\n" ] } ], "source": [ "ls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "#### NOTE! Common confusion point! Take note that all read input methods are called directly from pandas with pd.read_ , all output methods are called directly off the dataframe with df.to_\n", "\n", "-------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CSV Input" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('example.csv')" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
00123
14567
2891011
312131415
\n", "
" ], "text/plain": [ " a b c d\n", "0 0 1 2 3\n", "1 4 5 6 7\n", "2 8 9 10 11\n", "3 12 13 14 15" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('example.csv',index_col=0)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bcd
a
0123
4567
891011
12131415
\n", "
" ], "text/plain": [ " b c d\n", "a \n", "0 1 2 3\n", "4 5 6 7\n", "8 9 10 11\n", "12 13 14 15" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('example.csv')" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
00123
14567
2891011
312131415
\n", "
" ], "text/plain": [ " a b c d\n", "0 0 1 2 3\n", "1 4 5 6 7\n", "2 8 9 10 11\n", "3 12 13 14 15" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CSV Output\n", "\n", "Set index=False if you do not want to save the index , otherwise it will add a new column to the .csv file that includes your index and call it \"Unnamed: 0\" if your index did not have a name. If you do want to save your index, simply set it to True (the default value)." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "df.to_csv('new_file.csv',index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## HTML\n", "\n", "Pandas can read table tabs off of HTML. This only works if your firewall isn't blocking pandas from accessing the internet!\n", "\n", "Unless you're running the virtual environment included with the course, you may need to install lxml, htmllib5, and BeautifulSoup4.
\n", "In your terminal/command prompt run:\n", "\n", " conda install lxml\n", " \n", " or\n", " \n", " pip install lxml\n", " \n", "Then restart Jupyter Notebook (you may need to restart your computer).\n", "(or use pip install if you aren't using the Anaconda Distribution)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## read_html\n", "\n", "### HTML Input\n", "\n", "Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects. NOTE: This only works with well defined objects in the html on the page, this can not magically read in tables that are images on a page." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "tables = pd.read_html('https://en.wikipedia.org/wiki/World_population')" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "26" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(tables) #tables" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Not Useful Tables\n", "Pandas found 26 tables on that page. Some are not useful:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
0NaNAn editor has expressed concern that this arti...
\n", "" ], "text/plain": [ " 0 1\n", "0 NaN An editor has expressed concern that this arti..." ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tables[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tables that need formatting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some will be misaligned, meaning you need to do extra work to fix the columns and rows:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
World population (millions, UN estimates)[14]
#Top ten most populous countries200020152030[A]
01China[B]127013761416
12India105313111528
23United States283322356
34Indonesia212258295
45Pakistan136208245
56Brazil176206228
67Nigeria123182263
78Bangladesh131161186
89Russia146146149
910Mexico103127148
10NaNWorld total612773498501
11Notes: ^ 2030 = Medium variant. ^ China exclud...Notes: ^ 2030 = Medium variant. ^ China exclud...Notes: ^ 2030 = Medium variant. ^ China exclud...Notes: ^ 2030 = Medium variant. ^ China exclud...Notes: ^ 2030 = Medium variant. ^ China exclud...
\n", "
" ], "text/plain": [ " World population (millions, UN estimates)[14] \\\n", " # \n", "0 1 \n", "1 2 \n", "2 3 \n", "3 4 \n", "4 5 \n", "5 6 \n", "6 7 \n", "7 8 \n", "8 9 \n", "9 10 \n", "10 NaN \n", "11 Notes: ^ 2030 = Medium variant. ^ China exclud... \n", "\n", " \\\n", " Top ten most populous countries \n", "0 China[B] \n", "1 India \n", "2 United States \n", "3 Indonesia \n", "4 Pakistan \n", "5 Brazil \n", "6 Nigeria \n", "7 Bangladesh \n", "8 Russia \n", "9 Mexico \n", "10 World total \n", "11 Notes: ^ 2030 = Medium variant. ^ China exclud... \n", "\n", " \\\n", " 2000 \n", "0 1270 \n", "1 1053 \n", "2 283 \n", "3 212 \n", "4 136 \n", "5 176 \n", "6 123 \n", "7 131 \n", "8 146 \n", "9 103 \n", "10 6127 \n", "11 Notes: ^ 2030 = Medium variant. ^ China exclud... \n", "\n", " \\\n", " 2015 \n", "0 1376 \n", "1 1311 \n", "2 322 \n", "3 258 \n", "4 208 \n", "5 206 \n", "6 182 \n", "7 161 \n", "8 146 \n", "9 127 \n", "10 7349 \n", "11 Notes: ^ 2030 = Medium variant. ^ China exclud... \n", "\n", " \n", " 2030[A] \n", "0 1416 \n", "1 1528 \n", "2 356 \n", "3 295 \n", "4 245 \n", "5 228 \n", "6 263 \n", "7 186 \n", "8 149 \n", "9 148 \n", "10 8501 \n", "11 Notes: ^ 2030 = Medium variant. ^ China exclud... " ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tables[1]" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "world_pop = tables[1]" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultiIndex([('World population (millions, UN estimates)[14]', ...),\n", " ('World population (millions, UN estimates)[14]', ...),\n", " ('World population (millions, UN estimates)[14]', ...),\n", " ('World population (millions, UN estimates)[14]', ...),\n", " ('World population (millions, UN estimates)[14]', ...)],\n", " )" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_pop.columns" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "world_pop = world_pop['World population (millions, UN estimates)[14]'].drop('#',axis=1)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Top ten most populous countries', '2000', '2015', '2030[A]'], dtype='object')" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_pop.columns" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "world_pop.columns = ['Countries', '2000', '2015', '2030 Est.']\n", "world_pop = world_pop.drop(11,axis=0)" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Countries200020152030 Est.
0China[B]127013761416
1India105313111528
2United States283322356
3Indonesia212258295
4Pakistan136208245
5Brazil176206228
6Nigeria123182263
7Bangladesh131161186
8Russia146146149
9Mexico103127148
10World total612773498501
\n", "
" ], "text/plain": [ " Countries 2000 2015 2030 Est.\n", "0 China[B] 1270 1376 1416\n", "1 India 1053 1311 1528\n", "2 United States 283 322 356\n", "3 Indonesia 212 258 295\n", "4 Pakistan 136 208 245\n", "5 Brazil 176 206 228\n", "6 Nigeria 123 182 263\n", "7 Bangladesh 131 161 186\n", "8 Russia 146 146 149\n", "9 Mexico 103 127 148\n", "10 World total 6127 7349 8501" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "world_pop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tables that are intact" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RankCountryPopulationArea (km2)Density (Pop. per km2)
01Singapore57036007108033
12Bangladesh1688700001439981173
23Lebanon685571310452656
34Taiwan2360426536193652
45South Korea5178057999538520
56Rwanda1237439726338470
67Haiti1157777927065428
78Netherlands1748000041526421
89Israel922000022072418
910India13640800003287240415
\n", "
" ], "text/plain": [ " Rank Country Population Area (km2) Density (Pop. per km2)\n", "0 1 Singapore 5703600 710 8033\n", "1 2 Bangladesh 168870000 143998 1173\n", "2 3 Lebanon 6855713 10452 656\n", "3 4 Taiwan 23604265 36193 652\n", "4 5 South Korea 51780579 99538 520\n", "5 6 Rwanda 12374397 26338 470\n", "6 7 Haiti 11577779 27065 428\n", "7 8 Netherlands 17480000 41526 421\n", "8 9 Israel 9220000 22072 418\n", "9 10 India 1364080000 3287240 415" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tables[6]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Write to html Output\n", "\n", "If you are working on a website and want to quickly output the .html file, you can use to_html" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "df.to_html('simple.html',index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**read_html** is not perfect, but its quite powerful for such a simple method call!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Excel Files\n", "\n", "Pandas can read in basic excel files (it will get errors if there are macros or extensive formulas relying on outside excel files), in general, pandas can only grab the raw information from an .excel file.\n", "\n", "#### NOTE: Requires the openpyxl and xlrd library! Its provided for you in our environment, or simply install with:\n", "\n", " pip install openpyxl\n", " pip install xlrd\n", " \n", "Heavy excel users may want to check out this website: https://www.python-excel.org/\n", "\n", "You can think of an excel file as a Workbook containin sheets, which for pandas means each sheet can be a DataFrame.\n", "\n", "## Excel file input with read_excel()" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "df = pd.read_excel('my_excel_file.xlsx',sheet_name='First_Sheet')" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
00123
14567
2891011
312131415
\n", "
" ], "text/plain": [ " a b c d\n", "0 0 1 2 3\n", "1 4 5 6 7\n", "2 8 9 10 11\n", "3 12 13 14 15" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What if you don't know the sheet name? Or want to run a for loop for certain sheet names? Or want every sheet?\n", "\n", "Several ways to do this: https://stackoverflow.com/questions/17977540/pandas-looking-up-the-list-of-sheets-in-an-excel-file" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['First_Sheet']" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Returns a list of sheet_names\n", "pd.ExcelFile('my_excel_file.xlsx').sheet_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Grab all sheets" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "excel_sheets = pd.read_excel('my_excel_file.xlsx',sheet_name=None)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(excel_sheets)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['First_Sheet'])" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "excel_sheets.keys()" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
00123
14567
2891011
312131415
\n", "
" ], "text/plain": [ " a b c d\n", "0 0 1 2 3\n", "1 4 5 6 7\n", "2 8 9 10 11\n", "3 12 13 14 15" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "excel_sheets['First_Sheet']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write to Excel File" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "df.to_excel('example.xlsx',sheet_name='First_Sheet',index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# SQL Connections\n", "\n", "#### NOTE: Highly recommend you explore specific libraries for your specific SQL Engine. Simple search for your database+python in Google and the top results should hopefully include an API.\n", "\n", "* [MySQL](https://www.google.com/search?q=mysql+python)\n", "* [PostgreSQL](https://www.google.com/search?q=postgresql+python)\n", "* [MS SQL Server](https://www.google.com/search?q=MSSQLserver+python)\n", "* [Orcale](https://www.google.com/search?q=oracle+python)\n", "* [MongoDB](https://www.google.com/search?q=mongodb+python)\n", "\n", "Let's review pandas capabilities by using SQLite, which comes built in with Python.\n", "\n", "## Example SQL Database (temporary in your RAM)\n", "\n", "You will need to install sqlalchemy with:\n", "\n", " pip install sqlalchemy\n", " \n", "to follow along. To understand how to make a connection to your own database, make sure to review: https://docs.sqlalchemy.org/en/13/core/connections.html" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [], "source": [ "from sqlalchemy import create_engine" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "temp_db = create_engine('sqlite:///:memory:')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write to Database" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RankCountryPopulationArea (km2)Density (Pop. per km2)
01Singapore57036007108033
12Bangladesh1688700001439981173
23Lebanon685571310452656
34Taiwan2360426536193652
45South Korea5178057999538520
56Rwanda1237439726338470
67Haiti1157777927065428
78Netherlands1748000041526421
89Israel922000022072418
910India13640800003287240415
\n", "
" ], "text/plain": [ " Rank Country Population Area (km2) Density (Pop. per km2)\n", "0 1 Singapore 5703600 710 8033\n", "1 2 Bangladesh 168870000 143998 1173\n", "2 3 Lebanon 6855713 10452 656\n", "3 4 Taiwan 23604265 36193 652\n", "4 5 South Korea 51780579 99538 520\n", "5 6 Rwanda 12374397 26338 470\n", "6 7 Haiti 11577779 27065 428\n", "7 8 Netherlands 17480000 41526 421\n", "8 9 Israel 9220000 22072 418\n", "9 10 India 1364080000 3287240 415" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tables[6]" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "pop = tables[6]" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "pop.to_sql(name='populations',con=temp_db)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read from SQL Database" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexRankCountryPopulationArea (km2)Density (Pop. per km2)
001Singapore57036007108033
112Bangladesh1688700001439981173
223Lebanon685571310452656
334Taiwan2360426536193652
445South Korea5178057999538520
556Rwanda1237439726338470
667Haiti1157777927065428
778Netherlands1748000041526421
889Israel922000022072418
9910India13640800003287240415
\n", "
" ], "text/plain": [ " index Rank Country Population Area (km2) Density (Pop. per km2)\n", "0 0 1 Singapore 5703600 710 8033\n", "1 1 2 Bangladesh 168870000 143998 1173\n", "2 2 3 Lebanon 6855713 10452 656\n", "3 3 4 Taiwan 23604265 36193 652\n", "4 4 5 South Korea 51780579 99538 520\n", "5 5 6 Rwanda 12374397 26338 470\n", "6 6 7 Haiti 11577779 27065 428\n", "7 7 8 Netherlands 17480000 41526 421\n", "8 8 9 Israel 9220000 22072 418\n", "9 9 10 India 1364080000 3287240 415" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read in an entire table\n", "pd.read_sql(sql='populations',con=temp_db)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Country
0Singapore
1Bangladesh
2Lebanon
3Taiwan
4South Korea
5Rwanda
6Haiti
7Netherlands
8Israel
9India
\n", "
" ], "text/plain": [ " Country\n", "0 Singapore\n", "1 Bangladesh\n", "2 Lebanon\n", "3 Taiwan\n", "4 South Korea\n", "5 Rwanda\n", "6 Haiti\n", "7 Netherlands\n", "8 Israel\n", "9 India" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read in with a SQL Query\n", "pd.read_sql_query(sql=\"SELECT Country FROM populations\",con=temp_db)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is difficult to generalize pandas and SQL, due to a wide array of issues, including permissions,security, online access, varying SQL engines, etc... Use these ideas as a starting off point, and you will most likely need to do your own research for your own situation." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 1 }