___

Copyright by Pierian Data Inc. For more information, visit us at www.pieriandata.com

Pivot Tables¶

Pivoting data can sometimes help clarify relationships and connections.

Full documentation on a variety of related pivot methods: https://pandas.pydata.org/docs/user_guide/reshaping.html

Data¶

In [1]:

import numpy as np
import pandas as pd

In [2]:

df = pd.read_csv('Sales_Funnel_CRM.csv')

In [3]:

df

Out[3]:

	Account Number	Company	Contact	Account Manager	Product	Licenses	Sale Price	Status
0	2123398	Google	Larry Pager	Edward Thorp	Analytics	150	2100000	Presented
1	2123398	Google	Larry Pager	Edward Thorp	Prediction	150	700000	Presented
2	2123398	Google	Larry Pager	Edward Thorp	Tracking	300	350000	Under Review
3	2192650	BOBO	Larry Pager	Edward Thorp	Analytics	150	2450000	Lost
4	420496	IKEA	Elon Tusk	Edward Thorp	Analytics	300	4550000	Won
5	636685	Tesla Inc.	Elon Tusk	Edward Thorp	Analytics	300	2800000	Under Review
6	636685	Tesla Inc.	Elon Tusk	Edward Thorp	Prediction	150	700000	Presented
7	1216870	Microsoft	Will Grates	Edward Thorp	Tracking	300	350000	Under Review
8	2200450	Walmart	Will Grates	Edward Thorp	Analytics	150	2450000	Lost
9	405886	Apple	Cindy Phoner	Claude Shannon	Analytics	300	4550000	Won
10	470248	Exxon Mobile	Cindy Phoner	Claude Shannon	Analytics	150	2100000	Presented
11	698032	ATT	Cindy Phoner	Claude Shannon	Tracking	150	350000	Under Review
12	698032	ATT	Cindy Phoner	Claude Shannon	Prediction	150	700000	Presented
13	902797	CVS Health	Emma Gordian	Claude Shannon	Tracking	450	490000	Won
14	2046943	Salesforce	Emma Gordian	Claude Shannon	Analytics	750	7000000	Won
15	2169499	Cisco	Emma Gordian	Claude Shannon	Analytics	300	4550000	Lost
16	2169499	Cisco	Emma Gordian	Claude Shannon	GPS Positioning	300	350000	Presented

The pivot() method¶

The pivot method reshapes data based on column values and reassignment of the index. Keep in mind, it doesn't always make sense to pivot data. In our machine learning lessons, we will see that our data doesn't need to be pivoted. Pivot methods are mainly for data analysis,visualization, and exploration.

Here is an image showing the idea behind a pivot() call:

In [4]:

help(pd.pivot)

Help on function pivot in module pandas.core.reshape.pivot:

pivot(data:'DataFrame', index=None, columns=None, values=None) -> 'DataFrame'
    Return reshaped DataFrame organized by given index / column values.
    
    Reshape data (produce a "pivot" table) based on column values. Uses
    unique values from specified `index` / `columns` to form axes of the
    resulting DataFrame. This function does not support data
    aggregation, multiple values will result in a MultiIndex in the
    columns. See the :ref:`User Guide <reshaping>` for more on reshaping.
    
    Parameters
    ----------
    data : DataFrame
    index : str or object, optional
        Column to use to make new frame's index. If None, uses
        existing index.
    columns : str or object
        Column to use to make new frame's columns.
    values : str, object or a list of the previous, optional
        Column(s) to use for populating new frame's values. If not
        specified, all remaining columns will be used and the result will
        have hierarchically indexed columns.
    
        .. versionchanged:: 0.23.0
           Also accept list of column names.
    
    Returns
    -------
    DataFrame
        Returns reshaped DataFrame.
    
    Raises
    ------
    ValueError:
        When there are any `index`, `columns` combinations with multiple
        values. `DataFrame.pivot_table` when you need to aggregate.
    
    See Also
    --------
    DataFrame.pivot_table : Generalization of pivot that can handle
        duplicate values for one index/column pair.
    DataFrame.unstack : Pivot based on the index values instead of a
        column.
    
    Notes
    -----
    For finer-tuned control, see hierarchical indexing documentation along
    with the related stack/unstack methods.
    
    Examples
    --------
    >>> df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
    ...                            'two'],
    ...                    'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
    ...                    'baz': [1, 2, 3, 4, 5, 6],
    ...                    'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
    >>> df
        foo   bar  baz  zoo
    0   one   A    1    x
    1   one   B    2    y
    2   one   C    3    z
    3   two   A    4    q
    4   two   B    5    w
    5   two   C    6    t
    
    >>> df.pivot(index='foo', columns='bar', values='baz')
    bar  A   B   C
    foo
    one  1   2   3
    two  4   5   6
    
    >>> df.pivot(index='foo', columns='bar')['baz']
    bar  A   B   C
    foo
    one  1   2   3
    two  4   5   6
    
    >>> df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
          baz       zoo
    bar   A  B  C   A  B  C
    foo
    one   1  2  3   x  y  z
    two   4  5  6   q  w  t
    
    A ValueError is raised if there are any duplicates.
    
    >>> df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'],
    ...                    "bar": ['A', 'A', 'B', 'C'],
    ...                    "baz": [1, 2, 3, 4]})
    >>> df
       foo bar  baz
    0  one   A    1
    1  one   A    2
    2  two   B    3
    3  two   C    4
    
    Notice that the first two rows are the same for our `index`
    and `columns` arguments.
    
    >>> df.pivot(index='foo', columns='bar', values='baz')
    Traceback (most recent call last):
       ...
    ValueError: Index contains duplicate entries, cannot reshape

Note: Common Point of Confusion: Students often just randomly pass in index,column, and value choices in an attempt to see the changes. This often just leads to formatting errors. You should first go through this checklist BEFORE running a pivot():¶

What question are you trying to answer?
What would a dataframe that answers the question look like? Does it need a pivot()
What you want the resulting pivot to look like? Do you need all the original columns?

In [5]:

df

Out[5]:

	Account Number	Company	Contact	Account Manager	Product	Licenses	Sale Price	Status
0	2123398	Google	Larry Pager	Edward Thorp	Analytics	150	2100000	Presented
1	2123398	Google	Larry Pager	Edward Thorp	Prediction	150	700000	Presented
2	2123398	Google	Larry Pager	Edward Thorp	Tracking	300	350000	Under Review
3	2192650	BOBO	Larry Pager	Edward Thorp	Analytics	150	2450000	Lost
4	420496	IKEA	Elon Tusk	Edward Thorp	Analytics	300	4550000	Won
5	636685	Tesla Inc.	Elon Tusk	Edward Thorp	Analytics	300	2800000	Under Review
6	636685	Tesla Inc.	Elon Tusk	Edward Thorp	Prediction	150	700000	Presented
7	1216870	Microsoft	Will Grates	Edward Thorp	Tracking	300	350000	Under Review
8	2200450	Walmart	Will Grates	Edward Thorp	Analytics	150	2450000	Lost
9	405886	Apple	Cindy Phoner	Claude Shannon	Analytics	300	4550000	Won
10	470248	Exxon Mobile	Cindy Phoner	Claude Shannon	Analytics	150	2100000	Presented
11	698032	ATT	Cindy Phoner	Claude Shannon	Tracking	150	350000	Under Review
12	698032	ATT	Cindy Phoner	Claude Shannon	Prediction	150	700000	Presented
13	902797	CVS Health	Emma Gordian	Claude Shannon	Tracking	450	490000	Won
14	2046943	Salesforce	Emma Gordian	Claude Shannon	Analytics	750	7000000	Won
15	2169499	Cisco	Emma Gordian	Claude Shannon	Analytics	300	4550000	Lost
16	2169499	Cisco	Emma Gordian	Claude Shannon	GPS Positioning	300	350000	Presented

** What type of question does a pivot help answer?**

Imagine we wanted to know, how many licenses of each product type did Google purchase? Currently the way the data is formatted is hard to read. Let's pivot it so this is clearer, we will take a subset of the data for the question at hand.

In [6]:

# Let's take a subset, otherwise we'll get an error due to duplicate rows and data
licenses = df[['Company','Product','Licenses']]
licenses

Out[6]:

	Company	Product	Licenses
0	Google	Analytics	150
1	Google	Prediction	150
2	Google	Tracking	300
3	BOBO	Analytics	150
4	IKEA	Analytics	300
5	Tesla Inc.	Analytics	300
6	Tesla Inc.	Prediction	150
7	Microsoft	Tracking	300
8	Walmart	Analytics	150
9	Apple	Analytics	300
10	Exxon Mobile	Analytics	150
11	ATT	Tracking	150
12	ATT	Prediction	150
13	CVS Health	Tracking	450
14	Salesforce	Analytics	750
15	Cisco	Analytics	300
16	Cisco	GPS Positioning	300

In [7]:

pd.pivot(data=licenses,index='Company',columns='Product',values='Licenses')

Out[7]:

Product	Analytics	GPS Positioning	Prediction	Tracking
Company
Google	150.0	NaN	150.0	300.0
ATT	NaN	NaN	150.0	150.0
Apple	300.0	NaN	NaN	NaN
BOBO	150.0	NaN	NaN	NaN
CVS Health	NaN	NaN	NaN	450.0
Cisco	300.0	300.0	NaN	NaN
Exxon Mobile	150.0	NaN	NaN	NaN
IKEA	300.0	NaN	NaN	NaN
Microsoft	NaN	NaN	NaN	300.0
Salesforce	750.0	NaN	NaN	NaN
Tesla Inc.	300.0	NaN	150.0	NaN
Walmart	150.0	NaN	NaN	NaN

The pivot_table() method¶

Similar to the pivot() method, the pivot_table() can add aggregation functions to a pivot call.

In [8]:

df

Out[8]:

	Account Number	Company	Contact	Account Manager	Product	Licenses	Sale Price	Status
0	2123398	Google	Larry Pager	Edward Thorp	Analytics	150	2100000	Presented
1	2123398	Google	Larry Pager	Edward Thorp	Prediction	150	700000	Presented
2	2123398	Google	Larry Pager	Edward Thorp	Tracking	300	350000	Under Review
3	2192650	BOBO	Larry Pager	Edward Thorp	Analytics	150	2450000	Lost
4	420496	IKEA	Elon Tusk	Edward Thorp	Analytics	300	4550000	Won
5	636685	Tesla Inc.	Elon Tusk	Edward Thorp	Analytics	300	2800000	Under Review
6	636685	Tesla Inc.	Elon Tusk	Edward Thorp	Prediction	150	700000	Presented
7	1216870	Microsoft	Will Grates	Edward Thorp	Tracking	300	350000	Under Review
8	2200450	Walmart	Will Grates	Edward Thorp	Analytics	150	2450000	Lost
9	405886	Apple	Cindy Phoner	Claude Shannon	Analytics	300	4550000	Won
10	470248	Exxon Mobile	Cindy Phoner	Claude Shannon	Analytics	150	2100000	Presented
11	698032	ATT	Cindy Phoner	Claude Shannon	Tracking	150	350000	Under Review
12	698032	ATT	Cindy Phoner	Claude Shannon	Prediction	150	700000	Presented
13	902797	CVS Health	Emma Gordian	Claude Shannon	Tracking	450	490000	Won
14	2046943	Salesforce	Emma Gordian	Claude Shannon	Analytics	750	7000000	Won
15	2169499	Cisco	Emma Gordian	Claude Shannon	Analytics	300	4550000	Lost
16	2169499	Cisco	Emma Gordian	Claude Shannon	GPS Positioning	300	350000	Presented

In [9]:

# Notice Account Number sum() doesn't make sense to keep/use
pd.pivot_table(df,index="Company",aggfunc='sum')

Out[9]:

	Account Number	Licenses	Sale Price
Company
Google	6370194	600	3150000
ATT	1396064	300	1050000
Apple	405886	300	4550000
BOBO	2192650	150	2450000
CVS Health	902797	450	490000
Cisco	4338998	600	4900000
Exxon Mobile	470248	150	2100000
IKEA	420496	300	4550000
Microsoft	1216870	300	350000
Salesforce	2046943	750	7000000
Tesla Inc.	1273370	450	3500000
Walmart	2200450	150	2450000

In [10]:

# Either grab the columns
pd.pivot_table(df,index="Company",aggfunc='sum')[['Licenses','Sale Price']]

Out[10]:

	Licenses	Sale Price
Company
Google	600	3150000
ATT	300	1050000
Apple	300	4550000
BOBO	150	2450000
CVS Health	450	490000
Cisco	600	4900000
Exxon Mobile	150	2100000
IKEA	300	4550000
Microsoft	300	350000
Salesforce	750	7000000
Tesla Inc.	450	3500000
Walmart	150	2450000

In [11]:

# Or state them as wanted values
pd.pivot_table(df,index="Company",aggfunc='sum',values=['Licenses','Sale Price'])

Out[11]:

	Licenses	Sale Price
Company
Google	600	3150000
ATT	300	1050000
Apple	300	4550000
BOBO	150	2450000
CVS Health	450	490000
Cisco	600	4900000
Exxon Mobile	150	2100000
IKEA	300	4550000
Microsoft	300	350000
Salesforce	750	7000000
Tesla Inc.	450	3500000
Walmart	150	2450000

In [12]:

df.groupby('Company').sum()[['Licenses','Sale Price']]

Out[12]:

	Licenses	Sale Price
Company
Google	600	3150000
ATT	300	1050000
Apple	300	4550000
BOBO	150	2450000
CVS Health	450	490000
Cisco	600	4900000
Exxon Mobile	150	2100000
IKEA	300	4550000
Microsoft	300	350000
Salesforce	750	7000000
Tesla Inc.	450	3500000
Walmart	150	2450000

In [13]:

pd.pivot_table(df,index=["Account Manager","Contact"],values=['Sale Price'],aggfunc='sum')

Out[13]:

		Sale Price
Account Manager	Contact
Claude Shannon	Cindy Phoner	7700000
Claude Shannon	Emma Gordian	12390000
Edward Thorp	Elon Tusk	8050000
	Larry Pager	5600000
	Will Grates	2800000

Columns are optional - they provide an additional way to segment the actual values you care about. The aggregation functions are applied to the values you list.

In [14]:

pd.pivot_table(df,index=["Account Manager","Contact"],values=["Sale Price"],columns=["Product"],aggfunc=[np.sum])

Out[14]:

		sum
		Sale Price
	Product	Analytics	GPS Positioning	Prediction	Tracking
Account Manager	Contact
Claude Shannon	Cindy Phoner	6650000.0	NaN	700000.0	350000.0
Claude Shannon	Emma Gordian	11550000.0	350000.0	NaN	490000.0
Edward Thorp	Elon Tusk	7350000.0	NaN	700000.0	NaN
	Larry Pager	4550000.0	NaN	700000.0	350000.0
	Will Grates	2450000.0	NaN	NaN	350000.0

In [15]:

pd.pivot_table(df,index=["Account Manager","Contact"],values=["Sale Price"],columns=["Product"],aggfunc=[np.sum],fill_value=0)

Out[15]:

		sum
		Sale Price
	Product	Analytics	GPS Positioning	Prediction	Tracking
Account Manager	Contact
Claude Shannon	Cindy Phoner	6650000	0	700000	350000
Claude Shannon	Emma Gordian	11550000	350000	0	490000
Edward Thorp	Elon Tusk	7350000	0	700000	0
	Larry Pager	4550000	0	700000	350000
	Will Grates	2450000	0	0	350000

In [16]:

# Can add multiple agg functions
pd.pivot_table(df,index=["Account Manager","Contact"],values=["Sale Price"],columns=["Product"],
               aggfunc=[np.sum,np.mean],fill_value=0)

Out[16]:

		sum				mean
		Sale Price				Sale Price
	Product	Analytics	GPS Positioning	Prediction	Tracking	Analytics	GPS Positioning	Prediction	Tracking
Account Manager	Contact
Claude Shannon	Cindy Phoner	6650000	0	700000	350000	3325000	0	700000	350000
Claude Shannon	Emma Gordian	11550000	350000	0	490000	5775000	350000	0	490000
Edward Thorp	Elon Tusk	7350000	0	700000	0	3675000	0	700000	0
	Larry Pager	4550000	0	700000	350000	2275000	0	700000	350000
	Will Grates	2450000	0	0	350000	2450000	0	0	350000

In [17]:

# Can add on multiple columns
pd.pivot_table(df,index=["Account Manager","Contact"],values=["Sale Price","Licenses"],columns=["Product"],
               aggfunc=[np.sum],fill_value=0)

Out[17]:

		sum
		Licenses				Sale Price
	Product	Analytics	GPS Positioning	Prediction	Tracking	Analytics	GPS Positioning	Prediction	Tracking
Account Manager	Contact
Claude Shannon	Cindy Phoner	450	0	150	150	6650000	0	700000	350000
Claude Shannon	Emma Gordian	1050	300	0	450	11550000	350000	0	490000
Edward Thorp	Elon Tusk	600	0	150	0	7350000	0	700000	0
	Larry Pager	300	0	150	300	4550000	0	700000	350000
	Will Grates	150	0	0	300	2450000	0	0	350000

In [18]:

# Can add on multiple columns
pd.pivot_table(df,index=["Account Manager","Contact","Product"],values=["Sale Price","Licenses"],
               aggfunc=[np.sum],fill_value=0)

Out[18]:

			sum
			Licenses	Sale Price
Account Manager	Contact	Product
Claude Shannon	Cindy Phoner	Analytics	450	6650000
		Prediction	150	700000
		Tracking	150	350000
	Emma Gordian	Analytics	1050	11550000
		GPS Positioning	300	350000
		Tracking	450	490000
Edward Thorp	Elon Tusk	Analytics	600	7350000
	Elon Tusk	Prediction	150	700000
	Larry Pager	Analytics	300	4550000
		Prediction	150	700000
		Tracking	300	350000
	Will Grates	Analytics	150	2450000
	Will Grates	Tracking	300	350000

In [19]:

# get Final "ALL" with margins = True
# Can add on multiple columns
pd.pivot_table(df,index=["Account Manager","Contact","Product"],values=["Sale Price","Licenses"],
               aggfunc=[np.sum],fill_value=0,margins=True)

Out[19]:

			sum
			Licenses	Sale Price
Account Manager	Contact	Product
Claude Shannon	Cindy Phoner	Analytics	450	6650000
		Prediction	150	700000
		Tracking	150	350000
	Emma Gordian	Analytics	1050	11550000
		GPS Positioning	300	350000
		Tracking	450	490000
Edward Thorp	Elon Tusk	Analytics	600	7350000
	Elon Tusk	Prediction	150	700000
	Larry Pager	Analytics	300	4550000
		Prediction	150	700000
		Tracking	300	350000
	Will Grates	Analytics	150	2450000
	Will Grates	Tracking	300	350000
All			4500	36540000

In [20]:

pd.pivot_table(df,index=["Account Manager","Status"],values=["Sale Price"],
               aggfunc=[np.sum],fill_value=0,margins=True)

Out[20]:

		sum
		Sale Price
Account Manager	Status
Claude Shannon	Lost	4550000
	Presented	3150000
	Under Review	350000
	Won	12040000
Edward Thorp	Lost	4900000
	Presented	3500000
	Under Review	3500000
	Won	4550000
All		36540000

99 KiB Raw Blame History

Pivot Tables¶

Data¶

The pivot() method¶

Note: Common Point of Confusion: Students often just randomly pass in index,column, and value choices in an attempt to see the changes. This often just leads to formatting errors. You should first go through this checklist BEFORE running a pivot():¶

The pivot_table() method¶

99 KiB

Raw Blame History