You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

66 KiB

<html> <head> </head>

___

Copyright Pierian Data For more information, visit us at www.pieriandata.com

Time Resampling

Let's learn how to sample time series data! This will be useful later on in the course!

In [1]:
import pandas as pd
%matplotlib inline

Import the data

For this exercise we'll look at Starbucks stock data from 2015 to 2018 which includes daily closing prices and trading volumes.

In [2]:
df = pd.read_csv('../Data/starbucks.csv', index_col='Date', parse_dates=True)

Note: the above code is a faster way of doing the following:

df = pd.read_csv('../Data/starbucks.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)
In [3]:
df.head()
Out[3]:
Close Volume
Date
2015-01-02 38.0061 6906098
2015-01-05 37.2781 11623796
2015-01-06 36.9748 7664340
2015-01-07 37.8848 9732554
2015-01-08 38.4961 13170548

resample()

A common operation with time series data is resampling based on the time series index. Let's see how to use the resample() method. [reference]

In [4]:
# Our index
df.index
Out[4]:
DatetimeIndex(['2015-01-02', '2015-01-05', '2015-01-06', '2015-01-07',
               '2015-01-08', '2015-01-09', '2015-01-12', '2015-01-13',
               '2015-01-14', '2015-01-15',
               ...
               '2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20',
               '2018-12-21', '2018-12-24', '2018-12-26', '2018-12-27',
               '2018-12-28', '2018-12-31'],
              dtype='datetime64[ns]', name='Date', length=1006, freq=None)

When calling .resample() you first need to pass in a rule parameter, then you need to call some sort of aggregation function.

The rule parameter describes the frequency with which to apply the aggregation function (daily, monthly, yearly, etc.)
It is passed in using an "offset alias" - refer to the table below. [reference]

The aggregation function is needed because, due to resampling, we need some sort of mathematical rule to join the rows (mean, sum, count, etc.)

TIME SERIES OFFSET ALIASES
ALIASDESCRIPTION
Bbusiness day frequency
Ccustom business day frequency (experimental)
Dcalendar day frequency
Wweekly frequency
Mmonth end frequency
SMsemi-month end frequency (15th and end of month)
BMbusiness month end frequency
CBMcustom business month end frequency
MSmonth start frequency
SMSsemi-month start frequency (1st and 15th)
BMSbusiness month start frequency
CBMScustom business month start frequency
Qquarter end frequency
intentionally left blank
ALIASDESCRIPTION
BQbusiness quarter endfrequency
QSquarter start frequency
BQSbusiness quarter start frequency
Ayear end frequency
BAbusiness year end frequency
ASyear start frequency
BASbusiness year start frequency
BHbusiness hour frequency
Hhourly frequency
T, minminutely frequency
Ssecondly frequency
L, msmilliseconds
U, usmicroseconds
Nnanoseconds
In [5]:
# Yearly Means
df.resample(rule='A').mean()
Out[5]:
Close Volume
Date
2015-12-31 50.078100 8.649190e+06
2016-12-31 53.891732 9.300633e+06
2017-12-31 55.457310 9.296078e+06
2018-12-31 56.870005 1.122883e+07

Resampling rule 'A' takes all of the data points in a given year, applies the aggregation function (in this case we calculate the mean), and reports the result as the last day of that year.

Custom Resampling Functions

We're not limited to pandas built-in summary functions (min/max/mean etc.). We can define our own function:

In [6]:
def first_day(entry):
    """
    Returns the first instance of the period, regardless of sampling rate.
    """
    if len(entry):  # handles the case of missing data
        return entry[0]
In [7]:
df.resample(rule='A').apply(first_day)
Out[7]:
Close Volume
Date
2015-12-31 38.0061 6906098
2016-12-31 55.0780 13521544
2017-12-31 53.1100 7809307
2018-12-31 56.3243 7215978

Plotting

In [8]:
df['Close'].resample('A').mean().plot.bar(title='Yearly Mean Closing Price for Starbucks');

Pandas treats each sample as its own trace, and by default assigns different colors to each one. If you want, you can pass a color argument to assign your own color collection, or to set a uniform color. For example, color='#1f77b4' sets a uniform "steel blue" color.

Also, the above code can be broken into two lines for improved readability.

In [9]:
title = 'Yearly Mean Closing Price for Starbucks'
df['Close'].resample('A').mean().plot.bar(title=title,color=['#1f77b4']);
In [10]:
title = 'Monthly Max Closing Price for Starbucks'
df['Close'].resample('M').max().plot.bar(figsize=(16,6), title=title,color='#1f77b4');

That is it! Up next we'll learn about time shifts!

</html>