%matplotlib inline
Exploring dates and times in Python
So many dates and times
For many of us in the business analytics world, working with dates and times is bread and butter stuff. We can work all kinds of magic with Excel’s “serial” datetimes and worksheet functions like DATEVALUE(), WEEKDAY(), MONTH(), DATE() and many more. We can wield VBA functions like DateDiff with the best of them. We all know that, according to MS, time started on 1/1/1900 (or 1904 if you’re on a Mac) and we just live with the fact that we can’t do reliable datetime work before those magic epochs. I even wrote and released a free and open source MS Access add-in called Hillmaker many years ago that did statistical analysis of time interval data by day of week and time of day. That tool still finds lots of use in the healthcare world for analyzing occupancy of things like nursing units, emergency departments, recovery room, and many other places with entry and exit timestamp data.
So, when I started working with R and Python, one of the first things I did was see how they handled dates and times. I learned about the UTC epoch and the POSIX time standard. In R, I quickly learned to love the lubridate package. And then I started to learn about dates and times in Python and started to know what programmers mean when they say “there be dragons in there”. So, while I’m far from an expert, I did put together this little intro to Python dates and times aimed at business analtytics types who are new to Python. It is by no means comprehensive, but does cover the basics of the base Python datetime
and timedelta
classes along with Numpy datetime64
and timedelta64
classes. Oh yeah, it also covers the pandas Timestamp
and timedelta
classes (confused yet?). I also show one particular dragon that bit me - conversion between Numpy datetimes and Pandas datetimes - even though Pandas uses Numpy under the hood.
Python’s datetime
and timedelta
classes
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
from datetime import datetime
from datetime import timedelta
Python has a module called datetime
as part of its base system.
https://docs.python.org/3/library/datetime.html
The datetime module supplies classes for manipulating dates and times in both simple and complex ways. While date and time arithmetic is supported, the focus of the implementation is on efficient attribute extraction for output formatting and manipulation.
Datetimes can be naive or aware
- naive objects don’t know about timezones and their interpretation is application (coder) dependent
- aware objects do know about timezones and are unambiguous
= [datetime(2012, 5, 1, 8, 30),
dates 2012, 5, 1, 9, 25),
datetime(2012, 5, 1, 15, 30)] datetime(
dates
[datetime.datetime(2012, 5, 1, 8, 30),
datetime.datetime(2012, 5, 1, 9, 25),
datetime.datetime(2012, 5, 1, 15, 30)]
type(dates)
list
type(dates[0])
datetime.datetime
= dates[0]
intime_dt = dates[1]
outtime_dt type(intime_dt)
datetime.datetime
# The commented out lines represent attributes that ARE available in pandas Timestamp objects
# but not in Python datetime objects
print ('Datetime: {}'.format(intime_dt))
print ('Date: {}'.format(intime_dt.date()))
print ('Month: {}'.format(intime_dt.month))
print ('Day: {}'.format(intime_dt.day))
#print 'DayOfWeek: {}'.format(intime_dt.dayofweek)
print ('Weekday: {}'.format(intime_dt.weekday()))
#print 'DayOfYear: {}'.format(intime_dt.dayofyear)
#print 'WeekOfYear: {}'.format(intime_dt.weekofyear)
#print 'Quarter: {}'.format(intime_dt.quarter)
print ('Hour: {}'.format(intime_dt.hour))
print ('Minute: {}'.format(intime_dt.minute))
print ('Second: {}'.format(intime_dt.second))
print ('Microsecond: {}'.format(intime_dt.microsecond))
Datetime: 2012-05-01 08:30:00
Date: 2012-05-01
Month: 5
Day: 1
Weekday: 1
Hour: 8
Minute: 30
Second: 0
Microsecond: 0
= outtime_dt - intime_dt
los_td print (los_td)
type(los_td)
0:55:00
datetime.timedelta
List out the attributes for datetime
objects. Notice how we are filtering out any attributes starting with the double underscore. Change that if you want to see what’s getting filtered out.
= [att for att in dir(los_td) if '__' not in att]
atts #atts = [att for att in dir(los_td)]
print (atts)
['days', 'max', 'microseconds', 'min', 'resolution', 'seconds', 'total_seconds']
print ('Timedelta: {}'.format(los_td))
print ('Seconds: {}'.format(los_td.seconds))
print ('Total Seconds: {}'.format(los_td.total_seconds()))
print ('Microseconds: {}'.format(los_td.microseconds))
print ('Resolution: {}'.format(los_td.resolution))
print ('Min: {}'.format(los_td.min))
print ('Max: {}'.format(los_td.max))
Timedelta: 0:55:00
Seconds: 3300
Total Seconds: 3300.0
Microseconds: 0
Resolution: 0:00:00.000001
Min: -999999999 days, 0:00:00
Max: 999999999 days, 23:59:59.999999
So, you can use the total_seconds()
method to get a duration in seconds from a timedelta
object. If duration is < 1 second, it returns fractional seconds. To convert to other time units, just do the simple math.
0,0,10) timedelta(
datetime.timedelta(microseconds=10)
0,0,17).total_seconds() timedelta(
1.7e-05
= [att for att in dir(intime_dt) if '__' not in att] atts
print (atts)
['astimezone', 'combine', 'ctime', 'date', 'day', 'dst', 'fold', 'fromisoformat', 'fromordinal', 'fromtimestamp', 'hour', 'isocalendar', 'isoformat', 'isoweekday', 'max', 'microsecond', 'min', 'minute', 'month', 'now', 'replace', 'resolution', 'second', 'strftime', 'strptime', 'time', 'timestamp', 'timetuple', 'timetz', 'today', 'toordinal', 'tzinfo', 'tzname', 'utcfromtimestamp', 'utcnow', 'utcoffset', 'utctimetuple', 'weekday', 'year']
Numpy’s newish datetime64
and timedelta64
classes
The Numpy library (v 1.7.0) introduced a new datatype called datetime64. Numpy uses this name because there is already a datetime data type in Python datetime library. Like R, dates and times in Python are stored based on the notion of POSIX time - the number of seconds elapsed since Jan 1, 1970 00:00:00 GMT - called the UTC epoch. This doesn’t mean that dates and times before that cannot be represented - we just use negative numbers. Representing and working with dates and times for computing is a pretty interesting topic, fraught with complexities like time zones, daylight savings time, and leap seconds. A few good places to start are:
Numpy docs - http://docs.scipy.org/doc/numpy-dev/reference/arrays.datetime.html
Pandas docs - http://pandas.pydata.org/pandas-docs/dev/timeseries.html
See http://cr.yp.to/proto/utctai.html for gory details on UTC and ATI and leap seconds and the problems with the POSIX standard.
http://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64
Here’s the actual numpy proposal for datetime64 and timedelta64, authored by Travis Oliphant.
https://github.com/numpy/numpy/blob/master/doc/neps/datetime-proposal.rst
'2012-05-01 08:30:00') np.datetime64(
numpy.datetime64('2012-05-01T08:30:00')
= [np.datetime64('2012-05-01 08:30:00'),
dates64 '2012-05-01 09:25:00'),
np.datetime64('2012-05-01 15:30:00')] np.datetime64(
dates64
[numpy.datetime64('2012-05-01T08:30:00'),
numpy.datetime64('2012-05-01T09:25:00'),
numpy.datetime64('2012-05-01T15:30:00')]
As you can see from below, UTC is four hours later than EDT. In other words, the EDT timezone offset is UTC-4.
= datetime.utcnow()
dt_utcnow dt_utcnow
datetime.datetime(2019, 3, 5, 15, 31, 47, 583555)
# Can create datetime64's from dt's.
= np.datetime64(dt_utcnow)
dt64_utcnow dt64_utcnow
numpy.datetime64('2019-03-05T15:31:47.583555')
= [np.datetime64('2014-06-01 08:30:00'),
curdates64 '2014-06-01 09:25:00'),
np.datetime64('2014-06-01 15:30:00')] np.datetime64(
curdates64
[numpy.datetime64('2014-06-01T08:30:00'),
numpy.datetime64('2014-06-01T09:25:00'),
numpy.datetime64('2014-06-01T15:30:00')]
= dates64[0]
intime_dt64 = dates64[1]
outtime_dt64 print(type(intime_dt64))
print(intime_dt64)
intime_dt64
<class 'numpy.datetime64'>
2012-05-01T08:30:00
numpy.datetime64('2012-05-01T08:30:00')
# The commented out lines represent attributes that ARE available in pandas Timestamp objects
# but not in numpy datetime64 objects
print ('Datetime: {}'.format(str(intime_dt64)))
#print 'Date: {}'.format(intime_dt64.date())
#print 'Month: {}'.format(intime_dt64.month)
#print 'Day: {}'.format(intime_dt.day)
#print 'DayOfWeek: {}'.format(intime_dt.dayofweek)
#print 'Weekday: {}'.format(intime_dt.weekday())
#print 'DayOfYear: {}'.format(intime_dt.dayofyear)
#print 'WeekOfYear: {}'.format(intime_dt.weekofyear)
#print 'Quarter: {}'.format(intime_dt.quarter)
#print 'Hour: {}'.format(intime_dt.hour)
#print 'Minute: {}'.format(intime_dt.minute)
#print 'Second: {}'.format(intime_dt.second)
#print 'Microsecond: {}'.format(intime_dt.microsecond)
Datetime: 2012-05-01T08:30:00
You can subtract two datetime64
objects and get a timedelta64
object in return.
= outtime_dt64 - intime_dt64
los_dt64 los_dt64
numpy.timedelta64(3300,'s')
# Coerce it to other units
'm') np.timedelta64(los_dt64,
numpy.timedelta64(55,'m')
Unfortunately, timedelta64
objects have almost no useful attributes in terms of date time math. OTH, see the datetime.timedelta
object (and Pandas) for a number of useful attributes.
Pandas and datetimes
Pandas prides itself on its handling of time series data (speed, flexibility, power) and is built on time of numpy datetime64
and timedelta64
data types (instead of Python’s base datetime
and timedelta
types. Pandas also wanted to be very time-zone aware and time-zone capable. So, Pandas has its own data type for timestamps called, well, Timetamp
. It’s based on numpy’s datetime64
type but works like Python’s datetime
(I think). The following StackOverlfow post (the “welcome to hell” post) is quite popular and relevant.
http://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64
Pandas time series indexes can be timestamp or time period based.
= Series(np.random.randint(1,100,3), index=dates) ts
ts
2012-05-01 08:30:00 35
2012-05-01 09:25:00 45
2012-05-01 15:30:00 11
dtype: int64
ts.index
DatetimeIndex(['2012-05-01 08:30:00', '2012-05-01 09:25:00',
'2012-05-01 15:30:00'],
dtype='datetime64[ns]', freq=None)
If we have demand data by month, a period based index might make more sense.
from pandas import Period, PeriodIndex
= PeriodIndex([Period('2012-01'), Period('2012-02'),Period('2012-03')]) periods
= Series(np.random.randint(1,100,3), index=periods) ts2
ts2
2012-01 22
2012-02 72
2012-03 54
Freq: M, dtype: int64
ts2.index
PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='period[M]', freq='M')
From the pandas docs:
Starting with 0.8, pandas allows you to capture both representations and convert between them. Under the hood, pandas represents timestamps using instances of Timestamp and sequences of timestamps using instances of DatetimeIndex. For regular time spans, pandas uses Period objects for scalar values and PeriodIndex for sequences of spans.
from pandas import Timestamp
'20120501 08:30:00') Timestamp(
Timestamp('2012-05-01 08:30:00')
= [Timestamp('20120501 08:30:00'), Timestamp('20120501 09:25:00'), Timestamp('20120501 15:30:00')] timestamps
= timestamps[0]
intime_ts = timestamps[1]
outtime_ts type(intime_ts)
pandas._libs.tslibs.timestamps.Timestamp
= outtime_ts - intime_ts
los_ptd print (los_ptd)
type(los_ptd)
0 days 00:55:00
pandas._libs.tslibs.timedeltas.Timedelta
# The pandas Timestamp data type has a number of useful attributes (dayofweek, dayofyear, weekofyear)
# that the base Python datetime type does not.
print ('Datetime: {}'.format(intime_ts))
print ('Date: {}'.format(intime_ts.date()))
print ('Month: {}'.format(intime_ts.month))
print ('Day: {}'.format(intime_ts.day))
print ('DayOfWeek: {}'.format(intime_ts.dayofweek))
print ('Weekday: {}'.format(intime_ts.weekday()))
print ('DayOfYear: {}'.format(intime_ts.dayofyear))
print ('WeekOfYear: {}'.format(intime_ts.weekofyear))
print ('Quarter: {}'.format(intime_ts.quarter))
print ('Hour: {}'.format(intime_ts.hour))
print ('Minute: {}'.format(intime_ts.minute))
print ('Second: {}'.format(intime_ts.second))
print ('Microsecond: {}'.format(intime_ts.microsecond))
Datetime: 2012-05-01 08:30:00
Date: 2012-05-01
Month: 5
Day: 1
DayOfWeek: 1
Weekday: 1
DayOfYear: 122
WeekOfYear: 18
Quarter: 2
Hour: 8
Minute: 30
Second: 0
Microsecond: 0
print ('Timedelta: {}'.format(los_ptd))
print ('Days: {}'.format(los_ptd.days))
print ('Seconds: {}'.format(los_ptd.seconds))
print ('Total Seconds: {}'.format(los_ptd.total_seconds()))
print ('Microseconds: {}'.format(los_ptd.microseconds))
print ('Resolution: {}'.format(los_ptd.resolution))
print ('Min: {}'.format(los_ptd.min))
print ('Max: {}'.format(los_ptd.max))
Timedelta: 0 days 00:55:00
Days: 0
Seconds: 3300
Total Seconds: 3300.0
Microseconds: 0
Resolution: T
Min: -106752 days +00:12:43.145224
Max: 106751 days 23:47:16.854775
Pandas and time series
Generally, a time series in pandas is a Series
object with a DateTimeIndex
. Pandas has convenience functions for creating time series indexes. The default frequency for date_range is a calendar day while the default for bdate_range is a business day
= pd.date_range('2000-1-1', periods=1000, freq='M') index
index
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-30',
'2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
'2000-09-30', '2000-10-31',
...
'2082-07-31', '2082-08-31', '2082-09-30', '2082-10-31',
'2082-11-30', '2082-12-31', '2083-01-31', '2083-02-28',
'2083-03-31', '2083-04-30'],
dtype='datetime64[ns]', length=1000, freq='M')
= datetime(2014, 1, 1)
start = datetime(2014, 6, 12)
end = pd.date_range(start, end)
rng rng
DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
'2014-01-05', '2014-01-06', '2014-01-07', '2014-01-08',
'2014-01-09', '2014-01-10',
...
'2014-06-03', '2014-06-04', '2014-06-05', '2014-06-06',
'2014-06-07', '2014-06-08', '2014-06-09', '2014-06-10',
'2014-06-11', '2014-06-12'],
dtype='datetime64[ns]', length=163, freq='D')
='W') pd.date_range(start, end, freq
DatetimeIndex(['2014-01-05', '2014-01-12', '2014-01-19', '2014-01-26',
'2014-02-02', '2014-02-09', '2014-02-16', '2014-02-23',
'2014-03-02', '2014-03-09', '2014-03-16', '2014-03-23',
'2014-03-30', '2014-04-06', '2014-04-13', '2014-04-20',
'2014-04-27', '2014-05-04', '2014-05-11', '2014-05-18',
'2014-05-25', '2014-06-01', '2014-06-08'],
dtype='datetime64[ns]', freq='W-SUN')
= np.random.randint(1, 100, size=163) data
= DataFrame(data, index=rng) df
df.head()
0 | |
---|---|
2014-01-01 | 12 |
2014-01-02 | 29 |
2014-01-03 | 84 |
2014-01-04 | 4 |
2014-01-05 | 6 |
# Date index slicing is easy but, BE CAREFUL, the end point is included (unlike usual Python slicing behavior).
'2014-01-02':'2014-01-12'] df[
0 | |
---|---|
2014-01-02 | 29 |
2014-01-03 | 84 |
2014-01-04 | 4 |
2014-01-05 | 6 |
2014-01-06 | 25 |
2014-01-07 | 56 |
2014-01-08 | 63 |
2014-01-09 | 39 |
2014-01-10 | 16 |
2014-01-11 | 10 |
2014-01-12 | 72 |
2014,1,2):datetime(2014,1,12)] df[datetime(
0 | |
---|---|
2014-01-02 | 29 |
2014-01-03 | 84 |
2014-01-04 | 4 |
2014-01-05 | 6 |
2014-01-06 | 25 |
2014-01-07 | 56 |
2014-01-08 | 63 |
2014-01-09 | 39 |
2014-01-10 | 16 |
2014-01-11 | 10 |
2014-01-12 | 72 |
Pandas DateOffset
objects
Usually base frequencies are specified with a string alias such as ‘W’ or ‘M’. However,
= pd.date_range(start, end, freq='M') rng2
len(rng2)
5
len(rng2)] data[:
array([12, 29, 84, 4, 6])
= DataFrame(data[:len(rng2)], index=rng2) df2
df2
from pandas.tseries.offsets import Hour, Minute
#We rarely use these but they are available
= Hour()
one_hour = Hour(4) four_hours
+ four_hours one_hour
+ Minute(35) one_hour
Week of month dates
= pd.date_range('9/1/2012','1/1/2014',freq='M') rng
rng
list(rng)
= pd.date_range('9/1/2012','1/1/2014',freq='WOM-2TUE') rng
list(rng)
Shifting (Leading and Lagging) Data
Move data back and forth through time. Series
and DateFrame
objects have a shift
method for doing naive shifts forward and backward and leaving the index unchanged.
= Series(np.random.randint(1,10,4),index = pd.date_range('1/2/2013',periods=4,freq='M'))
ts ts
= 2) ts.shift(periods
# Common use is pct change
/ts.shift(1) - 1.0 ts
# Since naive shifts leave index unchanged, some data is lost. To shift both data and index, pass in the frequency
1) ts.shift(
1, freq='M') ts.shift(
Periods and period arithmetic
Periods represent time spans like months or quarters or years.
= pd.Period('6/1/2014',freq='M')
p p
+6 p
# Ranges of periods is done much like ranges of dates
= pd.period_range('1/1/2010','6/1/2014',freq='Q') rng
rng
list(rng)
'M','start') p.asfreq(
'M','end') p.asfreq(
Resampling and frequency conversion
Resampling - convert time series from one frequency to another.
- Downsampling - higher frequency to lower frequency (e.g. daily to monthly)
- Upsampling - lower frequency to higher frequency (e.g. annual to monthly)
Downsampling
Take a daily time series and convert to monthly.
= pd.date_range('9/1/2012',periods=100,freq='D') rng
= Series(np.random.randint(1,25,len(rng)),index=rng) ts
ts
'M',how='mean',kind='period') ts.resample(
'M',how='sum',kind='period') ts.resample(
'M',how='sum') ts.resample(
Whenver you are downsampling, you need to think about two bin boundary issues:
- is the interval closed on the left or the right?
- is the bin label based on the left or right bin boundary?
With resampling from days to months, the question of left or right closure isn’t really an issue, but consider changing from 15 minute time bins to 30 minute time bins.
= pd.date_range('9/1/2012',periods=96,freq='15T') rng15
= Series(np.random.randint(1,25,len(rng15)),index=rng15) ts15
ts15
Now, if we resample to 30 minute bins and take the mean across each set of two 15 bins, we need to specify if the times on the half hour such as 22:30 in the original series should go in the bin from (22:15,22:30] or [22:30,22:45). The first is referred to as closed on the right and the second as closed on the left. The default is close on right. You can also control whether the left or right edge of the bin is used as the label (independent of which side “closes” the interval).
'30min', how='mean', closed='right', label='left') ts15.resample(
Notice how in the example above, the first record is labelled with ‘2012-08-31’ since it’s the left boundary and we picked ‘right’ for closing intervals and ‘left’ for labelling. We might want to close and label on the left.
= ts15.resample('30min', how='mean', closed='left', label='left')
ts30 ts30
A common financial aggregation is to compute the first (open), last (close), max (high), min (low) values for each bucket. If you use ‘ohlc’ for the how
argument, you get a DataFrame
with these aggregations computed automatically.
= ts15.resample('60min', how='ohlc', closed='left', label='left')
ts60 ts60
Upsampling
Upsampling brings an entirely different set of challenges. If we increase the frequency of the samples, how do we fill in these new time series values? For example, consider the reverse of the last example. Then check out the fill_method
options.
ts30
'15min') ts30.resample(
Time Series Plotting
Pandas tries to make it a little easier to create nice looking time series plots using matplotlib
than if you just used matplotlib
yourself. Let’s read in some stock price data from Yahoo! Finance that has already been saved to a csv file.
= pd.read_csv('stock_px.csv', parse_dates=True, index_col=0) close_px_all
close_px_all.head()
= close_px_all[['AAPL','MSFT','XOM']] close_px
1:10] close_px[
'AAPL'].plot() close_px[
close_px.plot()
# And here's what it looks like if we just plot one year. Notice the auto x-axis formatting.
'2009'].plot() close_px.ix[
close_px.index
list(close_px.index)[1:25]
'AAPL'].ix['01-2011':'03-2011'].plot() close_px[
Pandas does a particulary nice job with quarterly data.
= close_px['AAPL'].resample('Q-DEC') appl_q
appl_q
'2009':].plot() appl_q.ix[
Moving window functions
Pandas makes it easy to do all kinds of moving window operations like moving average, cumulative sums, and even things like exponential moving averages (“exponential smoothing”).
close_px.AAPL.plot()250).plot() pd.rolling_mean(close_px.AAPL,
250) pd.rolling_mean(close_px.AAPL,
This example shows one problem with a moving average - by default you usually can’t compute it until you have data for the number of periods corresponding to the window size (250 in this case). Pandas lets you specify an alternate minumum number of data points to get around this problem.
250,min_periods=10) pd.rolling_mean(close_px.AAPL,
close_px.AAPL.plot()250,min_periods=10).plot() pd.rolling_mean(close_px.AAPL,
# Things like rolling standard deviation are also possible
= pd.rolling_std(close_px.AAPL,250,min_periods=10) appl_std250
appl_std250.plot()
Learn more about pandas timeseries
Another intro tutorial is http://earthpy.org/pandas-basics.html
Wes Mckinney’s 2012 PyCon talk on pandas and timeseries - http://pyvideo.org/video/1198/time-series-data-analysis-with-pandas