Exploring dates and times in Python

So many dates and times

python
datetime
Author

Mark Isken

Published

June 25, 2014

Note

This post is a bit dated. It could use an update. For one thing, timezone handling has changed.

For many of us in the business analytics world, working with dates and times is bread and butter stuff. We can work all kinds of magic with Excel’s “serial” datetimes and worksheet functions like DATEVALUE(), WEEKDAY(), MONTH(), DATE() and many more. We can wield VBA functions like DateDiff with the best of them. We all know that, according to MS, time started on 1/1/1900 (or 1904 if you’re on a Mac) and we just live with the fact that we can’t do reliable datetime work before those magic epochs. I even wrote and released a free and open source MS Access add-in called Hillmaker many years ago that did statistical analysis of time interval data by day of week and time of day. That tool still finds lots of use in the healthcare world for analyzing occupancy of things like nursing units, emergency departments, recovery room, and many other places with entry and exit timestamp data.

So, when I started working with R and Python, one of the first things I did was see how they handled dates and times. I learned about the UTC epoch and the POSIX time standard. In R, I quickly learned to love the lubridate package. And then I started to learn about dates and times in Python and started to know what programmers mean when they say “there be dragons in there”. So, while I’m far from an expert, I did put together this little intro to Python dates and times aimed at business analtytics types who are new to Python. It is by no means comprehensive, but does cover the basics of the base Python datetime and timedelta classes along with Numpy datetime64 and timedelta64 classes. Oh yeah, it also covers the pandas Timestamp and timedelta classes (confused yet?). I also show one particular dragon that bit me - conversion between Numpy datetimes and Pandas datetimes - even though Pandas uses Numpy under the hood.

Python’s datetime and timedelta classes

%matplotlib inline
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
from datetime import datetime
from datetime import timedelta

Python has a module called datetime as part of its base system.

https://docs.python.org/3/library/datetime.html

The datetime module supplies classes for manipulating dates and times in both simple and complex ways. While date and time arithmetic is supported, the focus of the implementation is on efficient attribute extraction for output formatting and manipulation.

Datetimes can be naive or aware

  • naive objects don’t know about timezones and their interpretation is application (coder) dependent
  • aware objects do know about timezones and are unambiguous
dates = [datetime(2012, 5, 1, 8, 30), 
         datetime(2012, 5, 1, 9, 25), 
         datetime(2012, 5, 1, 15, 30)]
dates
[datetime.datetime(2012, 5, 1, 8, 30),
 datetime.datetime(2012, 5, 1, 9, 25),
 datetime.datetime(2012, 5, 1, 15, 30)]
type(dates)
list
type(dates[0])
datetime.datetime
intime_dt = dates[0]
outtime_dt = dates[1]
type(intime_dt)
datetime.datetime
# The commented out lines represent attributes that ARE available in pandas Timestamp objects
#   but not in Python datetime objects
print ('Datetime: {}'.format(intime_dt))
print ('Date: {}'.format(intime_dt.date()))
print ('Month: {}'.format(intime_dt.month))
print ('Day: {}'.format(intime_dt.day))
#print 'DayOfWeek: {}'.format(intime_dt.dayofweek)
print ('Weekday: {}'.format(intime_dt.weekday()))
#print 'DayOfYear: {}'.format(intime_dt.dayofyear)
#print 'WeekOfYear: {}'.format(intime_dt.weekofyear)
#print 'Quarter: {}'.format(intime_dt.quarter)
print ('Hour: {}'.format(intime_dt.hour))
print ('Minute: {}'.format(intime_dt.minute))
print ('Second: {}'.format(intime_dt.second))
print ('Microsecond: {}'.format(intime_dt.microsecond))
Datetime: 2012-05-01 08:30:00
Date: 2012-05-01
Month: 5
Day: 1
Weekday: 1
Hour: 8
Minute: 30
Second: 0
Microsecond: 0
los_td = outtime_dt - intime_dt
print (los_td)
type(los_td)
0:55:00
datetime.timedelta

List out the attributes for datetime objects. Notice how we are filtering out any attributes starting with the double underscore. Change that if you want to see what’s getting filtered out.

atts = [att for att in dir(los_td) if '__' not in att]
#atts = [att for att in dir(los_td)]
print (atts)
['days', 'max', 'microseconds', 'min', 'resolution', 'seconds', 'total_seconds']
print ('Timedelta: {}'.format(los_td))
print ('Seconds: {}'.format(los_td.seconds))
print ('Total Seconds: {}'.format(los_td.total_seconds()))
print ('Microseconds: {}'.format(los_td.microseconds))
print ('Resolution: {}'.format(los_td.resolution))
print ('Min: {}'.format(los_td.min))
print ('Max: {}'.format(los_td.max))
Timedelta: 0:55:00
Seconds: 3300
Total Seconds: 3300.0
Microseconds: 0
Resolution: 0:00:00.000001
Min: -999999999 days, 0:00:00
Max: 999999999 days, 23:59:59.999999

So, you can use the total_seconds() method to get a duration in seconds from a timedelta object. If duration is < 1 second, it returns fractional seconds. To convert to other time units, just do the simple math.

timedelta(0,0,10)
datetime.timedelta(microseconds=10)
timedelta(0,0,17).total_seconds()
1.7e-05
atts = [att for att in dir(intime_dt) if '__' not in att]
print (atts)
['astimezone', 'combine', 'ctime', 'date', 'day', 'dst', 'fold', 'fromisoformat', 'fromordinal', 'fromtimestamp', 'hour', 'isocalendar', 'isoformat', 'isoweekday', 'max', 'microsecond', 'min', 'minute', 'month', 'now', 'replace', 'resolution', 'second', 'strftime', 'strptime', 'time', 'timestamp', 'timetuple', 'timetz', 'today', 'toordinal', 'tzinfo', 'tzname', 'utcfromtimestamp', 'utcnow', 'utcoffset', 'utctimetuple', 'weekday', 'year']

Numpy’s newish datetime64 and timedelta64 classes

The Numpy library (v 1.7.0) introduced a new datatype called datetime64. Numpy uses this name because there is already a datetime data type in Python datetime library. Like R, dates and times in Python are stored based on the notion of POSIX time - the number of seconds elapsed since Jan 1, 1970 00:00:00 GMT - called the UTC epoch. This doesn’t mean that dates and times before that cannot be represented - we just use negative numbers. Representing and working with dates and times for computing is a pretty interesting topic, fraught with complexities like time zones, daylight savings time, and leap seconds. A few good places to start are:

Numpy docs - http://docs.scipy.org/doc/numpy-dev/reference/arrays.datetime.html

Pandas docs - http://pandas.pydata.org/pandas-docs/dev/timeseries.html

See http://cr.yp.to/proto/utctai.html for gory details on UTC and ATI and leap seconds and the problems with the POSIX standard.

http://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64

Here’s the actual numpy proposal for datetime64 and timedelta64, authored by Travis Oliphant.

https://github.com/numpy/numpy/blob/master/doc/neps/datetime-proposal.rst

np.datetime64('2012-05-01 08:30:00')
numpy.datetime64('2012-05-01T08:30:00')
dates64 = [np.datetime64('2012-05-01 08:30:00'), 
           np.datetime64('2012-05-01 09:25:00'), 
           np.datetime64('2012-05-01 15:30:00')]
dates64
[numpy.datetime64('2012-05-01T08:30:00'),
 numpy.datetime64('2012-05-01T09:25:00'),
 numpy.datetime64('2012-05-01T15:30:00')]

As you can see from below, UTC is four hours later than EDT. In other words, the EDT timezone offset is UTC-4.

dt_utcnow = datetime.utcnow()
dt_utcnow
datetime.datetime(2019, 3, 5, 15, 31, 47, 583555)
# Can create datetime64's from dt's.
dt64_utcnow = np.datetime64(dt_utcnow)
dt64_utcnow
numpy.datetime64('2019-03-05T15:31:47.583555')
curdates64 = [np.datetime64('2014-06-01 08:30:00'), 
              np.datetime64('2014-06-01 09:25:00'), 
              np.datetime64('2014-06-01 15:30:00')]
curdates64
[numpy.datetime64('2014-06-01T08:30:00'),
 numpy.datetime64('2014-06-01T09:25:00'),
 numpy.datetime64('2014-06-01T15:30:00')]
intime_dt64 = dates64[0]
outtime_dt64 = dates64[1]
print(type(intime_dt64))
print(intime_dt64)
intime_dt64
<class 'numpy.datetime64'>
2012-05-01T08:30:00
numpy.datetime64('2012-05-01T08:30:00')
# The commented out lines represent attributes that ARE available in pandas Timestamp objects
#   but not in numpy datetime64 objects
print ('Datetime: {}'.format(str(intime_dt64)))
#print 'Date: {}'.format(intime_dt64.date())
#print 'Month: {}'.format(intime_dt64.month)
#print 'Day: {}'.format(intime_dt.day)
#print 'DayOfWeek: {}'.format(intime_dt.dayofweek)
#print 'Weekday: {}'.format(intime_dt.weekday())
#print 'DayOfYear: {}'.format(intime_dt.dayofyear)
#print 'WeekOfYear: {}'.format(intime_dt.weekofyear)
#print 'Quarter: {}'.format(intime_dt.quarter)
#print 'Hour: {}'.format(intime_dt.hour)
#print 'Minute: {}'.format(intime_dt.minute)
#print 'Second: {}'.format(intime_dt.second)
#print 'Microsecond: {}'.format(intime_dt.microsecond)
Datetime: 2012-05-01T08:30:00

You can subtract two datetime64 objects and get a timedelta64 object in return.

los_dt64 = outtime_dt64 - intime_dt64
los_dt64
numpy.timedelta64(3300,'s')
# Coerce it to other units
np.timedelta64(los_dt64,'m')
numpy.timedelta64(55,'m')

Unfortunately, timedelta64 objects have almost no useful attributes in terms of date time math. OTH, see the datetime.timedelta object (and Pandas) for a number of useful attributes.

Pandas and datetimes

Pandas prides itself on its handling of time series data (speed, flexibility, power) and is built on time of numpy datetime64 and timedelta64 data types (instead of Python’s base datetime and timedelta types. Pandas also wanted to be very time-zone aware and time-zone capable. So, Pandas has its own data type for timestamps called, well, Timetamp. It’s based on numpy’s datetime64 type but works like Python’s datetime (I think). The following StackOverlfow post (the “welcome to hell” post) is quite popular and relevant.

http://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64

Pandas time series indexes can be timestamp or time period based.

ts = Series(np.random.randint(1,100,3), index=dates)
ts
2012-05-01 08:30:00    35
2012-05-01 09:25:00    45
2012-05-01 15:30:00    11
dtype: int64
ts.index
DatetimeIndex(['2012-05-01 08:30:00', '2012-05-01 09:25:00',
               '2012-05-01 15:30:00'],
              dtype='datetime64[ns]', freq=None)

If we have demand data by month, a period based index might make more sense.

from pandas import Period, PeriodIndex
periods = PeriodIndex([Period('2012-01'), Period('2012-02'),Period('2012-03')])
ts2 = Series(np.random.randint(1,100,3), index=periods)
ts2
2012-01    22
2012-02    72
2012-03    54
Freq: M, dtype: int64
ts2.index
PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='period[M]', freq='M')

From the pandas docs:

Starting with 0.8, pandas allows you to capture both representations and convert between them. Under the hood, pandas represents timestamps using instances of Timestamp and sequences of timestamps using instances of DatetimeIndex. For regular time spans, pandas uses Period objects for scalar values and PeriodIndex for sequences of spans.

from pandas import Timestamp
Timestamp('20120501 08:30:00')
Timestamp('2012-05-01 08:30:00')
timestamps = [Timestamp('20120501 08:30:00'), Timestamp('20120501 09:25:00'), Timestamp('20120501 15:30:00')]
intime_ts = timestamps[0]
outtime_ts = timestamps[1]
type(intime_ts)
pandas._libs.tslibs.timestamps.Timestamp
los_ptd = outtime_ts - intime_ts
print (los_ptd)
type(los_ptd)
0 days 00:55:00
pandas._libs.tslibs.timedeltas.Timedelta
# The pandas Timestamp data type has a number of useful attributes (dayofweek, dayofyear, weekofyear)
#   that the base Python datetime type does not.
print ('Datetime: {}'.format(intime_ts))
print ('Date: {}'.format(intime_ts.date()))
print ('Month: {}'.format(intime_ts.month))
print ('Day: {}'.format(intime_ts.day))
print ('DayOfWeek: {}'.format(intime_ts.dayofweek))
print ('Weekday: {}'.format(intime_ts.weekday()))
print ('DayOfYear: {}'.format(intime_ts.dayofyear))
print ('WeekOfYear: {}'.format(intime_ts.weekofyear))
print ('Quarter: {}'.format(intime_ts.quarter))
print ('Hour: {}'.format(intime_ts.hour))
print ('Minute: {}'.format(intime_ts.minute))
print ('Second: {}'.format(intime_ts.second))
print ('Microsecond: {}'.format(intime_ts.microsecond))
Datetime: 2012-05-01 08:30:00
Date: 2012-05-01
Month: 5
Day: 1
DayOfWeek: 1
Weekday: 1
DayOfYear: 122
WeekOfYear: 18
Quarter: 2
Hour: 8
Minute: 30
Second: 0
Microsecond: 0
print ('Timedelta: {}'.format(los_ptd))
print ('Days: {}'.format(los_ptd.days))
print ('Seconds: {}'.format(los_ptd.seconds))
print ('Total Seconds: {}'.format(los_ptd.total_seconds()))
print ('Microseconds: {}'.format(los_ptd.microseconds))
print ('Resolution: {}'.format(los_ptd.resolution))
print ('Min: {}'.format(los_ptd.min))
print ('Max: {}'.format(los_ptd.max))
Timedelta: 0 days 00:55:00
Days: 0
Seconds: 3300
Total Seconds: 3300.0
Microseconds: 0
Resolution: T
Min: -106752 days +00:12:43.145224
Max: 106751 days 23:47:16.854775

Pandas and time series

Generally, a time series in pandas is a Series object with a DateTimeIndex. Pandas has convenience functions for creating time series indexes. The default frequency for date_range is a calendar day while the default for bdate_range is a business day

index = pd.date_range('2000-1-1', periods=1000, freq='M')
index
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-30',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-30', '2000-10-31',
               ...
               '2082-07-31', '2082-08-31', '2082-09-30', '2082-10-31',
               '2082-11-30', '2082-12-31', '2083-01-31', '2083-02-28',
               '2083-03-31', '2083-04-30'],
              dtype='datetime64[ns]', length=1000, freq='M')
start = datetime(2014, 1, 1)
end = datetime(2014, 6, 12)
rng = pd.date_range(start, end)
rng
DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06', '2014-01-07', '2014-01-08',
               '2014-01-09', '2014-01-10',
               ...
               '2014-06-03', '2014-06-04', '2014-06-05', '2014-06-06',
               '2014-06-07', '2014-06-08', '2014-06-09', '2014-06-10',
               '2014-06-11', '2014-06-12'],
              dtype='datetime64[ns]', length=163, freq='D')
pd.date_range(start, end, freq='W')
DatetimeIndex(['2014-01-05', '2014-01-12', '2014-01-19', '2014-01-26',
               '2014-02-02', '2014-02-09', '2014-02-16', '2014-02-23',
               '2014-03-02', '2014-03-09', '2014-03-16', '2014-03-23',
               '2014-03-30', '2014-04-06', '2014-04-13', '2014-04-20',
               '2014-04-27', '2014-05-04', '2014-05-11', '2014-05-18',
               '2014-05-25', '2014-06-01', '2014-06-08'],
              dtype='datetime64[ns]', freq='W-SUN')
data = np.random.randint(1, 100, size=163)
df = DataFrame(data, index=rng)
df.head()
0
2014-01-01 12
2014-01-02 29
2014-01-03 84
2014-01-04 4
2014-01-05 6
# Date index slicing is easy but, BE CAREFUL, the end point is included (unlike usual Python slicing behavior).
df['2014-01-02':'2014-01-12']
0
2014-01-02 29
2014-01-03 84
2014-01-04 4
2014-01-05 6
2014-01-06 25
2014-01-07 56
2014-01-08 63
2014-01-09 39
2014-01-10 16
2014-01-11 10
2014-01-12 72
df[datetime(2014,1,2):datetime(2014,1,12)]
0
2014-01-02 29
2014-01-03 84
2014-01-04 4
2014-01-05 6
2014-01-06 25
2014-01-07 56
2014-01-08 63
2014-01-09 39
2014-01-10 16
2014-01-11 10
2014-01-12 72

Pandas DateOffset objects

Usually base frequencies are specified with a string alias such as ‘W’ or ‘M’. However,

rng2 = pd.date_range(start, end, freq='M')
len(rng2)
5
data[:len(rng2)]
array([12, 29, 84,  4,  6])
df2 = DataFrame(data[:len(rng2)], index=rng2)
df2
from pandas.tseries.offsets import Hour, Minute
#We rarely use these but they are available
one_hour = Hour()
four_hours = Hour(4)
one_hour + four_hours
one_hour + Minute(35)

Week of month dates

rng = pd.date_range('9/1/2012','1/1/2014',freq='M')
rng
list(rng)
rng = pd.date_range('9/1/2012','1/1/2014',freq='WOM-2TUE')
list(rng)

Shifting (Leading and Lagging) Data

Move data back and forth through time. Series and DateFrame objects have a shift method for doing naive shifts forward and backward and leaving the index unchanged.

ts = Series(np.random.randint(1,10,4),index = pd.date_range('1/2/2013',periods=4,freq='M'))
ts
ts.shift(periods = 2)
# Common use is pct change
ts/ts.shift(1) - 1.0
# Since naive shifts leave index unchanged, some data is lost. To shift both data and index, pass in the frequency
ts.shift(1)
ts.shift(1, freq='M')

Periods and period arithmetic

Periods represent time spans like months or quarters or years.

p = pd.Period('6/1/2014',freq='M')
p
p+6
# Ranges of periods is done much like ranges of dates
rng = pd.period_range('1/1/2010','6/1/2014',freq='Q')
rng
list(rng)
p.asfreq('M','start')
p.asfreq('M','end')

Resampling and frequency conversion

Resampling - convert time series from one frequency to another.

  • Downsampling - higher frequency to lower frequency (e.g. daily to monthly)
  • Upsampling - lower frequency to higher frequency (e.g. annual to monthly)

Downsampling

Take a daily time series and convert to monthly.

rng = pd.date_range('9/1/2012',periods=100,freq='D')
ts = Series(np.random.randint(1,25,len(rng)),index=rng)
ts
ts.resample('M',how='mean',kind='period')
ts.resample('M',how='sum',kind='period')
ts.resample('M',how='sum')

Whenver you are downsampling, you need to think about two bin boundary issues:

  • is the interval closed on the left or the right?
  • is the bin label based on the left or right bin boundary?

With resampling from days to months, the question of left or right closure isn’t really an issue, but consider changing from 15 minute time bins to 30 minute time bins.

rng15 = pd.date_range('9/1/2012',periods=96,freq='15T')
ts15 = Series(np.random.randint(1,25,len(rng15)),index=rng15)
ts15

Now, if we resample to 30 minute bins and take the mean across each set of two 15 bins, we need to specify if the times on the half hour such as 22:30 in the original series should go in the bin from (22:15,22:30] or [22:30,22:45). The first is referred to as closed on the right and the second as closed on the left. The default is close on right. You can also control whether the left or right edge of the bin is used as the label (independent of which side “closes” the interval).

ts15.resample('30min', how='mean', closed='right', label='left')

Notice how in the example above, the first record is labelled with ‘2012-08-31’ since it’s the left boundary and we picked ‘right’ for closing intervals and ‘left’ for labelling. We might want to close and label on the left.

ts30 = ts15.resample('30min', how='mean', closed='left', label='left')
ts30

A common financial aggregation is to compute the first (open), last (close), max (high), min (low) values for each bucket. If you use ‘ohlc’ for the how argument, you get a DataFrame with these aggregations computed automatically.

ts60 = ts15.resample('60min', how='ohlc', closed='left', label='left')
ts60

Upsampling

Upsampling brings an entirely different set of challenges. If we increase the frequency of the samples, how do we fill in these new time series values? For example, consider the reverse of the last example. Then check out the fill_method options.

ts30
ts30.resample('15min')

Time Series Plotting

Pandas tries to make it a little easier to create nice looking time series plots using matplotlib than if you just used matplotlib yourself. Let’s read in some stock price data from Yahoo! Finance that has already been saved to a csv file.

close_px_all = pd.read_csv('stock_px.csv', parse_dates=True, index_col=0)
close_px_all.head()
close_px = close_px_all[['AAPL','MSFT','XOM']]
close_px[1:10]
close_px['AAPL'].plot()
close_px.plot()
# And here's what it looks like if we just plot one year. Notice the auto x-axis formatting.
close_px.ix['2009'].plot()
close_px.index
list(close_px.index)[1:25]
close_px['AAPL'].ix['01-2011':'03-2011'].plot()

Pandas does a particulary nice job with quarterly data.

appl_q = close_px['AAPL'].resample('Q-DEC')
appl_q
appl_q.ix['2009':].plot()

Moving window functions

Pandas makes it easy to do all kinds of moving window operations like moving average, cumulative sums, and even things like exponential moving averages (“exponential smoothing”).

close_px.AAPL.plot()
pd.rolling_mean(close_px.AAPL,250).plot()
pd.rolling_mean(close_px.AAPL,250)

This example shows one problem with a moving average - by default you usually can’t compute it until you have data for the number of periods corresponding to the window size (250 in this case). Pandas lets you specify an alternate minumum number of data points to get around this problem.

pd.rolling_mean(close_px.AAPL,250,min_periods=10)
close_px.AAPL.plot()
pd.rolling_mean(close_px.AAPL,250,min_periods=10).plot()
# Things like rolling standard deviation are also possible
appl_std250 = pd.rolling_std(close_px.AAPL,250,min_periods=10)
appl_std250.plot()

Learn more about pandas timeseries

Another intro tutorial is http://earthpy.org/pandas-basics.html

Wes Mckinney’s 2012 PyCon talk on pandas and timeseries - http://pyvideo.org/video/1198/time-series-data-analysis-with-pandas