## General imports
from itertools import izip
## Import datetime related moudles
from datetime import datetime
from datetime import timedelta
from dateutil.parser import parse
import time
## Pandas setup
from pandas import DataFrame
import pandas as pd
Computing occupancy statistics with Python - Part 1 of 3
Occupancy by datetime
Many years ago I created an MS Access add-in called Hillmaker for doing time of day and day of week based occupancy analysis in health care delivery systems. A typical use would be to find the mean and 95th percentile of occupancy in a set of nursing units or an emergency department. Hillmaker was released as an open source project back in 2005 and has gotten quite a bit of use. However, I never really did any more development on it even though there were a number of enhancements that I (and many others) would like to have seen. The only attention it has gotten from me was in response to the object library problems caused by new releases of MS Office. No fun.
While VBA is fine, I’ve always wanted to make Hillmaker less MS-centric and to decouple it from MS Access. Installing Access add-ins can be a hassle as they require changes to the Windows registry and some corporate settings frown on this. So, I hacked together a proof of concept version using Python with the pandas module for some of the number crunching and with matplotlib for plotting. Seems to work. Seems to be fast. Made it work with text files, Excel, Access and SQL Server based data. For this tutorial, I’ll call it HillPy.
In this tutorial I’ll focus on the basic ideas behind computing occupancy statistics with Python. The data is fictitious data from a hospital short stay unit. Patients flow through a short stay unit for a variety of procedures, tests or therapies. Let’s assume patients can be classified into one of five categories of patient types: ART (arterialgram), CAT (post cardiac-cath), MYE (myelogram), IVT (IV therapy), and OTH (other). From one of our hospital information systems we were able to get raw data about the entry and exit times of each patient. For simplicity, the data is in a csv file. Let’s start by importing a bunch of modules we’ll need.
Preliminaries
I created a module to house some functions specific to HillPy. The details will be discussed when we encounter these functions.
## HillPy library
import hillpylib as hlib
The original Hillmaker application has a bunch of key input parameters. For this demo, I’m just going to hard code them.
## Input parameters - just hard coding them for demonstration purposes
## The name of the csv file containing the raw stop record data
= 'data/ShortStay.csv'
file_stopdata
## A string scenario identifier. It gets appended to output filenames.
= 'shortstay_csv'
scenario_name
## The field in the raw data file corresponding to the category by which statistics are computed.
= 'PatType'
cat_fld_name
## A string used as the category name for the overall total occupancy statistics.
= 'Total'
cat_total_str
## The field in the raw data file with the timestamp of the patient entry time.
= 'InRoomTS'
in_fld_name
## The field in the raw data file with the timestamp of the patient exit time.
= 'OutRoomTS'
out_fld_name
## Each day is broken up into time bins. This is the bin size in minutes. Typical
## values are 15, 30, 60, 120, 240, or 480.
= 30
binsize_mins
## We need a date range for the analysis. More on this below.
= '1/2/1996'
start_analysis = '3/31/1996 23:45'
end_analysis ## Convert string dates to actual datetimes
= parse(start_analysis)
start_analysis_dt = parse(end_analysis)
end_analysis_dt = [start_analysis_dt,end_analysis_dt] analysis_range
The stop records
It’s important that you choose an analysis date range for which you have no missing data. It’s also important to take system ‘warm up’ issues into account. This issue is discussed in a paper I wrote a bunch of years ago:
Isken, M.W. (2002) “Modeling and Analysis of Occupancy Data: A Healthcare Capacity Planning Application,” International Journal of Information Technology and Decision Making, 1, 4 (December) 707-729.
For this example it’s not a big deal since length of stay is on the order of a few hours and many of the patients are scheduled during the standard business day.
Let’s read in the raw “stop records” from a csv file. We refer to them as stop records since each row is a single stop (with associated entry and exit timestamps) by a single patient.
## Read sample data set and convert string dates to datetimes
= pd.read_csv(file_stopdata,parse_dates=[in_fld_name,out_fld_name]) df
df.head()
PatID | InRoomTS | OutRoomTS | PatType | |
---|---|---|---|---|
0 | 1 | 1996-01-01 07:44:00 | 1996-01-01 08:50:00 | IVT |
1 | 2 | 1996-01-01 08:28:00 | 1996-01-01 09:20:00 | IVT |
2 | 3 | 1996-01-01 11:44:00 | 1996-01-01 13:30:00 | MYE |
3 | 4 | 1996-01-01 11:51:00 | 1996-01-01 12:55:00 | CAT |
4 | 5 | 1996-01-01 12:10:00 | 1996-01-01 13:00:00 | IVT |
df.tail()
PatID | InRoomTS | OutRoomTS | PatType | |
---|---|---|---|---|
59872 | 59873 | 1996-09-30 19:31:00 | 1996-09-30 20:15:00 | IVT |
59873 | 59874 | 1996-09-30 20:23:00 | 1996-09-30 21:30:00 | IVT |
59874 | 59875 | 1996-09-30 21:00:00 | 1996-09-30 22:45:00 | CAT |
59875 | 59876 | 1996-09-30 21:57:00 | 1996-09-30 22:40:00 | IVT |
59876 | 59877 | 1996-09-30 22:45:00 | 1996-09-30 23:35:00 | CAT |
While we don’t need to compute length of stay to do the occupancy analysis, let’s do it anyway. We usually want to do descriptive statistics on length of stay.
## Compute LOS - the results is a timedelta value
'LOS'] = df[out_fld_name] - df[in_fld_name]
df[
## Compute LOS in minutes using one of the functions in hillpylib
'LOSmins'] = hlib.vtd_to_mins(df['LOS']) df[
df.head()
PatID | InRoomTS | OutRoomTS | PatType | LOS | LOSmins | |
---|---|---|---|---|---|---|
0 | 1 | 1996-01-01 07:44:00 | 1996-01-01 08:50:00 | IVT | 1:06:00 | 66 |
1 | 2 | 1996-01-01 08:28:00 | 1996-01-01 09:20:00 | IVT | 0:52:00 | 52 |
2 | 3 | 1996-01-01 11:44:00 | 1996-01-01 13:30:00 | MYE | 1:46:00 | 106 |
3 | 4 | 1996-01-01 11:51:00 | 1996-01-01 12:55:00 | CAT | 1:04:00 | 64 |
4 | 5 | 1996-01-01 12:10:00 | 1996-01-01 13:00:00 | IVT | 0:50:00 | 50 |
Here’s what the relevant functions from hillpylib look like:
import numpy as np
import pandas as pd
def td_to_mins(x):
"""
Converts a timedelta object to minutes
"""
return x.days * 24.0 * 60 + x.seconds / 60.0 + x.microseconds / 6000.0
= np.vectorize(td_to_mins) # Make usable with list like things vtd_to_mins
Creating a seeded by date table for occupancy, arrivals, and departures
Now we are ready for the main computational tasks of computing occupancy for every datetime bin and category (PatType in this example) in the analysis date range. In addition to computing occupancy, we’ll also count the number of patient arrivals and departures in each datetime bin. Historically, this table is called the by date table (even though it’s really by datetime bin by category. Our strategy is to first create a “fully seeded” table and then fill in the values by processing the raw patient stop records. By fully seeded we mean that there is a record for each combination of datetime bin and category value in the analysis date range and all of the measures (occupancy, arrivals, and departures) are initialized to 0. Let’s start by creating the fully seeded table.
# Create date and range and convert it from a pandas DateTimeIndex to a
# reqular old array of datetimes to try to get around the weird problems
# in computing day of week on datetime64 values.
= str(binsize_mins) + 'min'
bin_freq = pd.date_range(start_analysis_dt, end_analysis_dt,
rng_bydate =bin_freq).to_pydatetime() freq
rng_bydate
array([datetime.datetime(1996, 1, 2, 0, 0),
datetime.datetime(1996, 1, 2, 0, 30),
datetime.datetime(1996, 1, 2, 1, 0), ...,
datetime.datetime(1996, 3, 31, 22, 30),
datetime.datetime(1996, 3, 31, 23, 0),
datetime.datetime(1996, 3, 31, 23, 30)], dtype=object)
Now let’s create a list of the unique category values (the PatType column in this example).
# Get the unique category values
= [c for c in df[cat_fld_name].unique()]
categories categories
['IVT', 'MYE', 'CAT', 'ART', 'OTH']
Create a list of column names for the by date table and then an empty data frame based on these columns.
=['category','datetime','arrivals','departures','occupancy']
columns# Create an empty ByDate data frame
= DataFrame(columns=columns) bydate_df
Now we’ll build up the seeded by date table a category at a time. Along the way we’ll initialize all the measures to 0.
= len(rng_bydate)
len_bydate for cat in categories:
= {'category':[cat]*len_bydate,
bydate_data 'datetime':rng_bydate,
'arrivals':[0.0]*len_bydate,
'departures':[0.0]*len_bydate,
'occupancy':[0.0]*len_bydate,
}= DataFrame(bydate_data,columns=['category',
bydate_df_cat 'datetime',
'arrivals',
'departures',
'occupancy'])
= pd.concat([bydate_df,bydate_df_cat]) bydate_df
bydate_df.head()
category | datetime | arrivals | departures | occupancy | |
---|---|---|---|---|---|
0 | IVT | 1996-01-02 00:00:00 | 0 | 0 | 0 |
1 | IVT | 1996-01-02 00:30:00 | 0 | 0 | 0 |
2 | IVT | 1996-01-02 01:00:00 | 0 | 0 | 0 |
3 | IVT | 1996-01-02 01:30:00 | 0 | 0 | 0 |
4 | IVT | 1996-01-02 02:00:00 | 0 | 0 | 0 |
bydate_df.tail()
category | datetime | arrivals | departures | occupancy | |
---|---|---|---|---|---|
4315 | OTH | 1996-03-31 21:30:00 | 0 | 0 | 0 |
4316 | OTH | 1996-03-31 22:00:00 | 0 | 0 | 0 |
4317 | OTH | 1996-03-31 22:30:00 | 0 | 0 | 0 |
4318 | OTH | 1996-03-31 23:00:00 | 0 | 0 | 0 |
4319 | OTH | 1996-03-31 23:30:00 | 0 | 0 | 0 |
Notice that each row in bdate_df
corresponds to a half-hour time bin for a specific PatType (category). It’s not readily apparent that we have created duplicate index values in the default pandas index.
4315:4330] bydate_df[
category | datetime | arrivals | departures | occupancy | |
---|---|---|---|---|---|
4315 | IVT | 1996-03-31 21:30:00 | 0 | 0 | 0 |
4316 | IVT | 1996-03-31 22:00:00 | 0 | 0 | 0 |
4317 | IVT | 1996-03-31 22:30:00 | 0 | 0 | 0 |
4318 | IVT | 1996-03-31 23:00:00 | 0 | 0 | 0 |
4319 | IVT | 1996-03-31 23:30:00 | 0 | 0 | 0 |
0 | MYE | 1996-01-02 00:00:00 | 0 | 0 | 0 |
1 | MYE | 1996-01-02 00:30:00 | 0 | 0 | 0 |
2 | MYE | 1996-01-02 01:00:00 | 0 | 0 | 0 |
3 | MYE | 1996-01-02 01:30:00 | 0 | 0 | 0 |
4 | MYE | 1996-01-02 02:00:00 | 0 | 0 | 0 |
5 | MYE | 1996-01-02 02:30:00 | 0 | 0 | 0 |
6 | MYE | 1996-01-02 03:00:00 | 0 | 0 | 0 |
7 | MYE | 1996-01-02 03:30:00 | 0 | 0 | 0 |
8 | MYE | 1996-01-02 04:00:00 | 0 | 0 | 0 |
9 | MYE | 1996-01-02 04:30:00 | 0 | 0 | 0 |
# Now create a hierarchical multiindex to replace the default index (since it
# has dups from the concatenation). We keep the columns used in the index as
# regular columns as well since it's hard
# to do a column transformation using a specific level of a multiindex.
# http://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64?rq=1
= bydate_df.set_index(['category', 'datetime'], drop=False) bydate_df
bydate_df
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 21600 entries, (IVT, 1996-01-02 00:00:00) to (OTH, 1996-03-31 23:30:00)
Data columns:
category 21600 non-null values
datetime 21600 non-null values
arrivals 21600 non-null values
departures 21600 non-null values
occupancy 21600 non-null values
dtypes: float64(3), object(2)
4315:4330] bydate_df[
category | datetime | arrivals | departures | occupancy | ||
---|---|---|---|---|---|---|
category | datetime | |||||
IVT | 1996-03-31 21:30:00 | IVT | 1996-03-31 21:30:00 | 0 | 0 | 0 |
1996-03-31 22:00:00 | IVT | 1996-03-31 22:00:00 | 0 | 0 | 0 | |
1996-03-31 22:30:00 | IVT | 1996-03-31 22:30:00 | 0 | 0 | 0 | |
1996-03-31 23:00:00 | IVT | 1996-03-31 23:00:00 | 0 | 0 | 0 | |
1996-03-31 23:30:00 | IVT | 1996-03-31 23:30:00 | 0 | 0 | 0 | |
MYE | 1996-01-02 00:00:00 | MYE | 1996-01-02 00:00:00 | 0 | 0 | 0 |
1996-01-02 00:30:00 | MYE | 1996-01-02 00:30:00 | 0 | 0 | 0 | |
1996-01-02 01:00:00 | MYE | 1996-01-02 01:00:00 | 0 | 0 | 0 | |
1996-01-02 01:30:00 | MYE | 1996-01-02 01:30:00 | 0 | 0 | 0 | |
1996-01-02 02:00:00 | MYE | 1996-01-02 02:00:00 | 0 | 0 | 0 | |
1996-01-02 02:30:00 | MYE | 1996-01-02 02:30:00 | 0 | 0 | 0 | |
1996-01-02 03:00:00 | MYE | 1996-01-02 03:00:00 | 0 | 0 | 0 | |
1996-01-02 03:30:00 | MYE | 1996-01-02 03:30:00 | 0 | 0 | 0 | |
1996-01-02 04:00:00 | MYE | 1996-01-02 04:00:00 | 0 | 0 | 0 | |
1996-01-02 04:30:00 | MYE | 1996-01-02 04:30:00 | 0 | 0 | 0 |
Again, for now we aren’t dropping the redundant category and datetime columns. We are going to compute three additional columns based on the value of datetime and it’s easier to do such transformations on a regular data frame column than on a specific level of a multi-index. The three columns we want to add are:
- day of week - 0..6 with 0=Monday
- bin of day - 0..(num_bins_per_day-1) where num_bins_per_day=48 if binsize_mins=30, num_bins_per_day=24 if binsize_mins=60, etc.
- bin of week - 0..(num_bins_per_week-1) where num_bins_per_week = 7 * num_bins_per_day
There is a built in weekday()
function for datetimes. For bin of day and bin of week, we created custom functions in hillpylib. Notice the avoidance of loops via the use of map()
and a lambda function.
'dayofweek'] = bydate_df['datetime'].map(lambda x: x.weekday())
bydate_df['binofday'] = bydate_df['datetime'].map(lambda x: hlib.binofday(x,binsize_mins))
bydate_df['binofweek'] = bydate_df['datetime'].map(lambda x: hlib.binofweek(x,binsize_mins)) bydate_df[
Instead of lambda functions, I did try to use vectorized versions of binofday and binofweek, but pandas didn’t seem to like them. Here are the bin of day and week functions:
def binofday(dt, binsizemins):
"""
Computes bin of day based on bin size for a datetime.
Parameters
----------
dt : datetime.datetime object, default now.
binsizemins : Size of bin in minutes; default 30 minutes.
Returns
-------
0 to (n-1) where n is number of bins per day.
Examples
--------
dt = datetime(2013,1,7,1,45)
bin = binofday(dt,30)
# bin = 3
"""
if dt == None: dt = datetime.now()
# YES, I know that type checking is NOT Pythonic!!!
# However, the hell that is numpy.datetime64 data types has
# caused me to give up and do it anyway.
if not isinstance(dt,datetime):
= pd.Timestamp(dt)
dt
= (dt.hour * 60) + dt.minute
mins bin = math.trunc(mins/binsizemins)
return bin
def binofweek(dt, binsizemins):
"""
Computes bin of week based on bin size for a datetime.
Based on .weekday() convention of 0=Monday.
Parameters
----------
dt : datetime.datetime object, default now.
binsizemins : Size of bin in minutes; default 30 minutes.
Returns
-------
0 to (n-1) where n is number of bins per week.
Examples
--------
dt = datetime(2013,1,9,1,45)
bin = binofweek(dt,30)
# bin = 99
"""
if dt == None: dt = datetime.now()
# YES, I know that type checking is NOT Pythonic!!!
# However, the hell that is numpy.datetime64 data types has
# caused me to give up and do it anyway.
if not isinstance(dt,datetime):
= pd.Timestamp(dt)
dt
= (dt.weekday() * 1440) + (dt.hour * 60) + dt.minute
mins bin = math.trunc(mins/binsizemins)
return bin
bydate_df.head()
category | datetime | arrivals | departures | occupancy | dayofweek | binofday | binofweek | ||
---|---|---|---|---|---|---|---|---|---|
category | datetime | ||||||||
IVT | 1996-01-02 00:00:00 | IVT | 1996-01-02 00:00:00 | 0 | 0 | 0 | 1 | 0 | 48 |
1996-01-02 00:30:00 | IVT | 1996-01-02 00:30:00 | 0 | 0 | 0 | 1 | 1 | 49 | |
1996-01-02 01:00:00 | IVT | 1996-01-02 01:00:00 | 0 | 0 | 0 | 1 | 2 | 50 | |
1996-01-02 01:30:00 | IVT | 1996-01-02 01:30:00 | 0 | 0 | 0 | 1 | 3 | 51 | |
1996-01-02 02:00:00 | IVT | 1996-01-02 02:00:00 | 0 | 0 | 0 | 1 | 4 | 52 |
Arrival and departure bins and occupancy contributions
A few more preliminary concepts are needed before we can churn the crank and process the stop records and fill in the bydate table with occupancy, arrival, and departure info.
Let’s look at the first few stop records.
df.head()
PatID | InRoomTS | OutRoomTS | PatType | LOS | LOSmins | |
---|---|---|---|---|---|---|
0 | 1 | 1996-01-01 07:44:00 | 1996-01-01 08:50:00 | IVT | 1:06:00 | 66 |
1 | 2 | 1996-01-01 08:28:00 | 1996-01-01 09:20:00 | IVT | 0:52:00 | 52 |
2 | 3 | 1996-01-01 11:44:00 | 1996-01-01 13:30:00 | MYE | 1:46:00 | 106 |
3 | 4 | 1996-01-01 11:51:00 | 1996-01-01 12:55:00 | CAT | 1:04:00 | 64 |
4 | 5 | 1996-01-01 12:10:00 | 1996-01-01 13:00:00 | IVT | 0:50:00 | 50 |
'InRoomTS'][0] # Entry time for first record df[
datetime.datetime(1996, 1, 1, 7, 44)
'OutRoomTS'][0] # Exit time for first record df[
datetime.datetime(1996, 1, 1, 8, 50)
So, the first patient arrived in the 30 minute time bin beginning on 1996-01-01 07:30:00 and departed during the bin beginning 1996-01-01 08:30:00. We call 1996-01-01 07:30:00 the arrival bin and 1996-01-01 08:30:00 the departure bin. The first patient was in the system for all or part of 3 time bins:
- 1996-01-01 07:30:00 - 1996-01-01 08:00:00 (16 mins –> 16/30 = 0.5333 of bin is the arrival bin fraction)
- 1996-01-01 08:00:00 - 1996-01-01 08:30:00 (30 mins –> 30/30 = 1.0000 of bin)
- 1996-01-01 08:30:00 - 1996-01-01 09:00:00 (20 mins –> 20/30 = 0.6667 of bin is the departure bin fraction)
The standard convention in Hillmaker is to use the arrival (departure) bin fraction as the contribution to occupancy in the arrival (departure) bin. For all bins between the arrival and departure bins, the patient is in the system for the entire bin length and thus the contribution to occupancy for each bin is 1.0. If you want to be more conservative, you could give full credit of 1.0 for the arrival and departure bins (this is an option in Hillmaker).
So, it seems we just need to step through all the rows in the stop record data frame, df
, and:
- get the PatType value
- find the arrival and departure bins
- compute the arrival and departure bin fractions,
- and increment the appropriate rows in the
bydate_df
data frame.
Stop record types
Well, it’s almost that easy. There’s one more complication to deal with. Recall that as part of the input parameters we specified start and end dates for the analysis (the analysis date range):
start_analysis = '1/2/1996'
end_analysis = '3/31/1996 23:45'
Obviously, a patient stop record having an arrival time after end_analysis
can be ignored as can those patients who departed before start_analysis
. For those patients whose stop falls strictly between start_analysis
and end_analysis
, they will end up with occupancy contributions for all of the time bins corresponding to their stay. However, what about patients who arrive during the analysis date range but depart after end_analysis
or those who arrive before start_analysis
and depart within the analysis date range. As you’d expect, we simply ignore time bins that fall outside of the analysis date range. Note we are NOT ignoring the entire record. We are simply computing occupancy contributions only for time bins falling within the analysis date range. To make our life easier, we created a function to determine the stop record type. You’ll see in the comments below the function declaration that there is a diagram illustrating the different record types.
def stoprec_analysis_rltnshp(stoprecrange, analysisrange):
"""
Determines relationship type of stop record to analysis date range.
Parameters
----------
stoprecrange: list consisting of [rec_in, rec_out]
analysisrange: list consisting of [a_start, a_end]
Returns
-------
Returns a string, either 'inner', 'left', 'right, 'outer',
'backwards', 'none' depending
on the relationship between the stop record being analyzed and the
analysis date range.
Type 'inner':
|-------------------------|
a_start a_end
|--------------|
rec_in rec_out
Type 'left':
|-------------------------|
a_start a_end
|--------------|
rec_in rec_out
Type 'right':
|-------------------------|
a_start a_end
|--------------|
rec_in rec_out
Type 'outer':
|-------------------------|
a_start a_end
|-------------------------------------|
rec_in rec_out
Type 'backwards':
The exit time is BEFORE the entry time. This is a BAD record.
Type 'none':
Ranges do not overlap
"""
= stoprecrange[0]
rec_in = stoprecrange[1]
rec_out = analysisrange[0]
a_start = analysisrange[1]
a_end
if (rec_in > rec_out):
return 'backwards'
elif (a_start <= rec_in < a_end) and (a_start <= rec_out < a_end):
return 'inner'
elif (a_start <= rec_in < a_end) and (rec_out >= a_end):
return 'right'
elif (rec_in < a_start) and (a_start <= rec_out < a_end):
return 'left'
elif (rec_in < a_start) and (rec_out >= a_end):
return 'outer'
else:
return 'none'
Main processing loop
Finally, we can step through the patient stop records and fill in the by date table.
# Main occupancy, arrivals, departures loop. Process each record in the
# stop data file.
for intime, outtime, cat in izip(df[in_fld_name], df[out_fld_name], df[cat_fld_name]):
= True
good_rec = hlib.stoprec_analysis_rltnshp([intime,outtime],analysis_range)
rectype
if rectype in ['backwards']:
# print "ERROR_{}: {} {} {}".format(rectype,intime,outtime,cat)
= False
good_rec
if good_rec and rectype != 'none':
= hlib.rounddownTime(intime,binsize_mins)
indtbin = hlib.rounddownTime(outtime,binsize_mins)
outdtbin = hlib.occ_frac([intime, outtime], binsize_mins)
inout_occ_frac # print "{} {} {} {} {:.3f} {:.3f} {:.3f}".format(intime, outtime, cat,
# rectype, time.clock(), inout_occ_frac[0], inout_occ_frac[1])
if rectype == 'inner':
'occupancy'] += inout_occ_frac[0]
bydate_df.ix[(cat,indtbin), 'occupancy'] += inout_occ_frac[1]
bydate_df.ix[(cat,outdtbin), 'arrivals'] += 1.0
bydate_df.ix[(cat,indtbin), 'departures'] += 1.0
bydate_df.ix[(cat,outdtbin),
if hlib.isgt2bins(indtbin, outdtbin, binsize_mins):
bin = indtbin + timedelta(minutes=binsize_mins)
while bin < outdtbin:
bin), 'occupancy'] += 1.0
bydate_df.ix[(cat,bin += timedelta(minutes=binsize_mins)
elif rectype == 'right':
# departure is outside analysis window
'occupancy'] += inout_occ_frac[0]
bydate_df.ix[(cat,indtbin), 'arrivals'] += 1.0
bydate_df.ix[(cat,indtbin),
if hlib.isgt2bins(indtbin, outdtbin, binsize_mins):
bin = indtbin + timedelta(minutes=binsize_mins)
while bin <= end_analysis_dt:
bin), 'occupancy'] += 1.0
bydate_df.ix[(cat,bin += timedelta(minutes=binsize_mins)
elif rectype == 'left':
# arrival is outside analysis window
'occupancy'] += inout_occ_frac[1]
bydate_df.ix[(cat,outdtbin), 'departures'] += 1.0
bydate_df.ix[(cat,outdtbin),
if hlib.isgt2bins(indtbin, outdtbin, binsize_mins):
bin = start_analysis_dt + timedelta(minutes=binsize_mins)
while bin < outdtbin:
bin), 'occupancy'] += 1.0
bydate_df.ix[(cat,bin += timedelta(minutes=binsize_mins)
elif rectype == 'outer':
# arrival and departure sandwich analysis window
if hlib.isgt2bins(indtbin, outdtbin, binsize_mins):
bin = start_analysis_dt
while bin <= end_analysis_dt:
bin), 'occupancy'] += 1.0
bydate_df.ix[(cat,bin += timedelta(minutes=binsize_mins)
else:
pass
print "Done processing stop recs: {}".format(time.clock())
Done processing stop recs: 2.56635211744e-06
At this point, bydate_df
contains occupancy, arrival and departure counts for each time bin and category. Here’s a look into the heart of the by date table.
1320:1350] bydate_df[
category | datetime | arrivals | departures | occupancy | dayofweek | binofday | binofweek | ||
---|---|---|---|---|---|---|---|---|---|
category | datetime | ||||||||
IVT | 1996-01-29 12:00:00 | IVT | 1996-01-29 12:00:00 | 9 | 9 | 21.266667 | 0 | 24 | 24 |
1996-01-29 12:30:00 | IVT | 1996-01-29 12:30:00 | 9 | 6 | 22.333333 | 0 | 25 | 25 | |
1996-01-29 13:00:00 | IVT | 1996-01-29 13:00:00 | 12 | 12 | 22.266667 | 0 | 26 | 26 | |
1996-01-29 13:30:00 | IVT | 1996-01-29 13:30:00 | 8 | 9 | 23.100000 | 0 | 27 | 27 | |
1996-01-29 14:00:00 | IVT | 1996-01-29 14:00:00 | 8 | 6 | 22.933333 | 0 | 28 | 28 | |
1996-01-29 14:30:00 | IVT | 1996-01-29 14:30:00 | 9 | 11 | 22.300000 | 0 | 29 | 29 | |
1996-01-29 15:00:00 | IVT | 1996-01-29 15:00:00 | 7 | 6 | 23.900000 | 0 | 30 | 30 | |
1996-01-29 15:30:00 | IVT | 1996-01-29 15:30:00 | 6 | 9 | 22.600000 | 0 | 31 | 31 | |
1996-01-29 16:00:00 | IVT | 1996-01-29 16:00:00 | 9 | 11 | 19.700000 | 0 | 32 | 32 | |
1996-01-29 16:30:00 | IVT | 1996-01-29 16:30:00 | 7 | 4 | 19.533333 | 0 | 33 | 33 | |
1996-01-29 17:00:00 | IVT | 1996-01-29 17:00:00 | 5 | 9 | 18.033333 | 0 | 34 | 34 | |
1996-01-29 17:30:00 | IVT | 1996-01-29 17:30:00 | 2 | 9 | 13.600000 | 0 | 35 | 35 | |
1996-01-29 18:00:00 | IVT | 1996-01-29 18:00:00 | 3 | 5 | 9.566667 | 0 | 36 | 36 | |
1996-01-29 18:30:00 | IVT | 1996-01-29 18:30:00 | 2 | 4 | 7.166667 | 0 | 37 | 37 | |
1996-01-29 19:00:00 | IVT | 1996-01-29 19:00:00 | 1 | 3 | 5.833333 | 0 | 38 | 38 | |
1996-01-29 19:30:00 | IVT | 1996-01-29 19:30:00 | 3 | 2 | 6.366667 | 0 | 39 | 39 | |
1996-01-29 20:00:00 | IVT | 1996-01-29 20:00:00 | 0 | 3 | 4.833333 | 0 | 40 | 40 | |
1996-01-29 20:30:00 | IVT | 1996-01-29 20:30:00 | 0 | 3 | 1.000000 | 0 | 41 | 41 | |
1996-01-29 21:00:00 | IVT | 1996-01-29 21:00:00 | 0 | 0 | 0.000000 | 0 | 42 | 42 | |
1996-01-29 21:30:00 | IVT | 1996-01-29 21:30:00 | 0 | 0 | 0.000000 | 0 | 43 | 43 | |
1996-01-29 22:00:00 | IVT | 1996-01-29 22:00:00 | 0 | 0 | 0.000000 | 0 | 44 | 44 | |
1996-01-29 22:30:00 | IVT | 1996-01-29 22:30:00 | 0 | 0 | 0.000000 | 0 | 45 | 45 | |
1996-01-29 23:00:00 | IVT | 1996-01-29 23:00:00 | 0 | 0 | 0.000000 | 0 | 46 | 46 | |
1996-01-29 23:30:00 | IVT | 1996-01-29 23:30:00 | 0 | 0 | 0.000000 | 0 | 47 | 47 | |
1996-01-30 00:00:00 | IVT | 1996-01-30 00:00:00 | 0 | 0 | 0.000000 | 1 | 0 | 48 | |
1996-01-30 00:30:00 | IVT | 1996-01-30 00:30:00 | 0 | 0 | 0.000000 | 1 | 1 | 49 | |
1996-01-30 01:00:00 | IVT | 1996-01-30 01:00:00 | 0 | 0 | 0.000000 | 1 | 2 | 50 | |
1996-01-30 01:30:00 | IVT | 1996-01-30 01:30:00 | 0 | 0 | 0.000000 | 1 | 3 | 51 | |
1996-01-30 02:00:00 | IVT | 1996-01-30 02:00:00 | 0 | 0 | 0.000000 | 1 | 4 | 52 | |
1996-01-30 02:30:00 | IVT | 1996-01-30 02:30:00 | 0 | 0 | 0.000000 | 1 | 5 | 53 |
Computing totals
Before computing statistics such as means and percentiles, we compute the total (over all categories) occupancy, arrivals and departures by time bin.
# Compute totals for arrivals, departures and occupancy by using
# Pandas aggregation methods
= bydate_df.groupby(level='datetime') ['occupancy','arrivals','departures']
bydate_dfgrp1 = bydate_dfgrp1.sum() # Compute the totals
bydate_tot
=True) # Moves the index column to a data column
bydate_tot.reset_index(inplace'category'] = cat_total_str # Set the category for the total rows
bydate_tot[
# Set the index to conform to the main bydate data frame. The inplace=True
# avoids creating a new object
'category','datetime'], inplace=True, drop=False)
bydate_tot.set_index([
# Concatenate the totals to the main data frame
= pd.concat([bydate_df,bydate_tot])
bydate_df
# Update dayofweek, binofday, binofweek
'dayofweek'][bydate_df['category'] == "Total"] = bydate_df['datetime'][bydate_df['category'] == "Total"].map(lambda x: pd.Timestamp(x).weekday())
bydate_df['binofday'][bydate_df['category'] == 'Total'] = bydate_df['datetime'][bydate_df['category'] == "Total"].map(lambda x: hlib.binofday(x,binsize_mins))
bydate_df['binofweek'][bydate_df['category'] == 'Total'] = bydate_df['datetime'][bydate_df['category'] == "Total"].map(lambda x: hlib.binofweek(x,binsize_mins))
bydate_df[
# Drop the redundant category and datetime fields (since they comprise the multi-index)
del bydate_df['category']
del bydate_df['datetime']
# Look at the tail to see example of totals
bydate_df.tail()
arrivals | binofday | binofweek | dayofweek | departures | occupancy | ||
---|---|---|---|---|---|---|---|
category | datetime | ||||||
Total | 1996-03-31 21:30:00 | 0 | 43 | 331 | 6 | 0 | 0.000000 |
1996-03-31 22:00:00 | 0 | 44 | 332 | 6 | 0 | 0.000000 | |
1996-03-31 22:30:00 | 4 | 45 | 333 | 6 | 0 | 1.600000 | |
1996-03-31 23:00:00 | 0 | 46 | 334 | 6 | 0 | 4.000000 | |
1996-03-31 23:30:00 | 0 | 47 | 335 | 6 | 4 | 0.666667 |
Write out bydate_df
to a csv file and take a look at it. With this data frame we can compute all kinds of interesting summary statistics by day of week and time of day. That will be the subject of part 2 of this tutorial.
= 'bydate_' + scenario_name + '.csv'
file_bydate_csv bydate_df.to_csv(file_bydate_csv)
About this IPython notebook
You can find the data and the .ipynb
file in my hselab-tutorials github repo. Clone or download a zip.
Check out the IPython doc on the notebook format to learn all about working with IPython notebooks. A few highlights include:
- IPython notebooks are JSON text files with a
.ipynb
extension. - You can download a notebook as regular
.py
file with a File|Download As… and setting the download filetype to py. - If you add the proper comments to a regular
.py
file, you can open it as a notebook file by dragging and dropping the file into the notebook dashboard file list area. See the doc link above for the details on how to comment your Python file so that this works well. - To create a static HTML or PDF of your notebook, do a File|Print and then just save or print or whatever from the resulting browser window.
Reuse
Citation
@online{isken2013,
author = {Mark Isken},
title = {Computing Occupancy Statistics with {Python} - {Part} 1 of 3},
date = {2013-03-25},
langid = {en}
}