Computing occupancy statistics with Python - Part 2 of 3

Occupancy by datetime

python
hillmaker
occupancy analysis
Author

Mark Isken

Published

April 8, 2013

UPDATE 2023-01-16

Hillmaker no longer works this way. See https://github.com/misken/hillmaker for latest version.

In the second part of this series, we will use Python to compute summary occupancy statistics (such as means and percentiles) by time of day, day of week, and patient category (recall that this example is from a hospital short stay unit - go back to Part 1 for all of the background info). Computation of percentiles by one or more grouping fields is a pain using tools like Excel, Access and SQL Server. With Python+pandas it’s easy.

You can find the data and the .ipynb file in my hselab-tutorials github repo. Clone or download a zip.

Preliminaries

import pandas as pd

At the end of Part 1 of this tutorial series, we ended up with a csv file called bydate_shortstay_csv.csv. Let’s read it in and take a look at it.

## Read sample data set and convert string dates to datetimes
bydate_df = pd.read_csv('data/bydate_shortstay_csv.csv',parse_dates=['datetime'])
bydate_df.head()
category datetime arrivals binofday binofweek dayofweek departures occupancy
0 IVT 1996-01-02 00:00:00 0.0 0.0 48.0 1.0 0.0 0.0
1 IVT 1996-01-02 00:30:00 0.0 1.0 49.0 1.0 0.0 0.0
2 IVT 1996-01-02 01:00:00 0.0 2.0 50.0 1.0 0.0 0.0
3 IVT 1996-01-02 01:30:00 0.0 3.0 51.0 1.0 0.0 0.0
4 IVT 1996-01-02 02:00:00 0.0 4.0 52.0 1.0 0.0 0.0
bydate_df[1320:1350]
category datetime arrivals binofday binofweek dayofweek departures occupancy
1320 IVT 1996-01-29 12:00:00 9.0 24.0 24.0 0.0 9.0 21.266667
1321 IVT 1996-01-29 12:30:00 9.0 25.0 25.0 0.0 6.0 22.333333
1322 IVT 1996-01-29 13:00:00 12.0 26.0 26.0 0.0 12.0 22.266667
1323 IVT 1996-01-29 13:30:00 8.0 27.0 27.0 0.0 9.0 23.100000
1324 IVT 1996-01-29 14:00:00 8.0 28.0 28.0 0.0 6.0 22.933333
1325 IVT 1996-01-29 14:30:00 9.0 29.0 29.0 0.0 11.0 22.300000
1326 IVT 1996-01-29 15:00:00 7.0 30.0 30.0 0.0 6.0 23.900000
1327 IVT 1996-01-29 15:30:00 6.0 31.0 31.0 0.0 9.0 22.600000
1328 IVT 1996-01-29 16:00:00 9.0 32.0 32.0 0.0 11.0 19.700000
1329 IVT 1996-01-29 16:30:00 7.0 33.0 33.0 0.0 4.0 19.533333
1330 IVT 1996-01-29 17:00:00 5.0 34.0 34.0 0.0 9.0 18.033333
1331 IVT 1996-01-29 17:30:00 2.0 35.0 35.0 0.0 9.0 13.600000
1332 IVT 1996-01-29 18:00:00 3.0 36.0 36.0 0.0 5.0 9.566667
1333 IVT 1996-01-29 18:30:00 2.0 37.0 37.0 0.0 4.0 7.166667
1334 IVT 1996-01-29 19:00:00 1.0 38.0 38.0 0.0 3.0 5.833333
1335 IVT 1996-01-29 19:30:00 3.0 39.0 39.0 0.0 2.0 6.366667
1336 IVT 1996-01-29 20:00:00 0.0 40.0 40.0 0.0 3.0 4.833333
1337 IVT 1996-01-29 20:30:00 0.0 41.0 41.0 0.0 3.0 1.000000
1338 IVT 1996-01-29 21:00:00 0.0 42.0 42.0 0.0 0.0 0.000000
1339 IVT 1996-01-29 21:30:00 0.0 43.0 43.0 0.0 0.0 0.000000
1340 IVT 1996-01-29 22:00:00 0.0 44.0 44.0 0.0 0.0 0.000000
1341 IVT 1996-01-29 22:30:00 0.0 45.0 45.0 0.0 0.0 0.000000
1342 IVT 1996-01-29 23:00:00 0.0 46.0 46.0 0.0 0.0 0.000000
1343 IVT 1996-01-29 23:30:00 0.0 47.0 47.0 0.0 0.0 0.000000
1344 IVT 1996-01-30 00:00:00 0.0 0.0 48.0 1.0 0.0 0.000000
1345 IVT 1996-01-30 00:30:00 0.0 1.0 49.0 1.0 0.0 0.000000
1346 IVT 1996-01-30 01:00:00 0.0 2.0 50.0 1.0 0.0 0.000000
1347 IVT 1996-01-30 01:30:00 0.0 3.0 51.0 1.0 0.0 0.000000
1348 IVT 1996-01-30 02:00:00 0.0 4.0 52.0 1.0 0.0 0.000000
1349 IVT 1996-01-30 02:30:00 0.0 5.0 53.0 1.0 0.0 0.000000

With this data frame we can compute all kinds of interesting summary statistics by category, by day of week and time of day. To facilitate this type of “group by” analysis, pandas takes what is known as the Split-Apply-Combine approach. The pandas documentation has a nice discussion of this. To really understand split-apply-combine, check out the article by Hadley Wickham who created the plyr package for R. I also created a tutorial on Getting started with Python (with pandas and matplotlib) for group by analysis that covers some of the basics. A companion tutorial shows how to do the same analysis using R instead of Python.

Pandas provides a GroupBy object to facilitate computing aggregate statistics by grouping fields.

# Create a GroupBy object for the summary stats    
bydate_dfgrp1 = bydate_df.groupby(['category','binofweek'])
# Having a group by object makes it easy to compute statistics such as the mean of all of the fields other than the grouping fields.
# You'll see that the result is simply another DataFrame.
bydate_dfgrp1.mean()
arrivals binofday dayofweek departures occupancy
category binofweek
ART 0.0 0.000000 0.0 0.0 0.000000 0.000000
1.0 0.000000 1.0 0.0 0.000000 0.000000
2.0 0.000000 2.0 0.0 0.000000 0.000000
3.0 0.000000 3.0 0.0 0.000000 0.000000
4.0 0.000000 4.0 0.0 0.000000 0.000000
... ... ... ... ... ... ...
Total 331.0 0.717949 43.0 6.0 0.538462 0.837607
332.0 0.179487 44.0 6.0 0.538462 0.652137
333.0 0.717949 45.0 6.0 0.358974 0.682051
334.0 0.179487 46.0 6.0 0.179487 0.831624
335.0 0.179487 47.0 6.0 0.717949 0.412821

2016 rows × 5 columns

# Let's explore some of the means.
bydate_dfgrp1.mean()[100:120]
arrivals binofday dayofweek departures occupancy
category binofweek
ART 100.0 0.000000 4.0 2.0 0.000000 0.000000
101.0 0.000000 5.0 2.0 0.000000 0.000000
102.0 0.000000 6.0 2.0 0.000000 0.000000
103.0 0.000000 7.0 2.0 0.000000 0.000000
104.0 0.000000 8.0 2.0 0.000000 0.000000
105.0 0.000000 9.0 2.0 0.000000 0.000000
106.0 0.000000 10.0 2.0 0.000000 0.000000
107.0 1.538462 11.0 2.0 0.000000 0.782051
108.0 1.769231 12.0 2.0 0.000000 2.361538
109.0 3.384615 13.0 2.0 0.000000 5.058974
110.0 1.769231 14.0 2.0 0.000000 7.674359
111.0 1.538462 15.0 2.0 3.076923 7.584615
112.0 1.692308 16.0 2.0 3.384615 5.225641
113.0 1.692308 17.0 2.0 1.538462 5.300000
114.0 2.153846 18.0 2.0 1.923077 5.282051
115.0 1.846154 19.0 2.0 1.923077 5.412821
116.0 1.153846 20.0 2.0 1.923077 5.228205
117.0 1.461538 21.0 2.0 1.230769 4.800000
118.0 1.692308 22.0 2.0 2.000000 4.764103
119.0 2.153846 23.0 2.0 1.461538 5.064103

Now that we’ve seen how the a GroupBy object works, let’s see how we can compute a whole bunch of summary statistics at once. Specifically we want to compute the mean, standard deviation, min, max and several percentiles. First let’s create a slightly different GroupBy object.

bydate_dfgrp2 = bydate_df.groupby(['category','dayofweek','binofday'])

Now let’s define a function that will return a bunch of statistics in a dictionary for a column of data.

def get_occstats(group, stub=''):
    return {stub+'count': group.count(), stub+'mean': group.mean(), 
            stub+'min': group.min(),
            stub+'max': group.max(), 'stdev': group.std(), 
            stub+'p50': group.quantile(0.5), stub+'p55': group.quantile(0.55),
            stub+'p60': group.quantile(0.6), stub+'p65': group.quantile(0.65),
            stub+'p70': group.quantile(0.7), stub+'p75': group.quantile(0.75),
            stub+'p80': group.quantile(0.8), stub+'p85': group.quantile(0.85),
            stub+'p90': group.quantile(0.9), stub+'p95': group.quantile(0.95),
            stub+'p975': group.quantile(0.975), 
            stub+'p99': group.quantile(0.99)}

Now we can use the apply function to apply the get_occstats() function to a data series. We’ll create separate output data series for occupancy, arrivals and departures.

occ_stats = bydate_dfgrp2['occupancy'].apply(get_occstats)
arr_stats = bydate_dfgrp2['arrivals'].apply(get_occstats)
dep_stats = bydate_dfgrp2['departures'].apply(get_occstats)

So, what is occ_stats?

type(occ_stats)
pandas.core.series.Series

It’s a pandas Series object. What does its index look like?

occ_stats.index
MultiIndex([(  'ART', 0.0,  0.0, 'count'),
            (  'ART', 0.0,  0.0,  'mean'),
            (  'ART', 0.0,  0.0,   'min'),
            (  'ART', 0.0,  0.0,   'max'),
            (  'ART', 0.0,  0.0, 'stdev'),
            (  'ART', 0.0,  0.0,   'p50'),
            (  'ART', 0.0,  0.0,   'p55'),
            (  'ART', 0.0,  0.0,   'p60'),
            (  'ART', 0.0,  0.0,   'p65'),
            (  'ART', 0.0,  0.0,   'p70'),
            ...
            ('Total', 6.0, 47.0,   'p60'),
            ('Total', 6.0, 47.0,   'p65'),
            ('Total', 6.0, 47.0,   'p70'),
            ('Total', 6.0, 47.0,   'p75'),
            ('Total', 6.0, 47.0,   'p80'),
            ('Total', 6.0, 47.0,   'p85'),
            ('Total', 6.0, 47.0,   'p90'),
            ('Total', 6.0, 47.0,   'p95'),
            ('Total', 6.0, 47.0,  'p975'),
            ('Total', 6.0, 47.0,   'p99')],
           names=['category', 'dayofweek', 'binofday', None], length=34272)

Notice it’s a MultiIndex with 4 levels: category, dayofweek, binofday, statistic. It would be nice to “un-pivot” the statistic from the index and have it correspond to a set of columns. That’s what unstack() will do. It will leave us with a DataFrame with all of the statistics as columns and a 3 level multi-index of category, dayofweek and binofday. Perfect for plotting.

occ_stats.unstack()
count mean min max stdev p50 p55 p60 p65 p70 p75 p80 p85 p90 p95 p975 p99
category dayofweek binofday
ART 0.0 0.0 12.0 0.000000 0.0 0.0 0.000000 0.0 0.00 0.00 0.000000 0.00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1.0 12.0 0.000000 0.0 0.0 0.000000 0.0 0.00 0.00 0.000000 0.00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2.0 12.0 0.000000 0.0 0.0 0.000000 0.0 0.00 0.00 0.000000 0.00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3.0 12.0 0.000000 0.0 0.0 0.000000 0.0 0.00 0.00 0.000000 0.00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4.0 12.0 0.000000 0.0 0.0 0.000000 0.0 0.00 0.00 0.000000 0.00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Total 6.0 43.0 39.0 0.837607 0.0 4.0 1.185640 0.0 0.45 0.66 0.960000 1.00 1.300000 1.840000 2.000000 2.560000 3.520000 4.000000 4.000000
44.0 39.0 0.652137 0.0 8.0 1.584774 0.0 0.00 0.00 0.000000 0.00 0.000000 0.813333 1.620000 2.240000 3.400000 4.200000 6.480000
45.0 39.0 0.682051 0.0 4.0 1.020631 0.0 0.27 0.56 0.646667 0.82 1.100000 1.253333 1.620000 2.080000 2.733333 3.366667 3.746667
46.0 39.0 0.831624 0.0 4.0 1.279230 0.0 0.15 0.44 0.853333 1.00 1.000000 1.920000 2.000000 2.400000 4.000000 4.000000 4.000000
47.0 39.0 0.412821 0.0 4.0 0.838915 0.0 0.00 0.00 0.166667 0.32 0.466667 0.666667 0.766667 1.226667 2.066667 2.733333 3.493333

2016 rows × 17 columns

occ_stats_summary = occ_stats.unstack()
arr_stats_summary = arr_stats.unstack()
dep_stats_summary = dep_stats.unstack()

occ_stats_summary[200:220] # Let's peek into the middle of the table.
count mean min max stdev p50 p55 p60 p65 p70 p75 p80 p85 p90 p95 p975 p99
category dayofweek binofday
ART 4.0 8.0 13.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
9.0 13.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
10.0 13.0 0.010256 0.000000 0.133333 0.036980 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.053333 0.093333 0.117333
11.0 13.0 0.325641 0.000000 1.600000 0.485179 0.000000 0.060000 0.180000 0.420000 0.553333 0.633333 0.653333 0.680000 0.720000 1.080000 1.340000 1.496000
12.0 13.0 1.800000 0.000000 4.500000 1.290923 1.966667 2.026667 2.146667 2.386667 2.493333 2.533333 2.713333 2.886667 3.046667 3.660000 4.080000 4.332000
13.0 13.0 5.143590 2.266667 7.966667 1.619007 4.933333 5.073333 5.226667 5.406667 5.840000 6.400000 6.480000 6.740000 7.360000 7.726667 7.846667 7.918667
14.0 13.0 8.497436 6.266667 11.166667 1.368661 7.966667 8.366667 8.733333 9.033333 9.226667 9.366667 9.466667 9.700000 10.200000 10.686667 10.926667 11.070667
15.0 13.0 8.707692 5.166667 10.733333 1.572987 9.200000 9.480000 9.666667 9.666667 9.680000 9.700000 9.860000 10.040000 10.260000 10.493333 10.613333 10.685333
16.0 13.0 6.046154 2.833333 8.666667 1.833815 6.533333 6.613333 6.680000 6.720000 6.880000 7.100000 7.480000 7.866667 8.266667 8.506667 8.586667 8.634667
17.0 13.0 6.279487 3.500000 9.066667 1.587074 6.333333 6.353333 6.480000 6.820000 6.960000 7.000000 7.060000 7.473333 8.593333 9.006667 9.036667 9.054667
18.0 13.0 6.310256 4.000000 8.800000 1.491729 6.500000 6.760000 6.966667 7.066667 7.220000 7.400000 7.620000 7.800000 7.900000 8.280000 8.540000 8.696000
19.0 13.0 6.074359 3.833333 7.566667 1.215955 5.966667 6.466667 6.840000 6.960000 7.053333 7.133333 7.193333 7.280000 7.420000 7.506667 7.536667 7.554667
20.0 13.0 5.784615 2.533333 9.500000 1.736142 5.533333 5.873333 6.153333 6.313333 6.473333 6.633333 6.753333 7.013333 7.553333 8.440000 8.970000 9.288000
21.0 13.0 6.271795 3.833333 9.900000 1.718303 5.700000 5.780000 5.860000 5.940000 6.566667 7.466667 7.646667 7.946667 8.486667 9.160000 9.530000 9.752000
22.0 13.0 6.343590 2.633333 9.000000 1.634092 6.633333 6.673333 6.793333 7.073333 7.246667 7.366667 7.526667 7.646667 7.686667 8.220000 8.610000 8.844000
23.0 13.0 5.692308 3.533333 9.100000 1.697337 5.100000 5.440000 5.806667 6.226667 6.660000 7.100000 7.300000 7.446667 7.486667 8.140000 8.620000 8.908000
24.0 13.0 5.592308 2.800000 8.900000 1.593921 5.433333 5.453333 5.566667 5.866667 6.113333 6.333333 6.633333 6.993333 7.473333 8.140000 8.520000 8.748000
25.0 13.0 5.282051 3.000000 8.566667 1.492674 5.166667 5.226667 5.340000 5.560000 5.753333 5.933333 6.173333 6.486667 6.946667 7.686667 8.126667 8.390667
26.0 13.0 4.779487 2.300000 8.333333 1.769627 4.800000 4.920000 5.106667 5.426667 5.533333 5.533333 6.013333 6.400000 6.600000 7.333333 7.833333 8.133333
27.0 13.0 4.817949 2.266667 8.066667 1.493295 4.833333 4.933333 5.073333 5.293333 5.393333 5.433333 5.653333 5.906667 6.226667 7.026667 7.546667 7.858667

Wouldn’t it be nice if Excel Pivot Tables could produce the output above? Why can’t they? Because they can’t do things like percentiles (or other custom aggregate functions). I love spreadsheets. I teach spreadsheet modeling. However, I find myself using either Python+pandas+matplotlib or R+plyr+ggplot2 more and more frequently for things I used to do in Excel.

Let’s fire these guys out to csv files so we can check them out and maybe play with them in spreadsheet.

occ_stats_summary.to_csv('occ_stats_summary.csv')
arr_stats_summary.to_csv('arr_stats_summary.csv')
dep_stats_summary.to_csv('dep_stats_summary.csv')

The real reason I exported them to csv was to make it easy to read these results back in for Part 3 of this series of tutorials. In Part 3, we’ll create some plots using matplotlib based on these summary statistics.

Reuse

Citation

BibTeX citation:
@online{isken2013,
  author = {Mark Isken},
  title = {Computing Occupancy Statistics with {Python} - {Part} 2 of 3},
  date = {2013-04-08},
  langid = {en}
}
For attribution, please cite this work as:
Mark Isken. 2013. “Computing Occupancy Statistics with Python - Part 2 of 3.” April 8, 2013.